← Back to all writing
2025-12-15·1 min read

TabLogs OCR Extraction Pipeline

How we built the OCR extraction pipeline at TabLogs using PyTorch and ONNX Runtime

Overview

At TabLogs, we needed a robust pipeline for extracting structured data from scanned logistics documents. The challenge: handling varied layouts, noisy scans, and multilingual text across thousands of daily shipments.

Architecture

The pipeline consists of three stages:

  1. Detection — CRAFT-based text detection locates text regions
  2. Recognition — A CRNN model reads the detected regions
  3. Structuring — Rule-based post-processing maps text to fields
pipeline.py
from torchwisdom.models import CRAFTDetector, CRNNRecognizer
 
def extract_document(image_path: str) -> dict:
    detector = CRAFTDetector.from_pretrained("craft-v2")
    recognizer = CRNNRecognizer.from_pretrained("crnn-multilang")
 
    regions = detector.detect(image_path)
    texts = [recognizer.recognize(r) for r in regions]
 
    return structure_fields(texts, regions)
python

ONNX Export for Production

For production deployment, we export models to ONNX format. This gives us ~3x inference speedup with ONNX Runtime compared to native PyTorch.

export.sh
python export_to_onnx.py --model craft-v2 --output models/craft.onnx
python export_to_onnx.py --model crnn-multilang --output models/crnn.onnx
bash

Results

The pipeline processes ~2,000 documents per hour on a single GPU instance, with 97.3% field extraction accuracy on our benchmark dataset. The ONNX deployment reduced inference costs by 60%.

Key Learnings

  • Start with a strong detection model — recognition accuracy depends heavily on clean text region crops
  • ONNX export is worth the effort for any production ML system
  • Build evaluation metrics early — you can't improve what you don't measure