TabLogs OCR Extraction Pipeline

Overview

At TabLogs, we needed a robust pipeline for extracting structured data from scanned logistics documents. The challenge: handling varied layouts, noisy scans, and multilingual text across thousands of daily shipments.

Architecture

The pipeline consists of three stages:

Detection — CRAFT-based text detection locates text regions
Recognition — A CRNN model reads the detected regions
Structuring — Rule-based post-processing maps text to fields

pipeline.py

from torchwisdom.models import CRAFTDetector, CRNNRecognizer
 
def extract_document(image_path: str) -> dict:
    detector = CRAFTDetector.from_pretrained("craft-v2")
    recognizer = CRNNRecognizer.from_pretrained("crnn-multilang")
 
    regions = detector.detect(image_path)
    texts = [recognizer.recognize(r) for r in regions]
 
    return structure_fields(texts, regions)

python

ONNX Export for Production

For production deployment, we export models to ONNX format. This gives us ~3x inference speedup with ONNX Runtime compared to native PyTorch.

export.sh

python export_to_onnx.py --model craft-v2 --output models/craft.onnx
python export_to_onnx.py --model crnn-multilang --output models/crnn.onnx

bash

Results

The pipeline processes ~2,000 documents per hour on a single GPU instance, with 97.3% field extraction accuracy on our benchmark dataset. The ONNX deployment reduced inference costs by 60%.

Key Learnings

Start with a strong detection model — recognition accuracy depends heavily on clean text region crops
ONNX export is worth the effort for any production ML system
Build evaluation metrics early — you can't improve what you don't measure