9.3 KiB
Semi-Automatic Labeling Strategy Analysis
1. Current Pipeline Overview
CSV (field values)
|
v
Autolabel CLI
|- PDF render (300 DPI)
|- Text extraction (PDF text layer or PaddleOCR)
|- FieldMatcher.find_matches() [5 strategies]
| |- ExactMatcher (priority 1)
| |- ConcatenatedMatcher (multi-token)
| |- FuzzyMatcher (Amount, dates only)
| |- SubstringMatcher (prevents false positives)
| |- FlexibleDateMatcher (fallback)
|
|- AnnotationGenerator
| |- PDF points -> pixels
| |- expand_bbox() [field-specific strategy]
| |- pixels -> YOLO normalized (0-1)
| |- Save to database
|
v
DBYOLODataset (training)
|- Load images + bboxes from DB
|- Re-apply expand_bbox()
|- YOLO training
|
v
Inference
|- YOLO detect -> pixel bboxes
|- Crop region -> OCR extract text
|- Normalize & validate
2. Current Expansion Strategy Analysis
2.1 Field-Specific Parameters
| Field | Scale X | Scale Y | Extra Top | Extra Left | Extra Right | Max Pad X | Max Pad Y |
|---|---|---|---|---|---|---|---|
| ocr_number | 1.15 | 1.80 | 0.60 | - | - | 50 | 140 |
| bankgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 |
| plusgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 |
| invoice_date | 1.25 | 1.55 | 0.40 | - | - | 80 | 110 |
| invoice_due_date | 1.30 | 1.65 | 0.45 | 0.35 | - | 100 | 120 |
| amount | 1.20 | 1.35 | - | - | 0.30 | 70 | 80 |
| invoice_number | 1.20 | 1.50 | 0.40 | - | - | 80 | 100 |
| supplier_org_number | 1.25 | 1.40 | 0.30 | 0.20 | - | 90 | 90 |
| customer_number | 1.25 | 1.45 | 0.35 | 0.25 | - | 90 | 100 |
| payment_line | 1.10 | 1.20 | - | - | - | 40 | 40 |
2.2 Design Rationale
The expansion is designed based on Swedish invoice layout patterns:
- Dates: Labels ("Fakturadatum") typically sit above the value -> extra top
- Giro accounts: Prefix ("BG:", "PG:") sits to the left -> extra left
- Amount: Currency suffix ("SEK", "kr") to the right -> extra right
- Payment line: Machine-readable, self-contained -> minimal expansion
2.3 Strengths
- Field-specific directional expansion - matches Swedish invoice conventions
- Max padding caps - prevents runaway expansion into neighboring fields
- Center-point scaling with directional compensation - geometrically sound
- Image boundary clamping - prevents out-of-bounds coordinates
2.4 Potential Issues
| Issue | Risk Level | Description |
|---|---|---|
| Over-expansion | HIGH | OCR number 1.80x Y-scale could capture adjacent fields |
| Inconsistent training vs inference bbox | MEDIUM | Model trained on expanded boxes, inference returns raw detection |
| No expansion at inference OCR crop | MEDIUM | Detected bbox may clip text edges without post-expansion |
| Max padding in pixels vs DPI-dependent | LOW | 140px at 300DPI != 140px at 150DPI |
3. Industry Best Practices (Research Findings)
3.1 Labeling: Tight vs. Loose Bounding Boxes
Consensus: Annotate tight bounding boxes around the value text only.
- FUNSD/CORD benchmarks annotate keys and values as separate entities
- Loose boxes "introduce background noise and can mislead the model" (V7 Labs, LabelVisor)
- IoU discrepancies from loose boxes degrade mAP during training
However, for YOLO + OCR pipelines, tight-only creates a problem:
- YOLO predicts slightly imprecise boxes (typical IoU 0.7-0.9)
- If the predicted box clips even slightly, OCR misses characters
- Solution: Label tight, expand at inference OR label with controlled padding
3.2 The Two Dominant Strategies
Strategy A: Tight Label + Inference-Time Expansion (Recommended by research)
Label: [ 2024-01-15 ] (tight around value)
Inference: [ [2024-01-15] ] + pad -> OCR
- Clean, consistent annotations
- Requires post-detection padding before OCR crop
- Used by: Microsoft Document Intelligence, Nanonets
Strategy B: Expanded Label at Training Time (Current project approach)
Label: [ Fakturadatum: 2024-01-15 ] (includes context)
Inference: YOLO detects full region -> OCR extracts from region
- Model learns spatial context (label + value)
- Larger, more variable boxes
- OCR must filter out label text from extracted content
3.3 OCR Padding Requirements
Tesseract: Requires ~10px white border for reliable segmentation (PSM 7-10).
PaddleOCR: det_db_unclip_ratio parameter (default 1.5) controls detection expansion.
Key insight: Even after YOLO detection, OCR engines need some padding around text to work reliably.
3.4 State-of-the-Art Comparison
| System | Bbox Strategy | Field Definition |
|---|---|---|
| LayoutLM | Word-level bboxes from OCR | Token classification (BIO tagging) |
| Donut | No bboxes (end-to-end) | Internal attention mechanism |
| Microsoft DocAI | Field-level, tight | Post-expansion for OCR |
| YOLO + OCR (this project) | Field-level, expanded | Field-specific directional expansion |
4. Recommendations
4.1 Short-Term (Current Architecture)
A. Add Inference-Time OCR Padding
Currently, the detected bbox is sent directly to OCR. Add a small uniform padding (5-10%) before cropping for OCR:
# In field_extractor.py, before OCR crop:
pad_ratio = 0.05 # 5% expansion
w_pad = (x2 - x1) * pad_ratio
h_pad = (y2 - y1) * pad_ratio
crop_x1 = max(0, x1 - w_pad)
crop_y1 = max(0, y1 - h_pad)
crop_x2 = min(img_w, x2 + w_pad)
crop_y2 = min(img_h, y2 + h_pad)
B. Reduce Training-Time Expansion Ratios
Current ratios (especially OCR number 1.80x Y, Bankgiro 1.45x X) are aggressive. Proposed reduction:
| Field | Current Scale Y | Proposed Scale Y | Rationale |
|---|---|---|---|
| ocr_number | 1.80 | 1.40 | 1.80 is too aggressive, captures neighbors |
| bankgiro | 1.35 | 1.25 | Reduce vertical over-expansion |
| invoice_due_date | 1.65 | 1.45 | Tighten vertical |
Principle: shift expansion work from training-time to inference-time.
C. Add Label Visualization Quality Check
Before training, sample 50-100 annotated images and visually inspect:
- Are expanded bboxes capturing only the target field?
- Are any bboxes overlapping with adjacent fields?
- Are any values being clipped?
4.2 Medium-Term (Architecture Improvements)
D. Two-Stage Detection Strategy
Stage 1: YOLO detects field regions (current)
Stage 2: Within each detection, use PaddleOCR text detection
to find the precise text boundary
Stage 3: Extract text from refined boundary
Benefits:
- YOLO handles field classification (what)
- PaddleOCR handles text localization (where exactly)
- Eliminates the "tight vs loose" dilemma entirely
E. Label Both Key and Value Separately
Add new annotation classes: invoice_date_label, invoice_date_value
- Model learns to find both the label and value
- Use spatial relationship (label -> value) for more robust extraction
- Aligns with FUNSD benchmark approach
F. Confidence-Weighted Expansion
Scale expansion by detection confidence:
# Higher confidence = tighter crop (model is sure)
# Lower confidence = wider crop (give OCR more context)
expansion = base_expansion * (1.5 - confidence)
4.3 Long-Term (Next Generation)
G. Move to LayoutLM-Style Token Classification
- Replace YOLO field detection with token-level classification
- Each OCR word gets classified as B-field/I-field/O
- Eliminates bbox expansion entirely
- Better for fields with complex layouts
H. End-to-End with Donut/Pix2Struct
- No separate OCR step
- Model directly outputs structured fields from image
- Zero bbox concerns
- Requires more training data and compute
5. Recommended Action Plan
Phase 1: Validate Current Labels (1-2 days)
- Build label visualization script
- Sample 100 documents across all field types
- Identify over-expansion and clipping cases
- Document per-field accuracy of current expansion
Phase 2: Tune Expansion Parameters (2-3 days)
- Reduce aggressive expansion ratios (OCR number, bankgiro)
- Add inference-time OCR padding (5-10%)
- Re-train model with adjusted labels
- Compare mAP and field extraction accuracy
Phase 3: Two-Stage Refinement (1 week)
- Implement PaddleOCR text detection within YOLO detection
- Use text detection bbox for precise OCR crop
- Keep YOLO expansion for classification only
Phase 4: Evaluation (ongoing)
- Track per-field extraction accuracy on test set
- A/B test tight vs expanded labels
- Build regression test suite for labeling quality
6. Summary
| Aspect | Current Approach | Best Practice | Gap |
|---|---|---|---|
| Labeling | Value + expansion at label time | Tight value + inference expansion | Medium |
| Expansion | Field-specific directional | Field-specific directional | Aligned |
| Inference OCR crop | Raw detection bbox | Detection + padding | Needs padding |
| Expansion ratios | Aggressive (up to 1.80x) | Moderate (1.10-1.30x) | Over-expanded |
| Visualization QC | None | Regular sampling | Missing |
| Coordinate consistency | PDF points -> pixels | Consistent DPI | Check needed |
Bottom line: The architecture (field-specific directional expansion) is sound and aligns with best practices. The main improvements are:
- Reduce expansion aggressiveness during training labels
- Add OCR padding at inference time
- Add label quality visualization for validation
- Longer term: consider two-stage detection or token classification