Files
invoice-master-poc-v2/LABELING_STRATEGY_ANALYSIS.md
Yaojia Wang ad5ed46b4c WIP
2026-02-11 23:40:38 +01:00

9.3 KiB

Semi-Automatic Labeling Strategy Analysis

1. Current Pipeline Overview

CSV (field values)
  |
  v
Autolabel CLI
  |- PDF render (300 DPI)
  |- Text extraction (PDF text layer or PaddleOCR)
  |- FieldMatcher.find_matches() [5 strategies]
  |   |- ExactMatcher (priority 1)
  |   |- ConcatenatedMatcher (multi-token)
  |   |- FuzzyMatcher (Amount, dates only)
  |   |- SubstringMatcher (prevents false positives)
  |   |- FlexibleDateMatcher (fallback)
  |
  |- AnnotationGenerator
  |   |- PDF points -> pixels
  |   |- expand_bbox() [field-specific strategy]
  |   |- pixels -> YOLO normalized (0-1)
  |   |- Save to database
  |
  v
DBYOLODataset (training)
  |- Load images + bboxes from DB
  |- Re-apply expand_bbox()
  |- YOLO training
  |
  v
Inference
  |- YOLO detect -> pixel bboxes
  |- Crop region -> OCR extract text
  |- Normalize & validate

2. Current Expansion Strategy Analysis

2.1 Field-Specific Parameters

Field Scale X Scale Y Extra Top Extra Left Extra Right Max Pad X Max Pad Y
ocr_number 1.15 1.80 0.60 - - 50 140
bankgiro 1.45 1.35 - 0.80 - 160 90
plusgiro 1.45 1.35 - 0.80 - 160 90
invoice_date 1.25 1.55 0.40 - - 80 110
invoice_due_date 1.30 1.65 0.45 0.35 - 100 120
amount 1.20 1.35 - - 0.30 70 80
invoice_number 1.20 1.50 0.40 - - 80 100
supplier_org_number 1.25 1.40 0.30 0.20 - 90 90
customer_number 1.25 1.45 0.35 0.25 - 90 100
payment_line 1.10 1.20 - - - 40 40

2.2 Design Rationale

The expansion is designed based on Swedish invoice layout patterns:

  • Dates: Labels ("Fakturadatum") typically sit above the value -> extra top
  • Giro accounts: Prefix ("BG:", "PG:") sits to the left -> extra left
  • Amount: Currency suffix ("SEK", "kr") to the right -> extra right
  • Payment line: Machine-readable, self-contained -> minimal expansion

2.3 Strengths

  1. Field-specific directional expansion - matches Swedish invoice conventions
  2. Max padding caps - prevents runaway expansion into neighboring fields
  3. Center-point scaling with directional compensation - geometrically sound
  4. Image boundary clamping - prevents out-of-bounds coordinates

2.4 Potential Issues

Issue Risk Level Description
Over-expansion HIGH OCR number 1.80x Y-scale could capture adjacent fields
Inconsistent training vs inference bbox MEDIUM Model trained on expanded boxes, inference returns raw detection
No expansion at inference OCR crop MEDIUM Detected bbox may clip text edges without post-expansion
Max padding in pixels vs DPI-dependent LOW 140px at 300DPI != 140px at 150DPI

3. Industry Best Practices (Research Findings)

3.1 Labeling: Tight vs. Loose Bounding Boxes

Consensus: Annotate tight bounding boxes around the value text only.

  • FUNSD/CORD benchmarks annotate keys and values as separate entities
  • Loose boxes "introduce background noise and can mislead the model" (V7 Labs, LabelVisor)
  • IoU discrepancies from loose boxes degrade mAP during training

However, for YOLO + OCR pipelines, tight-only creates a problem:

  • YOLO predicts slightly imprecise boxes (typical IoU 0.7-0.9)
  • If the predicted box clips even slightly, OCR misses characters
  • Solution: Label tight, expand at inference OR label with controlled padding

3.2 The Two Dominant Strategies

Strategy A: Tight Label + Inference-Time Expansion (Recommended by research)

Label:     [  2024-01-15  ]     (tight around value)
Inference: [ [2024-01-15] ] + pad -> OCR
  • Clean, consistent annotations
  • Requires post-detection padding before OCR crop
  • Used by: Microsoft Document Intelligence, Nanonets

Strategy B: Expanded Label at Training Time (Current project approach)

Label:     [  Fakturadatum: 2024-01-15  ]  (includes context)
Inference: YOLO detects full region -> OCR extracts from region
  • Model learns spatial context (label + value)
  • Larger, more variable boxes
  • OCR must filter out label text from extracted content

3.3 OCR Padding Requirements

Tesseract: Requires ~10px white border for reliable segmentation (PSM 7-10). PaddleOCR: det_db_unclip_ratio parameter (default 1.5) controls detection expansion.

Key insight: Even after YOLO detection, OCR engines need some padding around text to work reliably.

3.4 State-of-the-Art Comparison

System Bbox Strategy Field Definition
LayoutLM Word-level bboxes from OCR Token classification (BIO tagging)
Donut No bboxes (end-to-end) Internal attention mechanism
Microsoft DocAI Field-level, tight Post-expansion for OCR
YOLO + OCR (this project) Field-level, expanded Field-specific directional expansion

4. Recommendations

4.1 Short-Term (Current Architecture)

A. Add Inference-Time OCR Padding

Currently, the detected bbox is sent directly to OCR. Add a small uniform padding (5-10%) before cropping for OCR:

# In field_extractor.py, before OCR crop:
pad_ratio = 0.05  # 5% expansion
w_pad = (x2 - x1) * pad_ratio
h_pad = (y2 - y1) * pad_ratio
crop_x1 = max(0, x1 - w_pad)
crop_y1 = max(0, y1 - h_pad)
crop_x2 = min(img_w, x2 + w_pad)
crop_y2 = min(img_h, y2 + h_pad)

B. Reduce Training-Time Expansion Ratios

Current ratios (especially OCR number 1.80x Y, Bankgiro 1.45x X) are aggressive. Proposed reduction:

Field Current Scale Y Proposed Scale Y Rationale
ocr_number 1.80 1.40 1.80 is too aggressive, captures neighbors
bankgiro 1.35 1.25 Reduce vertical over-expansion
invoice_due_date 1.65 1.45 Tighten vertical

Principle: shift expansion work from training-time to inference-time.

C. Add Label Visualization Quality Check

Before training, sample 50-100 annotated images and visually inspect:

  • Are expanded bboxes capturing only the target field?
  • Are any bboxes overlapping with adjacent fields?
  • Are any values being clipped?

4.2 Medium-Term (Architecture Improvements)

D. Two-Stage Detection Strategy

Stage 1: YOLO detects field regions (current)
Stage 2: Within each detection, use PaddleOCR text detection
         to find the precise text boundary
Stage 3: Extract text from refined boundary

Benefits:

  • YOLO handles field classification (what)
  • PaddleOCR handles text localization (where exactly)
  • Eliminates the "tight vs loose" dilemma entirely

E. Label Both Key and Value Separately

Add new annotation classes: invoice_date_label, invoice_date_value

  • Model learns to find both the label and value
  • Use spatial relationship (label -> value) for more robust extraction
  • Aligns with FUNSD benchmark approach

F. Confidence-Weighted Expansion

Scale expansion by detection confidence:

# Higher confidence = tighter crop (model is sure)
# Lower confidence = wider crop (give OCR more context)
expansion = base_expansion * (1.5 - confidence)

4.3 Long-Term (Next Generation)

G. Move to LayoutLM-Style Token Classification

  • Replace YOLO field detection with token-level classification
  • Each OCR word gets classified as B-field/I-field/O
  • Eliminates bbox expansion entirely
  • Better for fields with complex layouts

H. End-to-End with Donut/Pix2Struct

  • No separate OCR step
  • Model directly outputs structured fields from image
  • Zero bbox concerns
  • Requires more training data and compute

Phase 1: Validate Current Labels (1-2 days)

  • Build label visualization script
  • Sample 100 documents across all field types
  • Identify over-expansion and clipping cases
  • Document per-field accuracy of current expansion

Phase 2: Tune Expansion Parameters (2-3 days)

  • Reduce aggressive expansion ratios (OCR number, bankgiro)
  • Add inference-time OCR padding (5-10%)
  • Re-train model with adjusted labels
  • Compare mAP and field extraction accuracy

Phase 3: Two-Stage Refinement (1 week)

  • Implement PaddleOCR text detection within YOLO detection
  • Use text detection bbox for precise OCR crop
  • Keep YOLO expansion for classification only

Phase 4: Evaluation (ongoing)

  • Track per-field extraction accuracy on test set
  • A/B test tight vs expanded labels
  • Build regression test suite for labeling quality

6. Summary

Aspect Current Approach Best Practice Gap
Labeling Value + expansion at label time Tight value + inference expansion Medium
Expansion Field-specific directional Field-specific directional Aligned
Inference OCR crop Raw detection bbox Detection + padding Needs padding
Expansion ratios Aggressive (up to 1.80x) Moderate (1.10-1.30x) Over-expanded
Visualization QC None Regular sampling Missing
Coordinate consistency PDF points -> pixels Consistent DPI Check needed

Bottom line: The architecture (field-specific directional expansion) is sound and aligns with best practices. The main improvements are:

  1. Reduce expansion aggressiveness during training labels
  2. Add OCR padding at inference time
  3. Add label quality visualization for validation
  4. Longer term: consider two-stage detection or token classification