Files

Yaojia Wang ad5ed46b4c WIP

2026-02-11 23:40:38 +01:00

9.3 KiB

Raw Blame History

Semi-Automatic Labeling Strategy Analysis

1. Current Pipeline Overview

CSV (field values)
  |
  v
Autolabel CLI
  |- PDF render (300 DPI)
  |- Text extraction (PDF text layer or PaddleOCR)
  |- FieldMatcher.find_matches() [5 strategies]
  |   |- ExactMatcher (priority 1)
  |   |- ConcatenatedMatcher (multi-token)
  |   |- FuzzyMatcher (Amount, dates only)
  |   |- SubstringMatcher (prevents false positives)
  |   |- FlexibleDateMatcher (fallback)
  |
  |- AnnotationGenerator
  |   |- PDF points -> pixels
  |   |- expand_bbox() [field-specific strategy]
  |   |- pixels -> YOLO normalized (0-1)
  |   |- Save to database
  |
  v
DBYOLODataset (training)
  |- Load images + bboxes from DB
  |- Re-apply expand_bbox()
  |- YOLO training
  |
  v
Inference
  |- YOLO detect -> pixel bboxes
  |- Crop region -> OCR extract text
  |- Normalize & validate

2. Current Expansion Strategy Analysis

2.1 Field-Specific Parameters

Field	Scale X	Scale Y	Extra Top	Extra Left	Extra Right	Max Pad X	Max Pad Y
ocr_number	1.15	1.80	0.60	-	-	50	140
bankgiro	1.45	1.35	-	0.80	-	160	90
plusgiro	1.45	1.35	-	0.80	-	160	90
invoice_date	1.25	1.55	0.40	-	-	80	110
invoice_due_date	1.30	1.65	0.45	0.35	-	100	120
amount	1.20	1.35	-	-	0.30	70	80
invoice_number	1.20	1.50	0.40	-	-	80	100
supplier_org_number	1.25	1.40	0.30	0.20	-	90	90
customer_number	1.25	1.45	0.35	0.25	-	90	100
payment_line	1.10	1.20	-	-	-	40	40

2.2 Design Rationale

The expansion is designed based on Swedish invoice layout patterns:

Dates: Labels ("Fakturadatum") typically sit above the value -> extra top
Giro accounts: Prefix ("BG:", "PG:") sits to the left -> extra left
Amount: Currency suffix ("SEK", "kr") to the right -> extra right
Payment line: Machine-readable, self-contained -> minimal expansion

2.3 Strengths

Field-specific directional expansion - matches Swedish invoice conventions
Max padding caps - prevents runaway expansion into neighboring fields
Center-point scaling with directional compensation - geometrically sound
Image boundary clamping - prevents out-of-bounds coordinates

2.4 Potential Issues

Issue	Risk Level	Description
Over-expansion	HIGH	OCR number 1.80x Y-scale could capture adjacent fields
Inconsistent training vs inference bbox	MEDIUM	Model trained on expanded boxes, inference returns raw detection
No expansion at inference OCR crop	MEDIUM	Detected bbox may clip text edges without post-expansion
Max padding in pixels vs DPI-dependent	LOW	140px at 300DPI != 140px at 150DPI

3. Industry Best Practices (Research Findings)

3.1 Labeling: Tight vs. Loose Bounding Boxes

Consensus: Annotate tight bounding boxes around the value text only.

FUNSD/CORD benchmarks annotate keys and values as separate entities
Loose boxes "introduce background noise and can mislead the model" (V7 Labs, LabelVisor)
IoU discrepancies from loose boxes degrade mAP during training

However, for YOLO + OCR pipelines, tight-only creates a problem:

YOLO predicts slightly imprecise boxes (typical IoU 0.7-0.9)
If the predicted box clips even slightly, OCR misses characters
Solution: Label tight, expand at inference OR label with controlled padding

3.2 The Two Dominant Strategies

Strategy A: Tight Label + Inference-Time Expansion (Recommended by research)

Label:     [  2024-01-15  ]     (tight around value)
Inference: [ [2024-01-15] ] + pad -> OCR

Clean, consistent annotations
Requires post-detection padding before OCR crop
Used by: Microsoft Document Intelligence, Nanonets

Strategy B: Expanded Label at Training Time (Current project approach)

Label:     [  Fakturadatum: 2024-01-15  ]  (includes context)
Inference: YOLO detects full region -> OCR extracts from region

Model learns spatial context (label + value)
Larger, more variable boxes
OCR must filter out label text from extracted content

3.3 OCR Padding Requirements

Tesseract: Requires ~10px white border for reliable segmentation (PSM 7-10). PaddleOCR: det_db_unclip_ratio parameter (default 1.5) controls detection expansion.

Key insight: Even after YOLO detection, OCR engines need some padding around text to work reliably.

3.4 State-of-the-Art Comparison

System	Bbox Strategy	Field Definition
LayoutLM	Word-level bboxes from OCR	Token classification (BIO tagging)
Donut	No bboxes (end-to-end)	Internal attention mechanism
Microsoft DocAI	Field-level, tight	Post-expansion for OCR
YOLO + OCR (this project)	Field-level, expanded	Field-specific directional expansion

4. Recommendations

4.1 Short-Term (Current Architecture)

A. Add Inference-Time OCR Padding

Currently, the detected bbox is sent directly to OCR. Add a small uniform padding (5-10%) before cropping for OCR:

# In field_extractor.py, before OCR crop:
pad_ratio = 0.05  # 5% expansion
w_pad = (x2 - x1) * pad_ratio
h_pad = (y2 - y1) * pad_ratio
crop_x1 = max(0, x1 - w_pad)
crop_y1 = max(0, y1 - h_pad)
crop_x2 = min(img_w, x2 + w_pad)
crop_y2 = min(img_h, y2 + h_pad)

B. Reduce Training-Time Expansion Ratios

Current ratios (especially OCR number 1.80x Y, Bankgiro 1.45x X) are aggressive. Proposed reduction:

Field	Current Scale Y	Proposed Scale Y	Rationale
ocr_number	1.80	1.40	1.80 is too aggressive, captures neighbors
bankgiro	1.35	1.25	Reduce vertical over-expansion
invoice_due_date	1.65	1.45	Tighten vertical

Principle: shift expansion work from training-time to inference-time.

C. Add Label Visualization Quality Check

Before training, sample 50-100 annotated images and visually inspect:

Are expanded bboxes capturing only the target field?
Are any bboxes overlapping with adjacent fields?
Are any values being clipped?

4.2 Medium-Term (Architecture Improvements)

D. Two-Stage Detection Strategy

Stage 1: YOLO detects field regions (current)
Stage 2: Within each detection, use PaddleOCR text detection
         to find the precise text boundary
Stage 3: Extract text from refined boundary

Benefits:

YOLO handles field classification (what)
PaddleOCR handles text localization (where exactly)
Eliminates the "tight vs loose" dilemma entirely

E. Label Both Key and Value Separately

Add new annotation classes: invoice_date_label, invoice_date_value

Model learns to find both the label and value
Use spatial relationship (label -> value) for more robust extraction
Aligns with FUNSD benchmark approach

F. Confidence-Weighted Expansion

Scale expansion by detection confidence:

# Higher confidence = tighter crop (model is sure)
# Lower confidence = wider crop (give OCR more context)
expansion = base_expansion * (1.5 - confidence)

4.3 Long-Term (Next Generation)

G. Move to LayoutLM-Style Token Classification

Replace YOLO field detection with token-level classification
Each OCR word gets classified as B-field/I-field/O
Eliminates bbox expansion entirely
Better for fields with complex layouts

H. End-to-End with Donut/Pix2Struct

No separate OCR step
Model directly outputs structured fields from image
Zero bbox concerns
Requires more training data and compute

5. Recommended Action Plan

Phase 1: Validate Current Labels (1-2 days)

Build label visualization script
Sample 100 documents across all field types
Identify over-expansion and clipping cases
Document per-field accuracy of current expansion

Phase 2: Tune Expansion Parameters (2-3 days)

Reduce aggressive expansion ratios (OCR number, bankgiro)
Add inference-time OCR padding (5-10%)
Re-train model with adjusted labels
Compare mAP and field extraction accuracy

Phase 3: Two-Stage Refinement (1 week)

Implement PaddleOCR text detection within YOLO detection
Use text detection bbox for precise OCR crop
Keep YOLO expansion for classification only

Phase 4: Evaluation (ongoing)

Track per-field extraction accuracy on test set
A/B test tight vs expanded labels
Build regression test suite for labeling quality

6. Summary

Aspect	Current Approach	Best Practice	Gap
Labeling	Value + expansion at label time	Tight value + inference expansion	Medium
Expansion	Field-specific directional	Field-specific directional	Aligned
Inference OCR crop	Raw detection bbox	Detection + padding	Needs padding
Expansion ratios	Aggressive (up to 1.80x)	Moderate (1.10-1.30x)	Over-expanded
Visualization QC	None	Regular sampling	Missing
Coordinate consistency	PDF points -> pixels	Consistent DPI	Check needed

Bottom line: The architecture (field-specific directional expansion) is sound and aligns with best practices. The main improvements are:

Reduce expansion aggressiveness during training labels
Add OCR padding at inference time
Add label quality visualization for validation
Longer term: consider two-stage detection or token classification

9.3 KiB Raw Blame History