# Semi-Automatic Labeling Strategy Analysis ## 1. Current Pipeline Overview ``` CSV (field values) | v Autolabel CLI |- PDF render (300 DPI) |- Text extraction (PDF text layer or PaddleOCR) |- FieldMatcher.find_matches() [5 strategies] | |- ExactMatcher (priority 1) | |- ConcatenatedMatcher (multi-token) | |- FuzzyMatcher (Amount, dates only) | |- SubstringMatcher (prevents false positives) | |- FlexibleDateMatcher (fallback) | |- AnnotationGenerator | |- PDF points -> pixels | |- expand_bbox() [field-specific strategy] | |- pixels -> YOLO normalized (0-1) | |- Save to database | v DBYOLODataset (training) |- Load images + bboxes from DB |- Re-apply expand_bbox() |- YOLO training | v Inference |- YOLO detect -> pixel bboxes |- Crop region -> OCR extract text |- Normalize & validate ``` --- ## 2. Current Expansion Strategy Analysis ### 2.1 Field-Specific Parameters | Field | Scale X | Scale Y | Extra Top | Extra Left | Extra Right | Max Pad X | Max Pad Y | |---|---|---|---|---|---|---|---| | ocr_number | 1.15 | 1.80 | 0.60 | - | - | 50 | 140 | | bankgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 | | plusgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 | | invoice_date | 1.25 | 1.55 | 0.40 | - | - | 80 | 110 | | invoice_due_date | 1.30 | 1.65 | 0.45 | 0.35 | - | 100 | 120 | | amount | 1.20 | 1.35 | - | - | 0.30 | 70 | 80 | | invoice_number | 1.20 | 1.50 | 0.40 | - | - | 80 | 100 | | supplier_org_number | 1.25 | 1.40 | 0.30 | 0.20 | - | 90 | 90 | | customer_number | 1.25 | 1.45 | 0.35 | 0.25 | - | 90 | 100 | | payment_line | 1.10 | 1.20 | - | - | - | 40 | 40 | ### 2.2 Design Rationale The expansion is designed based on Swedish invoice layout patterns: - **Dates**: Labels ("Fakturadatum") typically sit **above** the value -> extra top - **Giro accounts**: Prefix ("BG:", "PG:") sits **to the left** -> extra left - **Amount**: Currency suffix ("SEK", "kr") to the **right** -> extra right - **Payment line**: Machine-readable, self-contained -> minimal expansion ### 2.3 Strengths 1. **Field-specific directional expansion** - matches Swedish invoice conventions 2. **Max padding caps** - prevents runaway expansion into neighboring fields 3. **Center-point scaling** with directional compensation - geometrically sound 4. **Image boundary clamping** - prevents out-of-bounds coordinates ### 2.4 Potential Issues | Issue | Risk Level | Description | |---|---|---| | Over-expansion | HIGH | OCR number 1.80x Y-scale could capture adjacent fields | | Inconsistent training vs inference bbox | MEDIUM | Model trained on expanded boxes, inference returns raw detection | | No expansion at inference OCR crop | MEDIUM | Detected bbox may clip text edges without post-expansion | | Max padding in pixels vs DPI-dependent | LOW | 140px at 300DPI != 140px at 150DPI | --- ## 3. Industry Best Practices (Research Findings) ### 3.1 Labeling: Tight vs. Loose Bounding Boxes **Consensus**: Annotate **tight bounding boxes around the value text only**. - FUNSD/CORD benchmarks annotate keys and values as **separate entities** - Loose boxes "introduce background noise and can mislead the model" (V7 Labs, LabelVisor) - IoU discrepancies from loose boxes degrade mAP during training **However**, for YOLO + OCR pipelines, tight-only creates a problem: - YOLO predicts slightly imprecise boxes (typical IoU 0.7-0.9) - If the predicted box clips even slightly, OCR misses characters - Solution: **Label tight, expand at inference** OR **label with controlled padding** ### 3.2 The Two Dominant Strategies **Strategy A: Tight Label + Inference-Time Expansion** (Recommended by research) ``` Label: [ 2024-01-15 ] (tight around value) Inference: [ [2024-01-15] ] + pad -> OCR ``` - Clean, consistent annotations - Requires post-detection padding before OCR crop - Used by: Microsoft Document Intelligence, Nanonets **Strategy B: Expanded Label at Training Time** (Current project approach) ``` Label: [ Fakturadatum: 2024-01-15 ] (includes context) Inference: YOLO detects full region -> OCR extracts from region ``` - Model learns spatial context (label + value) - Larger, more variable boxes - OCR must filter out label text from extracted content ### 3.3 OCR Padding Requirements **Tesseract**: Requires ~10px white border for reliable segmentation (PSM 7-10). **PaddleOCR**: `det_db_unclip_ratio` parameter (default 1.5) controls detection expansion. Key insight: Even after YOLO detection, OCR engines need some padding around text to work reliably. ### 3.4 State-of-the-Art Comparison | System | Bbox Strategy | Field Definition | |---|---|---| | **LayoutLM** | Word-level bboxes from OCR | Token classification (BIO tagging) | | **Donut** | No bboxes (end-to-end) | Internal attention mechanism | | **Microsoft DocAI** | Field-level, tight | Post-expansion for OCR | | **YOLO + OCR (this project)** | Field-level, expanded | Field-specific directional expansion | --- ## 4. Recommendations ### 4.1 Short-Term (Current Architecture) #### A. Add Inference-Time OCR Padding Currently, the detected bbox is sent directly to OCR. Add a small uniform padding (5-10%) before cropping for OCR: ```python # In field_extractor.py, before OCR crop: pad_ratio = 0.05 # 5% expansion w_pad = (x2 - x1) * pad_ratio h_pad = (y2 - y1) * pad_ratio crop_x1 = max(0, x1 - w_pad) crop_y1 = max(0, y1 - h_pad) crop_x2 = min(img_w, x2 + w_pad) crop_y2 = min(img_h, y2 + h_pad) ``` #### B. Reduce Training-Time Expansion Ratios Current ratios (especially OCR number 1.80x Y, Bankgiro 1.45x X) are aggressive. Proposed reduction: | Field | Current Scale Y | Proposed Scale Y | Rationale | |---|---|---|---| | ocr_number | 1.80 | 1.40 | 1.80 is too aggressive, captures neighbors | | bankgiro | 1.35 | 1.25 | Reduce vertical over-expansion | | invoice_due_date | 1.65 | 1.45 | Tighten vertical | Principle: **shift expansion work from training-time to inference-time**. #### C. Add Label Visualization Quality Check Before training, sample 50-100 annotated images and visually inspect: - Are expanded bboxes capturing only the target field? - Are any bboxes overlapping with adjacent fields? - Are any values being clipped? ### 4.2 Medium-Term (Architecture Improvements) #### D. Two-Stage Detection Strategy ``` Stage 1: YOLO detects field regions (current) Stage 2: Within each detection, use PaddleOCR text detection to find the precise text boundary Stage 3: Extract text from refined boundary ``` Benefits: - YOLO handles field classification (what) - PaddleOCR handles text localization (where exactly) - Eliminates the "tight vs loose" dilemma entirely #### E. Label Both Key and Value Separately Add new annotation classes: `invoice_date_label`, `invoice_date_value` - Model learns to find both the label and value - Use spatial relationship (label -> value) for more robust extraction - Aligns with FUNSD benchmark approach #### F. Confidence-Weighted Expansion Scale expansion by detection confidence: ```python # Higher confidence = tighter crop (model is sure) # Lower confidence = wider crop (give OCR more context) expansion = base_expansion * (1.5 - confidence) ``` ### 4.3 Long-Term (Next Generation) #### G. Move to LayoutLM-Style Token Classification - Replace YOLO field detection with token-level classification - Each OCR word gets classified as B-field/I-field/O - Eliminates bbox expansion entirely - Better for fields with complex layouts #### H. End-to-End with Donut/Pix2Struct - No separate OCR step - Model directly outputs structured fields from image - Zero bbox concerns - Requires more training data and compute --- ## 5. Recommended Action Plan ### Phase 1: Validate Current Labels (1-2 days) - [ ] Build label visualization script - [ ] Sample 100 documents across all field types - [ ] Identify over-expansion and clipping cases - [ ] Document per-field accuracy of current expansion ### Phase 2: Tune Expansion Parameters (2-3 days) - [ ] Reduce aggressive expansion ratios (OCR number, bankgiro) - [ ] Add inference-time OCR padding (5-10%) - [ ] Re-train model with adjusted labels - [ ] Compare mAP and field extraction accuracy ### Phase 3: Two-Stage Refinement (1 week) - [ ] Implement PaddleOCR text detection within YOLO detection - [ ] Use text detection bbox for precise OCR crop - [ ] Keep YOLO expansion for classification only ### Phase 4: Evaluation (ongoing) - [ ] Track per-field extraction accuracy on test set - [ ] A/B test tight vs expanded labels - [ ] Build regression test suite for labeling quality --- ## 6. Summary | Aspect | Current Approach | Best Practice | Gap | |---|---|---|---| | **Labeling** | Value + expansion at label time | Tight value + inference expansion | Medium | | **Expansion** | Field-specific directional | Field-specific directional | Aligned | | **Inference OCR crop** | Raw detection bbox | Detection + padding | Needs padding | | **Expansion ratios** | Aggressive (up to 1.80x) | Moderate (1.10-1.30x) | Over-expanded | | **Visualization QC** | None | Regular sampling | Missing | | **Coordinate consistency** | PDF points -> pixels | Consistent DPI | Check needed | **Bottom line**: The architecture (field-specific directional expansion) is sound and aligns with best practices. The main improvements are: 1. **Reduce expansion aggressiveness** during training labels 2. **Add OCR padding** at inference time 3. **Add label quality visualization** for validation 4. Longer term: consider **two-stage detection** or **token classification**