WIP

2026-02-11 23:40:38 +01:00
parent f1a7bfe6b7
commit ad5ed46b4c
117 changed files with 5741 additions and 7669 deletions
--- a/LABELING_STRATEGY_ANALYSIS.md
+++ b/LABELING_STRATEGY_ANALYSIS.md
@@ -0,0 +1,257 @@
+# Semi-Automatic Labeling Strategy Analysis
+
+## 1. Current Pipeline Overview
+
+```
+CSV (field values)
+  |
+  v
+Autolabel CLI
+  |- PDF render (300 DPI)
+  |- Text extraction (PDF text layer or PaddleOCR)
+  |- FieldMatcher.find_matches() [5 strategies]
+  |   |- ExactMatcher (priority 1)
+  |   |- ConcatenatedMatcher (multi-token)
+  |   |- FuzzyMatcher (Amount, dates only)
+  |   |- SubstringMatcher (prevents false positives)
+  |   |- FlexibleDateMatcher (fallback)
+  |
+  |- AnnotationGenerator
+  |   |- PDF points -> pixels
+  |   |- expand_bbox() [field-specific strategy]
+  |   |- pixels -> YOLO normalized (0-1)
+  |   |- Save to database
+  |
+  v
+DBYOLODataset (training)
+  |- Load images + bboxes from DB
+  |- Re-apply expand_bbox()
+  |- YOLO training
+  |
+  v
+Inference
+  |- YOLO detect -> pixel bboxes
+  |- Crop region -> OCR extract text
+  |- Normalize & validate
+```
+
+---
+
+## 2. Current Expansion Strategy Analysis
+
+### 2.1 Field-Specific Parameters
+
+| Field | Scale X | Scale Y | Extra Top | Extra Left | Extra Right | Max Pad X | Max Pad Y |
+|---|---|---|---|---|---|---|---|
+| ocr_number | 1.15 | 1.80 | 0.60 | - | - | 50 | 140 |
+| bankgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 |
+| plusgiro | 1.45 | 1.35 | - | 0.80 | - | 160 | 90 |
+| invoice_date | 1.25 | 1.55 | 0.40 | - | - | 80 | 110 |
+| invoice_due_date | 1.30 | 1.65 | 0.45 | 0.35 | - | 100 | 120 |
+| amount | 1.20 | 1.35 | - | - | 0.30 | 70 | 80 |
+| invoice_number | 1.20 | 1.50 | 0.40 | - | - | 80 | 100 |
+| supplier_org_number | 1.25 | 1.40 | 0.30 | 0.20 | - | 90 | 90 |
+| customer_number | 1.25 | 1.45 | 0.35 | 0.25 | - | 90 | 100 |
+| payment_line | 1.10 | 1.20 | - | - | - | 40 | 40 |
+
+### 2.2 Design Rationale
+
+The expansion is designed based on Swedish invoice layout patterns:
+- **Dates**: Labels ("Fakturadatum") typically sit **above** the value -> extra top
+- **Giro accounts**: Prefix ("BG:", "PG:") sits **to the left** -> extra left
+- **Amount**: Currency suffix ("SEK", "kr") to the **right** -> extra right
+- **Payment line**: Machine-readable, self-contained -> minimal expansion
+
+### 2.3 Strengths
+
+1. **Field-specific directional expansion** - matches Swedish invoice conventions
+2. **Max padding caps** - prevents runaway expansion into neighboring fields
+3. **Center-point scaling** with directional compensation - geometrically sound
+4. **Image boundary clamping** - prevents out-of-bounds coordinates
+
+### 2.4 Potential Issues
+
+| Issue | Risk Level | Description |
+|---|---|---|
+| Over-expansion | HIGH | OCR number 1.80x Y-scale could capture adjacent fields |
+| Inconsistent training vs inference bbox | MEDIUM | Model trained on expanded boxes, inference returns raw detection |
+| No expansion at inference OCR crop | MEDIUM | Detected bbox may clip text edges without post-expansion |
+| Max padding in pixels vs DPI-dependent | LOW | 140px at 300DPI != 140px at 150DPI |
+
+---
+
+## 3. Industry Best Practices (Research Findings)
+
+### 3.1 Labeling: Tight vs. Loose Bounding Boxes
+
+**Consensus**: Annotate **tight bounding boxes around the value text only**.
+
+- FUNSD/CORD benchmarks annotate keys and values as **separate entities**
+- Loose boxes "introduce background noise and can mislead the model" (V7 Labs, LabelVisor)
+- IoU discrepancies from loose boxes degrade mAP during training
+
+**However**, for YOLO + OCR pipelines, tight-only creates a problem:
+- YOLO predicts slightly imprecise boxes (typical IoU 0.7-0.9)
+- If the predicted box clips even slightly, OCR misses characters
+- Solution: **Label tight, expand at inference** OR **label with controlled padding**
+
+### 3.2 The Two Dominant Strategies
+
+**Strategy A: Tight Label + Inference-Time Expansion** (Recommended by research)
+```
+Label:     [  2024-01-15  ]     (tight around value)
+Inference: [ [2024-01-15] ] + pad -> OCR
+```
+- Clean, consistent annotations
+- Requires post-detection padding before OCR crop
+- Used by: Microsoft Document Intelligence, Nanonets
+
+**Strategy B: Expanded Label at Training Time** (Current project approach)
+```
+Label:     [  Fakturadatum: 2024-01-15  ]  (includes context)
+Inference: YOLO detects full region -> OCR extracts from region
+```
+- Model learns spatial context (label + value)
+- Larger, more variable boxes
+- OCR must filter out label text from extracted content
+
+### 3.3 OCR Padding Requirements
+
+**Tesseract**: Requires ~10px white border for reliable segmentation (PSM 7-10).
+**PaddleOCR**: `det_db_unclip_ratio` parameter (default 1.5) controls detection expansion.
+
+Key insight: Even after YOLO detection, OCR engines need some padding around text to work reliably.
+
+### 3.4 State-of-the-Art Comparison
+
+| System | Bbox Strategy | Field Definition |
+|---|---|---|
+| **LayoutLM** | Word-level bboxes from OCR | Token classification (BIO tagging) |
+| **Donut** | No bboxes (end-to-end) | Internal attention mechanism |
+| **Microsoft DocAI** | Field-level, tight | Post-expansion for OCR |
+| **YOLO + OCR (this project)** | Field-level, expanded | Field-specific directional expansion |
+
+---
+
+## 4. Recommendations
+
+### 4.1 Short-Term (Current Architecture)
+
+#### A. Add Inference-Time OCR Padding
+Currently, the detected bbox is sent directly to OCR. Add a small uniform padding (5-10%) before cropping for OCR:
+
+```python
+# In field_extractor.py, before OCR crop:
+pad_ratio = 0.05  # 5% expansion
+w_pad = (x2 - x1) * pad_ratio
+h_pad = (y2 - y1) * pad_ratio
+crop_x1 = max(0, x1 - w_pad)
+crop_y1 = max(0, y1 - h_pad)
+crop_x2 = min(img_w, x2 + w_pad)
+crop_y2 = min(img_h, y2 + h_pad)
+```
+
+#### B. Reduce Training-Time Expansion Ratios
+Current ratios (especially OCR number 1.80x Y, Bankgiro 1.45x X) are aggressive. Proposed reduction:
+
+| Field | Current Scale Y | Proposed Scale Y | Rationale |
+|---|---|---|---|
+| ocr_number | 1.80 | 1.40 | 1.80 is too aggressive, captures neighbors |
+| bankgiro | 1.35 | 1.25 | Reduce vertical over-expansion |
+| invoice_due_date | 1.65 | 1.45 | Tighten vertical |
+
+Principle: **shift expansion work from training-time to inference-time**.
+
+#### C. Add Label Visualization Quality Check
+Before training, sample 50-100 annotated images and visually inspect:
+- Are expanded bboxes capturing only the target field?
+- Are any bboxes overlapping with adjacent fields?
+- Are any values being clipped?
+
+### 4.2 Medium-Term (Architecture Improvements)
+
+#### D. Two-Stage Detection Strategy
+```
+Stage 1: YOLO detects field regions (current)
+Stage 2: Within each detection, use PaddleOCR text detection
+         to find the precise text boundary
+Stage 3: Extract text from refined boundary
+```
+
+Benefits:
+- YOLO handles field classification (what)
+- PaddleOCR handles text localization (where exactly)
+- Eliminates the "tight vs loose" dilemma entirely
+
+#### E. Label Both Key and Value Separately
+Add new annotation classes: `invoice_date_label`, `invoice_date_value`
+- Model learns to find both the label and value
+- Use spatial relationship (label -> value) for more robust extraction
+- Aligns with FUNSD benchmark approach
+
+#### F. Confidence-Weighted Expansion
+Scale expansion by detection confidence:
+```python
+# Higher confidence = tighter crop (model is sure)
+# Lower confidence = wider crop (give OCR more context)
+expansion = base_expansion * (1.5 - confidence)
+```
+
+### 4.3 Long-Term (Next Generation)
+
+#### G. Move to LayoutLM-Style Token Classification
+- Replace YOLO field detection with token-level classification
+- Each OCR word gets classified as B-field/I-field/O
+- Eliminates bbox expansion entirely
+- Better for fields with complex layouts
+
+#### H. End-to-End with Donut/Pix2Struct
+- No separate OCR step
+- Model directly outputs structured fields from image
+- Zero bbox concerns
+- Requires more training data and compute
+
+---
+
+## 5. Recommended Action Plan
+
+### Phase 1: Validate Current Labels (1-2 days)
+- [ ] Build label visualization script
+- [ ] Sample 100 documents across all field types
+- [ ] Identify over-expansion and clipping cases
+- [ ] Document per-field accuracy of current expansion
+
+### Phase 2: Tune Expansion Parameters (2-3 days)
+- [ ] Reduce aggressive expansion ratios (OCR number, bankgiro)
+- [ ] Add inference-time OCR padding (5-10%)
+- [ ] Re-train model with adjusted labels
+- [ ] Compare mAP and field extraction accuracy
+
+### Phase 3: Two-Stage Refinement (1 week)
+- [ ] Implement PaddleOCR text detection within YOLO detection
+- [ ] Use text detection bbox for precise OCR crop
+- [ ] Keep YOLO expansion for classification only
+
+### Phase 4: Evaluation (ongoing)
+- [ ] Track per-field extraction accuracy on test set
+- [ ] A/B test tight vs expanded labels
+- [ ] Build regression test suite for labeling quality
+
+---
+
+## 6. Summary
+
+| Aspect | Current Approach | Best Practice | Gap |
+|---|---|---|---|
+| **Labeling** | Value + expansion at label time | Tight value + inference expansion | Medium |
+| **Expansion** | Field-specific directional | Field-specific directional | Aligned |
+| **Inference OCR crop** | Raw detection bbox | Detection + padding | Needs padding |
+| **Expansion ratios** | Aggressive (up to 1.80x) | Moderate (1.10-1.30x) | Over-expanded |
+| **Visualization QC** | None | Regular sampling | Missing |
+| **Coordinate consistency** | PDF points -> pixels | Consistent DPI | Check needed |
+
+**Bottom line**: The architecture (field-specific directional expansion) is sound and aligns with best practices. The main improvements are:
+1. **Reduce expansion aggressiveness** during training labels
+2. **Add OCR padding** at inference time
+3. **Add label quality visualization** for validation
+4. Longer term: consider **two-stage detection** or **token classification**