Files
invoice-master-poc-v2/INFERENCE_ANALYSIS_REPORT.md
Yaojia Wang ad5ed46b4c WIP
2026-02-11 23:40:38 +01:00

315 lines
12 KiB
Markdown

# Inference Analysis Report
Date: 2026-02-11
Sample: 39 PDFs (diverse sizes from 1783 total), invoice-sm120 environment
## Executive Summary
| Metric | Value |
|--------|-------|
| Total PDFs tested | 39 |
| Successful responses | 35 (89.7%) |
| Timeouts (>120s) | 4 (10.3%) |
| Pure fallback (all fields conf=0.500) | 15/35 (42.9%) |
| Full extraction (all expected fields) | 6/35 (17.1%) |
| supplier_org_number extraction rate | 0% |
| InvoiceDate extraction rate | 31.4% |
| OCR extraction rate | 31.4% |
**Root Cause**: A critical DPI mismatch bug causes 43% of documents to lose all YOLO-detected field data, falling back to inaccurate regex patterns.
---
## Problem #1 (CRITICAL): DPI Mismatch - Field Extraction Failures
### Symptom
- 15/35 documents (43%) have ALL extracted fields at confidence=0.500 (fallback)
- YOLO detects fields correctly (6+ detections at conf 0.8-0.97) but text extraction returns nothing
- Examples: `4f822b0d` has 6 YOLO detections but only 1 field extracted via fallback
### Root Cause
**DPI not passed from pipeline to FieldExtractor** causing 2x coordinate scaling error.
```
pipeline.py:237 -> self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
^^^ DPI NOT PASSED! Defaults to 300
```
The chain:
1. `shared/config.py:22` defines `DEFAULT_DPI = 150`
2. `InferencePipeline.__init__()` receives `dpi=150` from `ModelConfig`
3. PDF rendered at **150 DPI** -> YOLO detections in 150 DPI pixel coordinates
4. `FieldExtractor` defaults to `dpi=300` (never receives the actual 150)
5. Coordinate conversion: `scale = 72 / self.dpi` = `72/300 = 0.24` instead of `72/150 = 0.48`
6. Bounding boxes are **halved** in PDF point space -> no tokens match -> empty extraction
7. Fallback regex triggers with conf=0.500
### Fix
**File**: `packages/backend/backend/pipeline/pipeline.py`, line 237
```python
# BEFORE (broken):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
# AFTER (fixed):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu, dpi=dpi)
```
### Impact
This single-line fix will recover ~43% of documents from degraded fallback to proper YOLO+OCR extraction.
---
## Problem #2 (HIGH): Fallback Amount Extraction Grabs Wrong Values
### Symptom
- 3 documents extracted Amount=1.00 when actual amounts are 7500.00, etc.
- Fallback regex matches table column header "Summa" followed by row quantity "1,00" instead of total
### Example
Document `2b7e4103` (Astra Football Club):
- Actual amount: **7 500,00 SEK**
- Extracted: **1.00** (from "Summa 1" where "1" is the article number in the next row)
### Root Cause
The fallback Amount regex in `pipeline.py:676`:
```python
r'(?:att\s*betala|summa|total|belopp)\s*[:.]?\s*([\d\s,\.]+)\s*(?:SEK|kr)?'
```
matches "Summa" (column header) followed by "1" (first data in next row), because PaddleOCR produces tokens in position order. The greedy `[\d\s,\.]` captures "1" and stops at "Medlemsavgift".
### Fix
**File**: `packages/backend/backend/pipeline/pipeline.py`, lines 674-688
1. Require minimum amount value in fallback (e.g., > 10.00)
2. Require the matched amount to have a decimal separator (`,` or `.`) to avoid matching integers
3. Prefer "ATT BETALA" over "Summa" as the keyword (less ambiguous)
```python
'Amount': [
r'(?:att\s+betala)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
r'([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)\s*$',
r'(?:summa|total|belopp)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
],
```
---
## Problem #3 (HIGH): Fallback Bankgiro Regex False Positives
### Symptom
- Document `2b7e4103` extracts Bankgiro=2546-1610 but the actual document has NO Bankgiro
- The document has Plusgiro=2131575-9 and Org.nr=802546-1610
### Root Cause
Fallback Bankgiro regex in `pipeline.py:681`:
```python
r'(\d{4}[-\s]\d{4})\s*(?=\s|$)'
```
matches the LAST 8 digits of org number "802546-1610" as "2546-1610".
### Fix
**File**: `packages/backend/backend/pipeline/pipeline.py`, line 681
Add negative lookbehind to avoid matching within longer numbers:
```python
'Bankgiro': [
r'(?:bankgiro|bg)\s*[:.]?\s*(\d{3,4}[-\s]?\d{4})',
r'(?<!\d)(\d{3,4}[-\s]\d{4})(?!\d)', # Must not be preceded/followed by digits
],
```
---
## Problem #4 (MEDIUM): OCR Number Minimum 5-Digit Requirement
### Symptom
- Document `2b7e4103` has OCR=3046 (4 digits) which is valid but rejected by normalizer
- `OcrNumberNormalizer` requires minimum 5 digits
### Root Cause
**File**: `packages/backend/backend/pipeline/normalizers/ocr_number.py`, line 32:
```python
if len(digits) < 5:
return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")
```
Swedish OCR numbers can be 2-25 digits. The 5-digit minimum is too restrictive.
### Fix
Lower minimum to 2 digits (or possibly 1 for very short OCR references):
```python
if len(digits) < 2:
return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")
```
---
## Problem #5 (MEDIUM): InvoiceNumber Extracts Year (2025, 2026)
### Symptom
- 2 documents extract year as invoice number: "2025", "2026"
- `dc35ee8e`: actual invoice number visible in PDF but normalizer picks up year
- `56cabf73`: InvoiceNumber=2026
### Root Cause
**File**: `packages/backend/backend/pipeline/normalizers/invoice_number.py`, lines 54-72
The "Pattern 3: Short digit sequence" strategy prefers shorter sequences. When the YOLO bbox contains both the year "2025" and the actual invoice number, the shorter "2025" (4 digits) wins over a longer sequence.
### Fix
Add year exclusion to Pattern 3:
```python
for seq in digit_sequences:
if len(seq) == 8 and seq.startswith("20"):
continue # Skip YYYYMMDD dates
if len(seq) == 4 and seq.startswith("20"):
continue # Skip year-only values (2024, 2025, 2026)
if len(seq) > 10:
continue
valid_sequences.append(seq)
```
---
## Problem #6 (MEDIUM): InvoiceNumber vs OCR Mismatch
### Symptom
- 5 documents show InvoiceNumber different from OCR number
- Example: `87f470da` InvoiceNumber=852460234111905 vs OCR=524602341119055
- Example: `8b0674be` InvoiceNumber=508021404131 vs OCR=50802140413
### Root Cause
These are legitimate: InvoiceNumber and OCR are detected from DIFFERENT YOLO bounding boxes (different regions of the invoice). The InvoiceNumber normalizer picks a shorter sequence from the invoice_number bbox, while the OCR normalizer extracts from the ocr_number bbox. Cross-validation from payment_line should reconcile these but cross-validation isn't running (0 documents show cross_validation results).
### Diagnosis Needed
Check why cross-validation / payment_line parsing isn't populating `result.cross_validation` even when payment_line is extracted.
---
## Problem #7 (MEDIUM): supplier_org_number 0% Extraction Rate
### Symptom
- 0/35 documents extract supplier_org_number
- YOLO detects supplier_org_number in many documents (visible in detection classes)
- When extracted, the field appears as `supplier_organisation_number` (different name)
### Root Cause
This is actually a reporting issue. The API returns the field as `supplier_organisation_number` (full spelling) from `CLASS_TO_FIELD` mapping, but the analysis expected `supplier_org_number`. Looking at the actual data, 8/35 documents DO have `supplier_organisation_number` extracted.
However, the underlying issue is that even when YOLO detects `supplier_org_number`, the DPI bug prevents text extraction for text PDFs.
### Fix
Already addressed by Problem #1 (DPI fix). Additionally, ensure consistent field naming in API documentation.
---
## Problem #8 (LOW): Timeout Failures (4/39 documents)
### Symptom
- 4 PDFs timed out at 120 seconds
- File sizes: 89KB, 169KB, 239KB, 970KB (not correlated with size)
### Root Cause
Likely multi-page PDFs or PDFs with complex layouts requiring extensive OCR. The 120s timeout in the test script may be too short for multi-page documents + full-page OCR fallback.
### Fix
1. Increase API timeout for multi-page PDFs
2. Add page limit or early termination for very large documents
3. Log page count in response to correlate with processing time
---
## Problem #9 (LOW): Non-Invoice Documents in Dataset
### Symptom
- `dccf6655`: 0 detections, 0 fields - this is a screenshot of UI buttons, NOT an invoice
### Fix
Add document classification as a pre-processing step to reject non-invoice documents before running the expensive YOLO + OCR pipeline.
---
## Problem #10 (LOW): InvoiceDueDate Before InvoiceDate
### Symptom
- Document `11de4d07`: InvoiceDate=2026-01-16, InvoiceDueDate=2025-12-01
- Due date is BEFORE invoice date, which is illogical
### Root Cause
Either the date normalizer swapped the values, or the YOLO model detected the wrong region for one of the dates. The DPI bug (Problem #1) may also affect date extraction from the correct regions.
### Fix
Add post-extraction validation: if InvoiceDueDate < InvoiceDate, either swap them or flag for review.
---
## Priority Fix Order
| Priority | Fix | Impact | Effort |
|----------|-----|--------|--------|
| 1 | DPI mismatch (Problem #1) | 43% of docs recovered | 1 line change |
| 2 | Fallback amount regex (Problem #2) | 3+ docs with wrong amounts | Small regex fix |
| 3 | Fallback bankgiro regex (Problem #3) | False positive bankgiro | Small regex fix |
| 4 | OCR min digits (Problem #4) | Short OCR numbers supported | 1 line change |
| 5 | Year as InvoiceNumber (Problem #5) | 2+ docs | Small logic add |
| 6 | Date validation (Problem #10) | Logical consistency | Small validation add |
| 7 | Cross-validation (Problem #6) | Better field reconciliation | Investigation needed |
| 8 | Timeouts (Problem #8) | 4 docs | Config change |
| 9 | Document classification (Problem #9) | Filter non-invoices | Feature addition |
---
## Re-run Expected After Fix #1
After fixing the DPI mismatch alone, re-running the same 39 PDFs should show:
- Pure fallback rate dropping from 43% to near 0%
- InvoiceDate extraction rate improving from 31% to ~70%+
- OCR extraction rate improving from 31% to ~60%+
- Average confidence scores increasing significantly
- supplier_organisation_number extraction improving from 23% to ~60%+
---
## Detailed Per-PDF Results Summary
| PDF | Size | Time | Fields | Confidence | Issues |
|-----|------|------|--------|------------|--------|
| dccf6655 | 10KB | 17s | 0/0 | - | Not an invoice |
| 4f822b0d | 183KB | 37s | 1/6 | ALL 0.500 | DPI bug: 6 detections, 5 lost |
| d4af7848 | 55KB | 41s | 1/6 | ALL 0.500 | DPI bug: 6 detections, 5 lost |
| 19533483 | 262KB | 39s | 1/9 | ALL 0.500 | DPI bug: 9 detections, 8 lost |
| 2b7e4103 | 25KB | 47s | 3/6 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 7717d293 | 34KB | 16s | 3/6 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 3226ac59 | 66KB | 42s | 3/5 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 0553e5c2 | 31KB | 18s | 3/6 | ALL 0.500 | DPI bug + BG=5000-0000 suspicious |
| 32e90db8 | 136KB | 40s | 3/7 | Mixed | Amount=2026.00 (year?) |
| dc35ee8e | 567KB | 83s | 7/9 | YOLO | InvoiceNumber=2025 (year) |
| 56cabf73 | 67KB | 19s | 5/6 | YOLO | InvoiceNumber=2026 (year) |
| 87f470da | 784KB | 42s | 9/14 | YOLO | InvNum vs OCR mismatch |
| 11de4d07 | 356KB | 68s | 5/3 | Mixed | DueDate < InvoiceDate |
| 0f9047a9 | 415KB | 22s | 8/6 | YOLO | Good extraction |
| 9d0b793c | 286KB | 18s | 8/8 | YOLO | Good extraction |
| 5604d375 | 915KB | 51s | 9/10 | YOLO | Good extraction |
| 87f470da | 784KB | 42s | 9/14 | YOLO | Good extraction |
| f40fd418 | 523KB | 90s | 9/9 | YOLO | Good extraction |
---
## Field Extraction Rate Summary
| Field | Present | Missing | Rate | Avg Conf |
|-------|---------|---------|------|----------|
| Bankgiro | 32 | 3 | 91.4% | 0.681 |
| InvoiceNumber | 28 | 7 | 80.0% | 0.695 |
| Amount | 27 | 8 | 77.1% | 0.726 |
| InvoiceDueDate | 13 | 22 | 37.1% | 0.883 |
| InvoiceDate | 11 | 24 | 31.4% | 0.879 |
| OCR | 11 | 24 | 31.4% | 0.900 |
| customer_number | 11 | 24 | 31.4% | 0.926 |
| payment_line | 9 | 26 | 25.7% | 0.938 |
| Plusgiro | 3 | 32 | 8.6% | 0.948 |
| supplier_org_number | 0 | 35 | 0.0% | 0.000 |
Note: Fields with high confidence but low extraction rate (InvoiceDate 0.879, OCR 0.900, payment_line 0.938) indicate the DPI bug: when extraction works (via YOLO), confidence is high. The low rate is because most documents fall back and these fields have no fallback regex pattern.