12 KiB
Inference Analysis Report
Date: 2026-02-11 Sample: 39 PDFs (diverse sizes from 1783 total), invoice-sm120 environment
Executive Summary
| Metric | Value |
|---|---|
| Total PDFs tested | 39 |
| Successful responses | 35 (89.7%) |
| Timeouts (>120s) | 4 (10.3%) |
| Pure fallback (all fields conf=0.500) | 15/35 (42.9%) |
| Full extraction (all expected fields) | 6/35 (17.1%) |
| supplier_org_number extraction rate | 0% |
| InvoiceDate extraction rate | 31.4% |
| OCR extraction rate | 31.4% |
Root Cause: A critical DPI mismatch bug causes 43% of documents to lose all YOLO-detected field data, falling back to inaccurate regex patterns.
Problem #1 (CRITICAL): DPI Mismatch - Field Extraction Failures
Symptom
- 15/35 documents (43%) have ALL extracted fields at confidence=0.500 (fallback)
- YOLO detects fields correctly (6+ detections at conf 0.8-0.97) but text extraction returns nothing
- Examples:
4f822b0dhas 6 YOLO detections but only 1 field extracted via fallback
Root Cause
DPI not passed from pipeline to FieldExtractor causing 2x coordinate scaling error.
pipeline.py:237 -> self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
^^^ DPI NOT PASSED! Defaults to 300
The chain:
shared/config.py:22definesDEFAULT_DPI = 150InferencePipeline.__init__()receivesdpi=150fromModelConfig- PDF rendered at 150 DPI -> YOLO detections in 150 DPI pixel coordinates
FieldExtractordefaults todpi=300(never receives the actual 150)- Coordinate conversion:
scale = 72 / self.dpi=72/300 = 0.24instead of72/150 = 0.48 - Bounding boxes are halved in PDF point space -> no tokens match -> empty extraction
- Fallback regex triggers with conf=0.500
Fix
File: packages/backend/backend/pipeline/pipeline.py, line 237
# BEFORE (broken):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
# AFTER (fixed):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu, dpi=dpi)
Impact
This single-line fix will recover ~43% of documents from degraded fallback to proper YOLO+OCR extraction.
Problem #2 (HIGH): Fallback Amount Extraction Grabs Wrong Values
Symptom
- 3 documents extracted Amount=1.00 when actual amounts are 7500.00, etc.
- Fallback regex matches table column header "Summa" followed by row quantity "1,00" instead of total
Example
Document 2b7e4103 (Astra Football Club):
- Actual amount: 7 500,00 SEK
- Extracted: 1.00 (from "Summa 1" where "1" is the article number in the next row)
Root Cause
The fallback Amount regex in pipeline.py:676:
r'(?:att\s*betala|summa|total|belopp)\s*[:.]?\s*([\d\s,\.]+)\s*(?:SEK|kr)?'
matches "Summa" (column header) followed by "1" (first data in next row), because PaddleOCR produces tokens in position order. The greedy [\d\s,\.] captures "1" and stops at "Medlemsavgift".
Fix
File: packages/backend/backend/pipeline/pipeline.py, lines 674-688
- Require minimum amount value in fallback (e.g., > 10.00)
- Require the matched amount to have a decimal separator (
,or.) to avoid matching integers - Prefer "ATT BETALA" over "Summa" as the keyword (less ambiguous)
'Amount': [
r'(?:att\s+betala)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
r'([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)\s*$',
r'(?:summa|total|belopp)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
],
Problem #3 (HIGH): Fallback Bankgiro Regex False Positives
Symptom
- Document
2b7e4103extracts Bankgiro=2546-1610 but the actual document has NO Bankgiro - The document has Plusgiro=2131575-9 and Org.nr=802546-1610
Root Cause
Fallback Bankgiro regex in pipeline.py:681:
r'(\d{4}[-\s]\d{4})\s*(?=\s|$)'
matches the LAST 8 digits of org number "802546-1610" as "2546-1610".
Fix
File: packages/backend/backend/pipeline/pipeline.py, line 681
Add negative lookbehind to avoid matching within longer numbers:
'Bankgiro': [
r'(?:bankgiro|bg)\s*[:.]?\s*(\d{3,4}[-\s]?\d{4})',
r'(?<!\d)(\d{3,4}[-\s]\d{4})(?!\d)', # Must not be preceded/followed by digits
],
Problem #4 (MEDIUM): OCR Number Minimum 5-Digit Requirement
Symptom
- Document
2b7e4103has OCR=3046 (4 digits) which is valid but rejected by normalizer OcrNumberNormalizerrequires minimum 5 digits
Root Cause
File: packages/backend/backend/pipeline/normalizers/ocr_number.py, line 32:
if len(digits) < 5:
return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")
Swedish OCR numbers can be 2-25 digits. The 5-digit minimum is too restrictive.
Fix
Lower minimum to 2 digits (or possibly 1 for very short OCR references):
if len(digits) < 2:
return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")
Problem #5 (MEDIUM): InvoiceNumber Extracts Year (2025, 2026)
Symptom
- 2 documents extract year as invoice number: "2025", "2026"
dc35ee8e: actual invoice number visible in PDF but normalizer picks up year56cabf73: InvoiceNumber=2026
Root Cause
File: packages/backend/backend/pipeline/normalizers/invoice_number.py, lines 54-72
The "Pattern 3: Short digit sequence" strategy prefers shorter sequences. When the YOLO bbox contains both the year "2025" and the actual invoice number, the shorter "2025" (4 digits) wins over a longer sequence.
Fix
Add year exclusion to Pattern 3:
for seq in digit_sequences:
if len(seq) == 8 and seq.startswith("20"):
continue # Skip YYYYMMDD dates
if len(seq) == 4 and seq.startswith("20"):
continue # Skip year-only values (2024, 2025, 2026)
if len(seq) > 10:
continue
valid_sequences.append(seq)
Problem #6 (MEDIUM): InvoiceNumber vs OCR Mismatch
Symptom
- 5 documents show InvoiceNumber different from OCR number
- Example:
87f470daInvoiceNumber=852460234111905 vs OCR=524602341119055 - Example:
8b0674beInvoiceNumber=508021404131 vs OCR=50802140413
Root Cause
These are legitimate: InvoiceNumber and OCR are detected from DIFFERENT YOLO bounding boxes (different regions of the invoice). The InvoiceNumber normalizer picks a shorter sequence from the invoice_number bbox, while the OCR normalizer extracts from the ocr_number bbox. Cross-validation from payment_line should reconcile these but cross-validation isn't running (0 documents show cross_validation results).
Diagnosis Needed
Check why cross-validation / payment_line parsing isn't populating result.cross_validation even when payment_line is extracted.
Problem #7 (MEDIUM): supplier_org_number 0% Extraction Rate
Symptom
- 0/35 documents extract supplier_org_number
- YOLO detects supplier_org_number in many documents (visible in detection classes)
- When extracted, the field appears as
supplier_organisation_number(different name)
Root Cause
This is actually a reporting issue. The API returns the field as supplier_organisation_number (full spelling) from CLASS_TO_FIELD mapping, but the analysis expected supplier_org_number. Looking at the actual data, 8/35 documents DO have supplier_organisation_number extracted.
However, the underlying issue is that even when YOLO detects supplier_org_number, the DPI bug prevents text extraction for text PDFs.
Fix
Already addressed by Problem #1 (DPI fix). Additionally, ensure consistent field naming in API documentation.
Problem #8 (LOW): Timeout Failures (4/39 documents)
Symptom
- 4 PDFs timed out at 120 seconds
- File sizes: 89KB, 169KB, 239KB, 970KB (not correlated with size)
Root Cause
Likely multi-page PDFs or PDFs with complex layouts requiring extensive OCR. The 120s timeout in the test script may be too short for multi-page documents + full-page OCR fallback.
Fix
- Increase API timeout for multi-page PDFs
- Add page limit or early termination for very large documents
- Log page count in response to correlate with processing time
Problem #9 (LOW): Non-Invoice Documents in Dataset
Symptom
dccf6655: 0 detections, 0 fields - this is a screenshot of UI buttons, NOT an invoice
Fix
Add document classification as a pre-processing step to reject non-invoice documents before running the expensive YOLO + OCR pipeline.
Problem #10 (LOW): InvoiceDueDate Before InvoiceDate
Symptom
- Document
11de4d07: InvoiceDate=2026-01-16, InvoiceDueDate=2025-12-01 - Due date is BEFORE invoice date, which is illogical
Root Cause
Either the date normalizer swapped the values, or the YOLO model detected the wrong region for one of the dates. The DPI bug (Problem #1) may also affect date extraction from the correct regions.
Fix
Add post-extraction validation: if InvoiceDueDate < InvoiceDate, either swap them or flag for review.
Priority Fix Order
| Priority | Fix | Impact | Effort |
|---|---|---|---|
| 1 | DPI mismatch (Problem #1) | 43% of docs recovered | 1 line change |
| 2 | Fallback amount regex (Problem #2) | 3+ docs with wrong amounts | Small regex fix |
| 3 | Fallback bankgiro regex (Problem #3) | False positive bankgiro | Small regex fix |
| 4 | OCR min digits (Problem #4) | Short OCR numbers supported | 1 line change |
| 5 | Year as InvoiceNumber (Problem #5) | 2+ docs | Small logic add |
| 6 | Date validation (Problem #10) | Logical consistency | Small validation add |
| 7 | Cross-validation (Problem #6) | Better field reconciliation | Investigation needed |
| 8 | Timeouts (Problem #8) | 4 docs | Config change |
| 9 | Document classification (Problem #9) | Filter non-invoices | Feature addition |
Re-run Expected After Fix #1
After fixing the DPI mismatch alone, re-running the same 39 PDFs should show:
- Pure fallback rate dropping from 43% to near 0%
- InvoiceDate extraction rate improving from 31% to ~70%+
- OCR extraction rate improving from 31% to ~60%+
- Average confidence scores increasing significantly
- supplier_organisation_number extraction improving from 23% to ~60%+
Detailed Per-PDF Results Summary
| Size | Time | Fields | Confidence | Issues | |
|---|---|---|---|---|---|
| dccf6655 | 10KB | 17s | 0/0 | - | Not an invoice |
| 4f822b0d | 183KB | 37s | 1/6 | ALL 0.500 | DPI bug: 6 detections, 5 lost |
| d4af7848 | 55KB | 41s | 1/6 | ALL 0.500 | DPI bug: 6 detections, 5 lost |
| 19533483 | 262KB | 39s | 1/9 | ALL 0.500 | DPI bug: 9 detections, 8 lost |
| 2b7e4103 | 25KB | 47s | 3/6 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 7717d293 | 34KB | 16s | 3/6 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 3226ac59 | 66KB | 42s | 3/5 | ALL 0.500 | DPI bug + Amount=1.00 wrong |
| 0553e5c2 | 31KB | 18s | 3/6 | ALL 0.500 | DPI bug + BG=5000-0000 suspicious |
| 32e90db8 | 136KB | 40s | 3/7 | Mixed | Amount=2026.00 (year?) |
| dc35ee8e | 567KB | 83s | 7/9 | YOLO | InvoiceNumber=2025 (year) |
| 56cabf73 | 67KB | 19s | 5/6 | YOLO | InvoiceNumber=2026 (year) |
| 87f470da | 784KB | 42s | 9/14 | YOLO | InvNum vs OCR mismatch |
| 11de4d07 | 356KB | 68s | 5/3 | Mixed | DueDate < InvoiceDate |
| 0f9047a9 | 415KB | 22s | 8/6 | YOLO | Good extraction |
| 9d0b793c | 286KB | 18s | 8/8 | YOLO | Good extraction |
| 5604d375 | 915KB | 51s | 9/10 | YOLO | Good extraction |
| 87f470da | 784KB | 42s | 9/14 | YOLO | Good extraction |
| f40fd418 | 523KB | 90s | 9/9 | YOLO | Good extraction |
Field Extraction Rate Summary
| Field | Present | Missing | Rate | Avg Conf |
|---|---|---|---|---|
| Bankgiro | 32 | 3 | 91.4% | 0.681 |
| InvoiceNumber | 28 | 7 | 80.0% | 0.695 |
| Amount | 27 | 8 | 77.1% | 0.726 |
| InvoiceDueDate | 13 | 22 | 37.1% | 0.883 |
| InvoiceDate | 11 | 24 | 31.4% | 0.879 |
| OCR | 11 | 24 | 31.4% | 0.900 |
| customer_number | 11 | 24 | 31.4% | 0.926 |
| payment_line | 9 | 26 | 25.7% | 0.938 |
| Plusgiro | 3 | 32 | 8.6% | 0.948 |
| supplier_org_number | 0 | 35 | 0.0% | 0.000 |
Note: Fields with high confidence but low extraction rate (InvoiceDate 0.879, OCR 0.900, payment_line 0.938) indicate the DPI bug: when extraction works (via YOLO), confidence is high. The low rate is because most documents fall back and these fields have no fallback regex pattern.