Files
invoice-master-poc-v2/INFERENCE_ANALYSIS_REPORT.md
Yaojia Wang ad5ed46b4c WIP
2026-02-11 23:40:38 +01:00

12 KiB

Inference Analysis Report

Date: 2026-02-11 Sample: 39 PDFs (diverse sizes from 1783 total), invoice-sm120 environment

Executive Summary

Metric Value
Total PDFs tested 39
Successful responses 35 (89.7%)
Timeouts (>120s) 4 (10.3%)
Pure fallback (all fields conf=0.500) 15/35 (42.9%)
Full extraction (all expected fields) 6/35 (17.1%)
supplier_org_number extraction rate 0%
InvoiceDate extraction rate 31.4%
OCR extraction rate 31.4%

Root Cause: A critical DPI mismatch bug causes 43% of documents to lose all YOLO-detected field data, falling back to inaccurate regex patterns.


Problem #1 (CRITICAL): DPI Mismatch - Field Extraction Failures

Symptom

  • 15/35 documents (43%) have ALL extracted fields at confidence=0.500 (fallback)
  • YOLO detects fields correctly (6+ detections at conf 0.8-0.97) but text extraction returns nothing
  • Examples: 4f822b0d has 6 YOLO detections but only 1 field extracted via fallback

Root Cause

DPI not passed from pipeline to FieldExtractor causing 2x coordinate scaling error.

pipeline.py:237  ->  self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
                     ^^^ DPI NOT PASSED! Defaults to 300

The chain:

  1. shared/config.py:22 defines DEFAULT_DPI = 150
  2. InferencePipeline.__init__() receives dpi=150 from ModelConfig
  3. PDF rendered at 150 DPI -> YOLO detections in 150 DPI pixel coordinates
  4. FieldExtractor defaults to dpi=300 (never receives the actual 150)
  5. Coordinate conversion: scale = 72 / self.dpi = 72/300 = 0.24 instead of 72/150 = 0.48
  6. Bounding boxes are halved in PDF point space -> no tokens match -> empty extraction
  7. Fallback regex triggers with conf=0.500

Fix

File: packages/backend/backend/pipeline/pipeline.py, line 237

# BEFORE (broken):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)

# AFTER (fixed):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu, dpi=dpi)

Impact

This single-line fix will recover ~43% of documents from degraded fallback to proper YOLO+OCR extraction.


Problem #2 (HIGH): Fallback Amount Extraction Grabs Wrong Values

Symptom

  • 3 documents extracted Amount=1.00 when actual amounts are 7500.00, etc.
  • Fallback regex matches table column header "Summa" followed by row quantity "1,00" instead of total

Example

Document 2b7e4103 (Astra Football Club):

  • Actual amount: 7 500,00 SEK
  • Extracted: 1.00 (from "Summa 1" where "1" is the article number in the next row)

Root Cause

The fallback Amount regex in pipeline.py:676:

r'(?:att\s*betala|summa|total|belopp)\s*[:.]?\s*([\d\s,\.]+)\s*(?:SEK|kr)?'

matches "Summa" (column header) followed by "1" (first data in next row), because PaddleOCR produces tokens in position order. The greedy [\d\s,\.] captures "1" and stops at "Medlemsavgift".

Fix

File: packages/backend/backend/pipeline/pipeline.py, lines 674-688

  1. Require minimum amount value in fallback (e.g., > 10.00)
  2. Require the matched amount to have a decimal separator (, or .) to avoid matching integers
  3. Prefer "ATT BETALA" over "Summa" as the keyword (less ambiguous)
'Amount': [
    r'(?:att\s+betala)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
    r'([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)\s*$',
    r'(?:summa|total|belopp)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
],

Problem #3 (HIGH): Fallback Bankgiro Regex False Positives

Symptom

  • Document 2b7e4103 extracts Bankgiro=2546-1610 but the actual document has NO Bankgiro
  • The document has Plusgiro=2131575-9 and Org.nr=802546-1610

Root Cause

Fallback Bankgiro regex in pipeline.py:681:

r'(\d{4}[-\s]\d{4})\s*(?=\s|$)'

matches the LAST 8 digits of org number "802546-1610" as "2546-1610".

Fix

File: packages/backend/backend/pipeline/pipeline.py, line 681

Add negative lookbehind to avoid matching within longer numbers:

'Bankgiro': [
    r'(?:bankgiro|bg)\s*[:.]?\s*(\d{3,4}[-\s]?\d{4})',
    r'(?<!\d)(\d{3,4}[-\s]\d{4})(?!\d)',  # Must not be preceded/followed by digits
],

Problem #4 (MEDIUM): OCR Number Minimum 5-Digit Requirement

Symptom

  • Document 2b7e4103 has OCR=3046 (4 digits) which is valid but rejected by normalizer
  • OcrNumberNormalizer requires minimum 5 digits

Root Cause

File: packages/backend/backend/pipeline/normalizers/ocr_number.py, line 32:

if len(digits) < 5:
    return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")

Swedish OCR numbers can be 2-25 digits. The 5-digit minimum is too restrictive.

Fix

Lower minimum to 2 digits (or possibly 1 for very short OCR references):

if len(digits) < 2:
    return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")

Problem #5 (MEDIUM): InvoiceNumber Extracts Year (2025, 2026)

Symptom

  • 2 documents extract year as invoice number: "2025", "2026"
  • dc35ee8e: actual invoice number visible in PDF but normalizer picks up year
  • 56cabf73: InvoiceNumber=2026

Root Cause

File: packages/backend/backend/pipeline/normalizers/invoice_number.py, lines 54-72

The "Pattern 3: Short digit sequence" strategy prefers shorter sequences. When the YOLO bbox contains both the year "2025" and the actual invoice number, the shorter "2025" (4 digits) wins over a longer sequence.

Fix

Add year exclusion to Pattern 3:

for seq in digit_sequences:
    if len(seq) == 8 and seq.startswith("20"):
        continue  # Skip YYYYMMDD dates
    if len(seq) == 4 and seq.startswith("20"):
        continue  # Skip year-only values (2024, 2025, 2026)
    if len(seq) > 10:
        continue
    valid_sequences.append(seq)

Problem #6 (MEDIUM): InvoiceNumber vs OCR Mismatch

Symptom

  • 5 documents show InvoiceNumber different from OCR number
  • Example: 87f470da InvoiceNumber=852460234111905 vs OCR=524602341119055
  • Example: 8b0674be InvoiceNumber=508021404131 vs OCR=50802140413

Root Cause

These are legitimate: InvoiceNumber and OCR are detected from DIFFERENT YOLO bounding boxes (different regions of the invoice). The InvoiceNumber normalizer picks a shorter sequence from the invoice_number bbox, while the OCR normalizer extracts from the ocr_number bbox. Cross-validation from payment_line should reconcile these but cross-validation isn't running (0 documents show cross_validation results).

Diagnosis Needed

Check why cross-validation / payment_line parsing isn't populating result.cross_validation even when payment_line is extracted.


Problem #7 (MEDIUM): supplier_org_number 0% Extraction Rate

Symptom

  • 0/35 documents extract supplier_org_number
  • YOLO detects supplier_org_number in many documents (visible in detection classes)
  • When extracted, the field appears as supplier_organisation_number (different name)

Root Cause

This is actually a reporting issue. The API returns the field as supplier_organisation_number (full spelling) from CLASS_TO_FIELD mapping, but the analysis expected supplier_org_number. Looking at the actual data, 8/35 documents DO have supplier_organisation_number extracted.

However, the underlying issue is that even when YOLO detects supplier_org_number, the DPI bug prevents text extraction for text PDFs.

Fix

Already addressed by Problem #1 (DPI fix). Additionally, ensure consistent field naming in API documentation.


Problem #8 (LOW): Timeout Failures (4/39 documents)

Symptom

  • 4 PDFs timed out at 120 seconds
  • File sizes: 89KB, 169KB, 239KB, 970KB (not correlated with size)

Root Cause

Likely multi-page PDFs or PDFs with complex layouts requiring extensive OCR. The 120s timeout in the test script may be too short for multi-page documents + full-page OCR fallback.

Fix

  1. Increase API timeout for multi-page PDFs
  2. Add page limit or early termination for very large documents
  3. Log page count in response to correlate with processing time

Problem #9 (LOW): Non-Invoice Documents in Dataset

Symptom

  • dccf6655: 0 detections, 0 fields - this is a screenshot of UI buttons, NOT an invoice

Fix

Add document classification as a pre-processing step to reject non-invoice documents before running the expensive YOLO + OCR pipeline.


Problem #10 (LOW): InvoiceDueDate Before InvoiceDate

Symptom

  • Document 11de4d07: InvoiceDate=2026-01-16, InvoiceDueDate=2025-12-01
  • Due date is BEFORE invoice date, which is illogical

Root Cause

Either the date normalizer swapped the values, or the YOLO model detected the wrong region for one of the dates. The DPI bug (Problem #1) may also affect date extraction from the correct regions.

Fix

Add post-extraction validation: if InvoiceDueDate < InvoiceDate, either swap them or flag for review.


Priority Fix Order

Priority Fix Impact Effort
1 DPI mismatch (Problem #1) 43% of docs recovered 1 line change
2 Fallback amount regex (Problem #2) 3+ docs with wrong amounts Small regex fix
3 Fallback bankgiro regex (Problem #3) False positive bankgiro Small regex fix
4 OCR min digits (Problem #4) Short OCR numbers supported 1 line change
5 Year as InvoiceNumber (Problem #5) 2+ docs Small logic add
6 Date validation (Problem #10) Logical consistency Small validation add
7 Cross-validation (Problem #6) Better field reconciliation Investigation needed
8 Timeouts (Problem #8) 4 docs Config change
9 Document classification (Problem #9) Filter non-invoices Feature addition

Re-run Expected After Fix #1

After fixing the DPI mismatch alone, re-running the same 39 PDFs should show:

  • Pure fallback rate dropping from 43% to near 0%
  • InvoiceDate extraction rate improving from 31% to ~70%+
  • OCR extraction rate improving from 31% to ~60%+
  • Average confidence scores increasing significantly
  • supplier_organisation_number extraction improving from 23% to ~60%+

Detailed Per-PDF Results Summary

PDF Size Time Fields Confidence Issues
dccf6655 10KB 17s 0/0 - Not an invoice
4f822b0d 183KB 37s 1/6 ALL 0.500 DPI bug: 6 detections, 5 lost
d4af7848 55KB 41s 1/6 ALL 0.500 DPI bug: 6 detections, 5 lost
19533483 262KB 39s 1/9 ALL 0.500 DPI bug: 9 detections, 8 lost
2b7e4103 25KB 47s 3/6 ALL 0.500 DPI bug + Amount=1.00 wrong
7717d293 34KB 16s 3/6 ALL 0.500 DPI bug + Amount=1.00 wrong
3226ac59 66KB 42s 3/5 ALL 0.500 DPI bug + Amount=1.00 wrong
0553e5c2 31KB 18s 3/6 ALL 0.500 DPI bug + BG=5000-0000 suspicious
32e90db8 136KB 40s 3/7 Mixed Amount=2026.00 (year?)
dc35ee8e 567KB 83s 7/9 YOLO InvoiceNumber=2025 (year)
56cabf73 67KB 19s 5/6 YOLO InvoiceNumber=2026 (year)
87f470da 784KB 42s 9/14 YOLO InvNum vs OCR mismatch
11de4d07 356KB 68s 5/3 Mixed DueDate < InvoiceDate
0f9047a9 415KB 22s 8/6 YOLO Good extraction
9d0b793c 286KB 18s 8/8 YOLO Good extraction
5604d375 915KB 51s 9/10 YOLO Good extraction
87f470da 784KB 42s 9/14 YOLO Good extraction
f40fd418 523KB 90s 9/9 YOLO Good extraction

Field Extraction Rate Summary

Field Present Missing Rate Avg Conf
Bankgiro 32 3 91.4% 0.681
InvoiceNumber 28 7 80.0% 0.695
Amount 27 8 77.1% 0.726
InvoiceDueDate 13 22 37.1% 0.883
InvoiceDate 11 24 31.4% 0.879
OCR 11 24 31.4% 0.900
customer_number 11 24 31.4% 0.926
payment_line 9 26 25.7% 0.938
Plusgiro 3 32 8.6% 0.948
supplier_org_number 0 35 0.0% 0.000

Note: Fields with high confidence but low extraction rate (InvoiceDate 0.879, OCR 0.900, payment_line 0.938) indicate the DPI bug: when extraction works (via YOLO), confidence is high. The low rate is because most documents fall back and these fields have no fallback regex pattern.