kai/invoice-master-poc-v2

Fork 0

Files

Yaojia Wang ad5ed46b4c WIP

2026-02-11 23:40:38 +01:00

12 KiB

Raw Permalink Blame History

Inference Analysis Report

Date: 2026-02-11 Sample: 39 PDFs (diverse sizes from 1783 total), invoice-sm120 environment

Executive Summary

Metric	Value
Total PDFs tested	39
Successful responses	35 (89.7%)
Timeouts (>120s)	4 (10.3%)
Pure fallback (all fields conf=0.500)	15/35 (42.9%)
Full extraction (all expected fields)	6/35 (17.1%)
supplier_org_number extraction rate	0%
InvoiceDate extraction rate	31.4%
OCR extraction rate	31.4%

Root Cause: A critical DPI mismatch bug causes 43% of documents to lose all YOLO-detected field data, falling back to inaccurate regex patterns.

Problem #1 (CRITICAL): DPI Mismatch - Field Extraction Failures

Symptom

15/35 documents (43%) have ALL extracted fields at confidence=0.500 (fallback)
YOLO detects fields correctly (6+ detections at conf 0.8-0.97) but text extraction returns nothing
Examples: 4f822b0d has 6 YOLO detections but only 1 field extracted via fallback

Root Cause

DPI not passed from pipeline to FieldExtractor causing 2x coordinate scaling error.

pipeline.py:237  ->  self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)
                     ^^^ DPI NOT PASSED! Defaults to 300

The chain:

shared/config.py:22 defines DEFAULT_DPI = 150
InferencePipeline.__init__() receives dpi=150 from ModelConfig
PDF rendered at 150 DPI -> YOLO detections in 150 DPI pixel coordinates
FieldExtractor defaults to dpi=300 (never receives the actual 150)
Coordinate conversion: scale = 72 / self.dpi = 72/300 = 0.24 instead of 72/150 = 0.48
Bounding boxes are halved in PDF point space -> no tokens match -> empty extraction
Fallback regex triggers with conf=0.500

Fix

File: packages/backend/backend/pipeline/pipeline.py, line 237

# BEFORE (broken):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu)

# AFTER (fixed):
self.extractor = FieldExtractor(ocr_lang=ocr_lang, use_gpu=use_gpu, dpi=dpi)

Impact

This single-line fix will recover ~43% of documents from degraded fallback to proper YOLO+OCR extraction.

Problem #2 (HIGH): Fallback Amount Extraction Grabs Wrong Values

Symptom

3 documents extracted Amount=1.00 when actual amounts are 7500.00, etc.
Fallback regex matches table column header "Summa" followed by row quantity "1,00" instead of total

Example

Document 2b7e4103 (Astra Football Club):

Actual amount: 7 500,00 SEK
Extracted: 1.00 (from "Summa 1" where "1" is the article number in the next row)

Root Cause

The fallback Amount regex in pipeline.py:676:

r'(?:att\s*betala|summa|total|belopp)\s*[:.]?\s*([\d\s,\.]+)\s*(?:SEK|kr)?'

matches "Summa" (column header) followed by "1" (first data in next row), because PaddleOCR produces tokens in position order. The greedy [\d\s,\.] captures "1" and stops at "Medlemsavgift".

Fix

File: packages/backend/backend/pipeline/pipeline.py, lines 674-688

Require minimum amount value in fallback (e.g., > 10.00)
Require the matched amount to have a decimal separator (, or .) to avoid matching integers
Prefer "ATT BETALA" over "Summa" as the keyword (less ambiguous)

'Amount': [
    r'(?:att\s+betala)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
    r'([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)\s*$',
    r'(?:summa|total|belopp)\s*[:.]?\s*([\d\s]+[,\.]\d{2})\s*(?:SEK|kr)?',
],

Problem #3 (HIGH): Fallback Bankgiro Regex False Positives

Symptom

Document 2b7e4103 extracts Bankgiro=2546-1610 but the actual document has NO Bankgiro
The document has Plusgiro=2131575-9 and Org.nr=802546-1610

Root Cause

Fallback Bankgiro regex in pipeline.py:681:

r'(\d{4}[-\s]\d{4})\s*(?=\s|$)'

matches the LAST 8 digits of org number "802546-1610" as "2546-1610".

Fix

File: packages/backend/backend/pipeline/pipeline.py, line 681

Add negative lookbehind to avoid matching within longer numbers:

'Bankgiro': [
    r'(?:bankgiro|bg)\s*[:.]?\s*(\d{3,4}[-\s]?\d{4})',
    r'(?<!\d)(\d{3,4}[-\s]\d{4})(?!\d)',  # Must not be preceded/followed by digits
],

Problem #4 (MEDIUM): OCR Number Minimum 5-Digit Requirement

Symptom

Document 2b7e4103 has OCR=3046 (4 digits) which is valid but rejected by normalizer
OcrNumberNormalizer requires minimum 5 digits

Root Cause

File: packages/backend/backend/pipeline/normalizers/ocr_number.py, line 32:

if len(digits) < 5:
    return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")

Swedish OCR numbers can be 2-25 digits. The 5-digit minimum is too restrictive.

Fix

Lower minimum to 2 digits (or possibly 1 for very short OCR references):

if len(digits) < 2:
    return NormalizationResult.failure(f"Too few digits for OCR: {len(digits)}")

Problem #5 (MEDIUM): InvoiceNumber Extracts Year (2025, 2026)

Symptom

2 documents extract year as invoice number: "2025", "2026"
dc35ee8e: actual invoice number visible in PDF but normalizer picks up year
56cabf73: InvoiceNumber=2026

Root Cause

File: packages/backend/backend/pipeline/normalizers/invoice_number.py, lines 54-72

The "Pattern 3: Short digit sequence" strategy prefers shorter sequences. When the YOLO bbox contains both the year "2025" and the actual invoice number, the shorter "2025" (4 digits) wins over a longer sequence.

Fix

Add year exclusion to Pattern 3:

for seq in digit_sequences:
    if len(seq) == 8 and seq.startswith("20"):
        continue  # Skip YYYYMMDD dates
    if len(seq) == 4 and seq.startswith("20"):
        continue  # Skip year-only values (2024, 2025, 2026)
    if len(seq) > 10:
        continue
    valid_sequences.append(seq)

Problem #6 (MEDIUM): InvoiceNumber vs OCR Mismatch

Symptom

5 documents show InvoiceNumber different from OCR number
Example: 87f470da InvoiceNumber=852460234111905 vs OCR=524602341119055
Example: 8b0674be InvoiceNumber=508021404131 vs OCR=50802140413

Root Cause

These are legitimate: InvoiceNumber and OCR are detected from DIFFERENT YOLO bounding boxes (different regions of the invoice). The InvoiceNumber normalizer picks a shorter sequence from the invoice_number bbox, while the OCR normalizer extracts from the ocr_number bbox. Cross-validation from payment_line should reconcile these but cross-validation isn't running (0 documents show cross_validation results).

Diagnosis Needed

Check why cross-validation / payment_line parsing isn't populating result.cross_validation even when payment_line is extracted.

Problem #7 (MEDIUM): supplier_org_number 0% Extraction Rate

Symptom

0/35 documents extract supplier_org_number
YOLO detects supplier_org_number in many documents (visible in detection classes)
When extracted, the field appears as supplier_organisation_number (different name)

Root Cause

This is actually a reporting issue. The API returns the field as supplier_organisation_number (full spelling) from CLASS_TO_FIELD mapping, but the analysis expected supplier_org_number. Looking at the actual data, 8/35 documents DO have supplier_organisation_number extracted.

However, the underlying issue is that even when YOLO detects supplier_org_number, the DPI bug prevents text extraction for text PDFs.

Fix

Already addressed by Problem #1 (DPI fix). Additionally, ensure consistent field naming in API documentation.

Problem #8 (LOW): Timeout Failures (4/39 documents)

Symptom

4 PDFs timed out at 120 seconds
File sizes: 89KB, 169KB, 239KB, 970KB (not correlated with size)

Root Cause

Likely multi-page PDFs or PDFs with complex layouts requiring extensive OCR. The 120s timeout in the test script may be too short for multi-page documents + full-page OCR fallback.

Fix

Increase API timeout for multi-page PDFs
Add page limit or early termination for very large documents
Log page count in response to correlate with processing time

Problem #9 (LOW): Non-Invoice Documents in Dataset

Symptom

dccf6655: 0 detections, 0 fields - this is a screenshot of UI buttons, NOT an invoice

Fix

Add document classification as a pre-processing step to reject non-invoice documents before running the expensive YOLO + OCR pipeline.

Problem #10 (LOW): InvoiceDueDate Before InvoiceDate

Symptom

Document 11de4d07: InvoiceDate=2026-01-16, InvoiceDueDate=2025-12-01
Due date is BEFORE invoice date, which is illogical

Root Cause

Either the date normalizer swapped the values, or the YOLO model detected the wrong region for one of the dates. The DPI bug (Problem #1) may also affect date extraction from the correct regions.

Fix

Add post-extraction validation: if InvoiceDueDate < InvoiceDate, either swap them or flag for review.

Priority Fix Order

Priority	Fix	Impact	Effort
1	DPI mismatch (Problem #1)	43% of docs recovered	1 line change
2	Fallback amount regex (Problem #2)	3+ docs with wrong amounts	Small regex fix
3	Fallback bankgiro regex (Problem #3)	False positive bankgiro	Small regex fix
4	OCR min digits (Problem #4)	Short OCR numbers supported	1 line change
5	Year as InvoiceNumber (Problem #5)	2+ docs	Small logic add
6	Date validation (Problem #10)	Logical consistency	Small validation add
7	Cross-validation (Problem #6)	Better field reconciliation	Investigation needed
8	Timeouts (Problem #8)	4 docs	Config change
9	Document classification (Problem #9)	Filter non-invoices	Feature addition

Re-run Expected After Fix #1

After fixing the DPI mismatch alone, re-running the same 39 PDFs should show:

Pure fallback rate dropping from 43% to near 0%
InvoiceDate extraction rate improving from 31% to ~70%+
OCR extraction rate improving from 31% to ~60%+
Average confidence scores increasing significantly
supplier_organisation_number extraction improving from 23% to ~60%+

Detailed Per-PDF Results Summary

PDF	Size	Time	Fields	Confidence	Issues
dccf6655	10KB	17s	0/0	-	Not an invoice
4f822b0d	183KB	37s	1/6	ALL 0.500	DPI bug: 6 detections, 5 lost
d4af7848	55KB	41s	1/6	ALL 0.500	DPI bug: 6 detections, 5 lost
19533483	262KB	39s	1/9	ALL 0.500	DPI bug: 9 detections, 8 lost
2b7e4103	25KB	47s	3/6	ALL 0.500	DPI bug + Amount=1.00 wrong
7717d293	34KB	16s	3/6	ALL 0.500	DPI bug + Amount=1.00 wrong
3226ac59	66KB	42s	3/5	ALL 0.500	DPI bug + Amount=1.00 wrong
0553e5c2	31KB	18s	3/6	ALL 0.500	DPI bug + BG=5000-0000 suspicious
32e90db8	136KB	40s	3/7	Mixed	Amount=2026.00 (year?)
dc35ee8e	567KB	83s	7/9	YOLO	InvoiceNumber=2025 (year)
56cabf73	67KB	19s	5/6	YOLO	InvoiceNumber=2026 (year)
87f470da	784KB	42s	9/14	YOLO	InvNum vs OCR mismatch
11de4d07	356KB	68s	5/3	Mixed	DueDate < InvoiceDate
0f9047a9	415KB	22s	8/6	YOLO	Good extraction
9d0b793c	286KB	18s	8/8	YOLO	Good extraction
5604d375	915KB	51s	9/10	YOLO	Good extraction
87f470da	784KB	42s	9/14	YOLO	Good extraction
f40fd418	523KB	90s	9/9	YOLO	Good extraction

Field Extraction Rate Summary

Field	Present	Missing	Rate	Avg Conf
Bankgiro	32	3	91.4%	0.681
InvoiceNumber	28	7	80.0%	0.695
Amount	27	8	77.1%	0.726
InvoiceDueDate	13	22	37.1%	0.883
InvoiceDate	11	24	31.4%	0.879
OCR	11	24	31.4%	0.900
customer_number	11	24	31.4%	0.926
payment_line	9	26	25.7%	0.938
Plusgiro	3	32	8.6%	0.948
supplier_org_number	0	35	0.0%	0.000

Note: Fields with high confidence but low extraction rate (InvoiceDate 0.879, OCR 0.900, payment_line 0.938) indicate the DPI bug: when extraction works (via YOLO), confidence is high. The low rate is because most documents fall back and these fields have no fallback regex pattern.

12 KiB Raw Permalink Blame History

Inference Analysis Report

Executive Summary

Problem #1 (CRITICAL): DPI Mismatch - Field Extraction Failures

Symptom

Root Cause

Fix

Impact

Problem #2 (HIGH): Fallback Amount Extraction Grabs Wrong Values

Symptom

Example

Root Cause

Fix

Problem #3 (HIGH): Fallback Bankgiro Regex False Positives

Symptom

Root Cause

Fix

Problem #4 (MEDIUM): OCR Number Minimum 5-Digit Requirement

Symptom

Root Cause

Fix

Problem #5 (MEDIUM): InvoiceNumber Extracts Year (2025, 2026)

Symptom

Root Cause

Fix

Problem #6 (MEDIUM): InvoiceNumber vs OCR Mismatch

Symptom

Root Cause

Diagnosis Needed

Problem #7 (MEDIUM): supplier_org_number 0% Extraction Rate

Symptom

Root Cause

Fix

Problem #8 (LOW): Timeout Failures (4/39 documents)

Symptom

Root Cause

Fix

Problem #9 (LOW): Non-Invoice Documents in Dataset

Symptom

Fix

Problem #10 (LOW): InvoiceDueDate Before InvoiceDate

Symptom

Root Cause

Fix

Priority Fix Order

Re-run Expected After Fix #1

Detailed Per-PDF Results Summary

Field Extraction Rate Summary

12 KiB

Raw Permalink Blame History