Fix: Enable substring matching for OCR, InvoiceNumber, Bankgiro, Plusgiro
Previously substring matching was only enabled for date fields, causing OCR values embedded in longer tokens like "Fakturanummer: 2465027205" to not be matched. Changes: - Extended Strategy 4 (substring match) to numeric fields - Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro This should significantly improve match rates for these fields. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -106,8 +106,9 @@ class FieldMatcher:
|
|||||||
fuzzy_matches = self._find_fuzzy_matches(page_tokens, value, field_name)
|
fuzzy_matches = self._find_fuzzy_matches(page_tokens, value, field_name)
|
||||||
matches.extend(fuzzy_matches)
|
matches.extend(fuzzy_matches)
|
||||||
|
|
||||||
# Strategy 4: Substring match (for dates embedded in longer text)
|
# Strategy 4: Substring match (for values embedded in longer text)
|
||||||
if field_name in ('InvoiceDate', 'InvoiceDueDate'):
|
# e.g., "Fakturanummer: 2465027205" should match OCR value "2465027205"
|
||||||
|
if field_name in ('InvoiceDate', 'InvoiceDueDate', 'InvoiceNumber', 'OCR', 'Bankgiro', 'Plusgiro'):
|
||||||
substring_matches = self._find_substring_matches(page_tokens, value, field_name)
|
substring_matches = self._find_substring_matches(page_tokens, value, field_name)
|
||||||
matches.extend(substring_matches)
|
matches.extend(substring_matches)
|
||||||
|
|
||||||
@@ -240,16 +241,19 @@ class FieldMatcher:
|
|||||||
"""
|
"""
|
||||||
Find value as a substring within longer tokens.
|
Find value as a substring within longer tokens.
|
||||||
|
|
||||||
Handles cases like 'Fakturadatum: 2026-01-09' where the date
|
Handles cases like:
|
||||||
is embedded in a longer text string.
|
- 'Fakturadatum: 2026-01-09' where the date is embedded
|
||||||
|
- 'Fakturanummer: 2465027205' where OCR/invoice number is embedded
|
||||||
|
- 'OCR: 1234567890' where reference number is embedded
|
||||||
|
|
||||||
Uses lower score (0.75) than exact match to prefer exact matches.
|
Uses lower score (0.75-0.85) than exact match to prefer exact matches.
|
||||||
Only matches if the value appears as a distinct segment (not part of a number).
|
Only matches if the value appears as a distinct segment (not part of a larger number).
|
||||||
"""
|
"""
|
||||||
matches = []
|
matches = []
|
||||||
|
|
||||||
# Only use for date fields - other fields risk false positives
|
# Supported fields for substring matching
|
||||||
if field_name not in ('InvoiceDate', 'InvoiceDueDate'):
|
supported_fields = ('InvoiceDate', 'InvoiceDueDate', 'InvoiceNumber', 'OCR', 'Bankgiro', 'Plusgiro')
|
||||||
|
if field_name not in supported_fields:
|
||||||
return matches
|
return matches
|
||||||
|
|
||||||
for token in tokens:
|
for token in tokens:
|
||||||
|
|||||||
Reference in New Issue
Block a user