- Update paddlepaddle from >=2.5.0 to >=3.0.0,<3.3.0
- Update paddleocr from >=2.7.0 to >=3.0.0
- Update paddlepaddle-gpu from >=2.5.0 to >=3.0.0,<3.3.0
Note: PaddlePaddle 3.3.0 has an OneDNN bug that breaks CPU inference
(ConvertPirAttribute2RuntimeAttribute not implemented). Using <3.3.0
until the bug is fixed upstream.
This upgrade enables PP-StructureV3 for table extraction and uses
PP-OCRv5 for improved text recognition accuracy. The existing codebase
is already compatible with the 3.x API (predict() method and new
response format).
Verified:
- PaddleOCR import works
- PPStructureV3 is available
- OCREngine initializes correctly
- Inference API returns correct field extractions
- 2117 unit tests pass
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MachineCodeParser for Swedish invoice payment line parsing
- Fix OCR Reference extraction by normalizing account number spaces
- Add cross-validation tests for pipeline and field_extractor
- Update UI layout for compact upload and full-width results
Key changes:
- machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713")
- pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only
- field_extractor.py: Improved invoice number normalization
- app.py: Responsive UI layout changes
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously substring matching was only enabled for date fields, causing
OCR values embedded in longer tokens like "Fakturanummer: 2465027205"
to not be matched.
Changes:
- Extended Strategy 4 (substring match) to numeric fields
- Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro
This should significantly improve match rates for these fields.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>