Implement center-point based bbox scaling with directional compensation
to capture field labels that typically appear above or to the left of
field values. This improves YOLO training data quality by including
contextual information around field values.
Key changes:
- Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function
- Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.)
- Support manual_mode for minimal padding (no scaling)
- Integrate expand_bbox into AnnotationGenerator
- Add FIELD_TO_CLASS mapping for field_name to class_name lookup
- Comprehensive tests with 100% coverage (45 tests)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add validation that checks PDF files start with '%PDF' magic bytes
before accepting uploads. This prevents attackers from uploading
malicious files (executables, scripts) by renaming them to .pdf.
- Add validate_pdf_magic_bytes() function with clear error messages
- Integrate validation in upload_document endpoint after file read
- Add comprehensive test coverage (13 test cases)
Addresses medium-risk security issue from code review.
- Update paddlepaddle from >=2.5.0 to >=3.0.0,<3.3.0
- Update paddleocr from >=2.7.0 to >=3.0.0
- Update paddlepaddle-gpu from >=2.5.0 to >=3.0.0,<3.3.0
Note: PaddlePaddle 3.3.0 has an OneDNN bug that breaks CPU inference
(ConvertPirAttribute2RuntimeAttribute not implemented). Using <3.3.0
until the bug is fixed upstream.
This upgrade enables PP-StructureV3 for table extraction and uses
PP-OCRv5 for improved text recognition accuracy. The existing codebase
is already compatible with the 3.x API (predict() method and new
response format).
Verified:
- PaddleOCR import works
- PPStructureV3 is available
- OCREngine initializes correctly
- Inference API returns correct field extractions
- 2117 unit tests pass
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MachineCodeParser for Swedish invoice payment line parsing
- Fix OCR Reference extraction by normalizing account number spaces
- Add cross-validation tests for pipeline and field_extractor
- Update UI layout for compact upload and full-width results
Key changes:
- machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713")
- pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only
- field_extractor.py: Improved invoice number normalization
- app.py: Responsive UI layout changes
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously substring matching was only enabled for date fields, causing
OCR values embedded in longer tokens like "Fakturanummer: 2465027205"
to not be matched.
Changes:
- Extended Strategy 4 (substring match) to numeric fields
- Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro
This should significantly improve match rates for these fields.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>