Commit Graph

3 Commits

Author SHA1 Message Date
Yaojia Wang
1b7c61cdd8 Enable GPU by default for PaddleOCR
- Changed use_gpu default from False to True
- Added use_gpu parameter to PaddleOCR init
- Added show_log=False to reduce log noise

GPU acceleration significantly improves OCR performance and
reduces memory pressure when processing scanned PDFs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:29:02 +01:00
Yaojia Wang
dd69fbe9ed Fix: Enable substring matching for OCR, InvoiceNumber, Bankgiro, Plusgiro
Previously substring matching was only enabled for date fields, causing
OCR values embedded in longer tokens like "Fakturanummer: 2465027205"
to not be matched.

Changes:
- Extended Strategy 4 (substring match) to numeric fields
- Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro

This should significantly improve match rates for these fields.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 17:49:27 +01:00
Yaojia Wang
8938661850 Initial commit: Invoice field extraction system using YOLO + OCR
Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 17:44:14 +01:00