invoice-master-poc-v2

9 Commits 2 Branches 0 Tags

Author	SHA1	Message	Date
Yaojia Wang	4ea4bc96d4	Add payment line parser and fix OCR override from payment_line - Add MachineCodeParser for Swedish invoice payment line parsing - Fix OCR Reference extraction by normalizing account number spaces - Add cross-validation tests for pipeline and field_extractor - Update UI layout for compact upload and full-width results Key changes: - machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713") - pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only - field_extractor.py: Improved invoice number normalization - app.py: Responsive UI layout changes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 21:47:02 +01:00
Yaojia Wang	e9460e9f34	code issue fix	2026-01-17 18:55:46 +01:00
Yaojia Wang	510890d18c	Add claude config	2026-01-17 18:55:25 +01:00
Yaojia Wang	425b8fdedf	WIP	2026-01-16 23:10:01 +01:00
Yaojia Wang	53d1e8db25	Enhance.	2026-01-15 23:02:38 +01:00
Yaojia Wang	b26fd61852	WOP	2026-01-13 00:10:27 +01:00
Yaojia Wang	1b7c61cdd8	Enable GPU by default for PaddleOCR - Changed use_gpu default from False to True - Added use_gpu parameter to PaddleOCR init - Added show_log=False to reduce log noise GPU acceleration significantly improves OCR performance and reduces memory pressure when processing scanned PDFs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 18:29:02 +01:00
Yaojia Wang	dd69fbe9ed	Fix: Enable substring matching for OCR, InvoiceNumber, Bankgiro, Plusgiro Previously substring matching was only enabled for date fields, causing OCR values embedded in longer tokens like "Fakturanummer: 2465027205" to not be matched. Changes: - Extended Strategy 4 (substring match) to numeric fields - Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro This should significantly improve match rates for these fields. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 17:49:27 +01:00
Yaojia Wang	8938661850	Initial commit: Invoice field extraction system using YOLO + OCR Features: - Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations - Flexible date matching: year-month match, nearby date tolerance - PDF text extraction with PyMuPDF - OCR support for scanned documents (PaddleOCR) - YOLO training and inference pipeline - 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 17:44:14 +01:00