invoice-master-poc-v2

Author	SHA1	Message	Date
Yaojia Wang	d8f2acb762	fix: change default OCR language from English to Swedish Project targets Swedish invoice extraction. PaddleOCR sv model provides better recognition of Swedish-specific characters (å, ä, ö).	2026-02-12 23:19:51 +01:00
Yaojia Wang	58d36c8927	WIP	2026-02-12 23:06:00 +01:00
Yaojia Wang	ad5ed46b4c	WIP	2026-02-11 23:40:38 +01:00
Yaojia Wang	f1a7bfe6b7	WIP	2026-02-07 13:56:00 +01:00
Yaojia Wang	0990239e9c	feat: add field-specific bbox expansion strategies for YOLO training Implement center-point based bbox scaling with directional compensation to capture field labels that typically appear above or to the left of field values. This improves YOLO training data quality by including contextual information around field values. Key changes: - Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function - Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.) - Support manual_mode for minimal padding (no scaling) - Integrate expand_bbox into AnnotationGenerator - Add FIELD_TO_CLASS mapping for field_name to class_name lookup - Comprehensive tests with 100% coverage (45 tests) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:56:52 +01:00
Yaojia Wang	8723ef4653	refactor: split line_items_extractor into smaller modules with comprehensive tests - Extract models.py (LineItem, LineItemsResult dataclasses) - Extract html_table_parser.py (ColumnMapper, HtmlTableParser) - Extract merged_cell_handler.py (MergedCellHandler for PP-StructureV3 merged cells) - Reduce line_items_extractor.py from 971 to 396 lines - Add constants for magic numbers (MIN_AMOUNT_THRESHOLD, ROW_GROUPING_THRESHOLD, etc.) - Fix row grouping algorithm in text_line_items_extractor.py - Demote INFO logs to DEBUG level in structure_detector.py - Add 209 tests achieving 85%+ coverage on main modules Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 23:02:00 +01:00
Yaojia Wang	c2c8f2dd04	WIP	2026-02-03 22:29:53 +01:00
Yaojia Wang	4c7fc3015c	fix: add PDF magic bytes validation to prevent file type spoofing Add validation that checks PDF files start with '%PDF' magic bytes before accepting uploads. This prevents attackers from uploading malicious files (executables, scripts) by renaming them to .pdf. - Add validate_pdf_magic_bytes() function with clear error messages - Integrate validation in upload_document endpoint after file read - Add comprehensive test coverage (13 test cases) Addresses medium-risk security issue from code review.	2026-02-03 22:28:24 +01:00
Yaojia Wang	183d3503ef	Prepare for opencode	2026-02-03 22:03:44 +01:00
Yaojia Wang	729d96f59e	Merge branch 'feature/paddleocr-upgrade'	2026-02-03 21:28:33 +01:00
Yaojia Wang	35988b1ebf	Update paddle, and support invoice line item	2026-02-03 21:28:06 +01:00
Yaojia Wang	c4e3773df1	feat: upgrade PaddlePaddle and PaddleOCR to 3.x - Update paddlepaddle from >=2.5.0 to >=3.0.0,<3.3.0 - Update paddleocr from >=2.7.0 to >=3.0.0 - Update paddlepaddle-gpu from >=2.5.0 to >=3.0.0,<3.3.0 Note: PaddlePaddle 3.3.0 has an OneDNN bug that breaks CPU inference (ConvertPirAttribute2RuntimeAttribute not implemented). Using <3.3.0 until the bug is fixed upstream. This upgrade enables PP-StructureV3 for table extraction and uses PP-OCRv5 for improved text recognition accuracy. The existing codebase is already compatible with the 3.x API (predict() method and new response format). Verified: - PaddleOCR import works - PPStructureV3 is available - OCREngine initializes correctly - Inference API returns correct field extractions - 2117 unit tests pass Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 12:15:02 +01:00
Yaojia Wang	883fab5c4a	Add plan file	2026-02-01 23:41:46 +01:00
Yaojia Wang	45d74d048a	Update readme	2026-02-01 23:41:38 +01:00
Yaojia Wang	b602d0a340	re-structure	2026-02-01 22:55:31 +01:00
Yaojia Wang	400b12a967	Add more tests	2026-02-01 22:40:41 +01:00
Yaojia Wang	a564ac9d70	WIP	2026-02-01 18:51:54 +01:00
Yaojia Wang	4126196dea	Add report	2026-02-01 01:49:50 +01:00
Yaojia Wang	a516de4320	WIP	2026-02-01 00:08:40 +01:00
Yaojia Wang	33ada0350d	WIP	2026-01-30 00:44:21 +01:00
Yaojia Wang	d2489a97d4	Remove not used file	2026-01-27 23:58:39 +01:00
Yaojia Wang	d6550375b0	restructure project	2026-01-27 23:58:17 +01:00
Yaojia Wang	58bf75db68	WIP	2026-01-27 00:47:10 +01:00
Yaojia Wang	e83a0cae36	Add claude config	2026-01-25 16:17:39 +01:00
Yaojia Wang	d5101e3604	Add claude config	2026-01-25 16:17:23 +01:00
Yaojia Wang	e599424a92	Re-structure the project.	2026-01-25 15:21:11 +01:00
Yaojia Wang	8fd61ea928	WIP	2026-01-22 22:03:24 +01:00
Yaojia Wang	4ea4bc96d4	Add payment line parser and fix OCR override from payment_line - Add MachineCodeParser for Swedish invoice payment line parsing - Fix OCR Reference extraction by normalizing account number spaces - Add cross-validation tests for pipeline and field_extractor - Update UI layout for compact upload and full-width results Key changes: - machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713") - pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only - field_extractor.py: Improved invoice number normalization - app.py: Responsive UI layout changes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 21:47:02 +01:00
Yaojia Wang	e9460e9f34	code issue fix	2026-01-17 18:55:46 +01:00
Yaojia Wang	510890d18c	Add claude config	2026-01-17 18:55:25 +01:00
Yaojia Wang	425b8fdedf	WIP	2026-01-16 23:10:01 +01:00
Yaojia Wang	53d1e8db25	Enhance.	2026-01-15 23:02:38 +01:00
Yaojia Wang	b26fd61852	WOP	2026-01-13 00:10:27 +01:00
Yaojia Wang	1b7c61cdd8	Enable GPU by default for PaddleOCR - Changed use_gpu default from False to True - Added use_gpu parameter to PaddleOCR init - Added show_log=False to reduce log noise GPU acceleration significantly improves OCR performance and reduces memory pressure when processing scanned PDFs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 18:29:02 +01:00
Yaojia Wang	dd69fbe9ed	Fix: Enable substring matching for OCR, InvoiceNumber, Bankgiro, Plusgiro Previously substring matching was only enabled for date fields, causing OCR values embedded in longer tokens like "Fakturanummer: 2465027205" to not be matched. Changes: - Extended Strategy 4 (substring match) to numeric fields - Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro This should significantly improve match rates for these fields. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 17:49:27 +01:00
Yaojia Wang	8938661850	Initial commit: Invoice field extraction system using YOLO + OCR Features: - Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations - Flexible date matching: year-month match, nearby date tolerance - PDF text extraction with PyMuPDF - OCR support for scanned documents (PaddleOCR) - YOLO training and inference pipeline - 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 17:44:14 +01:00

36 Commits