36 Commits

Author SHA1 Message Date
Yaojia Wang
d8f2acb762 fix: change default OCR language from English to Swedish
Project targets Swedish invoice extraction. PaddleOCR sv model provides
better recognition of Swedish-specific characters (å, ä, ö).
2026-02-12 23:19:51 +01:00
Yaojia Wang
58d36c8927 WIP 2026-02-12 23:06:00 +01:00
Yaojia Wang
ad5ed46b4c WIP 2026-02-11 23:40:38 +01:00
Yaojia Wang
f1a7bfe6b7 WIP 2026-02-07 13:56:00 +01:00
Yaojia Wang
0990239e9c feat: add field-specific bbox expansion strategies for YOLO training
Implement center-point based bbox scaling with directional compensation
to capture field labels that typically appear above or to the left of
field values. This improves YOLO training data quality by including
contextual information around field values.

Key changes:
- Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function
- Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.)
- Support manual_mode for minimal padding (no scaling)
- Integrate expand_bbox into AnnotationGenerator
- Add FIELD_TO_CLASS mapping for field_name to class_name lookup
- Comprehensive tests with 100% coverage (45 tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:56:52 +01:00
Yaojia Wang
8723ef4653 refactor: split line_items_extractor into smaller modules with comprehensive tests
- Extract models.py (LineItem, LineItemsResult dataclasses)
- Extract html_table_parser.py (ColumnMapper, HtmlTableParser)
- Extract merged_cell_handler.py (MergedCellHandler for PP-StructureV3 merged cells)
- Reduce line_items_extractor.py from 971 to 396 lines
- Add constants for magic numbers (MIN_AMOUNT_THRESHOLD, ROW_GROUPING_THRESHOLD, etc.)
- Fix row grouping algorithm in text_line_items_extractor.py
- Demote INFO logs to DEBUG level in structure_detector.py
- Add 209 tests achieving 85%+ coverage on main modules

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 23:02:00 +01:00
Yaojia Wang
c2c8f2dd04 WIP 2026-02-03 22:29:53 +01:00
Yaojia Wang
4c7fc3015c fix: add PDF magic bytes validation to prevent file type spoofing
Add validation that checks PDF files start with '%PDF' magic bytes
before accepting uploads. This prevents attackers from uploading
malicious files (executables, scripts) by renaming them to .pdf.

- Add validate_pdf_magic_bytes() function with clear error messages
- Integrate validation in upload_document endpoint after file read
- Add comprehensive test coverage (13 test cases)

Addresses medium-risk security issue from code review.
2026-02-03 22:28:24 +01:00
Yaojia Wang
183d3503ef Prepare for opencode 2026-02-03 22:03:44 +01:00
Yaojia Wang
729d96f59e Merge branch 'feature/paddleocr-upgrade' 2026-02-03 21:28:33 +01:00
Yaojia Wang
35988b1ebf Update paddle, and support invoice line item 2026-02-03 21:28:06 +01:00
Yaojia Wang
c4e3773df1 feat: upgrade PaddlePaddle and PaddleOCR to 3.x
- Update paddlepaddle from >=2.5.0 to >=3.0.0,<3.3.0
- Update paddleocr from >=2.7.0 to >=3.0.0
- Update paddlepaddle-gpu from >=2.5.0 to >=3.0.0,<3.3.0

Note: PaddlePaddle 3.3.0 has an OneDNN bug that breaks CPU inference
(ConvertPirAttribute2RuntimeAttribute not implemented). Using <3.3.0
until the bug is fixed upstream.

This upgrade enables PP-StructureV3 for table extraction and uses
PP-OCRv5 for improved text recognition accuracy. The existing codebase
is already compatible with the 3.x API (predict() method and new
response format).

Verified:
- PaddleOCR import works
- PPStructureV3 is available
- OCREngine initializes correctly
- Inference API returns correct field extractions
- 2117 unit tests pass

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:15:02 +01:00
Yaojia Wang
883fab5c4a Add plan file 2026-02-01 23:41:46 +01:00
Yaojia Wang
45d74d048a Update readme 2026-02-01 23:41:38 +01:00
Yaojia Wang
b602d0a340 re-structure 2026-02-01 22:55:31 +01:00
Yaojia Wang
400b12a967 Add more tests 2026-02-01 22:40:41 +01:00
Yaojia Wang
a564ac9d70 WIP 2026-02-01 18:51:54 +01:00
Yaojia Wang
4126196dea Add report 2026-02-01 01:49:50 +01:00
Yaojia Wang
a516de4320 WIP 2026-02-01 00:08:40 +01:00
Yaojia Wang
33ada0350d WIP 2026-01-30 00:44:21 +01:00
Yaojia Wang
d2489a97d4 Remove not used file 2026-01-27 23:58:39 +01:00
Yaojia Wang
d6550375b0 restructure project 2026-01-27 23:58:17 +01:00
Yaojia Wang
58bf75db68 WIP 2026-01-27 00:47:10 +01:00
Yaojia Wang
e83a0cae36 Add claude config 2026-01-25 16:17:39 +01:00
Yaojia Wang
d5101e3604 Add claude config 2026-01-25 16:17:23 +01:00
Yaojia Wang
e599424a92 Re-structure the project. 2026-01-25 15:21:11 +01:00
Yaojia Wang
8fd61ea928 WIP 2026-01-22 22:03:24 +01:00
Yaojia Wang
4ea4bc96d4 Add payment line parser and fix OCR override from payment_line
- Add MachineCodeParser for Swedish invoice payment line parsing
- Fix OCR Reference extraction by normalizing account number spaces
- Add cross-validation tests for pipeline and field_extractor
- Update UI layout for compact upload and full-width results

Key changes:
- machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713")
- pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only
- field_extractor.py: Improved invoice number normalization
- app.py: Responsive UI layout changes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 21:47:02 +01:00
Yaojia Wang
e9460e9f34 code issue fix 2026-01-17 18:55:46 +01:00
Yaojia Wang
510890d18c Add claude config 2026-01-17 18:55:25 +01:00
Yaojia Wang
425b8fdedf WIP 2026-01-16 23:10:01 +01:00
Yaojia Wang
53d1e8db25 Enhance. 2026-01-15 23:02:38 +01:00
Yaojia Wang
b26fd61852 WOP 2026-01-13 00:10:27 +01:00
Yaojia Wang
1b7c61cdd8 Enable GPU by default for PaddleOCR
- Changed use_gpu default from False to True
- Added use_gpu parameter to PaddleOCR init
- Added show_log=False to reduce log noise

GPU acceleration significantly improves OCR performance and
reduces memory pressure when processing scanned PDFs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:29:02 +01:00
Yaojia Wang
dd69fbe9ed Fix: Enable substring matching for OCR, InvoiceNumber, Bankgiro, Plusgiro
Previously substring matching was only enabled for date fields, causing
OCR values embedded in longer tokens like "Fakturanummer: 2465027205"
to not be matched.

Changes:
- Extended Strategy 4 (substring match) to numeric fields
- Updated _find_substring_matches to support OCR, InvoiceNumber, Bankgiro, Plusgiro

This should significantly improve match rates for these fields.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 17:49:27 +01:00
Yaojia Wang
8938661850 Initial commit: Invoice field extraction system using YOLO + OCR
Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 17:44:14 +01:00