Commit Graph

11 Commits

Author SHA1 Message Date
Yaojia Wang
0990239e9c feat: add field-specific bbox expansion strategies for YOLO training
Implement center-point based bbox scaling with directional compensation
to capture field labels that typically appear above or to the left of
field values. This improves YOLO training data quality by including
contextual information around field values.

Key changes:
- Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function
- Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.)
- Support manual_mode for minimal padding (no scaling)
- Integrate expand_bbox into AnnotationGenerator
- Add FIELD_TO_CLASS mapping for field_name to class_name lookup
- Comprehensive tests with 100% coverage (45 tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:56:52 +01:00
Yaojia Wang
8723ef4653 refactor: split line_items_extractor into smaller modules with comprehensive tests
- Extract models.py (LineItem, LineItemsResult dataclasses)
- Extract html_table_parser.py (ColumnMapper, HtmlTableParser)
- Extract merged_cell_handler.py (MergedCellHandler for PP-StructureV3 merged cells)
- Reduce line_items_extractor.py from 971 to 396 lines
- Add constants for magic numbers (MIN_AMOUNT_THRESHOLD, ROW_GROUPING_THRESHOLD, etc.)
- Fix row grouping algorithm in text_line_items_extractor.py
- Demote INFO logs to DEBUG level in structure_detector.py
- Add 209 tests achieving 85%+ coverage on main modules

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 23:02:00 +01:00
Yaojia Wang
c2c8f2dd04 WIP 2026-02-03 22:29:53 +01:00
Yaojia Wang
4c7fc3015c fix: add PDF magic bytes validation to prevent file type spoofing
Add validation that checks PDF files start with '%PDF' magic bytes
before accepting uploads. This prevents attackers from uploading
malicious files (executables, scripts) by renaming them to .pdf.

- Add validate_pdf_magic_bytes() function with clear error messages
- Integrate validation in upload_document endpoint after file read
- Add comprehensive test coverage (13 test cases)

Addresses medium-risk security issue from code review.
2026-02-03 22:28:24 +01:00
Yaojia Wang
35988b1ebf Update paddle, and support invoice line item 2026-02-03 21:28:06 +01:00
Yaojia Wang
b602d0a340 re-structure 2026-02-01 22:55:31 +01:00
Yaojia Wang
400b12a967 Add more tests 2026-02-01 22:40:41 +01:00
Yaojia Wang
a564ac9d70 WIP 2026-02-01 18:51:54 +01:00
Yaojia Wang
a516de4320 WIP 2026-02-01 00:08:40 +01:00
Yaojia Wang
33ada0350d WIP 2026-01-30 00:44:21 +01:00
Yaojia Wang
d6550375b0 restructure project 2026-01-27 23:58:17 +01:00