Commit Graph

6 Commits

Author SHA1 Message Date
Yaojia Wang
f1a7bfe6b7 WIP 2026-02-07 13:56:00 +01:00
Yaojia Wang
8723ef4653 refactor: split line_items_extractor into smaller modules with comprehensive tests
- Extract models.py (LineItem, LineItemsResult dataclasses)
- Extract html_table_parser.py (ColumnMapper, HtmlTableParser)
- Extract merged_cell_handler.py (MergedCellHandler for PP-StructureV3 merged cells)
- Reduce line_items_extractor.py from 971 to 396 lines
- Add constants for magic numbers (MIN_AMOUNT_THRESHOLD, ROW_GROUPING_THRESHOLD, etc.)
- Fix row grouping algorithm in text_line_items_extractor.py
- Demote INFO logs to DEBUG level in structure_detector.py
- Add 209 tests achieving 85%+ coverage on main modules

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 23:02:00 +01:00
Yaojia Wang
c2c8f2dd04 WIP 2026-02-03 22:29:53 +01:00
Yaojia Wang
4c7fc3015c fix: add PDF magic bytes validation to prevent file type spoofing
Add validation that checks PDF files start with '%PDF' magic bytes
before accepting uploads. This prevents attackers from uploading
malicious files (executables, scripts) by renaming them to .pdf.

- Add validate_pdf_magic_bytes() function with clear error messages
- Integrate validation in upload_document endpoint after file read
- Add comprehensive test coverage (13 test cases)

Addresses medium-risk security issue from code review.
2026-02-03 22:28:24 +01:00
Yaojia Wang
35988b1ebf Update paddle, and support invoice line item 2026-02-03 21:28:06 +01:00
Yaojia Wang
b602d0a340 re-structure 2026-02-01 22:55:31 +01:00