Yaojia Wang
8723ef4653
refactor: split line_items_extractor into smaller modules with comprehensive tests
...
- Extract models.py (LineItem, LineItemsResult dataclasses)
- Extract html_table_parser.py (ColumnMapper, HtmlTableParser)
- Extract merged_cell_handler.py (MergedCellHandler for PP-StructureV3 merged cells)
- Reduce line_items_extractor.py from 971 to 396 lines
- Add constants for magic numbers (MIN_AMOUNT_THRESHOLD, ROW_GROUPING_THRESHOLD, etc.)
- Fix row grouping algorithm in text_line_items_extractor.py
- Demote INFO logs to DEBUG level in structure_detector.py
- Add 209 tests achieving 85%+ coverage on main modules
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 23:02:00 +01:00
Yaojia Wang
c2c8f2dd04
WIP
2026-02-03 22:29:53 +01:00
Yaojia Wang
4c7fc3015c
fix: add PDF magic bytes validation to prevent file type spoofing
...
Add validation that checks PDF files start with '%PDF' magic bytes
before accepting uploads. This prevents attackers from uploading
malicious files (executables, scripts) by renaming them to .pdf.
- Add validate_pdf_magic_bytes() function with clear error messages
- Integrate validation in upload_document endpoint after file read
- Add comprehensive test coverage (13 test cases)
Addresses medium-risk security issue from code review.
2026-02-03 22:28:24 +01:00
Yaojia Wang
35988b1ebf
Update paddle, and support invoice line item
2026-02-03 21:28:06 +01:00
Yaojia Wang
b602d0a340
re-structure
2026-02-01 22:55:31 +01:00
Yaojia Wang
400b12a967
Add more tests
2026-02-01 22:40:41 +01:00
Yaojia Wang
a564ac9d70
WIP
2026-02-01 18:51:54 +01:00
Yaojia Wang
a516de4320
WIP
2026-02-01 00:08:40 +01:00
Yaojia Wang
33ada0350d
WIP
2026-01-30 00:44:21 +01:00
Yaojia Wang
d6550375b0
restructure project
2026-01-27 23:58:17 +01:00
Yaojia Wang
58bf75db68
WIP
2026-01-27 00:47:10 +01:00
Yaojia Wang
e599424a92
Re-structure the project.
2026-01-25 15:21:11 +01:00