Initial commit: Invoice field extraction system using YOLO + OCR

Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Yaojia Wang
2026-01-10 17:44:14 +01:00
commit 8938661850
35 changed files with 5020 additions and 0 deletions

22
requirements.txt Normal file
View File

@@ -0,0 +1,22 @@
# Invoice Master POC v2 - Dependencies
# PDF Processing
PyMuPDF>=1.23.0 # PDF rendering and text extraction
# OCR
paddlepaddle>=2.5.0 # PaddlePaddle framework
paddleocr>=2.7.0 # PaddleOCR
# YOLO
ultralytics>=8.1.0 # YOLOv8/v11
# Image Processing
Pillow>=10.0.0 # Image handling
numpy>=1.24.0 # Array operations
opencv-python>=4.8.0 # Image processing
# Data Processing
pyyaml>=6.0 # YAML config files
# Utilities
tqdm>=4.65.0 # Progress bars