Initial commit: Invoice field extraction system using YOLO + OCR

Features:
- Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations
- Flexible date matching: year-month match, nearby date tolerance
- PDF text extraction with PyMuPDF
- OCR support for scanned documents (PaddleOCR)
- YOLO training and inference pipeline
- 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Yaojia Wang
2026-01-10 17:44:14 +01:00
commit 8938661850
35 changed files with 5020 additions and 0 deletions

71
.gitignore vendored Normal file
View File

@@ -0,0 +1,71 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual environments
venv/
ENV/
env/
.venv/
# IDE
.idea/
.vscode/
*.swp
*.swo
*~
# Data files (large files)
data/raw_pdfs/
data/dataset/train/images/
data/dataset/val/images/
data/dataset/test/images/
data/dataset/train/labels/
data/dataset/val/labels/
data/dataset/test/labels/
*.pdf
*.png
*.jpg
*.jpeg
# Model weights
models/weights/
runs/
*.pt
*.onnx
# Reports and logs
reports/*.jsonl
logs/
*.log
# Jupyter
.ipynb_checkpoints/
# OS
.DS_Store
Thumbs.db
# Credentials
.env
*.key
*.pem
credentials.json