4.5 KiB
4.5 KiB
Invoice Master POC v2
Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices.
Architecture
PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output
Project Structure
packages/
├── backend/ # FastAPI web server + inference pipeline
│ └── pipeline/ # YOLO detector → OCR → field extractor → value selector → normalizers
├── shared/ # Common utilities (bbox, OCR, field mappings)
└── training/ # YOLO training data generation (annotation, dataset)
tests/ # Mirrors packages/ structure
Pipeline Flow (process_pdf)
- YOLO detects field regions on rendered PDF page
- PaddleOCR extracts text from detected bboxes
- Field extractor maps detections to invoice fields via CLASS_TO_FIELD
- Value selector picks best candidate per field (confidence + validation)
- Normalizers clean values (dates, amounts, invoice numbers)
- Fallback regex extraction if key fields missing
Tech Stack
| Component | Technology |
|---|---|
| Object Detection | YOLO (Ultralytics >= 8.4.0) |
| OCR | PaddleOCR v5 (PP-OCRv5) |
| PyMuPDF (fitz), DPI=150 | |
| Database | PostgreSQL + psycopg2 |
| Web | FastAPI + Uvicorn |
| ML | PyTorch + CUDA 12.x |
WSL Environment (REQUIRED)
ALL Python commands MUST use this prefix:
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && <command>"
NEVER run Python directly in Windows PowerShell/CMD.
Project Rules
- Python 3.10, type hints on all function signatures
- No
print()in production code - useloggingmodule - Validation with
pydanticordataclasses - Error handling with
try/except(not try/catch) - Run tests:
pytest --cov=packages tests/
Key Files
| File | Purpose |
|---|---|
packages/backend/backend/pipeline/pipeline.py |
Main inference pipeline |
packages/backend/backend/pipeline/field_extractor.py |
YOLO → field mapping |
packages/backend/backend/pipeline/value_selector.py |
Best candidate selection |
packages/shared/shared/fields/mappings.py |
CLASS_TO_FIELD mapping |
packages/shared/shared/ocr/paddle_ocr.py |
OCRToken definition |
packages/shared/shared/bbox/ |
Bbox expansion strategies |
Environment Variables
# Required
DB_PASSWORD=
# Optional (with defaults)
DB_HOST=192.168.68.31
DB_PORT=5432
DB_NAME=docmaster
DB_USER=docmaster
MODEL_PATH=runs/train/invoice_fields/weights/best.pt
CONFIDENCE_THRESHOLD=0.5
SERVER_HOST=0.0.0.0
SERVER_PORT=8000
Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction)
These rules MUST be followed regardless of conversation history:
- New feature or bug fix → MUST use tdd-guide agent (write tests first)
- When writing code → MUST follow coding standards skill for the target language:
- Python →
python-patterns(PEP 8, type hints, Pythonic idioms) - C# →
dotnet-skills:coding-standards(records, pattern matching, modern C#) - TS/JS →
coding-standards(universal best practices)
- Python →
- After writing/modifying code → MUST use code-reviewer agent
- Before git commit → MUST use security-reviewer agent
- When build/test fails → MUST use build-error-resolver agent
- After context compaction → read MEMORY.md to restore session state
Plan Completion Protocol
After completing any plan or major task:
- Test - Run
pytestto confirm all tests pass - Security review - Use security-reviewer agent on changed files
- Fix loop - If security review reports CRITICAL or HIGH issues:
- Fix the issues
- Re-run tests (back to step 1)
- Re-run security review (back to step 2)
- Repeat until no CRITICAL/HIGH issues remain
- Commit - Auto-commit with conventional commit message (
feat:,fix:,refactor:, etc.). Stage only the files changed in this task, not unrelated files - Save - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work
- Suggest clear - Tell the user: "Plan complete. Recommend
/clearto free context for the next task." - Do NOT start a new task in the same context - wait for user to /clear first
This keeps each plan in a fresh context window for maximum quality.
Known Issues
- Pre-existing test failures:
test_s3.py,test_azure.py(missing boto3/azure) - safe to ignore - Always re-run dedup/validation after fallback adds new fields
- PDF DPI must be 150 (not 300) for correct bbox alignment