# Invoice Master POC v2 Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices. ## Architecture ``` PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output ``` ### Project Structure ``` packages/ ├── backend/ # FastAPI web server + inference pipeline │ └── pipeline/ # YOLO detector → OCR → field extractor → value selector → normalizers ├── shared/ # Common utilities (bbox, OCR, field mappings) └── training/ # YOLO training data generation (annotation, dataset) tests/ # Mirrors packages/ structure ``` ### Pipeline Flow (process_pdf) 1. YOLO detects field regions on rendered PDF page 2. PaddleOCR extracts text from detected bboxes 3. Field extractor maps detections to invoice fields via CLASS_TO_FIELD 4. Value selector picks best candidate per field (confidence + validation) 5. Normalizers clean values (dates, amounts, invoice numbers) 6. Fallback regex extraction if key fields missing ## Tech Stack | Component | Technology | |-----------|------------| | Object Detection | YOLO (Ultralytics >= 8.4.0) | | OCR | PaddleOCR v5 (PP-OCRv5) | | PDF | PyMuPDF (fitz), DPI=150 | | Database | PostgreSQL + psycopg2 | | Web | FastAPI + Uvicorn | | ML | PyTorch + CUDA 12.x | ## WSL Environment (REQUIRED) ALL Python commands MUST use this prefix: ```bash wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && " ``` NEVER run Python directly in Windows PowerShell/CMD. ## Project Rules - Python 3.10, type hints on all function signatures - No `print()` in production code - use `logging` module - Validation with `pydantic` or `dataclasses` - Error handling with `try/except` (not try/catch) - Run tests: `pytest --cov=packages tests/` ## Key Files | File | Purpose | |------|---------| | `packages/backend/backend/pipeline/pipeline.py` | Main inference pipeline | | `packages/backend/backend/pipeline/field_extractor.py` | YOLO → field mapping | | `packages/backend/backend/pipeline/value_selector.py` | Best candidate selection | | `packages/shared/shared/fields/mappings.py` | CLASS_TO_FIELD mapping | | `packages/shared/shared/ocr/paddle_ocr.py` | OCRToken definition | | `packages/shared/shared/bbox/` | Bbox expansion strategies | ## Environment Variables ```bash # Required DB_PASSWORD= # Optional (with defaults) DB_HOST=192.168.68.31 DB_PORT=5432 DB_NAME=docmaster DB_USER=docmaster MODEL_PATH=runs/train/invoice_fields/weights/best.pt CONFIDENCE_THRESHOLD=0.5 SERVER_HOST=0.0.0.0 SERVER_PORT=8000 ``` ## Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction) These rules MUST be followed regardless of conversation history: - New feature or bug fix → MUST use **tdd-guide** agent (write tests first) - When writing code → MUST follow coding standards skill for the target language: - Python → `python-patterns` (PEP 8, type hints, Pythonic idioms) - C# → `dotnet-skills:coding-standards` (records, pattern matching, modern C#) - TS/JS → `coding-standards` (universal best practices) - After writing/modifying code → MUST use **code-reviewer** agent - Before git commit → MUST use **security-reviewer** agent - When build/test fails → MUST use **build-error-resolver** agent - After context compaction → read MEMORY.md to restore session state ## Plan Completion Protocol After completing any plan or major task: 1. **Test** - Run `pytest` to confirm all tests pass 2. **Security review** - Use **security-reviewer** agent on changed files 3. **Fix loop** - If security review reports CRITICAL or HIGH issues: - Fix the issues - Re-run tests (back to step 1) - Re-run security review (back to step 2) - Repeat until no CRITICAL/HIGH issues remain 4. **Commit** - Auto-commit with conventional commit message (`feat:`, `fix:`, `refactor:`, etc.). Stage only the files changed in this task, not unrelated files 5. **Save** - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work 6. **Suggest clear** - Tell the user: "Plan complete. Recommend `/clear` to free context for the next task." 7. **Do NOT start a new task** in the same context - wait for user to /clear first This keeps each plan in a fresh context window for maximum quality. ## Known Issues - Pre-existing test failures: `test_s3.py`, `test_azure.py` (missing boto3/azure) - safe to ignore - Always re-run dedup/validation after fallback adds new fields - PDF DPI must be 150 (not 300) for correct bbox alignment