Files
invoice-master-poc-v2/.claude/CLAUDE.md
Yaojia Wang 58d36c8927 WIP
2026-02-12 23:06:00 +01:00

4.5 KiB

Invoice Master POC v2

Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices.

Architecture

PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output

Project Structure

packages/
├── backend/    # FastAPI web server + inference pipeline
│   └── pipeline/   # YOLO detector → OCR → field extractor → value selector → normalizers
├── shared/     # Common utilities (bbox, OCR, field mappings)
└── training/   # YOLO training data generation (annotation, dataset)
tests/          # Mirrors packages/ structure

Pipeline Flow (process_pdf)

  1. YOLO detects field regions on rendered PDF page
  2. PaddleOCR extracts text from detected bboxes
  3. Field extractor maps detections to invoice fields via CLASS_TO_FIELD
  4. Value selector picks best candidate per field (confidence + validation)
  5. Normalizers clean values (dates, amounts, invoice numbers)
  6. Fallback regex extraction if key fields missing

Tech Stack

Component Technology
Object Detection YOLO (Ultralytics >= 8.4.0)
OCR PaddleOCR v5 (PP-OCRv5)
PDF PyMuPDF (fitz), DPI=150
Database PostgreSQL + psycopg2
Web FastAPI + Uvicorn
ML PyTorch + CUDA 12.x

WSL Environment (REQUIRED)

ALL Python commands MUST use this prefix:

wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && <command>"

NEVER run Python directly in Windows PowerShell/CMD.

Project Rules

  • Python 3.10, type hints on all function signatures
  • No print() in production code - use logging module
  • Validation with pydantic or dataclasses
  • Error handling with try/except (not try/catch)
  • Run tests: pytest --cov=packages tests/

Key Files

File Purpose
packages/backend/backend/pipeline/pipeline.py Main inference pipeline
packages/backend/backend/pipeline/field_extractor.py YOLO → field mapping
packages/backend/backend/pipeline/value_selector.py Best candidate selection
packages/shared/shared/fields/mappings.py CLASS_TO_FIELD mapping
packages/shared/shared/ocr/paddle_ocr.py OCRToken definition
packages/shared/shared/bbox/ Bbox expansion strategies

Environment Variables

# Required
DB_PASSWORD=

# Optional (with defaults)
DB_HOST=192.168.68.31
DB_PORT=5432
DB_NAME=docmaster
DB_USER=docmaster
MODEL_PATH=runs/train/invoice_fields/weights/best.pt
CONFIDENCE_THRESHOLD=0.5
SERVER_HOST=0.0.0.0
SERVER_PORT=8000

Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction)

These rules MUST be followed regardless of conversation history:

  • New feature or bug fix → MUST use tdd-guide agent (write tests first)
  • When writing code → MUST follow coding standards skill for the target language:
    • Python → python-patterns (PEP 8, type hints, Pythonic idioms)
    • C# → dotnet-skills:coding-standards (records, pattern matching, modern C#)
    • TS/JS → coding-standards (universal best practices)
  • After writing/modifying code → MUST use code-reviewer agent
  • Before git commit → MUST use security-reviewer agent
  • When build/test fails → MUST use build-error-resolver agent
  • After context compaction → read MEMORY.md to restore session state

Plan Completion Protocol

After completing any plan or major task:

  1. Test - Run pytest to confirm all tests pass
  2. Security review - Use security-reviewer agent on changed files
  3. Fix loop - If security review reports CRITICAL or HIGH issues:
    • Fix the issues
    • Re-run tests (back to step 1)
    • Re-run security review (back to step 2)
    • Repeat until no CRITICAL/HIGH issues remain
  4. Commit - Auto-commit with conventional commit message (feat:, fix:, refactor:, etc.). Stage only the files changed in this task, not unrelated files
  5. Save - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work
  6. Suggest clear - Tell the user: "Plan complete. Recommend /clear to free context for the next task."
  7. Do NOT start a new task in the same context - wait for user to /clear first

This keeps each plan in a fresh context window for maximum quality.

Known Issues

  • Pre-existing test failures: test_s3.py, test_azure.py (missing boto3/azure) - safe to ignore
  • Always re-run dedup/validation after fallback adds new fields
  • PDF DPI must be 150 (not 300) for correct bbox alignment