125 lines
4.5 KiB
Markdown
125 lines
4.5 KiB
Markdown
# Invoice Master POC v2
|
|
|
|
Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output
|
|
```
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
packages/
|
|
├── backend/ # FastAPI web server + inference pipeline
|
|
│ └── pipeline/ # YOLO detector → OCR → field extractor → value selector → normalizers
|
|
├── shared/ # Common utilities (bbox, OCR, field mappings)
|
|
└── training/ # YOLO training data generation (annotation, dataset)
|
|
tests/ # Mirrors packages/ structure
|
|
```
|
|
|
|
### Pipeline Flow (process_pdf)
|
|
|
|
1. YOLO detects field regions on rendered PDF page
|
|
2. PaddleOCR extracts text from detected bboxes
|
|
3. Field extractor maps detections to invoice fields via CLASS_TO_FIELD
|
|
4. Value selector picks best candidate per field (confidence + validation)
|
|
5. Normalizers clean values (dates, amounts, invoice numbers)
|
|
6. Fallback regex extraction if key fields missing
|
|
|
|
## Tech Stack
|
|
|
|
| Component | Technology |
|
|
|-----------|------------|
|
|
| Object Detection | YOLO (Ultralytics >= 8.4.0) |
|
|
| OCR | PaddleOCR v5 (PP-OCRv5) |
|
|
| PDF | PyMuPDF (fitz), DPI=150 |
|
|
| Database | PostgreSQL + psycopg2 |
|
|
| Web | FastAPI + Uvicorn |
|
|
| ML | PyTorch + CUDA 12.x |
|
|
|
|
## WSL Environment (REQUIRED)
|
|
|
|
ALL Python commands MUST use this prefix:
|
|
|
|
```bash
|
|
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && <command>"
|
|
```
|
|
|
|
NEVER run Python directly in Windows PowerShell/CMD.
|
|
|
|
## Project Rules
|
|
|
|
- Python 3.10, type hints on all function signatures
|
|
- No `print()` in production code - use `logging` module
|
|
- Validation with `pydantic` or `dataclasses`
|
|
- Error handling with `try/except` (not try/catch)
|
|
- Run tests: `pytest --cov=packages tests/`
|
|
|
|
## Key Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `packages/backend/backend/pipeline/pipeline.py` | Main inference pipeline |
|
|
| `packages/backend/backend/pipeline/field_extractor.py` | YOLO → field mapping |
|
|
| `packages/backend/backend/pipeline/value_selector.py` | Best candidate selection |
|
|
| `packages/shared/shared/fields/mappings.py` | CLASS_TO_FIELD mapping |
|
|
| `packages/shared/shared/ocr/paddle_ocr.py` | OCRToken definition |
|
|
| `packages/shared/shared/bbox/` | Bbox expansion strategies |
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
# Required
|
|
DB_PASSWORD=
|
|
|
|
# Optional (with defaults)
|
|
DB_HOST=192.168.68.31
|
|
DB_PORT=5432
|
|
DB_NAME=docmaster
|
|
DB_USER=docmaster
|
|
MODEL_PATH=runs/train/invoice_fields/weights/best.pt
|
|
CONFIDENCE_THRESHOLD=0.5
|
|
SERVER_HOST=0.0.0.0
|
|
SERVER_PORT=8000
|
|
```
|
|
|
|
## Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction)
|
|
|
|
These rules MUST be followed regardless of conversation history:
|
|
|
|
- New feature or bug fix → MUST use **tdd-guide** agent (write tests first)
|
|
- When writing code → MUST follow coding standards skill for the target language:
|
|
- Python → `python-patterns` (PEP 8, type hints, Pythonic idioms)
|
|
- C# → `dotnet-skills:coding-standards` (records, pattern matching, modern C#)
|
|
- TS/JS → `coding-standards` (universal best practices)
|
|
- After writing/modifying code → MUST use **code-reviewer** agent
|
|
- Before git commit → MUST use **security-reviewer** agent
|
|
- When build/test fails → MUST use **build-error-resolver** agent
|
|
- After context compaction → read MEMORY.md to restore session state
|
|
|
|
## Plan Completion Protocol
|
|
|
|
After completing any plan or major task:
|
|
|
|
1. **Test** - Run `pytest` to confirm all tests pass
|
|
2. **Security review** - Use **security-reviewer** agent on changed files
|
|
3. **Fix loop** - If security review reports CRITICAL or HIGH issues:
|
|
- Fix the issues
|
|
- Re-run tests (back to step 1)
|
|
- Re-run security review (back to step 2)
|
|
- Repeat until no CRITICAL/HIGH issues remain
|
|
4. **Commit** - Auto-commit with conventional commit message (`feat:`, `fix:`, `refactor:`, etc.). Stage only the files changed in this task, not unrelated files
|
|
5. **Save** - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work
|
|
6. **Suggest clear** - Tell the user: "Plan complete. Recommend `/clear` to free context for the next task."
|
|
7. **Do NOT start a new task** in the same context - wait for user to /clear first
|
|
|
|
This keeps each plan in a fresh context window for maximum quality.
|
|
|
|
## Known Issues
|
|
|
|
- Pre-existing test failures: `test_s3.py`, `test_azure.py` (missing boto3/azure) - safe to ignore
|
|
- Always re-run dedup/validation after fallback adds new fields
|
|
- PDF DPI must be 150 (not 300) for correct bbox alignment
|