Files
2026-01-25 16:17:23 +01:00

3.8 KiB

Invoice Master POC v2

Swedish Invoice Field Extraction System - YOLOv11 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。

Tech Stack

Component Technology
Object Detection YOLOv11 (Ultralytics)
OCR Engine PaddleOCR v5 (PP-OCRv5)
PDF Processing PyMuPDF (fitz)
Database PostgreSQL + psycopg2
Web Framework FastAPI + Uvicorn
Deep Learning PyTorch + CUDA 12.x

WSL Environment (REQUIRED)

Prefix ALL commands with:

wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && <command>"

NEVER run Python commands directly in Windows PowerShell/CMD.

Project-Specific Rules

  • Python 3.11+ with type hints
  • No print() in production - use logging
  • Run tests: pytest --cov=src

File Structure

src/
├── cli/              # autolabel, train, infer, serve
├── pdf/              # extractor, renderer, detector
├── ocr/              # PaddleOCR wrapper, machine_code_parser
├── inference/        # pipeline, yolo_detector, field_extractor
├── normalize/        # Per-field normalizers
├── matcher/          # Exact, substring, fuzzy strategies
├── processing/       # CPU/GPU pool architecture
├── web/              # FastAPI app, routes, services, schemas
├── utils/            # validators, text_cleaner, fuzzy_matcher
└── data/             # Database operations
tests/                # Mirror of src structure
runs/train/           # Training outputs

Supported Fields

ID Field Description
0 invoice_number Invoice number
1 invoice_date Invoice date
2 invoice_due_date Due date
3 ocr_number OCR reference (Swedish payment)
4 bankgiro Bankgiro account
5 plusgiro Plusgiro account
6 amount Amount
7 supplier_organisation_number Supplier org number
8 payment_line Payment line (machine-readable)
9 customer_number Customer number

Key Patterns

Inference Result

@dataclass
class InferenceResult:
    document_id: str
    document_type: str  # "invoice" or "letter"
    fields: dict[str, str]
    confidence: dict[str, float]
    cross_validation: CrossValidationResult | None
    processing_time_ms: float

API Schemas

See src/web/schemas.py for request/response models.

Environment Variables

# Required
DB_PASSWORD=

# Optional (with defaults)
DB_HOST=192.168.68.31
DB_PORT=5432
DB_NAME=docmaster
DB_USER=docmaster
MODEL_PATH=runs/train/invoice_fields/weights/best.pt
CONFIDENCE_THRESHOLD=0.5
SERVER_HOST=0.0.0.0
SERVER_PORT=8000

CLI Commands

# Auto-labeling
python -m src.cli.autolabel --dual-pool --cpu-workers 3 --gpu-workers 1

# Training
python -m src.cli.train --model yolo11n.pt --epochs 100 --batch 16 --name invoice_fields

# Inference
python -m src.cli.infer --model runs/train/invoice_fields/weights/best.pt --input invoice.pdf --gpu

# Web Server
python run_server.py --port 8000

API Endpoints

Method Endpoint Description
GET / Web UI
GET /api/v1/health Health check
POST /api/v1/infer Process invoice
GET /api/v1/results/{filename} Get visualization

Current Status

  • Tests: 688 passing
  • Coverage: 37%
  • Model: 93.5% mAP@0.5
  • Documents Labeled: 9,738

Quick Start

# Start server
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && python run_server.py"

# Run tests
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest"

# Access UI: http://localhost:8000