kai/invoice-master-poc-v2

Fork 0

Files

Yaojia Wang b26fd61852 WOP

2026-01-13 00:10:27 +01:00

6.4 KiB

Raw Blame History

Claude Code Instructions - Invoice Master POC v2

Environment Requirements

Important

: This project MUST run in WSL + Conda environment.

Requirement	Details
WSL	WSL 2 with Ubuntu 22.04+
Conda	Miniconda or Anaconda
Python	3.10+ (managed by Conda)
GPU	NVIDIA drivers on Windows + CUDA in WSL

# Verify environment before running any commands
uname -a              # Should show "Linux"
conda --version       # Should show conda version
conda activate <env>  # Activate project environment
which python          # Should point to conda environment

All commands must be executed in WSL terminal with Conda environment activated.

Project Overview

Automated invoice field extraction system for Swedish PDF invoices:

YOLO Object Detection (YOLOv8/v11) for field region detection
PaddleOCR for text extraction
Multi-strategy matching for field validation

Stack: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF

Target Fields: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

Architecture Principles

SOLID

Single Responsibility: Each module handles one concern
Open/Closed: Extend via new strategies, not modifying existing code
Liskov Substitution: Use Protocol/ABC for interchangeable components
Interface Segregation: Small, focused interfaces
Dependency Inversion: Depend on abstractions, inject dependencies

Project Structure

src/
├── cli/           # Entry points only, no business logic
├── pdf/           # PDF processing (extraction, rendering, detection)
├── ocr/           # OCR engines (PaddleOCR wrapper)
├── normalize/     # Field normalization and validation
├── matcher/       # Multi-strategy field matching
├── yolo/          # YOLO annotation and dataset building
├── inference/     # Inference pipeline
└── data/          # Data loading and reporting

Configuration

configs/default.yaml — All tunable parameters
config.py — Sensitive data (credentials, use environment variables)
Never hardcode magic numbers

Python Standards

Required

Type hints on all public functions (PEP 484/585)
Docstrings in Google style (PEP 257)
Dataclasses for data structures (frozen=True, slots=True when immutable)
Protocol for interfaces (PEP 544)
Enum for constants
pathlib.Path instead of string paths

Naming Conventions

Type	Convention	Example
Functions/Variables	snake_case	`extract_tokens`, `page_count`
Classes	PascalCase	`FieldMatcher`, `AutoLabelReport`
Constants	UPPER_SNAKE	`DEFAULT_DPI`, `FIELD_TYPES`
Private	_prefix	`_parse_date`, `_cache`

Import Order (isort)

from __future__ import annotations
Standard library
Third-party
Local modules
if TYPE_CHECKING: block

Code Quality Tools

Tool	Purpose	Config
Black	Formatting	line-length=100
Ruff	Linting	E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH
MyPy	Type checking	strict=true
Pytest	Testing	tests/ directory

Error Handling

Use custom exception hierarchy (base: InvoiceMasterError)
Use logging instead of print (logger = logging.getLogger(__name__))
Implement graceful degradation with fallback strategies
Use context managers for resource cleanup

Machine Learning Standards

Data Management

Immutable raw data: Never modify data/raw/
Version datasets: Track with checksum and metadata
Reproducible splits: Use fixed random seed (42)
Split ratios: 80% train / 10% val / 10% test

YOLO Training

Disable flips for text detection (fliplr=0.0, flipud=0.0)
Use early stopping (patience=20)
Enable AMP for faster training (amp=true)
Save checkpoints periodically (save_period=10)

Reproducibility

Set random seeds: random, numpy, torch
Enable deterministic mode: torch.backends.cudnn.deterministic = True
Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit

Evaluation Metrics

Precision, Recall, F1 Score
mAP@0.5, mAP@0.5:0.95
Per-class AP

Testing Standards

Structure

tests/
├── unit/           # Isolated, fast tests
├── integration/    # Multi-module tests
├── e2e/            # End-to-end workflow tests
├── fixtures/       # Test data
└── conftest.py     # Shared fixtures

Practices

Follow AAA pattern: Arrange, Act, Assert
Use parametrized tests for multiple inputs
Use fixtures for shared setup
Use mocking for external dependencies
Mark slow tests with @pytest.mark.slow

Performance

Parallel processing: Use ProcessPoolExecutor with progress tracking
Lazy loading: Use @cached_property for expensive resources
Generators: Use for large datasets to save memory
Batch processing: Process items in batches when possible

Security

Never commit: credentials, API keys, .env files
Use environment variables for sensitive config
Validate paths: Prevent path traversal attacks
Validate inputs: At system boundaries

Commands

Task	Command
Run autolabel	`python run_autolabel.py`
Train YOLO	`python -m src.cli.train --config configs/training.yaml`
Run inference	`python -m src.cli.infer --model models/best.pt`
Run tests	`pytest tests/ -v`
Coverage	`pytest tests/ --cov=src --cov-report=html`
Format	`black src/ tests/`
Lint	`ruff check src/ tests/ --fix`
Type check	`mypy src/`

DO NOT

Hardcode file paths or magic numbers
Use print() for logging
Skip type hints on public APIs
Write functions longer than 50 lines
Mix business logic with I/O
Commit credentials or .env files
Use # type: ignore without explanation
Use mutable default arguments
Catch bare except:
Use flip augmentation for text detection

DO

Use type hints everywhere
Write descriptive docstrings
Log with appropriate levels
Use dataclasses for data structures
Use enums for constants
Use Protocol for interfaces
Set random seeds for reproducibility
Track experiment configurations
Use context managers for resources
Validate inputs at boundaries

6.4 KiB Raw Blame History