Files
invoice-master-poc-v2/claude.md
Yaojia Wang b26fd61852 WOP
2026-01-13 00:10:27 +01:00

6.4 KiB

Claude Code Instructions - Invoice Master POC v2

Environment Requirements

Important

: This project MUST run in WSL + Conda environment.

Requirement Details
WSL WSL 2 with Ubuntu 22.04+
Conda Miniconda or Anaconda
Python 3.10+ (managed by Conda)
GPU NVIDIA drivers on Windows + CUDA in WSL
# Verify environment before running any commands
uname -a              # Should show "Linux"
conda --version       # Should show conda version
conda activate <env>  # Activate project environment
which python          # Should point to conda environment

All commands must be executed in WSL terminal with Conda environment activated.


Project Overview

Automated invoice field extraction system for Swedish PDF invoices:

  • YOLO Object Detection (YOLOv8/v11) for field region detection
  • PaddleOCR for text extraction
  • Multi-strategy matching for field validation

Stack: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF

Target Fields: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount


Architecture Principles

SOLID

  • Single Responsibility: Each module handles one concern
  • Open/Closed: Extend via new strategies, not modifying existing code
  • Liskov Substitution: Use Protocol/ABC for interchangeable components
  • Interface Segregation: Small, focused interfaces
  • Dependency Inversion: Depend on abstractions, inject dependencies

Project Structure

src/
├── cli/           # Entry points only, no business logic
├── pdf/           # PDF processing (extraction, rendering, detection)
├── ocr/           # OCR engines (PaddleOCR wrapper)
├── normalize/     # Field normalization and validation
├── matcher/       # Multi-strategy field matching
├── yolo/          # YOLO annotation and dataset building
├── inference/     # Inference pipeline
└── data/          # Data loading and reporting

Configuration

  • configs/default.yaml — All tunable parameters
  • config.py — Sensitive data (credentials, use environment variables)
  • Never hardcode magic numbers

Python Standards

Required

  • Type hints on all public functions (PEP 484/585)
  • Docstrings in Google style (PEP 257)
  • Dataclasses for data structures (frozen=True, slots=True when immutable)
  • Protocol for interfaces (PEP 544)
  • Enum for constants
  • pathlib.Path instead of string paths

Naming Conventions

Type Convention Example
Functions/Variables snake_case extract_tokens, page_count
Classes PascalCase FieldMatcher, AutoLabelReport
Constants UPPER_SNAKE DEFAULT_DPI, FIELD_TYPES
Private _prefix _parse_date, _cache

Import Order (isort)

  1. from __future__ import annotations
  2. Standard library
  3. Third-party
  4. Local modules
  5. if TYPE_CHECKING: block

Code Quality Tools

Tool Purpose Config
Black Formatting line-length=100
Ruff Linting E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH
MyPy Type checking strict=true
Pytest Testing tests/ directory

Error Handling

  • Use custom exception hierarchy (base: InvoiceMasterError)
  • Use logging instead of print (logger = logging.getLogger(__name__))
  • Implement graceful degradation with fallback strategies
  • Use context managers for resource cleanup

Machine Learning Standards

Data Management

  • Immutable raw data: Never modify data/raw/
  • Version datasets: Track with checksum and metadata
  • Reproducible splits: Use fixed random seed (42)
  • Split ratios: 80% train / 10% val / 10% test

YOLO Training

  • Disable flips for text detection (fliplr=0.0, flipud=0.0)
  • Use early stopping (patience=20)
  • Enable AMP for faster training (amp=true)
  • Save checkpoints periodically (save_period=10)

Reproducibility

  • Set random seeds: random, numpy, torch
  • Enable deterministic mode: torch.backends.cudnn.deterministic = True
  • Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit

Evaluation Metrics

  • Precision, Recall, F1 Score
  • mAP@0.5, mAP@0.5:0.95
  • Per-class AP

Testing Standards

Structure

tests/
├── unit/           # Isolated, fast tests
├── integration/    # Multi-module tests
├── e2e/            # End-to-end workflow tests
├── fixtures/       # Test data
└── conftest.py     # Shared fixtures

Practices

  • Follow AAA pattern: Arrange, Act, Assert
  • Use parametrized tests for multiple inputs
  • Use fixtures for shared setup
  • Use mocking for external dependencies
  • Mark slow tests with @pytest.mark.slow

Performance

  • Parallel processing: Use ProcessPoolExecutor with progress tracking
  • Lazy loading: Use @cached_property for expensive resources
  • Generators: Use for large datasets to save memory
  • Batch processing: Process items in batches when possible

Security

  • Never commit: credentials, API keys, .env files
  • Use environment variables for sensitive config
  • Validate paths: Prevent path traversal attacks
  • Validate inputs: At system boundaries

Commands

Task Command
Run autolabel python run_autolabel.py
Train YOLO python -m src.cli.train --config configs/training.yaml
Run inference python -m src.cli.infer --model models/best.pt
Run tests pytest tests/ -v
Coverage pytest tests/ --cov=src --cov-report=html
Format black src/ tests/
Lint ruff check src/ tests/ --fix
Type check mypy src/

DO NOT

  • Hardcode file paths or magic numbers
  • Use print() for logging
  • Skip type hints on public APIs
  • Write functions longer than 50 lines
  • Mix business logic with I/O
  • Commit credentials or .env files
  • Use # type: ignore without explanation
  • Use mutable default arguments
  • Catch bare except:
  • Use flip augmentation for text detection

DO

  • Use type hints everywhere
  • Write descriptive docstrings
  • Log with appropriate levels
  • Use dataclasses for data structures
  • Use enums for constants
  • Use Protocol for interfaces
  • Set random seeds for reproducibility
  • Track experiment configurations
  • Use context managers for resources
  • Validate inputs at boundaries