# Claude Code Instructions - Invoice Master POC v2 ## Environment Requirements > **IMPORTANT**: This project MUST run in **WSL + Conda** environment. | Requirement | Details | |-------------|---------| | **WSL** | WSL 2 with Ubuntu 22.04+ | | **Conda** | Miniconda or Anaconda | | **Python** | 3.10+ (managed by Conda) | | **GPU** | NVIDIA drivers on Windows + CUDA in WSL | ```bash # Verify environment before running any commands uname -a # Should show "Linux" conda --version # Should show conda version conda activate # Activate project environment which python # Should point to conda environment ``` **All commands must be executed in WSL terminal with Conda environment activated.** --- ## Project Overview **Automated invoice field extraction system** for Swedish PDF invoices: - **YOLO Object Detection** (YOLOv8/v11) for field region detection - **PaddleOCR** for text extraction - **Multi-strategy matching** for field validation **Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF **Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount --- ## Architecture Principles ### SOLID - **Single Responsibility**: Each module handles one concern - **Open/Closed**: Extend via new strategies, not modifying existing code - **Liskov Substitution**: Use Protocol/ABC for interchangeable components - **Interface Segregation**: Small, focused interfaces - **Dependency Inversion**: Depend on abstractions, inject dependencies ### Project Structure ``` src/ ├── cli/ # Entry points only, no business logic ├── pdf/ # PDF processing (extraction, rendering, detection) ├── ocr/ # OCR engines (PaddleOCR wrapper) ├── normalize/ # Field normalization and validation ├── matcher/ # Multi-strategy field matching ├── yolo/ # YOLO annotation and dataset building ├── inference/ # Inference pipeline └── data/ # Data loading and reporting ``` ### Configuration - `configs/default.yaml` — All tunable parameters - `config.py` — Sensitive data (credentials, use environment variables) - Never hardcode magic numbers --- ## Python Standards ### Required - **Type hints** on all public functions (PEP 484/585) - **Docstrings** in Google style (PEP 257) - **Dataclasses** for data structures (`frozen=True, slots=True` when immutable) - **Protocol** for interfaces (PEP 544) - **Enum** for constants - **pathlib.Path** instead of string paths ### Naming Conventions | Type | Convention | Example | |------|------------|---------| | Functions/Variables | snake_case | `extract_tokens`, `page_count` | | Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` | | Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` | | Private | _prefix | `_parse_date`, `_cache` | ### Import Order (isort) 1. `from __future__ import annotations` 2. Standard library 3. Third-party 4. Local modules 5. `if TYPE_CHECKING:` block ### Code Quality Tools | Tool | Purpose | Config | |------|---------|--------| | Black | Formatting | line-length=100 | | Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH | | MyPy | Type checking | strict=true | | Pytest | Testing | tests/ directory | --- ## Error Handling - Use **custom exception hierarchy** (base: `InvoiceMasterError`) - Use **logging** instead of print (`logger = logging.getLogger(__name__)`) - Implement **graceful degradation** with fallback strategies - Use **context managers** for resource cleanup --- ## Machine Learning Standards ### Data Management - **Immutable raw data**: Never modify `data/raw/` - **Version datasets**: Track with checksum and metadata - **Reproducible splits**: Use fixed random seed (42) - **Split ratios**: 80% train / 10% val / 10% test ### YOLO Training - **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`) - **Use early stopping** (`patience=20`) - **Enable AMP** for faster training (`amp=true`) - **Save checkpoints** periodically (`save_period=10`) ### Reproducibility - Set random seeds: `random`, `numpy`, `torch` - Enable deterministic mode: `torch.backends.cudnn.deterministic = True` - Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit ### Evaluation Metrics - Precision, Recall, F1 Score - mAP@0.5, mAP@0.5:0.95 - Per-class AP --- ## Testing Standards ### Structure ``` tests/ ├── unit/ # Isolated, fast tests ├── integration/ # Multi-module tests ├── e2e/ # End-to-end workflow tests ├── fixtures/ # Test data └── conftest.py # Shared fixtures ``` ### Practices - Follow **AAA pattern**: Arrange, Act, Assert - Use **parametrized tests** for multiple inputs - Use **fixtures** for shared setup - Use **mocking** for external dependencies - Mark slow tests with `@pytest.mark.slow` --- ## Performance - **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking - **Lazy loading**: Use `@cached_property` for expensive resources - **Generators**: Use for large datasets to save memory - **Batch processing**: Process items in batches when possible --- ## Security - **Never commit**: credentials, API keys, `.env` files - **Use environment variables** for sensitive config - **Validate paths**: Prevent path traversal attacks - **Validate inputs**: At system boundaries --- ## Commands | Task | Command | |------|---------| | Run autolabel | `python run_autolabel.py` | | Train YOLO | `python -m src.cli.train --config configs/training.yaml` | | Run inference | `python -m src.cli.infer --model models/best.pt` | | Run tests | `pytest tests/ -v` | | Coverage | `pytest tests/ --cov=src --cov-report=html` | | Format | `black src/ tests/` | | Lint | `ruff check src/ tests/ --fix` | | Type check | `mypy src/` | --- ## DO NOT - Hardcode file paths or magic numbers - Use `print()` for logging - Skip type hints on public APIs - Write functions longer than 50 lines - Mix business logic with I/O - Commit credentials or `.env` files - Use `# type: ignore` without explanation - Use mutable default arguments - Catch bare `except:` - Use flip augmentation for text detection ## DO - Use type hints everywhere - Write descriptive docstrings - Log with appropriate levels - Use dataclasses for data structures - Use enums for constants - Use Protocol for interfaces - Set random seeds for reproducibility - Track experiment configurations - Use context managers for resources - Validate inputs at boundaries