invoice-master-poc-v2/claude.md

# Claude Code Instructions - Invoice Master POC v2

## Environment Requirements

> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.

| Requirement | Details |
|-------------|---------|
| **WSL** | WSL 2 with Ubuntu 22.04+ |
| **Conda** | Miniconda or Anaconda |
| **Python** | 3.10+ (managed by Conda) |
| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |

```bash
# Verify environment before running any commands
uname -a              # Should show "Linux"
conda --version       # Should show conda version
conda activate <env>  # Activate project environment
which python          # Should point to conda environment
```

**All commands must be executed in WSL terminal with Conda environment activated.**

---

## Project Overview

**Automated invoice field extraction system** for Swedish PDF invoices:
- **YOLO Object Detection** (YOLOv8/v11) for field region detection
- **PaddleOCR** for text extraction
- **Multi-strategy matching** for field validation

**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF

**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount

---

## Architecture Principles

### SOLID
- **Single Responsibility**: Each module handles one concern
- **Open/Closed**: Extend via new strategies, not modifying existing code
- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
- **Interface Segregation**: Small, focused interfaces
- **Dependency Inversion**: Depend on abstractions, inject dependencies

### Project Structure
```
src/
├── cli/           # Entry points only, no business logic
├── pdf/           # PDF processing (extraction, rendering, detection)
├── ocr/           # OCR engines (PaddleOCR wrapper)
├── normalize/     # Field normalization and validation
├── matcher/       # Multi-strategy field matching
├── yolo/          # YOLO annotation and dataset building
├── inference/     # Inference pipeline
└── data/          # Data loading and reporting
```

### Configuration
- `configs/default.yaml` — All tunable parameters
- `config.py` — Sensitive data (credentials, use environment variables)
- Never hardcode magic numbers

---

## Python Standards

### Required
- **Type hints** on all public functions (PEP 484/585)
- **Docstrings** in Google style (PEP 257)
- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
- **Protocol** for interfaces (PEP 544)
- **Enum** for constants
- **pathlib.Path** instead of string paths

### Naming Conventions
| Type | Convention | Example |
|------|------------|---------|
| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
| Private | _prefix | `_parse_date`, `_cache` |

### Import Order (isort)
1. `from __future__ import annotations`
2. Standard library
3. Third-party
4. Local modules
5. `if TYPE_CHECKING:` block

### Code Quality Tools
| Tool | Purpose | Config |
|------|---------|--------|
| Black | Formatting | line-length=100 |
| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
| MyPy | Type checking | strict=true |
| Pytest | Testing | tests/ directory |

---

## Error Handling

- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
- Implement **graceful degradation** with fallback strategies
- Use **context managers** for resource cleanup

---

## Machine Learning Standards

### Data Management
- **Immutable raw data**: Never modify `data/raw/`
- **Version datasets**: Track with checksum and metadata
- **Reproducible splits**: Use fixed random seed (42)
- **Split ratios**: 80% train / 10% val / 10% test

### YOLO Training
- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
- **Use early stopping** (`patience=20`)
- **Enable AMP** for faster training (`amp=true`)
- **Save checkpoints** periodically (`save_period=10`)

### Reproducibility
- Set random seeds: `random`, `numpy`, `torch`
- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit

### Evaluation Metrics
- Precision, Recall, F1 Score
- mAP@0.5, mAP@0.5:0.95
- Per-class AP

---

## Testing Standards

### Structure
```
tests/
├── unit/           # Isolated, fast tests
├── integration/    # Multi-module tests
├── e2e/            # End-to-end workflow tests
├── fixtures/       # Test data
└── conftest.py     # Shared fixtures
```

### Practices
- Follow **AAA pattern**: Arrange, Act, Assert
- Use **parametrized tests** for multiple inputs
- Use **fixtures** for shared setup
- Use **mocking** for external dependencies
- Mark slow tests with `@pytest.mark.slow`

---

## Performance

- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
- **Lazy loading**: Use `@cached_property` for expensive resources
- **Generators**: Use for large datasets to save memory
- **Batch processing**: Process items in batches when possible

---

## Security

- **Never commit**: credentials, API keys, `.env` files
- **Use environment variables** for sensitive config
- **Validate paths**: Prevent path traversal attacks
- **Validate inputs**: At system boundaries

---

## Commands

| Task | Command |
|------|---------|
| Run autolabel | `python run_autolabel.py` |
| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
| Run inference | `python -m src.cli.infer --model models/best.pt` |
| Run tests | `pytest tests/ -v` |
| Coverage | `pytest tests/ --cov=src --cov-report=html` |
| Format | `black src/ tests/` |
| Lint | `ruff check src/ tests/ --fix` |
| Type check | `mypy src/` |

---

## DO NOT

- Hardcode file paths or magic numbers
- Use `print()` for logging
- Skip type hints on public APIs
- Write functions longer than 50 lines
- Mix business logic with I/O
- Commit credentials or `.env` files
- Use `# type: ignore` without explanation
- Use mutable default arguments
- Catch bare `except:`
- Use flip augmentation for text detection

## DO

- Use type hints everywhere
- Write descriptive docstrings
- Log with appropriate levels
- Use dataclasses for data structures
- Use enums for constants
- Use Protocol for interfaces
- Set random seeds for reproducibility
- Track experiment configurations
- Use context managers for resources
- Validate inputs at boundaries