Files
invoice-master-poc-v2/claude.md
Yaojia Wang b26fd61852 WOP
2026-01-13 00:10:27 +01:00

217 lines
6.4 KiB
Markdown

# Claude Code Instructions - Invoice Master POC v2
## Environment Requirements
> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.
| Requirement | Details |
|-------------|---------|
| **WSL** | WSL 2 with Ubuntu 22.04+ |
| **Conda** | Miniconda or Anaconda |
| **Python** | 3.10+ (managed by Conda) |
| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |
```bash
# Verify environment before running any commands
uname -a # Should show "Linux"
conda --version # Should show conda version
conda activate <env> # Activate project environment
which python # Should point to conda environment
```
**All commands must be executed in WSL terminal with Conda environment activated.**
---
## Project Overview
**Automated invoice field extraction system** for Swedish PDF invoices:
- **YOLO Object Detection** (YOLOv8/v11) for field region detection
- **PaddleOCR** for text extraction
- **Multi-strategy matching** for field validation
**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
---
## Architecture Principles
### SOLID
- **Single Responsibility**: Each module handles one concern
- **Open/Closed**: Extend via new strategies, not modifying existing code
- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
- **Interface Segregation**: Small, focused interfaces
- **Dependency Inversion**: Depend on abstractions, inject dependencies
### Project Structure
```
src/
├── cli/ # Entry points only, no business logic
├── pdf/ # PDF processing (extraction, rendering, detection)
├── ocr/ # OCR engines (PaddleOCR wrapper)
├── normalize/ # Field normalization and validation
├── matcher/ # Multi-strategy field matching
├── yolo/ # YOLO annotation and dataset building
├── inference/ # Inference pipeline
└── data/ # Data loading and reporting
```
### Configuration
- `configs/default.yaml` — All tunable parameters
- `config.py` — Sensitive data (credentials, use environment variables)
- Never hardcode magic numbers
---
## Python Standards
### Required
- **Type hints** on all public functions (PEP 484/585)
- **Docstrings** in Google style (PEP 257)
- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
- **Protocol** for interfaces (PEP 544)
- **Enum** for constants
- **pathlib.Path** instead of string paths
### Naming Conventions
| Type | Convention | Example |
|------|------------|---------|
| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
| Private | _prefix | `_parse_date`, `_cache` |
### Import Order (isort)
1. `from __future__ import annotations`
2. Standard library
3. Third-party
4. Local modules
5. `if TYPE_CHECKING:` block
### Code Quality Tools
| Tool | Purpose | Config |
|------|---------|--------|
| Black | Formatting | line-length=100 |
| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
| MyPy | Type checking | strict=true |
| Pytest | Testing | tests/ directory |
---
## Error Handling
- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
- Implement **graceful degradation** with fallback strategies
- Use **context managers** for resource cleanup
---
## Machine Learning Standards
### Data Management
- **Immutable raw data**: Never modify `data/raw/`
- **Version datasets**: Track with checksum and metadata
- **Reproducible splits**: Use fixed random seed (42)
- **Split ratios**: 80% train / 10% val / 10% test
### YOLO Training
- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
- **Use early stopping** (`patience=20`)
- **Enable AMP** for faster training (`amp=true`)
- **Save checkpoints** periodically (`save_period=10`)
### Reproducibility
- Set random seeds: `random`, `numpy`, `torch`
- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
### Evaluation Metrics
- Precision, Recall, F1 Score
- mAP@0.5, mAP@0.5:0.95
- Per-class AP
---
## Testing Standards
### Structure
```
tests/
├── unit/ # Isolated, fast tests
├── integration/ # Multi-module tests
├── e2e/ # End-to-end workflow tests
├── fixtures/ # Test data
└── conftest.py # Shared fixtures
```
### Practices
- Follow **AAA pattern**: Arrange, Act, Assert
- Use **parametrized tests** for multiple inputs
- Use **fixtures** for shared setup
- Use **mocking** for external dependencies
- Mark slow tests with `@pytest.mark.slow`
---
## Performance
- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
- **Lazy loading**: Use `@cached_property` for expensive resources
- **Generators**: Use for large datasets to save memory
- **Batch processing**: Process items in batches when possible
---
## Security
- **Never commit**: credentials, API keys, `.env` files
- **Use environment variables** for sensitive config
- **Validate paths**: Prevent path traversal attacks
- **Validate inputs**: At system boundaries
---
## Commands
| Task | Command |
|------|---------|
| Run autolabel | `python run_autolabel.py` |
| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
| Run inference | `python -m src.cli.infer --model models/best.pt` |
| Run tests | `pytest tests/ -v` |
| Coverage | `pytest tests/ --cov=src --cov-report=html` |
| Format | `black src/ tests/` |
| Lint | `ruff check src/ tests/ --fix` |
| Type check | `mypy src/` |
---
## DO NOT
- Hardcode file paths or magic numbers
- Use `print()` for logging
- Skip type hints on public APIs
- Write functions longer than 50 lines
- Mix business logic with I/O
- Commit credentials or `.env` files
- Use `# type: ignore` without explanation
- Use mutable default arguments
- Catch bare `except:`
- Use flip augmentation for text detection
## DO
- Use type hints everywhere
- Write descriptive docstrings
- Log with appropriate levels
- Use dataclasses for data structures
- Use enums for constants
- Use Protocol for interfaces
- Set random seeds for reproducibility
- Track experiment configurations
- Use context managers for resources
- Validate inputs at boundaries