WOP

2026-01-13 00:10:27 +01:00
parent 1b7c61cdd8
commit b26fd61852
43 changed files with 7751 additions and 578 deletions
--- a/claude.md
+++ b/claude.md
@@ -0,0 +1,216 @@
+# Claude Code Instructions - Invoice Master POC v2
+
+## Environment Requirements
+
+> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.
+
+| Requirement | Details |
+|-------------|---------|
+| **WSL** | WSL 2 with Ubuntu 22.04+ |
+| **Conda** | Miniconda or Anaconda |
+| **Python** | 3.10+ (managed by Conda) |
+| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |
+
+```bash
+# Verify environment before running any commands
+uname -a              # Should show "Linux"
+conda --version       # Should show conda version
+conda activate <env>  # Activate project environment
+which python          # Should point to conda environment
+```
+
+**All commands must be executed in WSL terminal with Conda environment activated.**
+
+---
+
+## Project Overview
+
+**Automated invoice field extraction system** for Swedish PDF invoices:
+- **YOLO Object Detection** (YOLOv8/v11) for field region detection
+- **PaddleOCR** for text extraction
+- **Multi-strategy matching** for field validation
+
+**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
+
+**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
+
+---
+
+## Architecture Principles
+
+### SOLID
+- **Single Responsibility**: Each module handles one concern
+- **Open/Closed**: Extend via new strategies, not modifying existing code
+- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
+- **Interface Segregation**: Small, focused interfaces
+- **Dependency Inversion**: Depend on abstractions, inject dependencies
+
+### Project Structure
+```
+src/
+├── cli/           # Entry points only, no business logic
+├── pdf/           # PDF processing (extraction, rendering, detection)
+├── ocr/           # OCR engines (PaddleOCR wrapper)
+├── normalize/     # Field normalization and validation
+├── matcher/       # Multi-strategy field matching
+├── yolo/          # YOLO annotation and dataset building
+├── inference/     # Inference pipeline
+└── data/          # Data loading and reporting
+```
+
+### Configuration
+- `configs/default.yaml` — All tunable parameters
+- `config.py` — Sensitive data (credentials, use environment variables)
+- Never hardcode magic numbers
+
+---
+
+## Python Standards
+
+### Required
+- **Type hints** on all public functions (PEP 484/585)
+- **Docstrings** in Google style (PEP 257)
+- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
+- **Protocol** for interfaces (PEP 544)
+- **Enum** for constants
+- **pathlib.Path** instead of string paths
+
+### Naming Conventions
+| Type | Convention | Example |
+|------|------------|---------|
+| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
+| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
+| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
+| Private | _prefix | `_parse_date`, `_cache` |
+
+### Import Order (isort)
+1. `from __future__ import annotations`
+2. Standard library
+3. Third-party
+4. Local modules
+5. `if TYPE_CHECKING:` block
+
+### Code Quality Tools
+| Tool | Purpose | Config |
+|------|---------|--------|
+| Black | Formatting | line-length=100 |
+| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
+| MyPy | Type checking | strict=true |
+| Pytest | Testing | tests/ directory |
+
+---
+
+## Error Handling
+
+- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
+- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
+- Implement **graceful degradation** with fallback strategies
+- Use **context managers** for resource cleanup
+
+---
+
+## Machine Learning Standards
+
+### Data Management
+- **Immutable raw data**: Never modify `data/raw/`
+- **Version datasets**: Track with checksum and metadata
+- **Reproducible splits**: Use fixed random seed (42)
+- **Split ratios**: 80% train / 10% val / 10% test
+
+### YOLO Training
+- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
+- **Use early stopping** (`patience=20`)
+- **Enable AMP** for faster training (`amp=true`)
+- **Save checkpoints** periodically (`save_period=10`)
+
+### Reproducibility
+- Set random seeds: `random`, `numpy`, `torch`
+- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
+- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
+
+### Evaluation Metrics
+- Precision, Recall, F1 Score
+- mAP@0.5, mAP@0.5:0.95
+- Per-class AP
+
+---
+
+## Testing Standards
+
+### Structure
+```
+tests/
+├── unit/           # Isolated, fast tests
+├── integration/    # Multi-module tests
+├── e2e/            # End-to-end workflow tests
+├── fixtures/       # Test data
+└── conftest.py     # Shared fixtures
+```
+
+### Practices
+- Follow **AAA pattern**: Arrange, Act, Assert
+- Use **parametrized tests** for multiple inputs
+- Use **fixtures** for shared setup
+- Use **mocking** for external dependencies
+- Mark slow tests with `@pytest.mark.slow`
+
+---
+
+## Performance
+
+- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
+- **Lazy loading**: Use `@cached_property` for expensive resources
+- **Generators**: Use for large datasets to save memory
+- **Batch processing**: Process items in batches when possible
+
+---
+
+## Security
+
+- **Never commit**: credentials, API keys, `.env` files
+- **Use environment variables** for sensitive config
+- **Validate paths**: Prevent path traversal attacks
+- **Validate inputs**: At system boundaries
+
+---
+
+## Commands
+
+| Task | Command |
+|------|---------|
+| Run autolabel | `python run_autolabel.py` |
+| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
+| Run inference | `python -m src.cli.infer --model models/best.pt` |
+| Run tests | `pytest tests/ -v` |
+| Coverage | `pytest tests/ --cov=src --cov-report=html` |
+| Format | `black src/ tests/` |
+| Lint | `ruff check src/ tests/ --fix` |
+| Type check | `mypy src/` |
+
+---
+
+## DO NOT
+
+- Hardcode file paths or magic numbers
+- Use `print()` for logging
+- Skip type hints on public APIs
+- Write functions longer than 50 lines
+- Mix business logic with I/O
+- Commit credentials or `.env` files
+- Use `# type: ignore` without explanation
+- Use mutable default arguments
+- Catch bare `except:`
+- Use flip augmentation for text detection
+
+## DO
+
+- Use type hints everywhere
+- Write descriptive docstrings
+- Log with appropriate levels
+- Use dataclasses for data structures
+- Use enums for constants
+- Use Protocol for interfaces
+- Set random seeds for reproducibility
+- Track experiment configurations
+- Use context managers for resources
+- Validate inputs at boundaries