217 lines
6.4 KiB
Markdown
217 lines
6.4 KiB
Markdown
# Claude Code Instructions - Invoice Master POC v2
|
|
|
|
## Environment Requirements
|
|
|
|
> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.
|
|
|
|
| Requirement | Details |
|
|
|-------------|---------|
|
|
| **WSL** | WSL 2 with Ubuntu 22.04+ |
|
|
| **Conda** | Miniconda or Anaconda |
|
|
| **Python** | 3.10+ (managed by Conda) |
|
|
| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |
|
|
|
|
```bash
|
|
# Verify environment before running any commands
|
|
uname -a # Should show "Linux"
|
|
conda --version # Should show conda version
|
|
conda activate <env> # Activate project environment
|
|
which python # Should point to conda environment
|
|
```
|
|
|
|
**All commands must be executed in WSL terminal with Conda environment activated.**
|
|
|
|
---
|
|
|
|
## Project Overview
|
|
|
|
**Automated invoice field extraction system** for Swedish PDF invoices:
|
|
- **YOLO Object Detection** (YOLOv8/v11) for field region detection
|
|
- **PaddleOCR** for text extraction
|
|
- **Multi-strategy matching** for field validation
|
|
|
|
**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
|
|
|
|
**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
|
|
|
|
---
|
|
|
|
## Architecture Principles
|
|
|
|
### SOLID
|
|
- **Single Responsibility**: Each module handles one concern
|
|
- **Open/Closed**: Extend via new strategies, not modifying existing code
|
|
- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
|
|
- **Interface Segregation**: Small, focused interfaces
|
|
- **Dependency Inversion**: Depend on abstractions, inject dependencies
|
|
|
|
### Project Structure
|
|
```
|
|
src/
|
|
├── cli/ # Entry points only, no business logic
|
|
├── pdf/ # PDF processing (extraction, rendering, detection)
|
|
├── ocr/ # OCR engines (PaddleOCR wrapper)
|
|
├── normalize/ # Field normalization and validation
|
|
├── matcher/ # Multi-strategy field matching
|
|
├── yolo/ # YOLO annotation and dataset building
|
|
├── inference/ # Inference pipeline
|
|
└── data/ # Data loading and reporting
|
|
```
|
|
|
|
### Configuration
|
|
- `configs/default.yaml` — All tunable parameters
|
|
- `config.py` — Sensitive data (credentials, use environment variables)
|
|
- Never hardcode magic numbers
|
|
|
|
---
|
|
|
|
## Python Standards
|
|
|
|
### Required
|
|
- **Type hints** on all public functions (PEP 484/585)
|
|
- **Docstrings** in Google style (PEP 257)
|
|
- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
|
|
- **Protocol** for interfaces (PEP 544)
|
|
- **Enum** for constants
|
|
- **pathlib.Path** instead of string paths
|
|
|
|
### Naming Conventions
|
|
| Type | Convention | Example |
|
|
|------|------------|---------|
|
|
| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
|
|
| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
|
|
| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
|
|
| Private | _prefix | `_parse_date`, `_cache` |
|
|
|
|
### Import Order (isort)
|
|
1. `from __future__ import annotations`
|
|
2. Standard library
|
|
3. Third-party
|
|
4. Local modules
|
|
5. `if TYPE_CHECKING:` block
|
|
|
|
### Code Quality Tools
|
|
| Tool | Purpose | Config |
|
|
|------|---------|--------|
|
|
| Black | Formatting | line-length=100 |
|
|
| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
|
|
| MyPy | Type checking | strict=true |
|
|
| Pytest | Testing | tests/ directory |
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
|
|
- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
|
|
- Implement **graceful degradation** with fallback strategies
|
|
- Use **context managers** for resource cleanup
|
|
|
|
---
|
|
|
|
## Machine Learning Standards
|
|
|
|
### Data Management
|
|
- **Immutable raw data**: Never modify `data/raw/`
|
|
- **Version datasets**: Track with checksum and metadata
|
|
- **Reproducible splits**: Use fixed random seed (42)
|
|
- **Split ratios**: 80% train / 10% val / 10% test
|
|
|
|
### YOLO Training
|
|
- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
|
|
- **Use early stopping** (`patience=20`)
|
|
- **Enable AMP** for faster training (`amp=true`)
|
|
- **Save checkpoints** periodically (`save_period=10`)
|
|
|
|
### Reproducibility
|
|
- Set random seeds: `random`, `numpy`, `torch`
|
|
- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
|
|
- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
|
|
|
|
### Evaluation Metrics
|
|
- Precision, Recall, F1 Score
|
|
- mAP@0.5, mAP@0.5:0.95
|
|
- Per-class AP
|
|
|
|
---
|
|
|
|
## Testing Standards
|
|
|
|
### Structure
|
|
```
|
|
tests/
|
|
├── unit/ # Isolated, fast tests
|
|
├── integration/ # Multi-module tests
|
|
├── e2e/ # End-to-end workflow tests
|
|
├── fixtures/ # Test data
|
|
└── conftest.py # Shared fixtures
|
|
```
|
|
|
|
### Practices
|
|
- Follow **AAA pattern**: Arrange, Act, Assert
|
|
- Use **parametrized tests** for multiple inputs
|
|
- Use **fixtures** for shared setup
|
|
- Use **mocking** for external dependencies
|
|
- Mark slow tests with `@pytest.mark.slow`
|
|
|
|
---
|
|
|
|
## Performance
|
|
|
|
- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
|
|
- **Lazy loading**: Use `@cached_property` for expensive resources
|
|
- **Generators**: Use for large datasets to save memory
|
|
- **Batch processing**: Process items in batches when possible
|
|
|
|
---
|
|
|
|
## Security
|
|
|
|
- **Never commit**: credentials, API keys, `.env` files
|
|
- **Use environment variables** for sensitive config
|
|
- **Validate paths**: Prevent path traversal attacks
|
|
- **Validate inputs**: At system boundaries
|
|
|
|
---
|
|
|
|
## Commands
|
|
|
|
| Task | Command |
|
|
|------|---------|
|
|
| Run autolabel | `python run_autolabel.py` |
|
|
| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
|
|
| Run inference | `python -m src.cli.infer --model models/best.pt` |
|
|
| Run tests | `pytest tests/ -v` |
|
|
| Coverage | `pytest tests/ --cov=src --cov-report=html` |
|
|
| Format | `black src/ tests/` |
|
|
| Lint | `ruff check src/ tests/ --fix` |
|
|
| Type check | `mypy src/` |
|
|
|
|
---
|
|
|
|
## DO NOT
|
|
|
|
- Hardcode file paths or magic numbers
|
|
- Use `print()` for logging
|
|
- Skip type hints on public APIs
|
|
- Write functions longer than 50 lines
|
|
- Mix business logic with I/O
|
|
- Commit credentials or `.env` files
|
|
- Use `# type: ignore` without explanation
|
|
- Use mutable default arguments
|
|
- Catch bare `except:`
|
|
- Use flip augmentation for text detection
|
|
|
|
## DO
|
|
|
|
- Use type hints everywhere
|
|
- Write descriptive docstrings
|
|
- Log with appropriate levels
|
|
- Use dataclasses for data structures
|
|
- Use enums for constants
|
|
- Use Protocol for interfaces
|
|
- Set random seeds for reproducibility
|
|
- Track experiment configurations
|
|
- Use context managers for resources
|
|
- Validate inputs at boundaries
|