6.4 KiB
6.4 KiB
Claude Code Instructions - Invoice Master POC v2
Environment Requirements
Important
: This project MUST run in WSL + Conda environment.
| Requirement | Details |
|---|---|
| WSL | WSL 2 with Ubuntu 22.04+ |
| Conda | Miniconda or Anaconda |
| Python | 3.10+ (managed by Conda) |
| GPU | NVIDIA drivers on Windows + CUDA in WSL |
# Verify environment before running any commands
uname -a # Should show "Linux"
conda --version # Should show conda version
conda activate <env> # Activate project environment
which python # Should point to conda environment
All commands must be executed in WSL terminal with Conda environment activated.
Project Overview
Automated invoice field extraction system for Swedish PDF invoices:
- YOLO Object Detection (YOLOv8/v11) for field region detection
- PaddleOCR for text extraction
- Multi-strategy matching for field validation
Stack: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
Target Fields: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
Architecture Principles
SOLID
- Single Responsibility: Each module handles one concern
- Open/Closed: Extend via new strategies, not modifying existing code
- Liskov Substitution: Use Protocol/ABC for interchangeable components
- Interface Segregation: Small, focused interfaces
- Dependency Inversion: Depend on abstractions, inject dependencies
Project Structure
src/
├── cli/ # Entry points only, no business logic
├── pdf/ # PDF processing (extraction, rendering, detection)
├── ocr/ # OCR engines (PaddleOCR wrapper)
├── normalize/ # Field normalization and validation
├── matcher/ # Multi-strategy field matching
├── yolo/ # YOLO annotation and dataset building
├── inference/ # Inference pipeline
└── data/ # Data loading and reporting
Configuration
configs/default.yaml— All tunable parametersconfig.py— Sensitive data (credentials, use environment variables)- Never hardcode magic numbers
Python Standards
Required
- Type hints on all public functions (PEP 484/585)
- Docstrings in Google style (PEP 257)
- Dataclasses for data structures (
frozen=True, slots=Truewhen immutable) - Protocol for interfaces (PEP 544)
- Enum for constants
- pathlib.Path instead of string paths
Naming Conventions
| Type | Convention | Example |
|---|---|---|
| Functions/Variables | snake_case | extract_tokens, page_count |
| Classes | PascalCase | FieldMatcher, AutoLabelReport |
| Constants | UPPER_SNAKE | DEFAULT_DPI, FIELD_TYPES |
| Private | _prefix | _parse_date, _cache |
Import Order (isort)
from __future__ import annotations- Standard library
- Third-party
- Local modules
if TYPE_CHECKING:block
Code Quality Tools
| Tool | Purpose | Config |
|---|---|---|
| Black | Formatting | line-length=100 |
| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
| MyPy | Type checking | strict=true |
| Pytest | Testing | tests/ directory |
Error Handling
- Use custom exception hierarchy (base:
InvoiceMasterError) - Use logging instead of print (
logger = logging.getLogger(__name__)) - Implement graceful degradation with fallback strategies
- Use context managers for resource cleanup
Machine Learning Standards
Data Management
- Immutable raw data: Never modify
data/raw/ - Version datasets: Track with checksum and metadata
- Reproducible splits: Use fixed random seed (42)
- Split ratios: 80% train / 10% val / 10% test
YOLO Training
- Disable flips for text detection (
fliplr=0.0, flipud=0.0) - Use early stopping (
patience=20) - Enable AMP for faster training (
amp=true) - Save checkpoints periodically (
save_period=10)
Reproducibility
- Set random seeds:
random,numpy,torch - Enable deterministic mode:
torch.backends.cudnn.deterministic = True - Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
Evaluation Metrics
- Precision, Recall, F1 Score
- mAP@0.5, mAP@0.5:0.95
- Per-class AP
Testing Standards
Structure
tests/
├── unit/ # Isolated, fast tests
├── integration/ # Multi-module tests
├── e2e/ # End-to-end workflow tests
├── fixtures/ # Test data
└── conftest.py # Shared fixtures
Practices
- Follow AAA pattern: Arrange, Act, Assert
- Use parametrized tests for multiple inputs
- Use fixtures for shared setup
- Use mocking for external dependencies
- Mark slow tests with
@pytest.mark.slow
Performance
- Parallel processing: Use
ProcessPoolExecutorwith progress tracking - Lazy loading: Use
@cached_propertyfor expensive resources - Generators: Use for large datasets to save memory
- Batch processing: Process items in batches when possible
Security
- Never commit: credentials, API keys,
.envfiles - Use environment variables for sensitive config
- Validate paths: Prevent path traversal attacks
- Validate inputs: At system boundaries
Commands
| Task | Command |
|---|---|
| Run autolabel | python run_autolabel.py |
| Train YOLO | python -m src.cli.train --config configs/training.yaml |
| Run inference | python -m src.cli.infer --model models/best.pt |
| Run tests | pytest tests/ -v |
| Coverage | pytest tests/ --cov=src --cov-report=html |
| Format | black src/ tests/ |
| Lint | ruff check src/ tests/ --fix |
| Type check | mypy src/ |
DO NOT
- Hardcode file paths or magic numbers
- Use
print()for logging - Skip type hints on public APIs
- Write functions longer than 50 lines
- Mix business logic with I/O
- Commit credentials or
.envfiles - Use
# type: ignorewithout explanation - Use mutable default arguments
- Catch bare
except: - Use flip augmentation for text detection
DO
- Use type hints everywhere
- Write descriptive docstrings
- Log with appropriate levels
- Use dataclasses for data structures
- Use enums for constants
- Use Protocol for interfaces
- Set random seeds for reproducibility
- Track experiment configurations
- Use context managers for resources
- Validate inputs at boundaries