WOP
This commit is contained in:
216
claude.md
Normal file
216
claude.md
Normal file
@@ -0,0 +1,216 @@
|
||||
# Claude Code Instructions - Invoice Master POC v2
|
||||
|
||||
## Environment Requirements
|
||||
|
||||
> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.
|
||||
|
||||
| Requirement | Details |
|
||||
|-------------|---------|
|
||||
| **WSL** | WSL 2 with Ubuntu 22.04+ |
|
||||
| **Conda** | Miniconda or Anaconda |
|
||||
| **Python** | 3.10+ (managed by Conda) |
|
||||
| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |
|
||||
|
||||
```bash
|
||||
# Verify environment before running any commands
|
||||
uname -a # Should show "Linux"
|
||||
conda --version # Should show conda version
|
||||
conda activate <env> # Activate project environment
|
||||
which python # Should point to conda environment
|
||||
```
|
||||
|
||||
**All commands must be executed in WSL terminal with Conda environment activated.**
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Automated invoice field extraction system** for Swedish PDF invoices:
|
||||
- **YOLO Object Detection** (YOLOv8/v11) for field region detection
|
||||
- **PaddleOCR** for text extraction
|
||||
- **Multi-strategy matching** for field validation
|
||||
|
||||
**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
|
||||
|
||||
**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
|
||||
|
||||
---
|
||||
|
||||
## Architecture Principles
|
||||
|
||||
### SOLID
|
||||
- **Single Responsibility**: Each module handles one concern
|
||||
- **Open/Closed**: Extend via new strategies, not modifying existing code
|
||||
- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
|
||||
- **Interface Segregation**: Small, focused interfaces
|
||||
- **Dependency Inversion**: Depend on abstractions, inject dependencies
|
||||
|
||||
### Project Structure
|
||||
```
|
||||
src/
|
||||
├── cli/ # Entry points only, no business logic
|
||||
├── pdf/ # PDF processing (extraction, rendering, detection)
|
||||
├── ocr/ # OCR engines (PaddleOCR wrapper)
|
||||
├── normalize/ # Field normalization and validation
|
||||
├── matcher/ # Multi-strategy field matching
|
||||
├── yolo/ # YOLO annotation and dataset building
|
||||
├── inference/ # Inference pipeline
|
||||
└── data/ # Data loading and reporting
|
||||
```
|
||||
|
||||
### Configuration
|
||||
- `configs/default.yaml` — All tunable parameters
|
||||
- `config.py` — Sensitive data (credentials, use environment variables)
|
||||
- Never hardcode magic numbers
|
||||
|
||||
---
|
||||
|
||||
## Python Standards
|
||||
|
||||
### Required
|
||||
- **Type hints** on all public functions (PEP 484/585)
|
||||
- **Docstrings** in Google style (PEP 257)
|
||||
- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
|
||||
- **Protocol** for interfaces (PEP 544)
|
||||
- **Enum** for constants
|
||||
- **pathlib.Path** instead of string paths
|
||||
|
||||
### Naming Conventions
|
||||
| Type | Convention | Example |
|
||||
|------|------------|---------|
|
||||
| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
|
||||
| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
|
||||
| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
|
||||
| Private | _prefix | `_parse_date`, `_cache` |
|
||||
|
||||
### Import Order (isort)
|
||||
1. `from __future__ import annotations`
|
||||
2. Standard library
|
||||
3. Third-party
|
||||
4. Local modules
|
||||
5. `if TYPE_CHECKING:` block
|
||||
|
||||
### Code Quality Tools
|
||||
| Tool | Purpose | Config |
|
||||
|------|---------|--------|
|
||||
| Black | Formatting | line-length=100 |
|
||||
| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
|
||||
| MyPy | Type checking | strict=true |
|
||||
| Pytest | Testing | tests/ directory |
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
|
||||
- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
|
||||
- Implement **graceful degradation** with fallback strategies
|
||||
- Use **context managers** for resource cleanup
|
||||
|
||||
---
|
||||
|
||||
## Machine Learning Standards
|
||||
|
||||
### Data Management
|
||||
- **Immutable raw data**: Never modify `data/raw/`
|
||||
- **Version datasets**: Track with checksum and metadata
|
||||
- **Reproducible splits**: Use fixed random seed (42)
|
||||
- **Split ratios**: 80% train / 10% val / 10% test
|
||||
|
||||
### YOLO Training
|
||||
- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
|
||||
- **Use early stopping** (`patience=20`)
|
||||
- **Enable AMP** for faster training (`amp=true`)
|
||||
- **Save checkpoints** periodically (`save_period=10`)
|
||||
|
||||
### Reproducibility
|
||||
- Set random seeds: `random`, `numpy`, `torch`
|
||||
- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
|
||||
- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
|
||||
|
||||
### Evaluation Metrics
|
||||
- Precision, Recall, F1 Score
|
||||
- mAP@0.5, mAP@0.5:0.95
|
||||
- Per-class AP
|
||||
|
||||
---
|
||||
|
||||
## Testing Standards
|
||||
|
||||
### Structure
|
||||
```
|
||||
tests/
|
||||
├── unit/ # Isolated, fast tests
|
||||
├── integration/ # Multi-module tests
|
||||
├── e2e/ # End-to-end workflow tests
|
||||
├── fixtures/ # Test data
|
||||
└── conftest.py # Shared fixtures
|
||||
```
|
||||
|
||||
### Practices
|
||||
- Follow **AAA pattern**: Arrange, Act, Assert
|
||||
- Use **parametrized tests** for multiple inputs
|
||||
- Use **fixtures** for shared setup
|
||||
- Use **mocking** for external dependencies
|
||||
- Mark slow tests with `@pytest.mark.slow`
|
||||
|
||||
---
|
||||
|
||||
## Performance
|
||||
|
||||
- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
|
||||
- **Lazy loading**: Use `@cached_property` for expensive resources
|
||||
- **Generators**: Use for large datasets to save memory
|
||||
- **Batch processing**: Process items in batches when possible
|
||||
|
||||
---
|
||||
|
||||
## Security
|
||||
|
||||
- **Never commit**: credentials, API keys, `.env` files
|
||||
- **Use environment variables** for sensitive config
|
||||
- **Validate paths**: Prevent path traversal attacks
|
||||
- **Validate inputs**: At system boundaries
|
||||
|
||||
---
|
||||
|
||||
## Commands
|
||||
|
||||
| Task | Command |
|
||||
|------|---------|
|
||||
| Run autolabel | `python run_autolabel.py` |
|
||||
| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
|
||||
| Run inference | `python -m src.cli.infer --model models/best.pt` |
|
||||
| Run tests | `pytest tests/ -v` |
|
||||
| Coverage | `pytest tests/ --cov=src --cov-report=html` |
|
||||
| Format | `black src/ tests/` |
|
||||
| Lint | `ruff check src/ tests/ --fix` |
|
||||
| Type check | `mypy src/` |
|
||||
|
||||
---
|
||||
|
||||
## DO NOT
|
||||
|
||||
- Hardcode file paths or magic numbers
|
||||
- Use `print()` for logging
|
||||
- Skip type hints on public APIs
|
||||
- Write functions longer than 50 lines
|
||||
- Mix business logic with I/O
|
||||
- Commit credentials or `.env` files
|
||||
- Use `# type: ignore` without explanation
|
||||
- Use mutable default arguments
|
||||
- Catch bare `except:`
|
||||
- Use flip augmentation for text detection
|
||||
|
||||
## DO
|
||||
|
||||
- Use type hints everywhere
|
||||
- Write descriptive docstrings
|
||||
- Log with appropriate levels
|
||||
- Use dataclasses for data structures
|
||||
- Use enums for constants
|
||||
- Use Protocol for interfaces
|
||||
- Set random seeds for reproducibility
|
||||
- Track experiment configurations
|
||||
- Use context managers for resources
|
||||
- Validate inputs at boundaries
|
||||
Reference in New Issue
Block a user