Add claude config

2026-01-25 16:17:23 +01:00
parent e599424a92
commit d5101e3604
40 changed files with 5559 additions and 1378 deletions
--- a/.claude/skills/tdd-workflow/SKILL.md
+++ b/.claude/skills/tdd-workflow/SKILL.md
@@ -0,0 +1,553 @@
+---
+name: tdd-workflow
+description: Use this skill when writing new features, fixing bugs, or refactoring code. Enforces test-driven development with 80%+ coverage including unit, integration, and E2E tests.
+---
+
+# Test-Driven Development Workflow
+
+TDD principles for Python/FastAPI development with pytest.
+
+## When to Activate
+
+- Writing new features or functionality
+- Fixing bugs or issues
+- Refactoring existing code
+- Adding API endpoints
+- Creating new field extractors or normalizers
+
+## Core Principles
+
+### 1. Tests BEFORE Code
+ALWAYS write tests first, then implement code to make tests pass.
+
+### 2. Coverage Requirements
+- Minimum 80% coverage (unit + integration + E2E)
+- All edge cases covered
+- Error scenarios tested
+- Boundary conditions verified
+
+### 3. Test Types
+
+#### Unit Tests
+- Individual functions and utilities
+- Normalizers and validators
+- Parsers and extractors
+- Pure functions
+
+#### Integration Tests
+- API endpoints
+- Database operations
+- OCR + YOLO pipeline
+- Service interactions
+
+#### E2E Tests
+- Complete inference pipeline
+- PDF → Fields workflow
+- API health and inference endpoints
+
+## TDD Workflow Steps
+
+### Step 1: Write User Journeys
+```
+As a [role], I want to [action], so that [benefit]
+
+Example:
+As an invoice processor, I want to extract Bankgiro from payment_line,
+so that I can cross-validate OCR results.
+```
+
+### Step 2: Generate Test Cases
+For each user journey, create comprehensive test cases:
+
+```python
+import pytest
+
+class TestPaymentLineParser:
+    """Tests for payment_line parsing and field extraction."""
+
+    def test_parse_payment_line_extracts_bankgiro(self):
+        """Should extract Bankgiro from valid payment line."""
+        # Test implementation
+        pass
+
+    def test_parse_payment_line_handles_missing_checksum(self):
+        """Should handle payment lines without checksum."""
+        pass
+
+    def test_parse_payment_line_validates_checksum(self):
+        """Should validate checksum when present."""
+        pass
+
+    def test_parse_payment_line_returns_none_for_invalid(self):
+        """Should return None for invalid payment lines."""
+        pass
+```
+
+### Step 3: Run Tests (They Should Fail)
+```bash
+pytest tests/test_ocr/test_machine_code_parser.py -v
+# Tests should fail - we haven't implemented yet
+```
+
+### Step 4: Implement Code
+Write minimal code to make tests pass:
+
+```python
+def parse_payment_line(line: str) -> PaymentLineData | None:
+    """Parse Swedish payment line and extract fields."""
+    # Implementation guided by tests
+    pass
+```
+
+### Step 5: Run Tests Again
+```bash
+pytest tests/test_ocr/test_machine_code_parser.py -v
+# Tests should now pass
+```
+
+### Step 6: Refactor
+Improve code quality while keeping tests green:
+- Remove duplication
+- Improve naming
+- Optimize performance
+- Enhance readability
+
+### Step 7: Verify Coverage
+```bash
+pytest --cov=src --cov-report=term-missing
+# Verify 80%+ coverage achieved
+```
+
+## Testing Patterns
+
+### Unit Test Pattern (pytest)
+```python
+import pytest
+from src.normalize.bankgiro_normalizer import normalize_bankgiro
+
+class TestBankgiroNormalizer:
+    """Tests for Bankgiro normalization."""
+
+    def test_normalize_removes_hyphens(self):
+        """Should remove hyphens from Bankgiro."""
+        result = normalize_bankgiro("123-4567")
+        assert result == "1234567"
+
+    def test_normalize_removes_spaces(self):
+        """Should remove spaces from Bankgiro."""
+        result = normalize_bankgiro("123 4567")
+        assert result == "1234567"
+
+    def test_normalize_validates_length(self):
+        """Should validate Bankgiro is 7-8 digits."""
+        result = normalize_bankgiro("123456")  # 6 digits
+        assert result is None
+
+    def test_normalize_validates_checksum(self):
+        """Should validate Luhn checksum."""
+        result = normalize_bankgiro("1234568")  # Invalid checksum
+        assert result is None
+
+    @pytest.mark.parametrize("input_value,expected", [
+        ("123-4567", "1234567"),
+        ("1234567", "1234567"),
+        ("123 4567", "1234567"),
+        ("BG 123-4567", "1234567"),
+    ])
+    def test_normalize_various_formats(self, input_value, expected):
+        """Should handle various input formats."""
+        result = normalize_bankgiro(input_value)
+        assert result == expected
+```
+
+### API Integration Test Pattern
+```python
+import pytest
+from fastapi.testclient import TestClient
+from src.web.app import app
+
+@pytest.fixture
+def client():
+    return TestClient(app)
+
+class TestHealthEndpoint:
+    """Tests for /api/v1/health endpoint."""
+
+    def test_health_returns_200(self, client):
+        """Should return 200 OK."""
+        response = client.get("/api/v1/health")
+        assert response.status_code == 200
+
+    def test_health_returns_status(self, client):
+        """Should return health status."""
+        response = client.get("/api/v1/health")
+        data = response.json()
+        assert data["status"] == "healthy"
+        assert "model_loaded" in data
+
+class TestInferEndpoint:
+    """Tests for /api/v1/infer endpoint."""
+
+    def test_infer_requires_file(self, client):
+        """Should require file upload."""
+        response = client.post("/api/v1/infer")
+        assert response.status_code == 422
+
+    def test_infer_rejects_non_pdf(self, client):
+        """Should reject non-PDF files."""
+        response = client.post(
+            "/api/v1/infer",
+            files={"file": ("test.txt", b"not a pdf", "text/plain")}
+        )
+        assert response.status_code == 400
+
+    def test_infer_returns_fields(self, client, sample_invoice_pdf):
+        """Should return extracted fields."""
+        with open(sample_invoice_pdf, "rb") as f:
+            response = client.post(
+                "/api/v1/infer",
+                files={"file": ("invoice.pdf", f, "application/pdf")}
+            )
+        assert response.status_code == 200
+        data = response.json()
+        assert data["success"] is True
+        assert "fields" in data
+```
+
+### E2E Test Pattern
+```python
+import pytest
+import httpx
+from pathlib import Path
+
+@pytest.fixture(scope="module")
+def running_server():
+    """Ensure server is running for E2E tests."""
+    # Server should be started before running E2E tests
+    base_url = "http://localhost:8000"
+    yield base_url
+
+class TestInferencePipeline:
+    """E2E tests for complete inference pipeline."""
+
+    def test_health_check(self, running_server):
+        """Should pass health check."""
+        response = httpx.get(f"{running_server}/api/v1/health")
+        assert response.status_code == 200
+        data = response.json()
+        assert data["status"] == "healthy"
+        assert data["model_loaded"] is True
+
+    def test_pdf_inference_returns_fields(self, running_server):
+        """Should extract fields from PDF."""
+        pdf_path = Path("tests/fixtures/sample_invoice.pdf")
+        with open(pdf_path, "rb") as f:
+            response = httpx.post(
+                f"{running_server}/api/v1/infer",
+                files={"file": ("invoice.pdf", f, "application/pdf")}
+            )
+
+        assert response.status_code == 200
+        data = response.json()
+        assert data["success"] is True
+        assert "fields" in data
+        assert len(data["fields"]) > 0
+
+    def test_cross_validation_included(self, running_server):
+        """Should include cross-validation for invoices with payment_line."""
+        pdf_path = Path("tests/fixtures/invoice_with_payment_line.pdf")
+        with open(pdf_path, "rb") as f:
+            response = httpx.post(
+                f"{running_server}/api/v1/infer",
+                files={"file": ("invoice.pdf", f, "application/pdf")}
+            )
+
+        data = response.json()
+        if data["fields"].get("payment_line"):
+            assert "cross_validation" in data
+```
+
+## Test File Organization
+
+```
+tests/
+├── conftest.py                    # Shared fixtures
+├── fixtures/                      # Test data files
+│   ├── sample_invoice.pdf
+│   └── invoice_with_payment_line.pdf
+├── test_cli/
+│   └── test_infer.py
+├── test_pdf/
+│   ├── test_extractor.py
+│   └── test_renderer.py
+├── test_ocr/
+│   ├── test_paddle_ocr.py
+│   └── test_machine_code_parser.py
+├── test_inference/
+│   ├── test_pipeline.py
+│   ├── test_yolo_detector.py
+│   └── test_field_extractor.py
+├── test_normalize/
+│   ├── test_bankgiro_normalizer.py
+│   ├── test_date_normalizer.py
+│   └── test_amount_normalizer.py
+├── test_web/
+│   ├── test_routes.py
+│   └── test_services.py
+└── e2e/
+    └── test_inference_e2e.py
+```
+
+## Mocking External Services
+
+### Mock PaddleOCR
+```python
+import pytest
+from unittest.mock import Mock, patch
+
+@pytest.fixture
+def mock_paddle_ocr():
+    """Mock PaddleOCR for unit tests."""
+    with patch("src.ocr.paddle_ocr.PaddleOCR") as mock:
+        instance = Mock()
+        instance.ocr.return_value = [
+            [
+                [[[0, 0], [100, 0], [100, 20], [0, 20]], ("Invoice Number", 0.95)],
+                [[[0, 30], [100, 30], [100, 50], [0, 50]], ("INV-2024-001", 0.98)]
+            ]
+        ]
+        mock.return_value = instance
+        yield instance
+```
+
+### Mock YOLO Model
+```python
+@pytest.fixture
+def mock_yolo_model():
+    """Mock YOLO model for unit tests."""
+    with patch("src.inference.yolo_detector.YOLO") as mock:
+        instance = Mock()
+        # Mock detection results
+        instance.return_value = Mock(
+            boxes=Mock(
+                xyxy=[[10, 20, 100, 50]],
+                conf=[0.95],
+                cls=[0]  # invoice_number class
+            )
+        )
+        mock.return_value = instance
+        yield instance
+```
+
+### Mock Database
+```python
+@pytest.fixture
+def mock_db_connection():
+    """Mock database connection for unit tests."""
+    with patch("src.data.db.get_db_connection") as mock:
+        conn = Mock()
+        cursor = Mock()
+        cursor.fetchall.return_value = [
+            ("doc-123", "processed", {"invoice_number": "INV-001"})
+        ]
+        cursor.fetchone.return_value = ("doc-123",)
+        conn.cursor.return_value.__enter__ = Mock(return_value=cursor)
+        conn.cursor.return_value.__exit__ = Mock(return_value=False)
+        mock.return_value.__enter__ = Mock(return_value=conn)
+        mock.return_value.__exit__ = Mock(return_value=False)
+        yield conn
+```
+
+## Test Coverage Verification
+
+### Run Coverage Report
+```bash
+# Run with coverage
+pytest --cov=src --cov-report=term-missing
+
+# Generate HTML report
+pytest --cov=src --cov-report=html
+# Open htmlcov/index.html in browser
+```
+
+### Coverage Configuration (pyproject.toml)
+```toml
+[tool.coverage.run]
+source = ["src"]
+omit = ["*/__init__.py", "*/test_*.py"]
+
+[tool.coverage.report]
+fail_under = 80
+show_missing = true
+exclude_lines = [
+    "pragma: no cover",
+    "if TYPE_CHECKING:",
+    "raise NotImplementedError",
+]
+```
+
+## Common Testing Mistakes to Avoid
+
+### WRONG: Testing Implementation Details
+```python
+# Don't test internal state
+def test_parser_internal_state():
+    parser = PaymentLineParser()
+    parser._parse("...")
+    assert parser._groups == [...]  # Internal state
+```
+
+### CORRECT: Test Public Interface
+```python
+# Test what users see
+def test_parser_extracts_bankgiro():
+    result = parse_payment_line("...")
+    assert result.bankgiro == "1234567"
+```
+
+### WRONG: No Test Isolation
+```python
+# Tests depend on each other
+class TestDocuments:
+    def test_creates_document(self):
+        create_document(...)  # Creates in DB
+
+    def test_updates_document(self):
+        update_document(...)  # Depends on previous test
+```
+
+### CORRECT: Independent Tests
+```python
+# Each test sets up its own data
+class TestDocuments:
+    def test_creates_document(self, mock_db):
+        result = create_document(...)
+        assert result.id is not None
+
+    def test_updates_document(self, mock_db):
+        # Create own test data
+        doc = create_document(...)
+        result = update_document(doc.id, ...)
+        assert result.status == "updated"
+```
+
+### WRONG: Testing Too Much
+```python
+# One test doing everything
+def test_full_invoice_processing():
+    # Load PDF
+    # Extract images
+    # Run YOLO
+    # Run OCR
+    # Normalize fields
+    # Save to DB
+    # Return response
+```
+
+### CORRECT: Focused Tests
+```python
+def test_yolo_detects_invoice_number():
+    """Test only YOLO detection."""
+    result = detector.detect(image)
+    assert any(d.label == "invoice_number" for d in result)
+
+def test_ocr_extracts_text():
+    """Test only OCR extraction."""
+    result = ocr.extract(image, bbox)
+    assert result == "INV-2024-001"
+
+def test_normalizer_formats_date():
+    """Test only date normalization."""
+    result = normalize_date("2024-01-15")
+    assert result == "2024-01-15"
+```
+
+## Fixtures (conftest.py)
+
+```python
+import pytest
+from pathlib import Path
+from fastapi.testclient import TestClient
+
+@pytest.fixture
+def sample_invoice_pdf(tmp_path: Path) -> Path:
+    """Create sample invoice PDF for testing."""
+    pdf_path = tmp_path / "invoice.pdf"
+    # Copy from fixtures or create minimal PDF
+    src = Path("tests/fixtures/sample_invoice.pdf")
+    if src.exists():
+        pdf_path.write_bytes(src.read_bytes())
+    return pdf_path
+
+@pytest.fixture
+def client():
+    """FastAPI test client."""
+    from src.web.app import app
+    return TestClient(app)
+
+@pytest.fixture
+def sample_payment_line() -> str:
+    """Sample Swedish payment line for testing."""
+    return "1234567#0000000012345#230115#00012345678901234567#1"
+```
+
+## Continuous Testing
+
+### Watch Mode During Development
+```bash
+# Using pytest-watch
+ptw -- tests/test_ocr/
+# Tests run automatically on file changes
+```
+
+### Pre-Commit Hook
+```bash
+# .pre-commit-config.yaml
+repos:
+  - repo: local
+    hooks:
+      - id: pytest
+        name: pytest
+        entry: pytest --tb=short -q
+        language: system
+        pass_filenames: false
+        always_run: true
+```
+
+### CI/CD Integration (GitHub Actions)
+```yaml
+- name: Run Tests
+  run: |
+    pytest --cov=src --cov-report=xml
+
+- name: Upload Coverage
+  uses: codecov/codecov-action@v3
+  with:
+    file: coverage.xml
+```
+
+## Best Practices
+
+1. **Write Tests First** - Always TDD
+2. **One Assert Per Test** - Focus on single behavior
+3. **Descriptive Test Names** - `test_<what>_<condition>_<expected>`
+4. **Arrange-Act-Assert** - Clear test structure
+5. **Mock External Dependencies** - Isolate unit tests
+6. **Test Edge Cases** - None, empty, invalid, boundary
+7. **Test Error Paths** - Not just happy paths
+8. **Keep Tests Fast** - Unit tests < 50ms each
+9. **Clean Up After Tests** - Use fixtures with cleanup
+10. **Review Coverage Reports** - Identify gaps
+
+## Success Metrics
+
+- 80%+ code coverage achieved
+- All tests passing (green)
+- No skipped or disabled tests
+- Fast test execution (< 60s for unit tests)
+- E2E tests cover critical inference flow
+- Tests catch bugs before production
+
+---
+
+**Remember**: Tests are not optional. They are the safety net that enables confident refactoring, rapid development, and production reliability.