invoice-master-poc-v2/.claude/skills/coding-standards/SKILL.md

---
name: coding-standards
description: Universal coding standards, best practices, and patterns for Python, FastAPI, and data processing development.
---

# Coding Standards & Best Practices

Python coding standards for the Invoice Master project.

## Code Quality Principles

### 1. Readability First
- Code is read more than written
- Clear variable and function names
- Self-documenting code preferred over comments
- Consistent formatting (follow PEP 8)

### 2. KISS (Keep It Simple, Stupid)
- Simplest solution that works
- Avoid over-engineering
- No premature optimization
- Easy to understand > clever code

### 3. DRY (Don't Repeat Yourself)
- Extract common logic into functions
- Create reusable utilities
- Share modules across the codebase
- Avoid copy-paste programming

### 4. YAGNI (You Aren't Gonna Need It)
- Don't build features before they're needed
- Avoid speculative generality
- Add complexity only when required
- Start simple, refactor when needed

## Python Standards

### Variable Naming

```python
# GOOD: Descriptive names
invoice_number = "INV-2024-001"
is_valid_document = True
total_confidence_score = 0.95

# BAD: Unclear names
inv = "INV-2024-001"
flag = True
x = 0.95
```

### Function Naming

```python
# GOOD: Verb-noun pattern with type hints
def extract_invoice_fields(pdf_path: Path) -> dict[str, str]:
    """Extract fields from invoice PDF."""
    ...

def calculate_confidence(predictions: list[float]) -> float:
    """Calculate average confidence score."""
    ...

def is_valid_bankgiro(value: str) -> bool:
    """Check if value is valid Bankgiro number."""
    ...

# BAD: Unclear or noun-only
def invoice(path):
    ...

def confidence(p):
    ...

def bankgiro(v):
    ...
```

### Type Hints (REQUIRED)

```python
# GOOD: Full type annotations
from typing import Optional
from pathlib import Path
from dataclasses import dataclass

@dataclass
class InferenceResult:
    document_id: str
    fields: dict[str, str]
    confidence: dict[str, float]
    processing_time_ms: float

def process_document(
    pdf_path: Path,
    confidence_threshold: float = 0.5
) -> InferenceResult:
    """Process PDF and return extracted fields."""
    ...

# BAD: No type hints
def process_document(pdf_path, confidence_threshold=0.5):
    ...
```

### Immutability Pattern (CRITICAL)

```python
# GOOD: Create new objects, don't mutate
def update_fields(fields: dict[str, str], updates: dict[str, str]) -> dict[str, str]:
    return {**fields, **updates}

def add_item(items: list[str], new_item: str) -> list[str]:
    return [*items, new_item]

# BAD: Direct mutation
def update_fields(fields: dict[str, str], updates: dict[str, str]) -> dict[str, str]:
    fields.update(updates)  # MUTATION!
    return fields

def add_item(items: list[str], new_item: str) -> list[str]:
    items.append(new_item)  # MUTATION!
    return items
```

### Error Handling

```python
import logging

logger = logging.getLogger(__name__)

# GOOD: Comprehensive error handling with logging
def load_model(model_path: Path) -> Model:
    """Load YOLO model from path."""
    try:
        if not model_path.exists():
            raise FileNotFoundError(f"Model not found: {model_path}")

        model = YOLO(str(model_path))
        logger.info(f"Model loaded: {model_path}")
        return model
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        raise RuntimeError(f"Model loading failed: {model_path}") from e

# BAD: No error handling
def load_model(model_path):
    return YOLO(str(model_path))

# BAD: Bare except
def load_model(model_path):
    try:
        return YOLO(str(model_path))
    except:  # Never use bare except!
        return None
```

### Async Best Practices

```python
import asyncio

# GOOD: Parallel execution when possible
async def process_batch(pdf_paths: list[Path]) -> list[InferenceResult]:
    tasks = [process_document(path) for path in pdf_paths]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Handle exceptions
    valid_results = []
    for path, result in zip(pdf_paths, results):
        if isinstance(result, Exception):
            logger.error(f"Failed to process {path}: {result}")
        else:
            valid_results.append(result)
    return valid_results

# BAD: Sequential when unnecessary
async def process_batch(pdf_paths: list[Path]) -> list[InferenceResult]:
    results = []
    for path in pdf_paths:
        result = await process_document(path)
        results.append(result)
    return results
```

### Context Managers

```python
from contextlib import contextmanager
from pathlib import Path
import tempfile

# GOOD: Proper resource management
@contextmanager
def temp_pdf_copy(pdf_path: Path):
    """Create temporary copy of PDF for processing."""
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        tmp.write(pdf_path.read_bytes())
        tmp_path = Path(tmp.name)
    try:
        yield tmp_path
    finally:
        tmp_path.unlink(missing_ok=True)

# Usage
with temp_pdf_copy(original_pdf) as tmp_pdf:
    result = process_pdf(tmp_pdf)
```

## FastAPI Best Practices

### Route Structure

```python
from fastapi import APIRouter, HTTPException, Depends, Query, File, UploadFile
from pydantic import BaseModel

router = APIRouter(prefix="/api/v1", tags=["inference"])

class InferenceResponse(BaseModel):
    success: bool
    document_id: str
    fields: dict[str, str]
    confidence: dict[str, float]
    processing_time_ms: float

@router.post("/infer", response_model=InferenceResponse)
async def infer_document(
    file: UploadFile = File(...),
    confidence_threshold: float = Query(0.5, ge=0.0, le=1.0)
) -> InferenceResponse:
    """Process invoice PDF and extract fields."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files accepted")

    result = await inference_service.process(file, confidence_threshold)
    return InferenceResponse(
        success=True,
        document_id=result.document_id,
        fields=result.fields,
        confidence=result.confidence,
        processing_time_ms=result.processing_time_ms
    )
```

### Input Validation with Pydantic

```python
from pydantic import BaseModel, Field, field_validator
from datetime import date
import re

class InvoiceData(BaseModel):
    invoice_number: str = Field(..., min_length=1, max_length=50)
    invoice_date: date
    amount: float = Field(..., gt=0)
    bankgiro: str | None = None
    ocr_number: str | None = None

    @field_validator("bankgiro")
    @classmethod
    def validate_bankgiro(cls, v: str | None) -> str | None:
        if v is None:
            return None
        # Bankgiro: 7-8 digits
        cleaned = re.sub(r"[^0-9]", "", v)
        if not (7 <= len(cleaned) <= 8):
            raise ValueError("Bankgiro must be 7-8 digits")
        return cleaned

    @field_validator("ocr_number")
    @classmethod
    def validate_ocr(cls, v: str | None) -> str | None:
        if v is None:
            return None
        # OCR: 2-25 digits
        cleaned = re.sub(r"[^0-9]", "", v)
        if not (2 <= len(cleaned) <= 25):
            raise ValueError("OCR must be 2-25 digits")
        return cleaned
```

### Response Format

```python
from pydantic import BaseModel
from typing import Generic, TypeVar

T = TypeVar("T")

class ApiResponse(BaseModel, Generic[T]):
    success: bool
    data: T | None = None
    error: str | None = None
    meta: dict | None = None

# Success response
return ApiResponse(
    success=True,
    data=result,
    meta={"processing_time_ms": elapsed_ms}
)

# Error response
return ApiResponse(
    success=False,
    error="Invalid PDF format"
)
```

## File Organization

### Project Structure

```
src/
├── cli/                  # Command-line interfaces
│   ├── autolabel.py
│   ├── train.py
│   └── infer.py
├── pdf/                  # PDF processing
│   ├── extractor.py
│   └── renderer.py
├── ocr/                  # OCR processing
│   ├── paddle_ocr.py
│   └── machine_code_parser.py
├── inference/            # Inference pipeline
│   ├── pipeline.py
│   ├── yolo_detector.py
│   └── field_extractor.py
├── normalize/            # Field normalization
│   ├── base.py
│   ├── date_normalizer.py
│   └── amount_normalizer.py
├── web/                  # FastAPI application
│   ├── app.py
│   ├── routes.py
│   ├── services.py
│   └── schemas.py
└── utils/                # Shared utilities
    ├── validators.py
    ├── text_cleaner.py
    └── logging.py
tests/                    # Mirror of src structure
    ├── test_pdf/
    ├── test_ocr/
    └── test_inference/
```

### File Naming

```
src/ocr/paddle_ocr.py           # snake_case for modules
src/inference/yolo_detector.py  # snake_case for modules
tests/test_paddle_ocr.py        # test_ prefix for tests
config.py                       # snake_case for config
```

### Module Size Guidelines

- **Maximum**: 800 lines per file
- **Typical**: 200-400 lines per file
- **Functions**: Max 50 lines each
- Extract utilities when modules grow too large

## Comments & Documentation

### When to Comment

```python
# GOOD: Explain WHY, not WHAT
# Swedish Bankgiro uses Luhn algorithm with weight [1,2,1,2...]
def validate_bankgiro_checksum(bankgiro: str) -> bool:
    ...

# Payment line format: 7 groups separated by #, checksum at end
def parse_payment_line(line: str) -> PaymentLineData:
    ...

# BAD: Stating the obvious
# Increment counter by 1
count += 1

# Set name to user's name
name = user.name
```

### Docstrings for Public APIs

```python
def extract_invoice_fields(
    pdf_path: Path,
    confidence_threshold: float = 0.5,
    use_gpu: bool = True
) -> InferenceResult:
    """Extract structured fields from Swedish invoice PDF.

    Uses YOLOv11 for field detection and PaddleOCR for text extraction.
    Applies field-specific normalization and validation.

    Args:
        pdf_path: Path to the invoice PDF file.
        confidence_threshold: Minimum confidence for field detection (0.0-1.0).
        use_gpu: Whether to use GPU acceleration.

    Returns:
        InferenceResult containing extracted fields and confidence scores.

    Raises:
        FileNotFoundError: If PDF file doesn't exist.
        ProcessingError: If OCR or detection fails.

    Example:
        >>> result = extract_invoice_fields(Path("invoice.pdf"))
        >>> print(result.fields["invoice_number"])
        "INV-2024-001"
    """
    ...
```

## Performance Best Practices

### Caching

```python
from functools import lru_cache
from cachetools import TTLCache

# Static data: LRU cache
@lru_cache(maxsize=100)
def get_field_config(field_name: str) -> FieldConfig:
    """Load field configuration (cached)."""
    return load_config(field_name)

# Dynamic data: TTL cache
_document_cache = TTLCache(maxsize=1000, ttl=300)  # 5 minutes

def get_document_cached(doc_id: str) -> Document | None:
    if doc_id in _document_cache:
        return _document_cache[doc_id]

    doc = repo.find_by_id(doc_id)
    if doc:
        _document_cache[doc_id] = doc
    return doc
```

### Database Queries

```python
# GOOD: Select only needed columns
cur.execute("""
    SELECT id, status, fields->>'invoice_number'
    FROM documents
    WHERE status = %s
    LIMIT %s
""", ('processed', 10))

# BAD: Select everything
cur.execute("SELECT * FROM documents")

# GOOD: Batch operations
cur.executemany(
    "INSERT INTO labels (doc_id, field, value) VALUES (%s, %s, %s)",
    [(doc_id, f, v) for f, v in fields.items()]
)

# BAD: Individual inserts in loop
for field, value in fields.items():
    cur.execute("INSERT INTO labels ...", (doc_id, field, value))
```

### Lazy Loading

```python
class InferencePipeline:
    def __init__(self, model_path: Path):
        self.model_path = model_path
        self._model: YOLO | None = None
        self._ocr: PaddleOCR | None = None

    @property
    def model(self) -> YOLO:
        """Lazy load YOLO model."""
        if self._model is None:
            self._model = YOLO(str(self.model_path))
        return self._model

    @property
    def ocr(self) -> PaddleOCR:
        """Lazy load PaddleOCR."""
        if self._ocr is None:
            self._ocr = PaddleOCR(use_angle_cls=True, lang="latin")
        return self._ocr
```

## Testing Standards

### Test Structure (AAA Pattern)

```python
def test_extract_bankgiro_valid():
    # Arrange
    text = "Bankgiro: 123-4567"

    # Act
    result = extract_bankgiro(text)

    # Assert
    assert result == "1234567"

def test_extract_bankgiro_invalid_returns_none():
    # Arrange
    text = "No bankgiro here"

    # Act
    result = extract_bankgiro(text)

    # Assert
    assert result is None
```

### Test Naming

```python
# GOOD: Descriptive test names
def test_parse_payment_line_extracts_all_fields(): ...
def test_parse_payment_line_handles_missing_checksum(): ...
def test_validate_ocr_returns_false_for_invalid_checksum(): ...

# BAD: Vague test names
def test_parse(): ...
def test_works(): ...
def test_payment_line(): ...
```

### Fixtures

```python
import pytest
from pathlib import Path

@pytest.fixture
def sample_invoice_pdf(tmp_path: Path) -> Path:
    """Create sample invoice PDF for testing."""
    pdf_path = tmp_path / "invoice.pdf"
    # Create test PDF...
    return pdf_path

@pytest.fixture
def inference_pipeline(sample_model_path: Path) -> InferencePipeline:
    """Create inference pipeline with test model."""
    return InferencePipeline(sample_model_path)

def test_process_invoice(inference_pipeline, sample_invoice_pdf):
    result = inference_pipeline.process(sample_invoice_pdf)
    assert result.fields.get("invoice_number") is not None
```

## Code Smell Detection

### 1. Long Functions

```python
# BAD: Function > 50 lines
def process_document():
    # 100 lines of code...

# GOOD: Split into smaller functions
def process_document(pdf_path: Path) -> InferenceResult:
    image = render_pdf(pdf_path)
    detections = detect_fields(image)
    ocr_results = extract_text(image, detections)
    fields = normalize_fields(ocr_results)
    return build_result(fields)
```

### 2. Deep Nesting

```python
# BAD: 5+ levels of nesting
if document:
    if document.is_valid:
        if document.has_fields:
            if field in document.fields:
                if document.fields[field]:
                    # Do something

# GOOD: Early returns
if not document:
    return None
if not document.is_valid:
    return None
if not document.has_fields:
    return None
if field not in document.fields:
    return None
if not document.fields[field]:
    return None

# Do something
```

### 3. Magic Numbers

```python
# BAD: Unexplained numbers
if confidence > 0.5:
    ...
time.sleep(3)

# GOOD: Named constants
CONFIDENCE_THRESHOLD = 0.5
RETRY_DELAY_SECONDS = 3

if confidence > CONFIDENCE_THRESHOLD:
    ...
time.sleep(RETRY_DELAY_SECONDS)
```

### 4. Mutable Default Arguments

```python
# BAD: Mutable default argument
def process_fields(fields: list = []):  # DANGEROUS!
    fields.append("new_field")
    return fields

# GOOD: Use None as default
def process_fields(fields: list | None = None) -> list:
    if fields is None:
        fields = []
    return [*fields, "new_field"]
```

## Logging Standards

```python
import logging

# Module-level logger
logger = logging.getLogger(__name__)

# GOOD: Appropriate log levels
logger.debug("Processing document: %s", doc_id)
logger.info("Document processed successfully: %s", doc_id)
logger.warning("Low confidence score: %.2f", confidence)
logger.error("Failed to process document: %s", error)

# GOOD: Structured logging with extra data
logger.info(
    "Inference complete",
    extra={
        "document_id": doc_id,
        "field_count": len(fields),
        "processing_time_ms": elapsed_ms
    }
)

# BAD: Using print()
print(f"Processing {doc_id}")  # Never in production!
```

**Remember**: Code quality is not negotiable. Clear, maintainable Python code with proper type hints enables confident development and refactoring.