WIP

2026-02-12 23:06:00 +01:00
parent ad5ed46b4c
commit 58d36c8927
26 changed files with 3903 additions and 2551 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -1,66 +1,72 @@
 # Invoice Master POC v2

-Swedish Invoice Field Extraction System - YOLO26 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。
+Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices.
+
+## Architecture
+
+```
+PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output
+```
+
+### Project Structure
+
+```
+packages/
+├── backend/    # FastAPI web server + inference pipeline
+│   └── pipeline/   # YOLO detector → OCR → field extractor → value selector → normalizers
+├── shared/     # Common utilities (bbox, OCR, field mappings)
+└── training/   # YOLO training data generation (annotation, dataset)
+tests/          # Mirrors packages/ structure
+```
+
+### Pipeline Flow (process_pdf)
+
+1. YOLO detects field regions on rendered PDF page
+2. PaddleOCR extracts text from detected bboxes
+3. Field extractor maps detections to invoice fields via CLASS_TO_FIELD
+4. Value selector picks best candidate per field (confidence + validation)
+5. Normalizers clean values (dates, amounts, invoice numbers)
+6. Fallback regex extraction if key fields missing

 ## Tech Stack

 | Component | Technology |
 |-----------|------------|
-| Object Detection | YOLO26 (Ultralytics >= 8.4.0) |
-| OCR Engine | PaddleOCR v5 (PP-OCRv5) |
-| PDF Processing | PyMuPDF (fitz) |
+| Object Detection | YOLO (Ultralytics >= 8.4.0) |
+| OCR | PaddleOCR v5 (PP-OCRv5) |
+| PDF | PyMuPDF (fitz), DPI=150 |
 | Database | PostgreSQL + psycopg2 |
-| Web Framework | FastAPI + Uvicorn |
-| Deep Learning | PyTorch + CUDA 12.x |
+| Web | FastAPI + Uvicorn |
+| ML | PyTorch + CUDA 12.x |

 ## WSL Environment (REQUIRED)

-**Prefix ALL commands with:**
+ALL Python commands MUST use this prefix:

 ```bash
 wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && <command>"
 ```

-**NEVER run Python commands directly in Windows PowerShell/CMD.**
+NEVER run Python directly in Windows PowerShell/CMD.

-## Project-Specific Rules
+## Project Rules

- Python 3.11+ with type hints
- No print() in production - use logging
- Run tests: `pytest --cov=src`
+- Python 3.10, type hints on all function signatures
+- No `print()` in production code - use `logging` module
+- Validation with `pydantic` or `dataclasses`
+- Error handling with `try/except` (not try/catch)
+- Run tests: `pytest --cov=packages tests/`

-## Critical Rules
+## Key Files

-### Code Organization
-
- Many small files over few large files
- High cohesion, low coupling
- 200-400 lines typical, 800 max per file
- Organize by feature/domain, not by type
-
-### Code Style
-
- No emojis in code, comments, or documentation
- Immutability always - never mutate objects or arrays
- No console.log in production code
- Proper error handling with try/catch
- Input validation with Zod or similar
-
-### Testing
-
- TDD: Write tests first
- 80% minimum coverage
- Unit tests for utilities
- Integration tests for APIs
- E2E tests for critical flows
-
-### Security
-
- No hardcoded secrets
- Environment variables for sensitive data
- Validate all user inputs
- Parameterized queries only
- CSRF protection enabled
+| File | Purpose |
+|------|---------|
+| `packages/backend/backend/pipeline/pipeline.py` | Main inference pipeline |
+| `packages/backend/backend/pipeline/field_extractor.py` | YOLO → field mapping |
+| `packages/backend/backend/pipeline/value_selector.py` | Best candidate selection |
+| `packages/shared/shared/fields/mappings.py` | CLASS_TO_FIELD mapping |
+| `packages/shared/shared/ocr/paddle_ocr.py` | OCRToken definition |
+| `packages/shared/shared/bbox/` | Bbox expansion strategies |

 ## Environment Variables

@@ -78,18 +84,41 @@ CONFIDENCE_THRESHOLD=0.5
 SERVER_HOST=0.0.0.0
 SERVER_PORT=8000
 ```
-## Available Commands

- `/tdd` - Test-driven development workflow
- `/plan` - Create implementation plan
- `/code-review` - Review code quality
- `/build-fix` - Fix build errors
+## Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction)

-## Git Workflow
+These rules MUST be followed regardless of conversation history:

- Conventional commits: `feat:`, `fix:`, `refactor:`, `docs:`, `test:`
- Never commit to main directly
- PRs require review
- All tests must pass before merge
+- New feature or bug fix → MUST use **tdd-guide** agent (write tests first)
+- When writing code → MUST follow coding standards skill for the target language:
+  - Python → `python-patterns` (PEP 8, type hints, Pythonic idioms)
+  - C# → `dotnet-skills:coding-standards` (records, pattern matching, modern C#)
+  - TS/JS → `coding-standards` (universal best practices)
+- After writing/modifying code → MUST use **code-reviewer** agent
+- Before git commit → MUST use **security-reviewer** agent
+- When build/test fails → MUST use **build-error-resolver** agent
+- After context compaction → read MEMORY.md to restore session state

-Push the code before review and fix finished.
+## Plan Completion Protocol
+
+After completing any plan or major task:
+
+1. **Test** - Run `pytest` to confirm all tests pass
+2. **Security review** - Use **security-reviewer** agent on changed files
+3. **Fix loop** - If security review reports CRITICAL or HIGH issues:
+   - Fix the issues
+   - Re-run tests (back to step 1)
+   - Re-run security review (back to step 2)
+   - Repeat until no CRITICAL/HIGH issues remain
+4. **Commit** - Auto-commit with conventional commit message (`feat:`, `fix:`, `refactor:`, etc.). Stage only the files changed in this task, not unrelated files
+5. **Save** - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work
+6. **Suggest clear** - Tell the user: "Plan complete. Recommend `/clear` to free context for the next task."
+7. **Do NOT start a new task** in the same context - wait for user to /clear first
+
+This keeps each plan in a fresh context window for maximum quality.
+
+## Known Issues
+
+- Pre-existing test failures: `test_s3.py`, `test_azure.py` (missing boto3/azure) - safe to ignore
+- Always re-run dedup/validation after fallback adds new fields
+- PDF DPI must be 150 (not 300) for correct bbox alignment
--- a/.claude/rules/coding-style.md
+++ b/.claude/rules/coding-style.md
@@ -1,37 +0,0 @@
-# Python Coding Style
-
-> This file extends [common/coding-style.md](../common/coding-style.md) with Python specific content.
-
-## Standards
-
- Follow **PEP 8** conventions
- Use **type annotations** on all function signatures
-
-## Immutability
-
-Prefer immutable data structures:
-
-```python
-from dataclasses import dataclass
-
-@dataclass(frozen=True)
-class User:
-    name: str
-    email: str
-
-from typing import NamedTuple
-
-class Point(NamedTuple):
-    x: float
-    y: float
-```
-
-## Formatting
-
- **black** for code formatting
- **isort** for import sorting
- **ruff** for linting
-
-## Reference
-
-See skill: `python-patterns` for comprehensive Python idioms and patterns.
--- a/.claude/rules/hooks.md
+++ b/.claude/rules/hooks.md
@@ -1,14 +0,0 @@
-# Python Hooks
-
-> This file extends [common/hooks.md](../common/hooks.md) with Python specific content.
-
-## PostToolUse Hooks
-
-Configure in `~/.claude/settings.json`:
-
- **black/ruff**: Auto-format `.py` files after edit
- **mypy/pyright**: Run type checking after editing `.py` files
-
-## Warnings
-
- Warn about `print()` statements in edited files (use `logging` module instead)
--- a/.claude/rules/patterns.md
+++ b/.claude/rules/patterns.md
@@ -1,34 +0,0 @@
-# Python Patterns
-
-> This file extends [common/patterns.md](../common/patterns.md) with Python specific content.
-
-## Protocol (Duck Typing)
-
-```python
-from typing import Protocol
-
-class Repository(Protocol):
-    def find_by_id(self, id: str) -> dict | None: ...
-    def save(self, entity: dict) -> dict: ...
-```
-
-## Dataclasses as DTOs
-
-```python
-from dataclasses import dataclass
-
-@dataclass
-class CreateUserRequest:
-    name: str
-    email: str
-    age: int | None = None
-```
-
-## Context Managers & Generators
-
- Use context managers (`with` statement) for resource management
- Use generators for lazy evaluation and memory-efficient iteration
-
-## Reference
-
-See skill: `python-patterns` for comprehensive patterns including decorators, concurrency, and package organization.
--- a/.claude/rules/security.md
+++ b/.claude/rules/security.md
@@ -1,25 +0,0 @@
-# Python Security
-
-> This file extends [common/security.md](../common/security.md) with Python specific content.
-
-## Secret Management
-
-```python
-import os
-from dotenv import load_dotenv
-
-load_dotenv()
-
-api_key = os.environ["OPENAI_API_KEY"]  # Raises KeyError if missing
-```
-
-## Security Scanning
-
- Use **bandit** for static security analysis:
-  ```bash
-  bandit -r src/
-  ```
-
-## Reference
-
-See skill: `django-security` for Django-specific security guidelines (if applicable).
--- a/.claude/rules/testing.md
+++ b/.claude/rules/testing.md
@@ -1,33 +0,0 @@
-# Python Testing
-
-> This file extends [common/testing.md](../common/testing.md) with Python specific content.
-
-## Framework
-
-Use **pytest** as the testing framework.
-
-## Coverage
-
-```bash
-pytest --cov=src --cov-report=term-missing
-```
-
-## Test Organization
-
-Use `pytest.mark` for test categorization:
-
-```python
-import pytest
-
-@pytest.mark.unit
-def test_calculate_total():
-    ...
-
-@pytest.mark.integration
-def test_database_connection():
-    ...
-```
-
-## Reference
-
-See skill: `python-testing` for detailed pytest patterns and fixtures.