WIP
This commit is contained in:
@@ -1,66 +1,72 @@
|
||||
# Invoice Master POC v2
|
||||
|
||||
Swedish Invoice Field Extraction System - YOLO26 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。
|
||||
Swedish Invoice Field Extraction System - YOLO + PaddleOCR extracts structured data from Swedish PDF invoices.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
PDF → PyMuPDF (DPI=150) → YOLO Detection → PaddleOCR → Field Extraction → Normalization → Output
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
packages/
|
||||
├── backend/ # FastAPI web server + inference pipeline
|
||||
│ └── pipeline/ # YOLO detector → OCR → field extractor → value selector → normalizers
|
||||
├── shared/ # Common utilities (bbox, OCR, field mappings)
|
||||
└── training/ # YOLO training data generation (annotation, dataset)
|
||||
tests/ # Mirrors packages/ structure
|
||||
```
|
||||
|
||||
### Pipeline Flow (process_pdf)
|
||||
|
||||
1. YOLO detects field regions on rendered PDF page
|
||||
2. PaddleOCR extracts text from detected bboxes
|
||||
3. Field extractor maps detections to invoice fields via CLASS_TO_FIELD
|
||||
4. Value selector picks best candidate per field (confidence + validation)
|
||||
5. Normalizers clean values (dates, amounts, invoice numbers)
|
||||
6. Fallback regex extraction if key fields missing
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Component | Technology |
|
||||
|-----------|------------|
|
||||
| Object Detection | YOLO26 (Ultralytics >= 8.4.0) |
|
||||
| OCR Engine | PaddleOCR v5 (PP-OCRv5) |
|
||||
| PDF Processing | PyMuPDF (fitz) |
|
||||
| Object Detection | YOLO (Ultralytics >= 8.4.0) |
|
||||
| OCR | PaddleOCR v5 (PP-OCRv5) |
|
||||
| PDF | PyMuPDF (fitz), DPI=150 |
|
||||
| Database | PostgreSQL + psycopg2 |
|
||||
| Web Framework | FastAPI + Uvicorn |
|
||||
| Deep Learning | PyTorch + CUDA 12.x |
|
||||
| Web | FastAPI + Uvicorn |
|
||||
| ML | PyTorch + CUDA 12.x |
|
||||
|
||||
## WSL Environment (REQUIRED)
|
||||
|
||||
**Prefix ALL commands with:**
|
||||
ALL Python commands MUST use this prefix:
|
||||
|
||||
```bash
|
||||
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-sm120 && <command>"
|
||||
```
|
||||
|
||||
**NEVER run Python commands directly in Windows PowerShell/CMD.**
|
||||
NEVER run Python directly in Windows PowerShell/CMD.
|
||||
|
||||
## Project-Specific Rules
|
||||
## Project Rules
|
||||
|
||||
- Python 3.11+ with type hints
|
||||
- No print() in production - use logging
|
||||
- Run tests: `pytest --cov=src`
|
||||
- Python 3.10, type hints on all function signatures
|
||||
- No `print()` in production code - use `logging` module
|
||||
- Validation with `pydantic` or `dataclasses`
|
||||
- Error handling with `try/except` (not try/catch)
|
||||
- Run tests: `pytest --cov=packages tests/`
|
||||
|
||||
## Critical Rules
|
||||
## Key Files
|
||||
|
||||
### Code Organization
|
||||
|
||||
- Many small files over few large files
|
||||
- High cohesion, low coupling
|
||||
- 200-400 lines typical, 800 max per file
|
||||
- Organize by feature/domain, not by type
|
||||
|
||||
### Code Style
|
||||
|
||||
- No emojis in code, comments, or documentation
|
||||
- Immutability always - never mutate objects or arrays
|
||||
- No console.log in production code
|
||||
- Proper error handling with try/catch
|
||||
- Input validation with Zod or similar
|
||||
|
||||
### Testing
|
||||
|
||||
- TDD: Write tests first
|
||||
- 80% minimum coverage
|
||||
- Unit tests for utilities
|
||||
- Integration tests for APIs
|
||||
- E2E tests for critical flows
|
||||
|
||||
### Security
|
||||
|
||||
- No hardcoded secrets
|
||||
- Environment variables for sensitive data
|
||||
- Validate all user inputs
|
||||
- Parameterized queries only
|
||||
- CSRF protection enabled
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `packages/backend/backend/pipeline/pipeline.py` | Main inference pipeline |
|
||||
| `packages/backend/backend/pipeline/field_extractor.py` | YOLO → field mapping |
|
||||
| `packages/backend/backend/pipeline/value_selector.py` | Best candidate selection |
|
||||
| `packages/shared/shared/fields/mappings.py` | CLASS_TO_FIELD mapping |
|
||||
| `packages/shared/shared/ocr/paddle_ocr.py` | OCRToken definition |
|
||||
| `packages/shared/shared/bbox/` | Bbox expansion strategies |
|
||||
|
||||
## Environment Variables
|
||||
|
||||
@@ -78,18 +84,41 @@ CONFIDENCE_THRESHOLD=0.5
|
||||
SERVER_HOST=0.0.0.0
|
||||
SERVER_PORT=8000
|
||||
```
|
||||
## Available Commands
|
||||
|
||||
- `/tdd` - Test-driven development workflow
|
||||
- `/plan` - Create implementation plan
|
||||
- `/code-review` - Review code quality
|
||||
- `/build-fix` - Fix build errors
|
||||
## Auto-trigger Rules (ALWAYS FOLLOW - even after context compaction)
|
||||
|
||||
## Git Workflow
|
||||
These rules MUST be followed regardless of conversation history:
|
||||
|
||||
- Conventional commits: `feat:`, `fix:`, `refactor:`, `docs:`, `test:`
|
||||
- Never commit to main directly
|
||||
- PRs require review
|
||||
- All tests must pass before merge
|
||||
- New feature or bug fix → MUST use **tdd-guide** agent (write tests first)
|
||||
- When writing code → MUST follow coding standards skill for the target language:
|
||||
- Python → `python-patterns` (PEP 8, type hints, Pythonic idioms)
|
||||
- C# → `dotnet-skills:coding-standards` (records, pattern matching, modern C#)
|
||||
- TS/JS → `coding-standards` (universal best practices)
|
||||
- After writing/modifying code → MUST use **code-reviewer** agent
|
||||
- Before git commit → MUST use **security-reviewer** agent
|
||||
- When build/test fails → MUST use **build-error-resolver** agent
|
||||
- After context compaction → read MEMORY.md to restore session state
|
||||
|
||||
Push the code before review and fix finished.
|
||||
## Plan Completion Protocol
|
||||
|
||||
After completing any plan or major task:
|
||||
|
||||
1. **Test** - Run `pytest` to confirm all tests pass
|
||||
2. **Security review** - Use **security-reviewer** agent on changed files
|
||||
3. **Fix loop** - If security review reports CRITICAL or HIGH issues:
|
||||
- Fix the issues
|
||||
- Re-run tests (back to step 1)
|
||||
- Re-run security review (back to step 2)
|
||||
- Repeat until no CRITICAL/HIGH issues remain
|
||||
4. **Commit** - Auto-commit with conventional commit message (`feat:`, `fix:`, `refactor:`, etc.). Stage only the files changed in this task, not unrelated files
|
||||
5. **Save** - Write a summary to MEMORY.md including: what was done, files changed, decisions made, remaining work
|
||||
6. **Suggest clear** - Tell the user: "Plan complete. Recommend `/clear` to free context for the next task."
|
||||
7. **Do NOT start a new task** in the same context - wait for user to /clear first
|
||||
|
||||
This keeps each plan in a fresh context window for maximum quality.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- Pre-existing test failures: `test_s3.py`, `test_azure.py` (missing boto3/azure) - safe to ignore
|
||||
- Always re-run dedup/validation after fallback adds new fields
|
||||
- PDF DPI must be 150 (not 300) for correct bbox alignment
|
||||
|
||||
@@ -1,37 +0,0 @@
|
||||
# Python Coding Style
|
||||
|
||||
> This file extends [common/coding-style.md](../common/coding-style.md) with Python specific content.
|
||||
|
||||
## Standards
|
||||
|
||||
- Follow **PEP 8** conventions
|
||||
- Use **type annotations** on all function signatures
|
||||
|
||||
## Immutability
|
||||
|
||||
Prefer immutable data structures:
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class User:
|
||||
name: str
|
||||
email: str
|
||||
|
||||
from typing import NamedTuple
|
||||
|
||||
class Point(NamedTuple):
|
||||
x: float
|
||||
y: float
|
||||
```
|
||||
|
||||
## Formatting
|
||||
|
||||
- **black** for code formatting
|
||||
- **isort** for import sorting
|
||||
- **ruff** for linting
|
||||
|
||||
## Reference
|
||||
|
||||
See skill: `python-patterns` for comprehensive Python idioms and patterns.
|
||||
@@ -1,14 +0,0 @@
|
||||
# Python Hooks
|
||||
|
||||
> This file extends [common/hooks.md](../common/hooks.md) with Python specific content.
|
||||
|
||||
## PostToolUse Hooks
|
||||
|
||||
Configure in `~/.claude/settings.json`:
|
||||
|
||||
- **black/ruff**: Auto-format `.py` files after edit
|
||||
- **mypy/pyright**: Run type checking after editing `.py` files
|
||||
|
||||
## Warnings
|
||||
|
||||
- Warn about `print()` statements in edited files (use `logging` module instead)
|
||||
@@ -1,34 +0,0 @@
|
||||
# Python Patterns
|
||||
|
||||
> This file extends [common/patterns.md](../common/patterns.md) with Python specific content.
|
||||
|
||||
## Protocol (Duck Typing)
|
||||
|
||||
```python
|
||||
from typing import Protocol
|
||||
|
||||
class Repository(Protocol):
|
||||
def find_by_id(self, id: str) -> dict | None: ...
|
||||
def save(self, entity: dict) -> dict: ...
|
||||
```
|
||||
|
||||
## Dataclasses as DTOs
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class CreateUserRequest:
|
||||
name: str
|
||||
email: str
|
||||
age: int | None = None
|
||||
```
|
||||
|
||||
## Context Managers & Generators
|
||||
|
||||
- Use context managers (`with` statement) for resource management
|
||||
- Use generators for lazy evaluation and memory-efficient iteration
|
||||
|
||||
## Reference
|
||||
|
||||
See skill: `python-patterns` for comprehensive patterns including decorators, concurrency, and package organization.
|
||||
@@ -1,25 +0,0 @@
|
||||
# Python Security
|
||||
|
||||
> This file extends [common/security.md](../common/security.md) with Python specific content.
|
||||
|
||||
## Secret Management
|
||||
|
||||
```python
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
api_key = os.environ["OPENAI_API_KEY"] # Raises KeyError if missing
|
||||
```
|
||||
|
||||
## Security Scanning
|
||||
|
||||
- Use **bandit** for static security analysis:
|
||||
```bash
|
||||
bandit -r src/
|
||||
```
|
||||
|
||||
## Reference
|
||||
|
||||
See skill: `django-security` for Django-specific security guidelines (if applicable).
|
||||
@@ -1,33 +0,0 @@
|
||||
# Python Testing
|
||||
|
||||
> This file extends [common/testing.md](../common/testing.md) with Python specific content.
|
||||
|
||||
## Framework
|
||||
|
||||
Use **pytest** as the testing framework.
|
||||
|
||||
## Coverage
|
||||
|
||||
```bash
|
||||
pytest --cov=src --cov-report=term-missing
|
||||
```
|
||||
|
||||
## Test Organization
|
||||
|
||||
Use `pytest.mark` for test categorization:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_calculate_total():
|
||||
...
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_database_connection():
|
||||
...
|
||||
```
|
||||
|
||||
## Reference
|
||||
|
||||
See skill: `python-testing` for detailed pytest patterns and fixtures.
|
||||
Reference in New Issue
Block a user