Files
invoice-master-poc-v2/docs/plans/business-invoice-plan.md
2026-02-01 23:41:46 +01:00

345 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Business Invoice Features Implementation Plan
## Overview
为 Invoice Master 添加 Business 用户功能Line Items 提取、多税率 VAT 信息提取、交叉验证。
## Current State
- **基础字段提取**: YOLO + PaddleOCR10 个 class (invoice_number, date, amount, etc.)
- **交叉验证**: payment_line 验证 OCR/Amount/Bankgiro
- **PaddlePaddle**: 2.5.0 (需升级到 3.0 for PP-StructureV3)
- **测试覆盖**: 688 tests, 37% coverage
## Target Architecture
```
PDF/Image
├─► YOLO (现有) ──► 基础字段检测
│ - invoice_number, date, amount, etc.
├─► PP-StructureV3 ──► 表格自动检测 + 结构解析
│ - Line items extraction
│ - SLANet 行列识别
├─► 正则引擎 ──► VAT 信息提取
│ - 多税率分解 (25%, 12%, 6%)
│ - Moms 金额匹配
└─► 交叉验证引擎 ──► 多源验证
- 数学验证: incl = excl + vat
- Line items 汇总 vs VAT 汇总区
- 置信度评分
```
## Implementation Phases
### Phase 1: Environment & PP-StructureV3 POC
**Goal**: 在独立分支验证 PP-StructureV3 能否正确检测瑞典发票表格
**Tasks**:
1. 创建 `feature/business-invoice` 分支
2. 升级依赖:
- `paddlepaddle>=3.0.0`
- `paddleocr>=3.0.0`
3. 创建 PP-StructureV3 wrapper:
- `src/table/structure_detector.py`
4. 用 5-10 张真实发票测试表格检测效果
5. 验证与现有 YOLO pipeline 的兼容性
**Critical Files**:
- [requirements.txt](../../requirements.txt)
- [pyproject.toml](../../pyproject.toml)
- New: `src/table/structure_detector.py`
**Verification**:
```bash
# WSL 环境测试
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
conda activate invoice-py311 && \
python -c 'from paddleocr import PPStructureV3; print(\"OK\")'"
```
---
### Phase 2: Line Items Extraction
**Goal**: 从检测到的表格区域提取结构化行项目数据
**Data Structures**:
```python
@dataclass
class LineItem:
"""单行项目"""
row_index: int
description: str
quantity: float | None
unit_price: str | None
amount: str
vat_rate: float | None # [待验证] 行级税率是否存在
confidence: float
@dataclass
class LineItemsResult:
"""行项目提取结果"""
items: list[LineItem]
table_bbox: tuple[float, float, float, float] # 表格区域
header_row: list[str] | None # 表头
raw_html: str # PP-Structure 原始输出
```
**Tasks**:
1. 实现 `src/table/line_items_extractor.py`:
- PP-StructureV3 表格检测
- HTML 解析为结构化数据
- 列名智能匹配 (Description, Qty, Price, Amount, Moms%)
2. 实现列名映射:
```python
COLUMN_MAPPINGS = {
'description': ['beskrivning', 'artikel', 'produkt', 'tjänst', 'text'],
'quantity': ['antal', 'qty', 'st', 'pcs'],
'unit_price': ['á-pris', 'pris', 'styckpris', 'enhetspris'],
'amount': ['belopp', 'summa', 'total', 'netto'],
'vat_rate': ['moms', 'moms%', 'vat', 'skatt'],
}
```
3. 单元测试覆盖
**Critical Files**:
- New: `src/table/line_items_extractor.py`
- New: `src/table/column_mapper.py`
- New: `tests/table/test_line_items_extractor.py`
**[待验证]**: Line items 每行是否标注税率
- 需要检查实际发票样本
- 两种策略都实现,运行时根据表头判断
---
### Phase 3: VAT Information Extraction
**Goal**: 从 OCR 全文提取多税率 VAT 信息
**Data Structures**:
```python
@dataclass
class VATBreakdown:
"""单个税率分解"""
rate: float # 25.0, 12.0, 6.0, 0.0
base_amount: str # 税基 (不含税)
vat_amount: str # 税额
source: str # 'regex' | 'line_items'
@dataclass
class VATSummary:
"""完整 VAT 汇总"""
breakdowns: list[VATBreakdown]
total_excl_vat: str | None
total_vat: str | None
total_incl_vat: str | None # 应与现有 amount 字段一致
confidence: float
```
**Tasks**:
1. 实现 `src/vat/vat_extractor.py`:
- 多税率正则模式匹配
- 瑞典语关键词识别
2. 正则模式库:
```python
VAT_PATTERNS = [
# Moms 25%: 2 500,00 (Underlag 10 000,00)
r"Moms\s*(\d+)\s*%\s*:?\s*([\d\s,\.]+)(?:.*?[Uu]nderlag\s*([\d\s,\.]+))?",
# Varav moms 25% 2 500,00
r"Varav\s+moms\s+(\d+)\s*%\s*([\d\s,\.]+)",
# 25% moms: 2 500,00
r"(\d+)\s*%\s*moms\s*:?\s*([\d\s,\.]+)",
# Summa exkl. moms: 10 000,00
r"[Ss]umma\s+exkl\.?\s+moms\s*:?\s*([\d\s,\.]+)",
# Summa moms: 2 500,00
r"[Ss]umma\s+moms\s*:?\s*([\d\s,\.]+)",
]
```
3. 金额解析器 (处理瑞典格式):
- `1 234,56` → `1234.56`
- `1.234,56` → `1234.56`
4. 单元测试
**Critical Files**:
- New: `src/vat/vat_extractor.py`
- New: `src/vat/amount_parser.py`
- New: `tests/vat/test_vat_extractor.py`
---
### Phase 4: Cross-Validation Engine
**Goal**: 多源交叉验证,确保 99%+ 精度
**Data Structures**:
```python
@dataclass
class VATValidationResult:
"""VAT 交叉验证结果"""
is_valid: bool
confidence_score: float # 0.0 - 1.0
# 数学验证
math_checks: list[dict] # 每个税率: base × rate = vat?
total_check: bool # incl = excl + total_vat?
# 来源对比
line_items_vs_summary: bool | None # line items 汇总 = VAT 汇总区?
amount_consistency: bool | None # total_incl_vat = 现有 amount 字段?
# 需人工复核
needs_review: bool
review_reasons: list[str]
```
**Validation Logic**:
```
1. 数学验证 (最可靠)
- 每个税率: base_amount × rate = vat_amount (±0.01 容差)
- 总计: sum(base) + sum(vat) = total_incl_vat
2. 来源交叉验证
- Line items 按税率汇总 vs VAT 汇总区分解
- total_incl_vat vs 现有 amount 字段 (YOLO 提取)
3. 置信度评分
- 所有验证通过: 0.95+
- 数学验证通过但来源不一致: 0.7-0.9
- 数学验证失败: 0.3-0.5, needs_review=True
```
**Tasks**:
1. 扩展现有 `CrossValidationResult`:
- 添加 `vat_validation: VATValidationResult`
2. 实现 `src/validation/vat_validator.py`
3. 集成到 `InferencePipeline._merge_fields()`
4. 单元测试 + 集成测试
**Critical Files**:
- Modify: [packages/backend/backend/pipeline/pipeline.py](../../packages/backend/backend/pipeline/pipeline.py)
- New: `src/validation/vat_validator.py`
- New: `tests/validation/test_vat_validator.py`
---
### Phase 5: Pipeline Integration
**Goal**: 将所有组件集成到现有推理管道
**Tasks**:
1. 扩展 `InferenceResult`:
```python
@dataclass
class InferenceResult:
# ... 现有字段 ...
line_items: list[LineItem] | None = None
vat_summary: VATSummary | None = None
vat_validation: VATValidationResult | None = None
```
2. 修改 `InferencePipeline.process_pdf()`:
```python
def process_pdf(self, pdf_path, document_id=None, extract_line_items=False):
# 1. 现有 YOLO + OCR 流程
# 2. if extract_line_items:
# PP-StructureV3 表格提取
# VAT 正则提取
# 交叉验证
```
3. 更新 API schema:
- `POST /api/v1/infer` 新增 `extract_line_items` 参数
- Response 新增 `line_items`, `vat_summary` 字段
4. 前端展示 (可选,后续迭代)
**Critical Files**:
- Modify: [packages/backend/backend/pipeline/pipeline.py](../../packages/backend/backend/pipeline/pipeline.py)
- Modify: [packages/backend/backend/web/app.py](../../packages/backend/backend/web/app.py)
- Modify: [frontend/src/api/types.ts](../../frontend/src/api/types.ts)
---
### Phase 6: Testing & Validation
**Goal**: 确保 80%+ 测试覆盖率,验证真实发票准确率
**Tasks**:
1. 单元测试:
- `test_line_items_extractor.py`
- `test_vat_extractor.py`
- `test_vat_validator.py`
2. 集成测试:
- 完整 pipeline 测试
- 多税率发票测试用例
3. 真实发票验证:
- 选取 50+ 张含 line items 的发票
- 人工标注 ground truth
- 计算准确率
**Verification Commands**:
```bash
# 运行测试
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
conda activate invoice-py311 && \
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
pytest tests/table tests/vat tests/validation --cov=src -v"
# 真实发票测试
wsl bash -c "... && python -m src.cli.infer \
--model runs/train/invoice_fields/weights/best.pt \
--input test_invoices/ \
--extract-line-items \
--output results.json"
```
---
## Open Questions (待验证)
| 问题 | 状态 | 验证方式 |
|------|------|---------|
| Line items 每行是否标注税率? | 待验证 | 检查 10+ 张真实发票样本 |
| PP-StructureV3 对瑞典发票表格检测准确率? | 待验证 | Phase 1 POC |
| PaddlePaddle 3.0 与 PyTorch 共存兼容性? | 待验证 | Phase 1 环境测试 |
---
## Risk Mitigation
| 风险 | 影响 | 缓解措施 |
|------|------|---------|
| PP-StructureV3 表格检测效果差 | 高 | 回退到 YOLO 检测表格区域 + PP-Structure 解析 |
| PaddlePaddle 升级破坏现有功能 | 高 | 独立分支开发,充分测试后再合并 |
| 多税率正则覆盖不全 | 中 | 收集更多发票样本,迭代正则模式 |
---
## Success Criteria
- [ ] PP-StructureV3 表格检测准确率 > 95%
- [ ] Line items 提取准确率 > 90%
- [ ] VAT 信息提取准确率 > 95%
- [ ] 交叉验证覆盖所有数学关系
- [ ] 测试覆盖率 > 80%
- [ ] 现有基础字段功能不受影响
---
## Timeline Estimate
| Phase | 工作内容 |
|-------|---------|
| Phase 1 | 环境升级 + POC 验证 |
| Phase 2 | Line Items 提取 |
| Phase 3 | VAT 信息提取 |
| Phase 4 | 交叉验证引擎 |
| Phase 5 | Pipeline 集成 |
| Phase 6 | 测试 + 验证 |