Add plan file

Update readme
2026-02-01 23:41:46 +01:00 · 2026-02-01 23:41:38 +01:00
2 changed files with 356 additions and 12 deletions
--- a/README.md
+++ b/README.md
@@ -19,14 +19,14 @@
 packages/
 ├── shared/      # 共享库 (PDF, OCR, 规范化, 匹配, 存储, 训练)
 ├── training/    # 训练服务 (GPU, 按需启动)
-└── inference/   # 推理服务 (常驻运行)
+└── backend/     # 后端服务 (Web API + 推理, 常驻运行)
 frontend/        # React 前端 (Vite + TypeScript + TailwindCSS)
 ```

 | 服务 | 部署目标 | GPU | 生命周期 |
 |------|---------|-----|---------|
 | **Frontend** | Vercel / Nginx | 否 | 常驻 |
-| **Inference** | Azure App Service / AWS | 可选 | 常驻 7x24 |
+| **Backend** | Azure App Service / AWS | 可选 | 常驻 7x24 |
 | **Training** | Azure ACI / AWS ECS | 必需 | 按需启动/销毁 |

 两个服务通过共享 PostgreSQL 数据库通信。推理服务通过 API 触发训练任务，训练服务从数据库拾取任务执行。
@@ -37,8 +37,8 @@ frontend/        # React 前端 (Vite + TypeScript + TailwindCSS)
 |------|------|
 | **已标注文档** | 9,738 (9,709 成功) |
 | **总体字段匹配率** | 94.8% (82,604/87,121) |
-| **测试** | 1,601 passed |
-| **测试覆盖率** | 28% |
+| **测试** | 2,058 passed |
+| **测试覆盖率** | 60% |
 | **模型 mAP@0.5** | 93.5% |

 **各字段匹配率:**
@@ -86,7 +86,7 @@ cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2
 # 4. 安装三个包 (editable mode)
 pip install -e packages/shared
 pip install -e packages/training
-pip install -e packages/inference
+pip install -e packages/backend
 ```

 ## 项目结构
@@ -119,11 +119,11 @@ invoice-master-poc-v2/
 │   │       ├── processing/         # CPU/GPU worker pool, task dispatcher
 │   │       └── data/               # training_db, autolabel_report
 │   │
-│   └── inference/                  # 推理服务 (常驻)
+│   └── backend/                    # 后端服务 (Web API + 推理, 常驻)
 │       ├── setup.py
 │       ├── Dockerfile
 │       ├── run_server.py           # Web 服务器入口
-│       └── inference/
+│       └── backend/
 │           ├── cli/                # infer, serve
 │           ├── pipeline/           # YOLO 检测, 字段提取, 解析器
 │           ├── web/                # FastAPI 应用
@@ -224,7 +224,7 @@ python -m training.cli.train \

 ```bash
 # 命令行推理
-python -m inference.cli.infer \
+python -m backend.cli.infer \
    --model runs/train/invoice_fields/weights/best.pt \
    --input path/to/invoice.pdf \
    --output result.json \
@@ -371,8 +371,8 @@ print(f"Customer Number: {result}")  # "UMJ 436-R"
 | 组件 | 配置位置 |
 |------|---------|
 | 全局常量 | `packages/shared/shared/config.py` -> `DEFAULT_DPI = 150` |
-| Web 推理 | `packages/inference/inference/web/config.py` -> `ModelConfig.dpi` |
-| CLI 推理 | `python -m inference.cli.infer --dpi 150` |
+| Web 推理 | `packages/backend/backend/web/config.py` -> `ModelConfig.dpi` |
+| CLI 推理 | `python -m backend.cli.infer --dpi 150` |
 | 自动标注 | `packages/shared/shared/config.py` -> `AUTOLABEL['dpi']` |

 ## 数据库架构
@@ -427,9 +427,9 @@ DB_PASSWORD=xxx pytest tests/ --cov=packages --cov-report=term-missing

 | 指标 | 数值 |
 |------|------|
-| **测试总数** | 1,601 |
+| **测试总数** | 2,058 |
 | **通过率** | 100% |
-| **覆盖率** | 28% |
+| **覆盖率** | 60% |

 ## 存储抽象层

--- a/docs/plans/business-invoice-plan.md
+++ b/docs/plans/business-invoice-plan.md
@@ -0,0 +1,344 @@
+# Business Invoice Features Implementation Plan
+
+## Overview
+
+为 Invoice Master 添加 Business 用户功能：Line Items 提取、多税率 VAT 信息提取、交叉验证。
+
+## Current State
+
+- **基础字段提取**: YOLO + PaddleOCR，10 个 class (invoice_number, date, amount, etc.)
+- **交叉验证**: payment_line 验证 OCR/Amount/Bankgiro
+- **PaddlePaddle**: 2.5.0 (需升级到 3.0 for PP-StructureV3)
+- **测试覆盖**: 688 tests, 37% coverage
+
+## Target Architecture
+
+```
+PDF/Image
+    │
+    ├─► YOLO (现有) ──► 基础字段检测
+    │                    - invoice_number, date, amount, etc.
+    │
+    ├─► PP-StructureV3 ──► 表格自动检测 + 结构解析
+    │                       - Line items extraction
+    │                       - SLANet 行列识别
+    │
+    ├─► 正则引擎 ──► VAT 信息提取
+    │                - 多税率分解 (25%, 12%, 6%)
+    │                - Moms 金额匹配
+    │
+    └─► 交叉验证引擎 ──► 多源验证
+                         - 数学验证: incl = excl + vat
+                         - Line items 汇总 vs VAT 汇总区
+                         - 置信度评分
+```
+
+## Implementation Phases
+
+### Phase 1: Environment & PP-StructureV3 POC
+
+**Goal**: 在独立分支验证 PP-StructureV3 能否正确检测瑞典发票表格
+
+**Tasks**:
+1. 创建 `feature/business-invoice` 分支
+2. 升级依赖:
+   - `paddlepaddle>=3.0.0`
+   - `paddleocr>=3.0.0`
+3. 创建 PP-StructureV3 wrapper:
+   - `src/table/structure_detector.py`
+4. 用 5-10 张真实发票测试表格检测效果
+5. 验证与现有 YOLO pipeline 的兼容性
+
+**Critical Files**:
+- [requirements.txt](../../requirements.txt)
+- [pyproject.toml](../../pyproject.toml)
+- New: `src/table/structure_detector.py`
+
+**Verification**:
+```bash
+# WSL 环境测试
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
+  conda activate invoice-py311 && \
+  python -c 'from paddleocr import PPStructureV3; print(\"OK\")'"
+```
+
+---
+
+### Phase 2: Line Items Extraction
+
+**Goal**: 从检测到的表格区域提取结构化行项目数据
+
+**Data Structures**:
+```python
+@dataclass
+class LineItem:
+    """单行项目"""
+    row_index: int
+    description: str
+    quantity: float | None
+    unit_price: str | None
+    amount: str
+    vat_rate: float | None      # [待验证] 行级税率是否存在
+    confidence: float
+
+@dataclass
+class LineItemsResult:
+    """行项目提取结果"""
+    items: list[LineItem]
+    table_bbox: tuple[float, float, float, float]  # 表格区域
+    header_row: list[str] | None                    # 表头
+    raw_html: str                                   # PP-Structure 原始输出
+```
+
+**Tasks**:
+1. 实现 `src/table/line_items_extractor.py`:
+   - PP-StructureV3 表格检测
+   - HTML 解析为结构化数据
+   - 列名智能匹配 (Description, Qty, Price, Amount, Moms%)
+2. 实现列名映射:
+   ```python
+   COLUMN_MAPPINGS = {
+       'description': ['beskrivning', 'artikel', 'produkt', 'tjänst', 'text'],
+       'quantity': ['antal', 'qty', 'st', 'pcs'],
+       'unit_price': ['á-pris', 'pris', 'styckpris', 'enhetspris'],
+       'amount': ['belopp', 'summa', 'total', 'netto'],
+       'vat_rate': ['moms', 'moms%', 'vat', 'skatt'],
+   }
+   ```
+3. 单元测试覆盖
+
+**Critical Files**:
+- New: `src/table/line_items_extractor.py`
+- New: `src/table/column_mapper.py`
+- New: `tests/table/test_line_items_extractor.py`
+
+**[待验证]**: Line items 每行是否标注税率
+- 需要检查实际发票样本
+- 两种策略都实现，运行时根据表头判断
+
+---
+
+### Phase 3: VAT Information Extraction
+
+**Goal**: 从 OCR 全文提取多税率 VAT 信息
+
+**Data Structures**:
+```python
+@dataclass
+class VATBreakdown:
+    """单个税率分解"""
+    rate: float              # 25.0, 12.0, 6.0, 0.0
+    base_amount: str         # 税基 (不含税)
+    vat_amount: str          # 税额
+    source: str              # 'regex' | 'line_items'
+
+@dataclass
+class VATSummary:
+    """完整 VAT 汇总"""
+    breakdowns: list[VATBreakdown]
+    total_excl_vat: str | None
+    total_vat: str | None
+    total_incl_vat: str | None      # 应与现有 amount 字段一致
+    confidence: float
+```
+
+**Tasks**:
+1. 实现 `src/vat/vat_extractor.py`:
+   - 多税率正则模式匹配
+   - 瑞典语关键词识别
+2. 正则模式库:
+   ```python
+   VAT_PATTERNS = [
+       # Moms 25%: 2 500,00 (Underlag 10 000,00)
+       r"Moms\s*(\d+)\s*%\s*:?\s*([\d\s,\.]+)(?:.*?[Uu]nderlag\s*([\d\s,\.]+))?",
+       # Varav moms 25% 2 500,00
+       r"Varav\s+moms\s+(\d+)\s*%\s*([\d\s,\.]+)",
+       # 25% moms: 2 500,00
+       r"(\d+)\s*%\s*moms\s*:?\s*([\d\s,\.]+)",
+       # Summa exkl. moms: 10 000,00
+       r"[Ss]umma\s+exkl\.?\s+moms\s*:?\s*([\d\s,\.]+)",
+       # Summa moms: 2 500,00
+       r"[Ss]umma\s+moms\s*:?\s*([\d\s,\.]+)",
+   ]
+   ```
+3. 金额解析器 (处理瑞典格式):
+   - `1 234,56` → `1234.56`
+   - `1.234,56` → `1234.56`
+4. 单元测试
+
+**Critical Files**:
+- New: `src/vat/vat_extractor.py`
+- New: `src/vat/amount_parser.py`
+- New: `tests/vat/test_vat_extractor.py`
+
+---
+
+### Phase 4: Cross-Validation Engine
+
+**Goal**: 多源交叉验证，确保 99%+ 精度
+
+**Data Structures**:
+```python
+@dataclass
+class VATValidationResult:
+    """VAT 交叉验证结果"""
+    is_valid: bool
+    confidence_score: float          # 0.0 - 1.0
+
+    # 数学验证
+    math_checks: list[dict]          # 每个税率: base × rate = vat?
+    total_check: bool                # incl = excl + total_vat?
+
+    # 来源对比
+    line_items_vs_summary: bool | None   # line items 汇总 = VAT 汇总区?
+    amount_consistency: bool | None       # total_incl_vat = 现有 amount 字段?
+
+    # 需人工复核
+    needs_review: bool
+    review_reasons: list[str]
+```
+
+**Validation Logic**:
+```
+1. 数学验证 (最可靠)
+   - 每个税率: base_amount × rate = vat_amount (±0.01 容差)
+   - 总计: sum(base) + sum(vat) = total_incl_vat
+
+2. 来源交叉验证
+   - Line items 按税率汇总 vs VAT 汇总区分解
+   - total_incl_vat vs 现有 amount 字段 (YOLO 提取)
+
+3. 置信度评分
+   - 所有验证通过: 0.95+
+   - 数学验证通过但来源不一致: 0.7-0.9
+   - 数学验证失败: 0.3-0.5, needs_review=True
+```
+
+**Tasks**:
+1. 扩展现有 `CrossValidationResult`:
+   - 添加 `vat_validation: VATValidationResult`
+2. 实现 `src/validation/vat_validator.py`
+3. 集成到 `InferencePipeline._merge_fields()`
+4. 单元测试 + 集成测试
+
+**Critical Files**:
+- Modify: [packages/backend/backend/pipeline/pipeline.py](../../packages/backend/backend/pipeline/pipeline.py)
+- New: `src/validation/vat_validator.py`
+- New: `tests/validation/test_vat_validator.py`
+
+---
+
+### Phase 5: Pipeline Integration
+
+**Goal**: 将所有组件集成到现有推理管道
+
+**Tasks**:
+1. 扩展 `InferenceResult`:
+   ```python
+   @dataclass
+   class InferenceResult:
+       # ... 现有字段 ...
+       line_items: list[LineItem] | None = None
+       vat_summary: VATSummary | None = None
+       vat_validation: VATValidationResult | None = None
+   ```
+
+2. 修改 `InferencePipeline.process_pdf()`:
+   ```python
+   def process_pdf(self, pdf_path, document_id=None, extract_line_items=False):
+       # 1. 现有 YOLO + OCR 流程
+       # 2. if extract_line_items:
+       #        PP-StructureV3 表格提取
+       #        VAT 正则提取
+       #        交叉验证
+   ```
+
+3. 更新 API schema:
+   - `POST /api/v1/infer` 新增 `extract_line_items` 参数
+   - Response 新增 `line_items`, `vat_summary` 字段
+
+4. 前端展示 (可选，后续迭代)
+
+**Critical Files**:
+- Modify: [packages/backend/backend/pipeline/pipeline.py](../../packages/backend/backend/pipeline/pipeline.py)
+- Modify: [packages/backend/backend/web/app.py](../../packages/backend/backend/web/app.py)
+- Modify: [frontend/src/api/types.ts](../../frontend/src/api/types.ts)
+
+---
+
+### Phase 6: Testing & Validation
+
+**Goal**: 确保 80%+ 测试覆盖率，验证真实发票准确率
+
+**Tasks**:
+1. 单元测试:
+   - `test_line_items_extractor.py`
+   - `test_vat_extractor.py`
+   - `test_vat_validator.py`
+2. 集成测试:
+   - 完整 pipeline 测试
+   - 多税率发票测试用例
+3. 真实发票验证:
+   - 选取 50+ 张含 line items 的发票
+   - 人工标注 ground truth
+   - 计算准确率
+
+**Verification Commands**:
+```bash
+# 运行测试
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
+  conda activate invoice-py311 && \
+  cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
+  pytest tests/table tests/vat tests/validation --cov=src -v"
+
+# 真实发票测试
+wsl bash -c "... && python -m src.cli.infer \
+  --model runs/train/invoice_fields/weights/best.pt \
+  --input test_invoices/ \
+  --extract-line-items \
+  --output results.json"
+```
+
+---
+
+## Open Questions (待验证)
+
+| 问题 | 状态 | 验证方式 |
+|------|------|---------|
+| Line items 每行是否标注税率? | 待验证 | 检查 10+ 张真实发票样本 |
+| PP-StructureV3 对瑞典发票表格检测准确率? | 待验证 | Phase 1 POC |
+| PaddlePaddle 3.0 与 PyTorch 共存兼容性? | 待验证 | Phase 1 环境测试 |
+
+---
+
+## Risk Mitigation
+
+| 风险 | 影响 | 缓解措施 |
+|------|------|---------|
+| PP-StructureV3 表格检测效果差 | 高 | 回退到 YOLO 检测表格区域 + PP-Structure 解析 |
+| PaddlePaddle 升级破坏现有功能 | 高 | 独立分支开发，充分测试后再合并 |
+| 多税率正则覆盖不全 | 中 | 收集更多发票样本，迭代正则模式 |
+
+---
+
+## Success Criteria
+
+- [ ] PP-StructureV3 表格检测准确率 > 95%
+- [ ] Line items 提取准确率 > 90%
+- [ ] VAT 信息提取准确率 > 95%
+- [ ] 交叉验证覆盖所有数学关系
+- [ ] 测试覆盖率 > 80%
+- [ ] 现有基础字段功能不受影响
+
+---
+
+## Timeline Estimate
+
+| Phase | 工作内容 |
+|-------|---------|
+| Phase 1 | 环境升级 + POC 验证 |
+| Phase 2 | Line Items 提取 |
+| Phase 3 | VAT 信息提取 |
+| Phase 4 | 交叉验证引擎 |
+| Phase 5 | Pipeline 集成 |
+| Phase 6 | 测试 + 验证 |
Author	SHA1	Message	Date
Yaojia Wang	883fab5c4a	Add plan file	2026-02-01 23:41:46 +01:00
Yaojia Wang	45d74d048a	Update readme	2026-02-01 23:41:38 +01:00