kai/invoice-master-poc-v2

Fork 0

Files

Yaojia Wang 35988b1ebf Update paddle, and support invoice line item

2026-02-03 21:28:06 +01:00

13 KiB

Raw Permalink Blame History

Business Invoice Features Implementation Plan

Overview

为 Invoice Master 添加 Business 用户功能：Line Items 提取、多税率 VAT 信息提取、交叉验证。

Current State

基础字段提取: YOLO + PaddleOCR，10 个 class (invoice_number, date, amount, etc.)
交叉验证: payment_line 验证 OCR/Amount/Bankgiro
PaddlePaddle: 2.5.0 (需升级到 3.0 for PP-StructureV3)
测试覆盖: 688 tests, 37% coverage

Target Architecture

PDF/Image
    │
    ├─► YOLO (现有) ──► 基础字段检测
    │                    - invoice_number, date, amount, etc.
    │
    ├─► PP-StructureV3 ──► 表格自动检测 + 结构解析
    │                       - Line items extraction
    │                       - SLANet 行列识别
    │
    ├─► 正则引擎 ──► VAT 信息提取
    │                - 多税率分解 (25%, 12%, 6%)
    │                - Moms 金额匹配
    │
    └─► 交叉验证引擎 ──► 多源验证
                         - 数学验证: incl = excl + vat
                         - Line items 汇总 vs VAT 汇总区
                         - 置信度评分

Implementation Phases

Phase 1: Environment & PP-StructureV3 POC

Goal: 在独立分支验证 PP-StructureV3 能否正确检测瑞典发票表格

Status: COMPLETED

Completed:

Created TableDetector wrapper class with TDD approach
29 unit tests passing, 84% coverage
Supports wired and wireless table detection
Lazy initialization pattern for PP-StructureV3
PaddleX 3.x API support (LayoutParsingResultV2)
Used existing invoice-sm120 conda environment (PaddlePaddle 3.3, PaddleOCR 3.3.1)
Tested with real Swedish invoices - 10 tables detected across 5 PDFs
HTML table structure extraction working (pred_html)
Cell-level OCR text extraction working (table_ocr_pred)

Files Created:

packages/backend/backend/table/__init__.py
packages/backend/backend/table/structure_detector.py
tests/table/__init__.py
tests/table/test_structure_detector.py
scripts/ppstructure_poc.py (POC test script)

POC Results:

Total PDFs tested: 5
Total tables detected: 10
  12d321cb-4a3a-47c6-90aa-890cecd13d91.pdf: 4 tables (14, 20, 10, 12 cells)
  3c8d2673-42f7-4474-82ff-4480d6aee632.pdf: 1 table (25 cells)
  52bb76c4-5a43-4c5a-81e0-d9a04002fcb1.pdf: 0 tables (letter, not invoice)
  7d18a79e-7b1e-4daf-8560-f10ab04f265d.pdf: 4 tables (14, 20, 10, 12 cells)
  87b95d60-d980-4037-b1b5-ba2b5d14ecc8.pdf: 1 table (25 cells)

Verification Commands:

# Run tests
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
  conda activate invoice-py311 && \
  cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
  pytest tests/table/ -v"

# Run POC with real invoices
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
  conda activate invoice-sm120 && \
  cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
  python scripts/ppstructure_poc.py"

Phase 2: Line Items Extraction

Goal: 从检测到的表格区域提取结构化行项目数据

Status: COMPLETED

Completed:

Created LineItemsExtractor class with TDD approach
19 unit tests passing, 93% coverage
Supports reversed tables (header at bottom - PP-StructureV3 quirk)
Swedish column name mapping (Beskrivning, Antal, Belopp, etc.)
HTMLTableParser for table structure parsing
Automatic header detection from row content
Tested with real Swedish invoices

Files Created:

packages/backend/backend/table/line_items_extractor.py
tests/table/test_line_items_extractor.py
scripts/ppstructure_line_items_poc.py (POC test script)

Data Structures:

@dataclass
class LineItem:
    """单行项目"""
    row_index: int
    description: str
    quantity: float | None
    unit_price: str | None
    amount: str
    vat_rate: float | None      # [待验证] 行级税率是否存在
    confidence: float

@dataclass
class LineItemsResult:
    """行项目提取结果"""
    items: list[LineItem]
    table_bbox: tuple[float, float, float, float]  # 表格区域
    header_row: list[str] | None                    # 表头
    raw_html: str                                   # PP-Structure 原始输出

Tasks:

实现 src/table/line_items_extractor.py:
- PP-StructureV3 表格检测
- HTML 解析为结构化数据
- 列名智能匹配 (Description, Qty, Price, Amount, Moms%)

实现列名映射:

COLUMN_MAPPINGS = {
    'description': ['beskrivning', 'artikel', 'produkt', 'tjänst', 'text'],
    'quantity': ['antal', 'qty', 'st', 'pcs'],
    'unit_price': ['á-pris', 'pris', 'styckpris', 'enhetspris'],
    'amount': ['belopp', 'summa', 'total', 'netto'],
    'vat_rate': ['moms', 'moms%', 'vat', 'skatt'],
}

单元测试覆盖

Critical Files:

New: src/table/line_items_extractor.py
New: src/table/column_mapper.py
New: tests/table/test_line_items_extractor.py

[待验证]: Line items 每行是否标注税率

需要检查实际发票样本
两种策略都实现，运行时根据表头判断

Phase 3: VAT Information Extraction

Goal: 从 OCR 全文提取多税率 VAT 信息

Status: COMPLETED

Completed:

Created VATExtractor class with TDD approach
21 unit tests passing, 96% coverage
AmountParser for Swedish/European number formats
Multiple VAT rate extraction (25%, 12%, 6%, 0%)
Multiple regex patterns for different Swedish formats
Confidence score calculation based on extracted data
Mathematical consistency verification

Files Created:

packages/backend/backend/vat/__init__.py
packages/backend/backend/vat/vat_extractor.py
tests/vat/__init__.py
tests/vat/test_vat_extractor.py

Data Structures:

@dataclass
class VATBreakdown:
    """单个税率分解"""
    rate: float              # 25.0, 12.0, 6.0, 0.0
    base_amount: str         # 税基 (不含税)
    vat_amount: str          # 税额
    source: str              # 'regex' | 'line_items'

@dataclass
class VATSummary:
    """完整 VAT 汇总"""
    breakdowns: list[VATBreakdown]
    total_excl_vat: str | None
    total_vat: str | None
    total_incl_vat: str | None      # 应与现有 amount 字段一致
    confidence: float

Tasks:

实现 src/vat/vat_extractor.py:
- 多税率正则模式匹配
- 瑞典语关键词识别

正则模式库:

VAT_PATTERNS = [
    # Moms 25%: 2 500,00 (Underlag 10 000,00)
    r"Moms\s*(\d+)\s*%\s*:?\s*([\d\s,\.]+)(?:.*?[Uu]nderlag\s*([\d\s,\.]+))?",
    # Varav moms 25% 2 500,00
    r"Varav\s+moms\s+(\d+)\s*%\s*([\d\s,\.]+)",
    # 25% moms: 2 500,00
    r"(\d+)\s*%\s*moms\s*:?\s*([\d\s,\.]+)",
    # Summa exkl. moms: 10 000,00
    r"[Ss]umma\s+exkl\.?\s+moms\s*:?\s*([\d\s,\.]+)",
    # Summa moms: 2 500,00
    r"[Ss]umma\s+moms\s*:?\s*([\d\s,\.]+)",
]

金额解析器 (处理瑞典格式):
- 1 234,56 → 1234.56
- 1.234,56 → 1234.56
单元测试

Critical Files:

New: src/vat/vat_extractor.py
New: src/vat/amount_parser.py
New: tests/vat/test_vat_extractor.py

Phase 4: Cross-Validation Engine

Goal: 多源交叉验证，确保 99%+ 精度

Status: COMPLETED

Completed:

Created VATValidator class with TDD approach
15 unit tests passing, 90% coverage
Mathematical verification (base × rate = vat)
Total amount check (excl + vat = incl)
Line items comparison
Amount consistency check with existing YOLO extraction
Configurable tolerance
Confidence score calculation

Files Created:

packages/backend/backend/validation/vat_validator.py
tests/validation/__init__.py
tests/validation/test_vat_validator.py

Data Structures:

@dataclass
class VATValidationResult:
    """VAT 交叉验证结果"""
    is_valid: bool
    confidence_score: float          # 0.0 - 1.0

    # 数学验证
    math_checks: list[dict]          # 每个税率: base × rate = vat?
    total_check: bool                # incl = excl + total_vat?

    # 来源对比
    line_items_vs_summary: bool | None   # line items 汇总 = VAT 汇总区?
    amount_consistency: bool | None       # total_incl_vat = 现有 amount 字段?

    # 需人工复核
    needs_review: bool
    review_reasons: list[str]

Validation Logic:

1. 数学验证 (最可靠)
   - 每个税率: base_amount × rate = vat_amount (±0.01 容差)
   - 总计: sum(base) + sum(vat) = total_incl_vat

2. 来源交叉验证
   - Line items 按税率汇总 vs VAT 汇总区分解
   - total_incl_vat vs 现有 amount 字段 (YOLO 提取)

3. 置信度评分
   - 所有验证通过: 0.95+
   - 数学验证通过但来源不一致: 0.7-0.9
   - 数学验证失败: 0.3-0.5, needs_review=True

Tasks:

扩展现有 CrossValidationResult:
- 添加 vat_validation: VATValidationResult
实现 src/validation/vat_validator.py
集成到 InferencePipeline._merge_fields()
单元测试 + 集成测试

Critical Files:

Modify: packages/backend/backend/pipeline/pipeline.py
New: src/validation/vat_validator.py
New: tests/validation/test_vat_validator.py

Phase 5: Pipeline Integration

Goal: 将所有组件集成到现有推理管道

Tasks:

扩展 InferenceResult:

@dataclass
class InferenceResult:
    # ... 现有字段 ...
    line_items: list[LineItem] | None = None
    vat_summary: VATSummary | None = None
    vat_validation: VATValidationResult | None = None

修改 InferencePipeline.process_pdf():

def process_pdf(self, pdf_path, document_id=None, extract_line_items=False):
    # 1. 现有 YOLO + OCR 流程
    # 2. if extract_line_items:
    #        PP-StructureV3 表格提取
    #        VAT 正则提取
    #        交叉验证

更新 API schema:
- POST /api/v1/infer 新增 extract_line_items 参数
- Response 新增 line_items, vat_summary 字段
前端展示 (可选，后续迭代)

Critical Files:

Modify: packages/backend/backend/pipeline/pipeline.py
Modify: packages/backend/backend/web/app.py
Modify: frontend/src/api/types.ts

Phase 6: Testing & Validation

Goal: 确保 80%+ 测试覆盖率，验证真实发票准确率

Tasks:

单元测试:
- test_line_items_extractor.py
- test_vat_extractor.py
- test_vat_validator.py
集成测试:
- 完整 pipeline 测试
- 多税率发票测试用例
真实发票验证:
- 选取 50+ 张含 line items 的发票
- 人工标注 ground truth
- 计算准确率

Verification Commands:

# 运行测试
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
  conda activate invoice-py311 && \
  cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
  pytest tests/table tests/vat tests/validation --cov=src -v"

# 真实发票测试
wsl bash -c "... && python -m src.cli.infer \
  --model runs/train/invoice_fields/weights/best.pt \
  --input test_invoices/ \
  --extract-line-items \
  --output results.json"

Open Questions (待验证)

问题	状态	验证方式
Line items 每行是否标注税率?	待验证	检查 10+ 张真实发票样本
PP-StructureV3 对瑞典发票表格检测准确率?	待验证	Phase 1 POC
PaddlePaddle 3.0 与 PyTorch 共存兼容性?	待验证	Phase 1 环境测试

Risk Mitigation

风险	影响	缓解措施
PP-StructureV3 表格检测效果差	高	回退到 YOLO 检测表格区域 + PP-Structure 解析
PaddlePaddle 升级破坏现有功能	高	独立分支开发，充分测试后再合并
多税率正则覆盖不全	中	收集更多发票样本，迭代正则模式

Success Criteria

PP-StructureV3 表格检测准确率 > 95%
Line items 提取准确率 > 90%
VAT 信息提取准确率 > 95%
交叉验证覆盖所有数学关系
测试覆盖率 > 80%
现有基础字段功能不受影响

Timeline Estimate

Phase	工作内容
Phase 1	环境升级 + POC 验证
Phase 2	Line Items 提取
Phase 3	VAT 信息提取
Phase 4	交叉验证引擎
Phase 5	Pipeline 集成
Phase 6	测试 + 验证

13 KiB Raw Permalink Blame History Unescape Escape

Business Invoice Features Implementation Plan

Overview

Current State

Target Architecture

Implementation Phases

Phase 1: Environment & PP-StructureV3 POC

Phase 2: Line Items Extraction

Phase 3: VAT Information Extraction

Phase 4: Cross-Validation Engine

Phase 5: Pipeline Integration

Phase 6: Testing & Validation

Open Questions (待验证)

Risk Mitigation

Success Criteria

Timeline Estimate

13 KiB

Raw Permalink Blame History