Update paddle, and support invoice line item

This commit is contained in:
Yaojia Wang
2026-02-03 21:28:06 +01:00
parent c4e3773df1
commit 35988b1ebf
41 changed files with 6832 additions and 48 deletions

View File

@@ -39,27 +39,50 @@ PDF/Image
**Goal**: 在独立分支验证 PP-StructureV3 能否正确检测瑞典发票表格
**Tasks**:
1. 创建 `feature/business-invoice` 分支
2. 升级依赖:
- `paddlepaddle>=3.0.0`
- `paddleocr>=3.0.0`
3. 创建 PP-StructureV3 wrapper:
- `src/table/structure_detector.py`
4. 用 5-10 张真实发票测试表格检测效果
5. 验证与现有 YOLO pipeline 的兼容性
**Status**: COMPLETED
**Critical Files**:
- [requirements.txt](../../requirements.txt)
- [pyproject.toml](../../pyproject.toml)
- New: `src/table/structure_detector.py`
**Completed**:
- [x] Created `TableDetector` wrapper class with TDD approach
- [x] 29 unit tests passing, 84% coverage
- [x] Supports wired and wireless table detection
- [x] Lazy initialization pattern for PP-StructureV3
- [x] PaddleX 3.x API support (LayoutParsingResultV2)
- [x] Used existing `invoice-sm120` conda environment (PaddlePaddle 3.3, PaddleOCR 3.3.1)
- [x] Tested with real Swedish invoices - 10 tables detected across 5 PDFs
- [x] HTML table structure extraction working (pred_html)
- [x] Cell-level OCR text extraction working (table_ocr_pred)
**Verification**:
**Files Created**:
- `packages/backend/backend/table/__init__.py`
- `packages/backend/backend/table/structure_detector.py`
- `tests/table/__init__.py`
- `tests/table/test_structure_detector.py`
- `scripts/ppstructure_poc.py` (POC test script)
**POC Results**:
```
Total PDFs tested: 5
Total tables detected: 10
12d321cb-4a3a-47c6-90aa-890cecd13d91.pdf: 4 tables (14, 20, 10, 12 cells)
3c8d2673-42f7-4474-82ff-4480d6aee632.pdf: 1 table (25 cells)
52bb76c4-5a43-4c5a-81e0-d9a04002fcb1.pdf: 0 tables (letter, not invoice)
7d18a79e-7b1e-4daf-8560-f10ab04f265d.pdf: 4 tables (14, 20, 10, 12 cells)
87b95d60-d980-4037-b1b5-ba2b5d14ecc8.pdf: 1 table (25 cells)
```
**Verification Commands**:
```bash
# WSL 环境测试
# Run tests
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
conda activate invoice-py311 && \
python -c 'from paddleocr import PPStructureV3; print(\"OK\")'"
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
pytest tests/table/ -v"
# Run POC with real invoices
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
conda activate invoice-sm120 && \
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
python scripts/ppstructure_poc.py"
```
---
@@ -68,6 +91,22 @@ wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
**Goal**: 从检测到的表格区域提取结构化行项目数据
**Status**: COMPLETED
**Completed**:
- [x] Created `LineItemsExtractor` class with TDD approach
- [x] 19 unit tests passing, 93% coverage
- [x] Supports reversed tables (header at bottom - PP-StructureV3 quirk)
- [x] Swedish column name mapping (Beskrivning, Antal, Belopp, etc.)
- [x] HTMLTableParser for table structure parsing
- [x] Automatic header detection from row content
- [x] Tested with real Swedish invoices
**Files Created**:
- `packages/backend/backend/table/line_items_extractor.py`
- `tests/table/test_line_items_extractor.py`
- `scripts/ppstructure_line_items_poc.py` (POC test script)
**Data Structures**:
```python
@dataclass
@@ -122,6 +161,23 @@ class LineItemsResult:
**Goal**: 从 OCR 全文提取多税率 VAT 信息
**Status**: COMPLETED
**Completed**:
- [x] Created `VATExtractor` class with TDD approach
- [x] 21 unit tests passing, 96% coverage
- [x] `AmountParser` for Swedish/European number formats
- [x] Multiple VAT rate extraction (25%, 12%, 6%, 0%)
- [x] Multiple regex patterns for different Swedish formats
- [x] Confidence score calculation based on extracted data
- [x] Mathematical consistency verification
**Files Created**:
- `packages/backend/backend/vat/__init__.py`
- `packages/backend/backend/vat/vat_extractor.py`
- `tests/vat/__init__.py`
- `tests/vat/test_vat_extractor.py`
**Data Structures**:
```python
@dataclass
@@ -177,6 +233,23 @@ class VATSummary:
**Goal**: 多源交叉验证,确保 99%+ 精度
**Status**: COMPLETED
**Completed**:
- [x] Created `VATValidator` class with TDD approach
- [x] 15 unit tests passing, 90% coverage
- [x] Mathematical verification (base × rate = vat)
- [x] Total amount check (excl + vat = incl)
- [x] Line items comparison
- [x] Amount consistency check with existing YOLO extraction
- [x] Configurable tolerance
- [x] Confidence score calculation
**Files Created**:
- `packages/backend/backend/validation/vat_validator.py`
- `tests/validation/__init__.py`
- `tests/validation/test_vat_validator.py`
**Data Structures**:
```python
@dataclass