Update paddle, and support invoice line item
This commit is contained in:
@@ -39,27 +39,50 @@ PDF/Image
|
||||
|
||||
**Goal**: 在独立分支验证 PP-StructureV3 能否正确检测瑞典发票表格
|
||||
|
||||
**Tasks**:
|
||||
1. 创建 `feature/business-invoice` 分支
|
||||
2. 升级依赖:
|
||||
- `paddlepaddle>=3.0.0`
|
||||
- `paddleocr>=3.0.0`
|
||||
3. 创建 PP-StructureV3 wrapper:
|
||||
- `src/table/structure_detector.py`
|
||||
4. 用 5-10 张真实发票测试表格检测效果
|
||||
5. 验证与现有 YOLO pipeline 的兼容性
|
||||
**Status**: COMPLETED
|
||||
|
||||
**Critical Files**:
|
||||
- [requirements.txt](../../requirements.txt)
|
||||
- [pyproject.toml](../../pyproject.toml)
|
||||
- New: `src/table/structure_detector.py`
|
||||
**Completed**:
|
||||
- [x] Created `TableDetector` wrapper class with TDD approach
|
||||
- [x] 29 unit tests passing, 84% coverage
|
||||
- [x] Supports wired and wireless table detection
|
||||
- [x] Lazy initialization pattern for PP-StructureV3
|
||||
- [x] PaddleX 3.x API support (LayoutParsingResultV2)
|
||||
- [x] Used existing `invoice-sm120` conda environment (PaddlePaddle 3.3, PaddleOCR 3.3.1)
|
||||
- [x] Tested with real Swedish invoices - 10 tables detected across 5 PDFs
|
||||
- [x] HTML table structure extraction working (pred_html)
|
||||
- [x] Cell-level OCR text extraction working (table_ocr_pred)
|
||||
|
||||
**Verification**:
|
||||
**Files Created**:
|
||||
- `packages/backend/backend/table/__init__.py`
|
||||
- `packages/backend/backend/table/structure_detector.py`
|
||||
- `tests/table/__init__.py`
|
||||
- `tests/table/test_structure_detector.py`
|
||||
- `scripts/ppstructure_poc.py` (POC test script)
|
||||
|
||||
**POC Results**:
|
||||
```
|
||||
Total PDFs tested: 5
|
||||
Total tables detected: 10
|
||||
12d321cb-4a3a-47c6-90aa-890cecd13d91.pdf: 4 tables (14, 20, 10, 12 cells)
|
||||
3c8d2673-42f7-4474-82ff-4480d6aee632.pdf: 1 table (25 cells)
|
||||
52bb76c4-5a43-4c5a-81e0-d9a04002fcb1.pdf: 0 tables (letter, not invoice)
|
||||
7d18a79e-7b1e-4daf-8560-f10ab04f265d.pdf: 4 tables (14, 20, 10, 12 cells)
|
||||
87b95d60-d980-4037-b1b5-ba2b5d14ecc8.pdf: 1 table (25 cells)
|
||||
```
|
||||
|
||||
**Verification Commands**:
|
||||
```bash
|
||||
# WSL 环境测试
|
||||
# Run tests
|
||||
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
|
||||
conda activate invoice-py311 && \
|
||||
python -c 'from paddleocr import PPStructureV3; print(\"OK\")'"
|
||||
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
|
||||
pytest tests/table/ -v"
|
||||
|
||||
# Run POC with real invoices
|
||||
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
|
||||
conda activate invoice-sm120 && \
|
||||
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && \
|
||||
python scripts/ppstructure_poc.py"
|
||||
```
|
||||
|
||||
---
|
||||
@@ -68,6 +91,22 @@ wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && \
|
||||
|
||||
**Goal**: 从检测到的表格区域提取结构化行项目数据
|
||||
|
||||
**Status**: COMPLETED
|
||||
|
||||
**Completed**:
|
||||
- [x] Created `LineItemsExtractor` class with TDD approach
|
||||
- [x] 19 unit tests passing, 93% coverage
|
||||
- [x] Supports reversed tables (header at bottom - PP-StructureV3 quirk)
|
||||
- [x] Swedish column name mapping (Beskrivning, Antal, Belopp, etc.)
|
||||
- [x] HTMLTableParser for table structure parsing
|
||||
- [x] Automatic header detection from row content
|
||||
- [x] Tested with real Swedish invoices
|
||||
|
||||
**Files Created**:
|
||||
- `packages/backend/backend/table/line_items_extractor.py`
|
||||
- `tests/table/test_line_items_extractor.py`
|
||||
- `scripts/ppstructure_line_items_poc.py` (POC test script)
|
||||
|
||||
**Data Structures**:
|
||||
```python
|
||||
@dataclass
|
||||
@@ -122,6 +161,23 @@ class LineItemsResult:
|
||||
|
||||
**Goal**: 从 OCR 全文提取多税率 VAT 信息
|
||||
|
||||
**Status**: COMPLETED
|
||||
|
||||
**Completed**:
|
||||
- [x] Created `VATExtractor` class with TDD approach
|
||||
- [x] 21 unit tests passing, 96% coverage
|
||||
- [x] `AmountParser` for Swedish/European number formats
|
||||
- [x] Multiple VAT rate extraction (25%, 12%, 6%, 0%)
|
||||
- [x] Multiple regex patterns for different Swedish formats
|
||||
- [x] Confidence score calculation based on extracted data
|
||||
- [x] Mathematical consistency verification
|
||||
|
||||
**Files Created**:
|
||||
- `packages/backend/backend/vat/__init__.py`
|
||||
- `packages/backend/backend/vat/vat_extractor.py`
|
||||
- `tests/vat/__init__.py`
|
||||
- `tests/vat/test_vat_extractor.py`
|
||||
|
||||
**Data Structures**:
|
||||
```python
|
||||
@dataclass
|
||||
@@ -177,6 +233,23 @@ class VATSummary:
|
||||
|
||||
**Goal**: 多源交叉验证,确保 99%+ 精度
|
||||
|
||||
**Status**: COMPLETED
|
||||
|
||||
**Completed**:
|
||||
- [x] Created `VATValidator` class with TDD approach
|
||||
- [x] 15 unit tests passing, 90% coverage
|
||||
- [x] Mathematical verification (base × rate = vat)
|
||||
- [x] Total amount check (excl + vat = incl)
|
||||
- [x] Line items comparison
|
||||
- [x] Amount consistency check with existing YOLO extraction
|
||||
- [x] Configurable tolerance
|
||||
- [x] Confidence score calculation
|
||||
|
||||
**Files Created**:
|
||||
- `packages/backend/backend/validation/vat_validator.py`
|
||||
- `tests/validation/__init__.py`
|
||||
- `tests/validation/test_vat_validator.py`
|
||||
|
||||
**Data Structures**:
|
||||
```python
|
||||
@dataclass
|
||||
|
||||
Reference in New Issue
Block a user