Add claude config

2026-01-25 16:17:39 +01:00
parent d5101e3604
commit e83a0cae36
6 changed files with 695 additions and 38 deletions
--- a/README.md
+++ b/README.md
@@ -54,8 +54,12 @@
 - **数据库存储**: 标注结果存储在 PostgreSQL，支持增量处理和断点续传
 - **YOLO 检测**: 使用 YOLOv11 检测发票字段区域
 - **OCR 识别**: 使用 PaddleOCR v5 提取检测区域的文本
+- **统一解析器**: payment_line 和 customer_number 采用独立解析器模块
+- **交叉验证**: payment_line 数据与单独检测字段交叉验证，优先采用 payment_line 值
+- **文档类型识别**: 自动区分 invoice (有 payment_line) 和 letter (无 payment_line)
 - **Web 应用**: 提供 REST API 和可视化界面
 - **增量训练**: 支持在已训练模型基础上继续训练
+- **内存优化**: 支持低内存模式训练 (--low-memory)

 ## 支持的字段

@@ -69,6 +73,8 @@
 | 5 | plusgiro | Plusgiro 号码 |
 | 6 | amount | 金额 |
 | 7 | supplier_organisation_number | 供应商组织号 |
+| 8 | payment_line | 支付行 (机器可读格式) |
+| 9 | customer_number | 客户编号 |

 ## 安装

@@ -132,8 +138,24 @@ python -m src.cli.train \
    --model yolo11n.pt \
    --epochs 100 \
    --batch 16 \
-    --name invoice_yolo11n_full \
+    --name invoice_fields \
    --dpi 150
+
+# 低内存模式 (适用于内存不足场景)
+python -m src.cli.train \
+    --model yolo11n.pt \
+    --epochs 100 \
+    --name invoice_fields \
+    --low-memory \
+    --workers 4 \
+    --no-cache
+
+# 从检查点恢复训练 (训练中断后)
+python -m src.cli.train \
+    --model runs/train/invoice_fields/weights/last.pt \
+    --epochs 100 \
+    --name invoice_fields \
+    --resume
 ```

 ### 4. 增量训练
@@ -164,26 +186,46 @@ python -m src.cli.train \
 ```bash
 # 命令行推理
 python -m src.cli.infer \
-    --model runs/train/invoice_yolo11n_full/weights/best.pt \
+    --model runs/train/invoice_fields/weights/best.pt \
    --input path/to/invoice.pdf \
    --output result.json \
    --gpu
+
+# 批量推理
+python -m src.cli.infer \
+    --model runs/train/invoice_fields/weights/best.pt \
+    --input invoices/*.pdf \
+    --output results/ \
+    --gpu
 ```

+**推理结果包含**:
+- `fields`: 提取的字段值 (InvoiceNumber, Amount, payment_line, customer_number 等)
+- `confidence`: 各字段的置信度
+- `document_type`: 文档类型 ("invoice" 或 "letter")
+- `cross_validation`: payment_line 交叉验证结果 (如果有)
+
 ### 6. Web 应用

+**在 WSL 环境中启动**:
+
 ```bash
-# 启动 Web 服务器
+# 方法 1: 从 Windows PowerShell 启动 (推荐)
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && python run_server.py --port 8000"
+
+# 方法 2: 在 WSL 内启动
+conda activate invoice-py311
+cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2
 python run_server.py --port 8000

-# 开发模式 (自动重载)
-python run_server.py --debug --reload
-
-# 禁用 GPU
-python run_server.py --no-gpu
+# 方法 3: 使用启动脚本
+./start_web.sh
 ```

-访问 **http://localhost:8000** 使用 Web 界面。
+**服务启动后**:
+- 访问 **http://localhost:8000** 使用 Web 界面
+- 服务会自动加载模型 `runs/train/invoice_fields/weights/best.pt`
+- GPU 默认启用，置信度阈值 0.5

 #### Web API 端点

@@ -194,6 +236,33 @@ python run_server.py --no-gpu
 | POST | `/api/v1/infer` | 上传文件并推理 |
 | GET | `/api/v1/results/{filename}` | 获取可视化图片 |

+#### API 响应格式
+
+```json
+{
+  "status": "success",
+  "result": {
+    "document_id": "abc123",
+    "document_type": "invoice",
+    "fields": {
+      "InvoiceNumber": "12345",
+      "Amount": "1234.56",
+      "payment_line": "# 94228110015950070 # > 48666036#14#",
+      "customer_number": "UMJ 436-R"
+    },
+    "confidence": {
+      "InvoiceNumber": 0.95,
+      "Amount": 0.92
+    },
+    "cross_validation": {
+      "is_valid": true,
+      "ocr_match": true,
+      "amount_match": true
+    }
+  }
+}
+```
+
 ## 训练配置

 ### YOLO 训练参数
@@ -210,6 +279,10 @@ Options:
  --name             训练名称
  --limit            限制文档数 (用于测试)
  --device           设备 (0=GPU, cpu)
+  --resume           从检查点恢复训练
+  --low-memory       启用低内存模式 (batch=8, workers=4, no-cache)
+  --workers          数据加载 worker 数 (默认: 8)
+  --cache            缓存图像到内存
 ```

 ### 训练最佳实践
@@ -236,14 +309,28 @@ Options:

 ### 训练结果示例

-使用约 10,000 张训练图片，100 epochs 后的结果：
+**最新训练结果** (100 epochs, 2026-01-22):

 | 指标 | 值 |
 |------|-----|
-| **mAP@0.5** | 98.7% |
-| **mAP@0.5-0.95** | 87.4% |
-| **Precision** | 97.5% |
-| **Recall** | 95.5% |
+| **mAP@0.5** | 93.5% |
+| **mAP@0.5-0.95** | 83.0% |
+| **训练集** | ~10,000 张标注图片 |
+| **字段类型** | 10 个字段 (新增 payment_line, customer_number) |
+| **模型位置** | `runs/train/invoice_fields/weights/best.pt` |
+
+**各字段检测性能**:
+- 发票基础信息 (InvoiceNumber, InvoiceDate, InvoiceDueDate): >95% mAP
+- 支付信息 (OCR, Bankgiro, Plusgiro, Amount): >90% mAP
+- 组织信息 (supplier_org_number, customer_number): >85% mAP
+- 支付行 (payment_line): >80% mAP
+
+**模型文件**:
+```
+runs/train/invoice_fields/weights/
+├── best.pt          # 最佳模型 (mAP@0.5 最高) ⭐ 推荐用于生产
+└── last.pt          # 最后检查点 (用于继续训练)
+```

 > 注：目前仍在持续标注更多数据，预计最终将有 25,000+ 张标注图片用于训练。

@@ -262,15 +349,18 @@ invoice-master-poc-v2/
 │   │   ├── renderer.py   # 图像渲染
 │   │   └── detector.py   # 类型检测
 │   ├── ocr/              # PaddleOCR 封装
+│   │   └── machine_code_parser.py  # 机器可读付款行解析器
 │   ├── normalize/        # 字段规范化
 │   ├── matcher/          # 字段匹配
 │   ├── yolo/             # YOLO 相关
 │   │   ├── annotation_generator.py
 │   │   └── db_dataset.py
 │   ├── inference/        # 推理管道
-│   │   ├── pipeline.py
-│   │   ├── yolo_detector.py
-│   │   └── field_extractor.py
+│   │   ├── pipeline.py               # 主推理流程
+│   │   ├── yolo_detector.py          # YOLO 检测
+│   │   ├── field_extractor.py        # 字段提取
+│   │   ├── payment_line_parser.py    # 支付行解析器
+│   │   └── customer_number_parser.py # 客户编号解析器
 │   ├── processing/       # 多池处理架构
 │   │   ├── worker_pool.py
 │   │   ├── cpu_pool.py
@@ -278,20 +368,33 @@ invoice-master-poc-v2/
 │   │   ├── task_dispatcher.py
 │   │   └── dual_pool_coordinator.py
 │   ├── web/              # Web 应用
-│   │   ├── app.py        # FastAPI 应用
+│   │   ├── app.py        # FastAPI 应用入口
 │   │   ├── routes.py     # API 路由
 │   │   ├── services.py   # 业务逻辑
-│   │   ├── schemas.py    # 数据模型
-│   │   └── config.py     # 配置
+│   │   └── schemas.py    # 数据模型
+│   ├── utils/            # 工具模块
+│   │   ├── text_cleaner.py      # 文本清理
+│   │   ├── validators.py        # 字段验证
+│   │   ├── fuzzy_matcher.py     # 模糊匹配
+│   │   └── ocr_corrections.py   # OCR 错误修正
 │   └── data/             # 数据处理
+├── tests/                # 测试文件
+│   ├── ocr/              # OCR 模块测试
+│   │   └── test_machine_code_parser.py
+│   ├── inference/        # 推理模块测试
+│   ├── normalize/        # 规范化模块测试
+│   └── utils/            # 工具模块测试
+├── docs/                 # 文档
+│   ├── REFACTORING_SUMMARY.md
+│   └── TEST_COVERAGE_IMPROVEMENT.md
 ├── config.py             # 配置文件
 ├── run_server.py         # Web 服务器启动脚本
 ├── runs/                 # 训练输出
 │   └── train/
-│       └── invoice_yolo11n_full/
+│       └── invoice_fields/
 │           └── weights/
-│               ├── best.pt
-│               └── last.pt
+│               ├── best.pt      # 最佳模型
+│               └── last.pt      # 最后检查点
 └── requirements.txt
 ```

@@ -410,14 +513,15 @@ Options:
 ## Python API

 ```python
-from src.inference import InferencePipeline
+from src.inference.pipeline import InferencePipeline

 # 初始化
 pipeline = InferencePipeline(
-    model_path='runs/train/invoice_yolo11n_full/weights/best.pt',
-    confidence_threshold=0.3,
+    model_path='runs/train/invoice_fields/weights/best.pt',
+    confidence_threshold=0.25,
    use_gpu=True,
-    dpi=150
+    dpi=150,
+    enable_fallback=True
 )

 # 处理 PDF
@@ -427,26 +531,194 @@ result = pipeline.process_pdf('invoice.pdf')
 result = pipeline.process_image('invoice.png')

 # 获取结果
-print(result.fields)       # {'InvoiceNumber': '12345', 'Amount': '1234.56', ...}
+print(result.fields)
+# {
+#   'InvoiceNumber': '12345',
+#   'Amount': '1234.56',
+#   'payment_line': '# 94228110015950070 # > 48666036#14#',
+#   'customer_number': 'UMJ 436-R',
+#   ...
+# }
+
 print(result.confidence)   # {'InvoiceNumber': 0.95, 'Amount': 0.92, ...}
 print(result.to_json())    # JSON 格式输出
+
+# 访问交叉验证结果
+if result.cross_validation:
+    print(f"OCR match: {result.cross_validation.ocr_match}")
+    print(f"Amount match: {result.cross_validation.amount_match}")
+    print(f"Details: {result.cross_validation.details}")
+```
+
+### 统一解析器使用
+
+```python
+from src.inference.payment_line_parser import PaymentLineParser
+from src.inference.customer_number_parser import CustomerNumberParser
+
+# Payment Line 解析
+parser = PaymentLineParser()
+result = parser.parse("# 94228110015950070 # 15658 00 8 > 48666036#14#")
+print(f"OCR: {result.ocr_number}")
+print(f"Amount: {result.amount}")
+print(f"Account: {result.account_number}")
+
+# Customer Number 解析
+parser = CustomerNumberParser()
+result = parser.parse("Said, Shakar Umj 436-R Billo")
+print(f"Customer Number: {result}")  # "UMJ 436-R"
+```
+
+## 测试
+
+### 测试统计
+
+| 指标 | 数值 |
+|------|------|
+| **测试总数** | 688 |
+| **通过率** | 100% |
+| **整体覆盖率** | 37% |
+
+### 关键模块覆盖率
+
+| 模块 | 覆盖率 | 测试数 |
+|------|--------|--------|
+| `machine_code_parser.py` | 65% | 79 |
+| `payment_line_parser.py` | 85% | 45 |
+| `customer_number_parser.py` | 90% | 32 |
+
+### 运行测试
+
+```bash
+# 运行所有测试
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest"
+
+# 运行并查看覆盖率
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest --cov=src --cov-report=term-missing"
+
+# 运行特定模块测试
+wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && pytest tests/ocr/test_machine_code_parser.py -v"
+```
+
+### 测试结构
+
+```
+tests/
+├── ocr/
+│   ├── test_machine_code_parser.py   # 支付行解析 (79 tests)
+│   └── test_ocr_engine.py            # OCR 引擎测试
+├── inference/
+│   ├── test_payment_line_parser.py   # 支付行解析器
+│   └── test_customer_number_parser.py # 客户编号解析器
+├── normalize/
+│   └── test_normalizers.py           # 字段规范化
+└── utils/
+    └── test_validators.py            # 字段验证
 ```

 ## 开发状态

+**已完成功能**:
 - [x] 文本层 PDF 自动标注
 - [x] 扫描图 OCR 自动标注
 - [x] 多策略字段匹配 (精确/子串/规范化)
 - [x] PostgreSQL 数据库存储 (断点续传)
 - [x] 信号处理和超时保护
- [x] YOLO 训练 (98.7% mAP@0.5)
+- [x] YOLO 训练 (93.5% mAP@0.5, 10 个字段)
 - [x] 推理管道
 - [x] 字段规范化和验证
- [x] Web 应用 (FastAPI + 前端 UI)
+- [x] Web 应用 (FastAPI + REST API)
 - [x] 增量训练支持
+- [x] 内存优化训练 (--low-memory, --resume)
+- [x] Payment Line 解析器 (统一模块)
+- [x] Customer Number 解析器 (统一模块)
+- [x] Payment Line 交叉验证 (OCR, Amount, Account)
+- [x] 文档类型识别 (invoice/letter)
+- [x] 单元测试覆盖 (688 tests, 37% coverage)
+
+**进行中**:
 - [ ] 完成全部 25,000+ 文档标注
- [ ] 表格 items 处理
- [ ] 模型量化部署
+- [ ] 多源融合增强 (Multi-source fusion)
+- [ ] OCR 错误修正集成
+- [ ] 提升测试覆盖率到 60%+
+
+**计划中**:
+- [ ] 表格 items 提取
+- [ ] 模型量化部署 (ONNX/TensorRT)
+- [ ] 多语言支持扩展
+
+## 关键技术特性
+
+### 1. Payment Line 交叉验证
+
+瑞典发票的 payment_line (支付行) 包含完整的支付信息：OCR 参考号、金额、账号。我们实现了交叉验证机制：
+
+```
+Payment Line: # 94228110015950070 # 15658 00 8 > 48666036#14#
+             ↓                     ↓            ↓
+           OCR Number            Amount     Bankgiro Account
+```
+
+**验证流程**:
+1. 从 payment_line 提取 OCR、Amount、Account
+2. 与单独检测的字段对比验证
+3. **payment_line 值优先** - 如有不匹配，采用 payment_line 的值
+4. 返回验证结果和详细信息
+
+**优势**:
+- 提高数据准确性 (payment_line 是机器可读格式，更可靠)
+- 发现 OCR 或检测错误
+- 为数据质量提供信心指标
+
+### 2. 统一解析器架构
+
+采用独立解析器模块处理复杂字段：
+
+**PaymentLineParser**:
+- 解析瑞典标准支付行格式
+- 提取 OCR、Amount (包含 Kronor + Öre)、Account + Check digits
+- 支持多种变体格式
+
+**CustomerNumberParser**:
+- 支持多种瑞典客户编号格式 (`UMJ 436-R`, `JTY 576-3`, `FFL 019N`)
+- 从混合文本中提取 (如地址行中的客户编号)
+- 大小写不敏感，输出统一大写格式
+
+**优势**:
+- 代码模块化、可测试
+- 易于扩展新格式
+- 统一的解析逻辑，减少重复代码
+
+### 3. 文档类型自动识别
+
+根据 payment_line 字段自动判断文档类型：
+
+- **invoice**: 包含 payment_line 的发票文档
+- **letter**: 不含 payment_line 的信函文档
+
+这个特性帮助下游系统区分处理流程。
+
+### 4. 低内存模式训练
+
+支持在内存受限环境下训练：
+
+```bash
+python -m src.cli.train --low-memory
+```
+
+自动调整:
+- batch size: 16 → 8
+- workers: 8 → 4
+- cache: disabled
+- 推荐用于 GPU 内存 < 8GB 或系统内存 < 16GB 的场景
+
+### 5. 断点续传训练
+
+训练中断后可从检查点恢复：
+
+```bash
+python -m src.cli.train --resume --model runs/train/invoice_fields/weights/last.pt
+```

 ## 技术栈

@@ -457,7 +729,33 @@ print(result.to_json())    # JSON 格式输出
 | **PDF 处理** | PyMuPDF (fitz) |
 | **数据库** | PostgreSQL + psycopg2 |
 | **Web 框架** | FastAPI + Uvicorn |
-| **深度学习** | PyTorch + CUDA |
+| **深度学习** | PyTorch + CUDA 12.x |
+
+## 常见问题
+
+**Q: 为什么必须在 WSL 环境运行？**
+
+A: PaddleOCR 和某些依赖在 Windows 原生环境存在兼容性问题。WSL 提供完整的 Linux 环境，确保所有依赖正常工作。
+
+**Q: 训练过程中出现 OOM (内存不足) 错误怎么办？**
+
+A: 使用 `--low-memory` 模式，或手动调整 `--batch` 和 `--workers` 参数。
+
+**Q: payment_line 和单独检测字段不匹配时怎么处理？**
+
+A: 系统默认优先采用 payment_line 的值，因为 payment_line 是机器可读格式，通常更准确。验证结果会记录在 `cross_validation` 字段中。
+
+**Q: 如何添加新的字段类型？**
+
+A:
+1. 在 `src/inference/constants.py` 添加字段定义
+2. 在 `field_extractor.py` 添加规范化方法
+3. 重新生成标注数据
+4. 从头训练模型
+
+**Q: 可以用 CPU 训练吗？**
+
+A: 可以，但速度会非常慢 (慢 10-50 倍)。强烈建议使用 GPU 训练。

 ## 许可证