Initial commit: Invoice field extraction system using YOLO + OCR

Features: - Auto-labeling pipeline: CSV values -> PDF search -> YOLO annotations - Flexible date matching: year-month match, nearby date tolerance - PDF text extraction with PyMuPDF - OCR support for scanned documents (PaddleOCR) - YOLO training and inference pipeline - 7 field types: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 17:44:14 +01:00
commit 8938661850
35 changed files with 5020 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,226 @@
+# Invoice Master POC v2
+
+自动账单信息提取系统 - 使用 YOLO + OCR 从 PDF 发票中提取结构化数据。
+
+## 运行环境
+
+> **重要**: 本项目需要在 **WSL (Windows Subsystem for Linux)** 环境下运行。
+
+### 系统要求
+
+- WSL 2 (Ubuntu 22.04 推荐)
+- Python 3.10+
+- **NVIDIA GPU + CUDA 12.x (强烈推荐)** - GPU 训练比 CPU 快 10-50 倍
+
+## 功能特点
+
+- **双模式 PDF 处理**: 支持文本层 PDF 和扫描图 PDF
+- **自动标注**: 利用已有 CSV 结构化数据自动生成 YOLO 训练数据
+- **字段检测**: 使用 YOLOv8 检测发票字段区域
+- **OCR 识别**: 使用 PaddleOCR 提取检测区域的文本
+- **智能匹配**: 支持多种格式规范化和上下文关键词增强
+
+## 支持的字段
+
+| 字段 | 说明 |
+|------|------|
+| InvoiceNumber | 发票号码 |
+| InvoiceDate | 发票日期 |
+| InvoiceDueDate | 到期日期 |
+| OCR | OCR 参考号 (瑞典) |
+| Bankgiro | Bankgiro 号码 |
+| Plusgiro | Plusgiro 号码 |
+| Amount | 金额 |
+
+## 安装 (WSL)
+
+### 1. 进入 WSL 环境
+
+```bash
+# 从 Windows 终端进入 WSL
+wsl
+
+# 进入项目目录 (Windows 路径映射到 /mnt/)
+cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2
+```
+
+### 2. 安装系统依赖
+
+```bash
+# 更新系统
+sudo apt update && sudo apt upgrade -y
+
+# 安装 Python 和必要工具
+sudo apt install -y python3.10 python3.10-venv python3-pip
+
+# 安装 OpenCV 依赖
+sudo apt install -y libgl1-mesa-glx libglib2.0-0 libsm6 libxrender1 libxext6
+```
+
+### 3. 创建虚拟环境并安装依赖
+
+```bash
+# 创建虚拟环境
+python3 -m venv venv
+source venv/bin/activate
+
+# 升级 pip
+pip install --upgrade pip
+
+# 安装依赖
+pip install -r requirements.txt
+
+# 或使用 pip install (开发模式)
+pip install -e .
+```
+
+### GPU 支持 (可选)
+
+```bash
+# 确保 WSL 已配置 CUDA
+nvidia-smi  # 检查 GPU 是否可用
+
+# 安装 GPU 版本 PaddlePaddle
+pip install paddlepaddle-gpu
+
+# 或指定 CUDA 版本
+pip install paddlepaddle-gpu==2.5.2.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+```
+
+## 快速开始
+
+### 1. 准备数据
+
+```
+data/
+├── raw_pdfs/
+│   ├── {DocumentId}.pdf
+│   └── ...
+└── structured_data/
+    └── invoices.csv
+```
+
+CSV 格式:
+```csv
+DocumentId,InvoiceDate,InvoiceNumber,InvoiceDueDate,OCR,Bankgiro,Plusgiro,Amount
+3be53fd7-...,2025-12-13,100017500321,2026-01-03,100017500321,53939484,,114
+```
+
+### 2. 自动标注
+
+```bash
+python -m src.cli.autolabel \
+    --csv data/structured_data/invoices.csv \
+    --pdf-dir data/raw_pdfs \
+    --output data/dataset \
+    --report reports/autolabel_report.jsonl
+```
+
+### 3. 训练模型
+
+> **重要**: 务必使用 GPU 进行训练！CPU 训练速度非常慢。
+
+```bash
+# GPU 训练 (强烈推荐)
+python -m src.cli.train \
+    --data data/dataset/dataset.yaml \
+    --model yolo11n.pt \
+    --epochs 100 \
+    --batch 16 \
+    --device 0  # 使用 GPU
+
+# 验证 GPU 可用
+python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"
+```
+
+GPU vs CPU 训练时间对比 (100 epochs, 77 训练图片):
+- **GPU (RTX 5080)**: ~2 分钟
+- **CPU**: 30+ 分钟
+
+### 4. 推理
+
+```bash
+python -m src.cli.infer \
+    --model runs/train/invoice_fields/weights/best.pt \
+    --input path/to/invoice.pdf \
+    --output result.json
+```
+
+## 输出示例
+
+```json
+{
+  "DocumentId": "3be53fd7-d5ea-458c-a229-8d360b8ba6a9",
+  "InvoiceNumber": "100017500321",
+  "InvoiceDate": "2025-12-13",
+  "InvoiceDueDate": "2026-01-03",
+  "OCR": "100017500321",
+  "Bankgiro": "5393-9484",
+  "Plusgiro": null,
+  "Amount": "114.00",
+  "confidence": {
+    "InvoiceNumber": 0.96,
+    "InvoiceDate": 0.92,
+    "Amount": 0.93
+  }
+}
+```
+
+## 项目结构
+
+```
+invoice-master-poc-v2/
+├── src/
+│   ├── pdf/           # PDF 处理模块
+│   ├── ocr/           # OCR 提取模块
+│   ├── normalize/     # 字段规范化模块
+│   ├── matcher/       # 字段匹配模块
+│   ├── yolo/          # YOLO 标注生成
+│   ├── inference/     # 推理管道
+│   ├── data/          # 数据加载模块
+│   └── cli/           # 命令行工具
+├── configs/           # 配置文件
+├── data/              # 数据目录
+└── requirements.txt
+```
+
+## 开发优先级
+
+1. ✅ 文本层 PDF 自动标注
+2. ✅ 扫描图 OCR 自动标注
+3. 🔄 金额 / OCR / Bankgiro 三字段稳定
+4. ⏳ 日期、Plusgiro 扩展
+5. ⏳ 表格 items 处理
+
+## 配置
+
+编辑 `configs/default.yaml` 自定义:
+- PDF 渲染 DPI
+- OCR 语言
+- 匹配置信度阈值
+- 上下文关键词
+- 数据增强参数
+
+## API 使用
+
+```python
+from src.inference import InferencePipeline
+
+# 初始化
+pipeline = InferencePipeline(
+    model_path='models/best.pt',
+    confidence_threshold=0.5,
+    ocr_lang='en'
+)
+
+# 处理 PDF
+result = pipeline.process_pdf('invoice.pdf')
+
+# 获取字段
+print(result.fields)
+print(result.confidence)
+```
+
+## 许可证
+
+MIT License