kai/invoice-master-poc-v2

Fork 0

Go to file

Yaojia Wang 4126196dea Add report

2026-02-01 01:49:50 +01:00

.claude

WIP

2026-02-01 00:08:40 +01:00

configs

Initial commit: Invoice field extraction system using YOLO + OCR

2026-01-10 17:44:14 +01:00

docs

WIP

2026-02-01 00:08:40 +01:00

frontend

WIP

2026-02-01 00:08:40 +01:00

migrations

WIP

2026-02-01 00:08:40 +01:00

packages

WIP

2026-02-01 00:08:40 +01:00

runs_backup/train

WIP

2026-01-30 00:44:21 +01:00

scripts

Initial commit: Invoice field extraction system using YOLO + OCR

2026-01-10 17:44:14 +01:00

tests

WIP

2026-02-01 00:08:40 +01:00

.coverage

WIP

2026-02-01 00:08:40 +01:00

.env.example

WIP

2026-02-01 00:08:40 +01:00

.gitignore

restructure project

2026-01-27 23:58:17 +01:00

ARCHITECTURE_REVIEW.md

Add report

2026-02-01 01:49:50 +01:00

CHANGELOG.md

Add claude config

2026-01-25 16:17:39 +01:00

create_shims.sh

WIP

2026-01-27 00:47:10 +01:00

docker-compose.yml

restructure project

2026-01-27 23:58:17 +01:00

PROJECT_REVIEW.md

Add report

2026-02-01 01:49:50 +01:00

pyproject.toml

restructure project

2026-01-27 23:58:17 +01:00

README.md

WIP

2026-02-01 00:08:40 +01:00

requirements.txt

WIP

2026-01-27 00:47:10 +01:00

run_autolabel.py

restructure project

2026-01-27 23:58:17 +01:00

run_migration.py

WIP

2026-02-01 00:08:40 +01:00

run_server.py

restructure project

2026-01-27 23:58:17 +01:00

start_web.sh

Add claude config

2026-01-25 16:17:39 +01:00

update_test_imports.py

WIP

2026-01-27 00:47:10 +01:00

README.md

Invoice Master POC v2

自动发票字段提取系统 - 使用 YOLOv11 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。

项目概述

本项目实现了一个完整的发票字段自动提取流程：

自动标注: 利用已有 CSV 结构化数据 + OCR 自动生成 YOLO 训练标注
模型训练: 使用 YOLOv11 训练字段检测模型，支持数据增强
推理提取: 检测字段区域 -> OCR 提取文本 -> 字段规范化
Web 管理: React 前端 + FastAPI 后端，支持文档管理、数据集构建、模型训练和版本管理

架构

项目采用 monorepo + 三包分离 架构，训练和推理可独立部署：

packages/
├── shared/      # 共享库 (PDF, OCR, 规范化, 匹配, 存储, 训练)
├── training/    # 训练服务 (GPU, 按需启动)
└── inference/   # 推理服务 (常驻运行)
frontend/        # React 前端 (Vite + TypeScript + TailwindCSS)

服务	部署目标	GPU	生命周期
Frontend	Vercel / Nginx	否	常驻
Inference	Azure App Service / AWS	可选	常驻 7x24
Training	Azure ACI / AWS ECS	必需	按需启动/销毁

两个服务通过共享 PostgreSQL 数据库通信。推理服务通过 API 触发训练任务，训练服务从数据库拾取任务执行。

当前进度

指标	数值
已标注文档	9,738 (9,709 成功)
总体字段匹配率	94.8% (82,604/87,121)
测试	1,601 passed
测试覆盖率	28%
模型 mAP@0.5	93.5%

各字段匹配率:

字段	匹配率	说明
supplier_accounts(Bankgiro)	100.0%	供应商 Bankgiro
supplier_accounts(Plusgiro)	100.0%	供应商 Plusgiro
Plusgiro	99.4%	支付 Plusgiro
OCR	99.1%	OCR 参考号
Bankgiro	99.0%	支付 Bankgiro
InvoiceNumber	98.9%	发票号码
InvoiceDueDate	95.9%	到期日期
InvoiceDate	95.5%	发票日期
Amount	91.3%	金额
supplier_organisation_number	78.2%	供应商组织号 (CSV 数据质量问题)

运行环境

本项目必须在 WSL + Conda 环境中运行。

系统要求

环境	要求
WSL	WSL 2 + Ubuntu 22.04
Conda	Miniconda 或 Anaconda
Python	3.11+ (通过 Conda 管理)
GPU	NVIDIA GPU + CUDA 12.x (强烈推荐)
数据库	PostgreSQL (存储标注结果)

安装

# 1. 进入 WSL
wsl -d Ubuntu-22.04

# 2. 创建 Conda 环境
conda create -n invoice-py311 python=3.11 -y
conda activate invoice-py311

# 3. 进入项目目录
cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2

# 4. 安装三个包 (editable mode)
pip install -e packages/shared
pip install -e packages/training
pip install -e packages/inference

项目结构

invoice-master-poc-v2/
├── packages/
│   ├── shared/                     # 共享库
│   │   ├── setup.py
│   │   └── shared/
│   │       ├── pdf/                # PDF 处理 (提取, 渲染, 检测)
│   │       ├── ocr/                # PaddleOCR 封装 + 机器码解析
│   │       ├── normalize/          # 字段规范化 (10 种 normalizer)
│   │       ├── matcher/            # 字段匹配 (精确/子串/模糊)
│   │       ├── storage/            # 存储抽象层 (Local/Azure/S3)
│   │       ├── training/           # 共享训练组件 (YOLOTrainer)
│   │       ├── augmentation/       # 数据增强 (DatasetAugmenter)
│   │       ├── utils/              # 工具 (验证, 清理, 模糊匹配)
│   │       ├── data/               # DocumentDB, CSVLoader
│   │       ├── config.py           # 全局配置 (数据库, 路径, DPI)
│   │       └── exceptions.py       # 异常定义
│   │
│   ├── training/                   # 训练服务 (GPU, 按需)
│   │   ├── setup.py
│   │   ├── Dockerfile
│   │   ├── run_training.py         # 入口 (--task-id 或 --poll)
│   │   └── training/
│   │       ├── cli/                # train, autolabel, analyze_*, validate
│   │       ├── yolo/               # db_dataset, annotation_generator
│   │       ├── processing/         # CPU/GPU worker pool, task dispatcher
│   │       └── data/               # training_db, autolabel_report
│   │
│   └── inference/                  # 推理服务 (常驻)
│       ├── setup.py
│       ├── Dockerfile
│       ├── run_server.py           # Web 服务器入口
│       └── inference/
│           ├── cli/                # infer, serve
│           ├── pipeline/           # YOLO 检测, 字段提取, 解析器
│           ├── web/                # FastAPI 应用
│           │   ├── api/v1/         # REST API (admin, public, batch)
│           │   ├── schemas/        # Pydantic 数据模型
│           │   ├── services/       # 业务逻辑
│           │   ├── core/           # 认证, 调度器, 限流
│           │   └── workers/        # 后台任务队列
│           ├── validation/         # LLM 验证器
│           ├── data/               # AdminDB, AsyncRequestDB, Models
│           └── azure/              # ACI 训练触发器
│
├── frontend/                       # React 前端 (Vite + TypeScript + TailwindCSS)
│   ├── src/
│   │   ├── api/                    # API 客户端 (axios + react-query)
│   │   ├── components/             # UI 组件
│   │   │   ├── Dashboard.tsx       # 文档管理面板
│   │   │   ├── Training.tsx        # 训练管理 (数据集/任务)
│   │   │   ├── Models.tsx          # 模型版本管理
│   │   │   ├── DatasetDetail.tsx   # 数据集详情
│   │   │   └── InferenceDemo.tsx   # 推理演示
│   │   └── hooks/                  # React Query hooks
│   └── package.json
│
├── migrations/                     # 数据库迁移 (SQL)
│   ├── 003_training_tasks.sql
│   ├── 004_training_datasets.sql
│   ├── 005_add_group_key.sql
│   ├── 006_model_versions.sql
│   ├── 007_training_tasks_extra_columns.sql
│   ├── 008_fix_model_versions_fk.sql
│   ├── 009_add_document_category.sql
│   └── 010_add_dataset_training_status.sql
│
├── tests/                          # 测试 (1,601 tests)
├── docker-compose.yml              # 本地开发 (postgres + inference + training)
├── run_server.py                   # 快捷启动脚本
└── runs/train/                     # 训练输出 (weights, curves)

支持的字段

类别 ID	字段名	说明
0	invoice_number	发票号码
1	invoice_date	发票日期
2	invoice_due_date	到期日期
3	ocr_number	OCR 参考号 (瑞典支付系统)
4	bankgiro	Bankgiro 号码
5	plusgiro	Plusgiro 号码
6	amount	金额
7	supplier_organisation_number	供应商组织号
8	payment_line	支付行 (机器可读格式)
9	customer_number	客户编号

快速开始

1. 自动标注

# 使用双池模式 (CPU + GPU)
python -m training.cli.autolabel \
    --dual-pool \
    --cpu-workers 3 \
    --gpu-workers 1

# 单线程模式
python -m training.cli.autolabel --workers 4

2. 训练模型

# 从预训练模型开始训练
python -m training.cli.train \
    --model yolo11n.pt \
    --epochs 100 \
    --batch 16 \
    --name invoice_fields \
    --dpi 150

# 低内存模式
python -m training.cli.train \
    --model yolo11n.pt \
    --epochs 100 \
    --name invoice_fields \
    --low-memory

# 从检查点恢复训练
python -m training.cli.train \
    --model runs/train/invoice_fields/weights/last.pt \
    --epochs 100 \
    --name invoice_fields \
    --resume

3. 推理

# 命令行推理
python -m inference.cli.infer \
    --model runs/train/invoice_fields/weights/best.pt \
    --input path/to/invoice.pdf \
    --output result.json \
    --gpu

4. Web 应用

# 从 Windows PowerShell 启动
wsl bash -c "source ~/miniconda3/etc/profile.d/conda.sh && conda activate invoice-py311 && cd /mnt/c/Users/yaoji/git/ColaCoder/invoice-master-poc-v2 && python run_server.py --port 8000"

# 启动前端
cd frontend && npm install && npm run dev
# 访问 http://localhost:5173

5. Docker 本地开发

docker-compose up
# inference: http://localhost:8000
# training: 轮询模式自动拾取任务

训练触发流程

推理服务通过 API 触发训练，训练在独立的 GPU 实例上执行：

Inference API                    PostgreSQL              Training (ACI)
    |                                |                        |
    POST /admin/training/trigger     |                        |
    |-> INSERT training_tasks ------>| status=pending          |
    |-> Azure SDK: create ACI --------------------------------> 启动
    |                                |                        |
    |                                |<-- SELECT pending -----+
    |                                |--- UPDATE running -----+
    |                                |                   执行训练...
    |                                |<-- UPDATE completed ---+
    |                                |    + model_path        |
    |                                |    + metrics      自动关机
    |                                |                        |
    GET /admin/training/{id}         |                        |
    |-> SELECT training_tasks ------>|                        |
    +-- return status + metrics      |                        |

Web API 端点

Public API:

方法	端点	描述
GET	`/api/v1/health`	健康检查
POST	`/api/v1/infer`	上传文件并推理
GET	`/api/v1/results/{filename}`	获取可视化图片
POST	`/api/v1/async/infer`	异步推理
GET	`/api/v1/async/status/{task_id}`	查询异步任务状态

Admin API (需要 X-Admin-Token header):

方法	端点	描述
POST	`/api/v1/admin/auth/login`	管理员登录
GET	`/api/v1/admin/documents`	文档列表
POST	`/api/v1/admin/documents/upload`	上传 PDF
GET	`/api/v1/admin/documents/{id}`	文档详情
PATCH	`/api/v1/admin/documents/{id}/status`	更新文档状态
PATCH	`/api/v1/admin/documents/{id}/category`	更新文档分类
GET	`/api/v1/admin/documents/categories`	获取分类列表
POST	`/api/v1/admin/documents/{id}/annotations`	创建标注

Training API:

方法	端点	描述
POST	`/api/v1/admin/training/datasets`	创建数据集
GET	`/api/v1/admin/training/datasets`	数据集列表
GET	`/api/v1/admin/training/datasets/{id}`	数据集详情
DELETE	`/api/v1/admin/training/datasets/{id}`	删除数据集
POST	`/api/v1/admin/training/tasks`	创建训练任务
GET	`/api/v1/admin/training/tasks`	任务列表
GET	`/api/v1/admin/training/tasks/{id}`	任务详情
GET	`/api/v1/admin/training/tasks/{id}/logs`	训练日志

Model Versions API:

方法	端点	描述
GET	`/api/v1/admin/models`	模型版本列表
GET	`/api/v1/admin/models/{id}`	模型详情
POST	`/api/v1/admin/models/{id}/activate`	激活模型
POST	`/api/v1/admin/models/{id}/archive`	归档模型
DELETE	`/api/v1/admin/models/{id}`	删除模型

Python API

from inference.pipeline import InferencePipeline

# 初始化
pipeline = InferencePipeline(
    model_path='runs/train/invoice_fields/weights/best.pt',
    confidence_threshold=0.25,
    use_gpu=True,
    dpi=150,
    enable_fallback=True
)

# 处理 PDF
result = pipeline.process_pdf('invoice.pdf')

print(result.fields)
# {'InvoiceNumber': '12345', 'Amount': '1234.56', ...}

print(result.confidence)
# {'InvoiceNumber': 0.95, 'Amount': 0.92, ...}

# 交叉验证
if result.cross_validation:
    print(f"OCR match: {result.cross_validation.ocr_match}")

from inference.pipeline.payment_line_parser import PaymentLineParser
from inference.pipeline.customer_number_parser import CustomerNumberParser

# Payment Line 解析
parser = PaymentLineParser()
result = parser.parse("# 94228110015950070 # 15658 00 8 > 48666036#14#")
print(f"OCR: {result.ocr_number}, Amount: {result.amount}")

# Customer Number 解析
parser = CustomerNumberParser()
result = parser.parse("Said, Shakar Umj 436-R Billo")
print(f"Customer Number: {result}")  # "UMJ 436-R"

DPI 配置

系统所有组件统一使用 150 DPI。DPI 必须在训练和推理时保持一致。

组件	配置位置
全局常量	`packages/shared/shared/config.py` -> `DEFAULT_DPI = 150`
Web 推理	`packages/inference/inference/web/config.py` -> `ModelConfig.dpi`
CLI 推理	`python -m inference.cli.infer --dpi 150`
自动标注	`packages/shared/shared/config.py` -> `AUTOLABEL['dpi']`

数据库架构

数据库	用途	存储内容
PostgreSQL	主数据库	文档、标注、训练任务、数据集、模型版本

主要表

表名	说明
`admin_documents`	文档管理 (PDF 元数据, 状态, 分类)
`admin_annotations`	标注数据 (YOLO 格式边界框)
`training_tasks`	训练任务 (状态, 配置, 指标)
`training_datasets`	数据集 (train/val/test 分割)
`dataset_documents`	数据集-文档关联
`model_versions`	模型版本管理 (激活/归档)
`admin_tokens`	管理员认证令牌
`async_requests`	异步推理请求

数据集状态

状态	说明
`building`	正在构建数据集
`ready`	数据集就绪，可开始训练
`trained`	已完成训练
`failed`	构建失败
`archived`	已归档

训练状态

状态	说明
`pending`	等待执行
`scheduled`	已计划
`running`	正在训练
`completed`	训练完成
`failed`	训练失败
`cancelled`	已取消

测试

# 运行所有测试
DB_PASSWORD=xxx pytest tests/ -q

# 运行并查看覆盖率
DB_PASSWORD=xxx pytest tests/ --cov=packages --cov-report=term-missing

指标	数值
测试总数	1,601
通过率	100%
覆盖率	28%

存储抽象层

统一的文件存储接口，支持多后端切换：

后端	用途	安装
Local	本地开发/测试	默认
Azure Blob	Azure 云部署	`pip install -e "packages/shared[azure]"`
AWS S3	AWS 云部署	`pip install -e "packages/shared[s3]"`

配置文件 (storage.yaml)

backend: ${STORAGE_BACKEND:-local}
presigned_url_expiry: 3600

local:
  base_path: ${STORAGE_BASE_PATH:-./data/storage}

azure:
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: ${AZURE_STORAGE_CONTAINER:-documents}

s3:
  bucket_name: ${AWS_S3_BUCKET}
  region_name: ${AWS_REGION:-us-east-1}

使用示例

from shared.storage import get_storage_backend

# 从配置文件加载
storage = get_storage_backend("storage.yaml")

# 上传文件
storage.upload(Path("local.pdf"), "documents/invoice.pdf")

# 获取预签名 URL (前端访问)
url = storage.get_presigned_url("documents/invoice.pdf", expires_in_seconds=3600)

环境变量

变量	后端	说明
`STORAGE_BACKEND`	全部	`local`, `azure_blob`, `s3`
`STORAGE_BASE_PATH`	Local	本地存储路径
`AZURE_STORAGE_CONNECTION_STRING`	Azure	连接字符串
`AZURE_STORAGE_CONTAINER`	Azure	容器名称
`AWS_S3_BUCKET`	S3	存储桶名称
`AWS_REGION`	S3	区域 (默认: us-east-1)

数据增强

训练时支持多种数据增强策略：

增强类型	说明
`perspective_warp`	透视变换 (模拟扫描角度)
`wrinkle`	皱纹效果
`edge_damage`	边缘损坏
`stain`	污渍效果
`lighting_variation`	光照变化
`shadow`	阴影效果
`gaussian_blur`	高斯模糊
`motion_blur`	运动模糊
`gaussian_noise`	高斯噪声
`salt_pepper`	椒盐噪声
`paper_texture`	纸张纹理
`scanner_artifacts`	扫描伪影

增强配置示例：

{
  "augmentation": {
    "gaussian_blur": { "enabled": true, "kernel_size": 5 },
    "perspective_warp": { "enabled": true, "intensity": 0.1 }
  },
  "augmentation_multiplier": 2
}

前端功能

React 前端提供以下功能模块：

模块	功能
Dashboard	文档列表、上传、标注状态管理、分类筛选
Training	数据集创建/管理、训练任务配置、增强设置
Models	模型版本管理、激活/归档、指标查看
Inference Demo	实时推理演示、结果可视化

启动前端

cd frontend
npm install
npm run dev
# 访问 http://localhost:5173

技术栈

组件	技术
目标检测	YOLOv11 (Ultralytics)
OCR 引擎	PaddleOCR v5 (PP-OCRv5)
PDF 处理	PyMuPDF (fitz)
数据库	PostgreSQL + SQLModel
Web 框架	FastAPI + Uvicorn
前端	React + TypeScript + Vite + TailwindCSS
状态管理	React Query (TanStack Query)
深度学习	PyTorch + CUDA 12.x
部署	Docker + Azure/AWS (训练) / App Service (推理)

环境变量

变量	必需	说明
`DB_PASSWORD`	是	PostgreSQL 密码
`DB_HOST`	否	数据库主机 (默认: localhost)
`DB_PORT`	否	数据库端口 (默认: 5432)
`DB_NAME`	否	数据库名 (默认: docmaster)
`DB_USER`	否	数据库用户 (默认: docmaster)
`STORAGE_BASE_PATH`	否	存储路径 (默认: ~/invoice-data/data)
`MODEL_PATH`	否	模型路径
`CONFIDENCE_THRESHOLD`	否	置信度阈值 (默认: 0.5)
`SERVER_HOST`	否	服务器主机 (默认: 0.0.0.0)
`SERVER_PORT`	否	服务器端口 (默认: 8000)

许可证

MIT License