This commit is contained in:
Yaojia Wang
2026-01-27 00:47:10 +01:00
parent e83a0cae36
commit 58bf75db68
141 changed files with 24814 additions and 3884 deletions

View File

@@ -1,405 +0,0 @@
# Invoice Master POC v2 - 代码审查报告
**审查日期**: 2026-01-22
**代码库规模**: 67 个 Python 源文件,约 22,434 行代码
**测试覆盖率**: ~40-50%
---
## 执行摘要
### 总体评估:**良好B+**
**优势**
- ✅ 清晰的模块化架构,职责分离良好
- ✅ 使用了合适的数据类和类型提示
- ✅ 针对瑞典发票的全面规范化逻辑
- ✅ 空间索引优化O(1) token 查找)
- ✅ 完善的降级机制YOLO 失败时的 OCR fallback
- ✅ 设计良好的 Web API 和 UI
**主要问题**
- ❌ 支付行解析代码重复3+ 处)
- ❌ 长函数(`_normalize_customer_number` 127 行)
- ❌ 配置安全问题(明文数据库密码)
- ❌ 异常处理不一致(到处都是通用 Exception
- ❌ 缺少集成测试
- ❌ 魔法数字散布各处0.5, 0.95, 300 等)
---
## 1. 架构分析
### 1.1 模块结构
```
src/
├── inference/ # 推理管道核心
│ ├── pipeline.py (517 行) ⚠️
│ ├── field_extractor.py (1,347 行) 🔴 太长
│ └── yolo_detector.py
├── web/ # FastAPI Web 服务
│ ├── app.py (765 行) ⚠️ HTML 内联
│ ├── routes.py (184 行)
│ └── services.py (286 行)
├── ocr/ # OCR 提取
│ ├── paddle_ocr.py
│ └── machine_code_parser.py (919 行) 🔴 太长
├── matcher/ # 字段匹配
│ └── field_matcher.py (875 行) ⚠️
├── utils/ # 共享工具
│ ├── validators.py
│ ├── text_cleaner.py
│ ├── fuzzy_matcher.py
│ ├── ocr_corrections.py
│ └── format_variants.py (610 行)
├── processing/ # 批处理
├── data/ # 数据管理
└── cli/ # 命令行工具
```
### 1.2 推理流程
```
PDF/Image 输入
渲染为图片 (pdf/renderer.py)
YOLO 检测 (yolo_detector.py) - 检测字段区域
字段提取 (field_extractor.py)
├→ OCR 文本提取 (ocr/paddle_ocr.py)
├→ 规范化 & 验证
└→ 置信度计算
交叉验证 (pipeline.py)
├→ 解析 payment_line 格式
├→ 从 payment_line 提取 OCR/Amount/Account
└→ 与检测字段验证payment_line 值优先
降级 OCR如果关键字段缺失
├→ 全页 OCR
└→ 正则提取
InferenceResult 输出
```
---
## 2. 代码质量问题
### 2.1 长函数(>50 行)🔴
| 函数 | 文件 | 行数 | 复杂度 | 问题 |
|------|------|------|--------|------|
| `_normalize_customer_number()` | field_extractor.py | **127** | 极高 | 4 层模式匹配7+ 正则,复杂评分 |
| `_cross_validate_payment_line()` | pipeline.py | **127** | 极高 | 核心验证逻辑8+ 条件分支 |
| `_normalize_bankgiro()` | field_extractor.py | 62 | 高 | Luhn 验证 + 多种降级 |
| `_normalize_plusgiro()` | field_extractor.py | 63 | 高 | 类似 bankgiro |
| `_normalize_payment_line()` | field_extractor.py | 74 | 高 | 4 种正则模式 |
| `_normalize_amount()` | field_extractor.py | 78 | 高 | 多策略降级 |
**示例问题** - `_normalize_customer_number()` (第 776-902 行):
```python
def _normalize_customer_number(self, text: str):
# 127 行函数,包含:
# - 4 个嵌套的 if/for 循环
# - 7 种不同的正则模式
# - 5 个评分机制
# - 处理有标签和无标签格式
```
**建议**: 拆分为:
- `_find_customer_code_patterns()`
- `_find_labeled_customer_code()`
- `_score_customer_candidates()`
### 2.2 代码重复 🔴
**支付行解析3+ 处重复实现)**:
1. `_parse_machine_readable_payment_line()` (pipeline.py:217-252)
2. `MachineCodeParser.parse()` (machine_code_parser.py:919 行)
3. `_normalize_payment_line()` (field_extractor.py:632-705)
所有三处都实现类似的正则模式:
```
格式: # <OCR> # <Kronor> <Öre> <Type> > <Account>#<Check>#
```
**Bankgiro/Plusgiro 验证(重复)**:
- `validators.py`: `is_valid_bankgiro()`, `format_bankgiro()`
- `field_extractor.py`: `_normalize_bankgiro()`, `_normalize_plusgiro()`, `_luhn_checksum()`
- `normalizer.py`: `normalize_bankgiro()`, `normalize_plusgiro()`
- `field_matcher.py`: 类似匹配逻辑
**建议**: 创建统一模块:
```python
# src/common/payment_line_parser.py
class PaymentLineParser:
def parse(text: str) -> PaymentLineResult
# src/common/giro_validator.py
class GiroValidator:
def validate_and_format(value: str, giro_type: str) -> str
```
### 2.3 错误处理不一致 ⚠️
**通用异常捕获31 处)**:
```python
except Exception as e: # 代码库中 31 处
result.errors.append(str(e))
```
**问题**:
- 没有捕获特定错误类型
- 通用错误消息丢失上下文
- 第 142-147 行 (routes.py): 捕获所有异常,返回 500 状态
**当前写法** (routes.py:142-147):
```python
try:
service_result = inference_service.process_pdf(...)
except Exception as e: # 太宽泛
logger.error(f"Error processing document: {e}")
raise HTTPException(status_code=500, detail=str(e))
```
**改进建议**:
```python
except FileNotFoundError:
raise HTTPException(status_code=400, detail="PDF 文件未找到")
except PyMuPDFError:
raise HTTPException(status_code=400, detail="无效的 PDF 格式")
except OCRError:
raise HTTPException(status_code=503, detail="OCR 服务不可用")
```
### 2.4 配置安全问题 🔴
**config.py 第 24-30 行** - 明文凭据:
```python
DATABASE = {
'host': '192.168.68.31', # 硬编码 IP
'user': 'docmaster', # 硬编码用户名
'password': 'nY6LYK5d', # 🔴 明文密码!
'database': 'invoice_master'
}
```
**建议**:
```python
DATABASE = {
'host': os.getenv('DB_HOST', 'localhost'),
'user': os.getenv('DB_USER', 'docmaster'),
'password': os.getenv('DB_PASSWORD'), # 从环境变量读取
'database': os.getenv('DB_NAME', 'invoice_master')
}
```
### 2.5 魔法数字 ⚠️
| 值 | 位置 | 用途 | 问题 |
|---|------|------|------|
| 0.5 | 多处 | 置信度阈值 | 不可按字段配置 |
| 0.95 | pipeline.py | payment_line 置信度 | 无说明 |
| 300 | 多处 | DPI | 硬编码 |
| 0.1 | field_extractor.py | BBox 填充 | 应为配置 |
| 72 | 多处 | PDF 基础 DPI | 公式中的魔法数字 |
| 50 | field_extractor.py | 客户编号评分加分 | 无说明 |
**建议**: 提取到配置:
```python
INFERENCE_CONFIG = {
'confidence_threshold': 0.5,
'payment_line_confidence': 0.95,
'dpi': 300,
'bbox_padding': 0.1,
}
```
### 2.6 命名不一致 ⚠️
**字段名称不一致**:
- YOLO 类名: `invoice_number`, `ocr_number`, `supplier_org_number`
- 字段名: `InvoiceNumber`, `OCR`, `supplier_org_number`
- CSV 列名: 可能又不同
- 数据库字段名: 另一种变体
映射维护在多处:
- `yolo_detector.py` (90-100 行): `CLASS_TO_FIELD`
- 多个其他位置
---
## 3. 测试分析
### 3.1 测试覆盖率
**测试文件**: 13 个
- ✅ 覆盖良好: field_matcher, normalizer, payment_line_parser
- ⚠️ 中等覆盖: field_extractor, pipeline
- ❌ 覆盖不足: web 层, CLI, 批处理
**估算覆盖率**: 40-50%
### 3.2 缺失的测试用例 🔴
**关键缺失**:
1. 交叉验证逻辑 - 最复杂部分,测试很少
2. payment_line 解析变体 - 多种实现,边界情况不清楚
3. OCR 错误纠正 - 不同策略的复杂逻辑
4. Web API 端点 - 没有请求/响应测试
5. 批处理 - 多 worker 协调未测试
6. 降级 OCR 机制 - YOLO 检测失败时
---
## 4. 架构风险
### 🔴 关键风险
1. **配置安全** - config.py 中明文数据库凭据24-30 行)
2. **错误恢复** - 宽泛的异常处理掩盖真实问题
3. **可测试性** - 硬编码依赖阻止单元测试
### 🟡 高风险
1. **代码可维护性** - 支付行解析重复
2. **可扩展性** - 没有长时间推理的异步处理
3. **扩展性** - 添加新字段类型会很困难
### 🟢 中等风险
1. **性能** - 懒加载有帮助,但 ORM 查询未优化
2. **文档** - 大部分足够但可以更好
---
## 5. 优先级矩阵
| 优先级 | 行动 | 工作量 | 影响 |
|--------|------|--------|------|
| 🔴 关键 | 修复配置安全(环境变量) | 1 小时 | 高 |
| 🔴 关键 | 添加集成测试 | 2-3 天 | 高 |
| 🔴 关键 | 文档化错误处理策略 | 4 小时 | 中 |
| 🟡 高 | 统一 payment_line 解析 | 1-2 天 | 高 |
| 🟡 高 | 提取规范化到子模块 | 2-3 天 | 中 |
| 🟡 高 | 添加依赖注入 | 2-3 天 | 中 |
| 🟡 高 | 拆分长函数 | 2-3 天 | 低 |
| 🟢 中 | 提高测试覆盖率到 70%+ | 3-5 天 | 高 |
| 🟢 中 | 提取魔法数字 | 4 小时 | 低 |
| 🟢 中 | 标准化命名约定 | 1-2 天 | 中 |
---
## 6. 具体文件建议
### 高优先级(代码质量)
| 文件 | 问题 | 建议 |
|------|------|------|
| `field_extractor.py` | 1,347 行6 个长规范化方法 | 拆分为 `normalizers/` 子模块 |
| `pipeline.py` | 127 行 `_cross_validate_payment_line()` | 提取到单独的 `CrossValidator` 类 |
| `field_matcher.py` | 875 行;复杂匹配逻辑 | 拆分为 `matching/` 子模块 |
| `config.py` | 硬编码凭据(第 29 行) | 使用环境变量 |
| `machine_code_parser.py` | 919 行payment_line 解析 | 与 pipeline 解析合并 |
### 中优先级(重构)
| 文件 | 问题 | 建议 |
|------|------|------|
| `app.py` | 765 行HTML 内联在 Python 中 | 提取到 `templates/` 目录 |
| `autolabel.py` | 753 行;批处理逻辑 | 提取 worker 函数到模块 |
| `format_variants.py` | 610 行;变体生成 | 考虑策略模式 |
---
## 7. 建议行动
### 第 1 阶段关键修复1 周)
1. **配置安全** (1 小时)
- 移除 config.py 中的明文密码
- 添加环境变量支持
- 更新 README 说明配置
2. **错误处理标准化** (1 天)
- 定义自定义异常类
- 替换通用 Exception 捕获
- 添加错误代码常量
3. **添加关键集成测试** (2 天)
- 端到端推理测试
- payment_line 交叉验证测试
- API 端点测试
### 第 2 阶段重构2-3 周)
4. **统一 payment_line 解析** (2 天)
- 创建 `src/common/payment_line_parser.py`
- 合并 3 处重复实现
- 迁移所有调用方
5. **拆分 field_extractor.py** (3 天)
- 创建 `src/inference/normalizers/` 子模块
- 每个字段类型一个文件
- 提取共享验证逻辑
6. **拆分长函数** (2 天)
- `_normalize_customer_number()` → 3 个函数
- `_cross_validate_payment_line()` → CrossValidator 类
### 第 3 阶段改进1-2 周)
7. **提高测试覆盖率** (5 天)
- 目标70%+ 覆盖率
- 专注于验证逻辑
- 添加边界情况测试
8. **配置管理改进** (1 天)
- 提取所有魔法数字
- 创建配置文件YAML
- 添加配置验证
9. **文档改进** (2 天)
- 添加架构图
- 文档化所有私有方法
- 创建贡献指南
---
## 附录 A度量指标
### 代码复杂度
| 类别 | 计数 | 平均行数 |
|------|------|----------|
| 源文件 | 67 | 334 |
| 长文件 (>500 行) | 12 | 875 |
| 长函数 (>50 行) | 23 | 89 |
| 测试文件 | 13 | 298 |
### 依赖关系
| 类型 | 计数 |
|------|------|
| 外部依赖 | ~25 |
| 内部模块 | 10 |
| 循环依赖 | 0 ✅ |
### 代码风格
| 指标 | 覆盖率 |
|------|--------|
| 类型提示 | 80% |
| Docstrings (公开) | 80% |
| Docstrings (私有) | 40% |
| 测试覆盖率 | 45% |
---
**生成日期**: 2026-01-22
**审查者**: Claude Code
**版本**: v2.0

View File

@@ -1,96 +0,0 @@
# Field Extractor 分析报告
## 概述
field_extractor.py (1183行) 最初被识别为可优化文件,尝试使用 `src/normalize` 模块进行重构,但经过分析和测试后发现 **不应该重构**
## 重构尝试
### 初始计划
将 field_extractor.py 中的重复 normalize 方法删除,统一使用 `src/normalize/normalize_field()` 接口。
### 实施步骤
1. ✅ 备份原文件 (`field_extractor_old.py`)
2. ✅ 修改 `_normalize_and_validate` 使用统一 normalizer
3. ✅ 删除重复的 normalize 方法 (~400行)
4. ❌ 运行测试 - **28个失败**
5. ✅ 添加 wrapper 方法委托给 normalizer
6. ❌ 再次测试 - **12个失败**
7. ✅ 还原原文件
8. ✅ 测试通过 - **全部45个测试通过**
## 关键发现
### 两个模块的不同用途
| 模块 | 用途 | 输入 | 输出 | 示例 |
|------|------|------|------|------|
| **src/normalize/** | **变体生成** 用于匹配 | 已提取的字段值 | 多个匹配变体列表 | `"INV-12345"``["INV-12345", "12345"]` |
| **field_extractor** | **值提取** 从OCR文本 | 包含字段的原始OCR文本 | 提取的单个字段值 | `"Fakturanummer: A3861"``"A3861"` |
### 为什么不能统一?
1. **src/normalize/** 的设计目的:
- 接收已经提取的字段值
- 生成多个标准化变体用于fuzzy matching
- 例如 BankgiroNormalizer:
```python
normalize("782-1713") → ["7821713", "782-1713"] # 生成变体
```
2. **field_extractor** 的 normalize 方法:
- 接收包含字段的原始OCR文本可能包含标签、其他文本等
- **提取**特定模式的字段值
- 例如 `_normalize_bankgiro`:
```python
_normalize_bankgiro("Bankgiro: 782-1713") → ("782-1713", True, None) # 从文本提取
```
3. **关键区别**:
- Normalizer: 变体生成器 (for matching)
- Field Extractor: 模式提取器 (for parsing)
### 测试失败示例
使用 normalizer 替代 field extractor 方法后的失败:
```python
# InvoiceNumber 测试
Input: "Fakturanummer: A3861"
期望: "A3861"
实际: "Fakturanummer: A3861" # 没有提取,只是清理
# Bankgiro 测试
Input: "Bankgiro: 782-1713"
期望: "782-1713"
实际: "7821713" # 返回了不带破折号的变体,而不是提取格式化值
```
## 结论
**field_extractor.py 不应该使用 src/normalize 模块重构**,因为:
1.**职责不同**: 提取 vs 变体生成
2.**输入不同**: 包含标签的原始OCR文本 vs 已提取的字段值
3.**输出不同**: 单个提取值 vs 多个匹配变体
4.**现有代码运行良好**: 所有45个测试通过
5.**提取逻辑有价值**: 包含复杂的模式匹配规则(例如区分 Bankgiro/Plusgiro 格式)
## 建议
1. **保留 field_extractor.py 原样**: 不进行重构
2. **文档化两个模块的差异**: 确保团队理解各自用途
3. **关注其他优化目标**: machine_code_parser.py (919行)
## 学习点
重构前应该:
1. 理解模块的**真实用途**,而不只是看代码相似度
2. 运行完整测试套件验证假设
3. 评估是否真的存在重复,还是表面相似但用途不同
---
**状态**: ✅ 分析完成,决定不重构
**测试**: ✅ 45/45 通过
**文件**: 保持 1183行 原样

View File

@@ -1,238 +0,0 @@
# Machine Code Parser 分析报告
## 文件概况
- **文件**: `src/ocr/machine_code_parser.py`
- **总行数**: 919 行
- **代码行**: 607 行 (66%)
- **方法数**: 14 个
- **正则表达式使用**: 47 次
## 代码结构
### 类结构
```
MachineCodeResult (数据类)
├── to_dict()
└── get_region_bbox()
MachineCodeParser (主解析器)
├── __init__()
├── parse() - 主入口
├── _find_tokens_with_values()
├── _find_machine_code_line_tokens()
├── _parse_standard_payment_line_with_tokens()
├── _parse_standard_payment_line() - 142行 ⚠️
├── _extract_ocr() - 50行
├── _extract_bankgiro() - 58行
├── _extract_plusgiro() - 30行
├── _extract_amount() - 68行
├── _calculate_confidence()
└── cross_validate()
```
## 发现的问题
### 1. ⚠️ `_parse_standard_payment_line` 方法过长 (142行)
**位置**: 442-582 行
**问题**:
- 包含嵌套函数 `normalize_account_spaces``format_account`
- 多个正则匹配分支
- 逻辑复杂,难以测试和维护
**建议**:
可以拆分为独立方法:
- `_normalize_account_spaces(line)`
- `_format_account(account_digits, context)`
- `_match_primary_pattern(line)`
- `_match_fallback_patterns(line)`
### 2. 🔁 4个 `_extract_*` 方法有重复模式
所有 extract 方法都遵循相同模式:
```python
def _extract_XXX(self, tokens):
candidates = []
for token in tokens:
text = token.text.strip()
matches = self.XXX_PATTERN.findall(text)
for match in matches:
# 验证逻辑
# 上下文检测
candidates.append((normalized, context_score, token))
if not candidates:
return None
candidates.sort(key=lambda x: (x[1], 1), reverse=True)
return candidates[0][0]
```
**重复的逻辑**:
- Token 迭代
- 模式匹配
- 候选收集
- 上下文评分
- 排序和选择最佳匹配
**建议**:
可以提取基础提取器类或通用方法来减少重复。
### 3. ✅ 上下文检测重复
上下文检测代码在多个地方重复:
```python
# _extract_bankgiro 中
context_text = ' '.join(t.text.lower() for t in tokens)
is_bankgiro_context = (
'bankgiro' in context_text or
'bg:' in context_text or
'bg ' in context_text
)
# _extract_plusgiro 中
context_text = ' '.join(t.text.lower() for t in tokens)
is_plusgiro_context = (
'plusgiro' in context_text or
'postgiro' in context_text or
'pg:' in context_text or
'pg ' in context_text
)
# _parse_standard_payment_line 中
context = (context_line or raw_line).lower()
is_plusgiro_context = (
('plusgiro' in context or 'postgiro' in context or 'plusgirokonto' in context)
and 'bankgiro' not in context
)
```
**建议**:
提取为独立方法:
- `_detect_account_context(tokens) -> dict[str, bool]`
## 重构建议
### 方案 A: 轻度重构(推荐)✅
**目标**: 提取重复的上下文检测逻辑,不改变主要结构
**步骤**:
1. 提取 `_detect_account_context(tokens)` 方法
2. 提取 `_normalize_account_spaces(line)` 为独立方法
3. 提取 `_format_account(digits, context)` 为独立方法
**影响**:
- 减少 ~50-80 行重复代码
- 提高可测试性
- 低风险,易于验证
**预期结果**: 919 行 → ~850 行 (↓7%)
### 方案 B: 中度重构
**目标**: 创建通用的字段提取框架
**步骤**:
1. 创建 `_generic_extract(pattern, normalizer, context_checker)`
2. 重构所有 `_extract_*` 方法使用通用框架
3. 拆分 `_parse_standard_payment_line` 为多个小方法
**影响**:
- 减少 ~150-200 行代码
- 显著提高可维护性
- 中等风险,需要全面测试
**预期结果**: 919 行 → ~720 行 (↓22%)
### 方案 C: 深度重构(不推荐)
**目标**: 完全重新设计为策略模式
**风险**:
- 高风险,可能引入 bugs
- 需要大量测试
- 可能破坏现有集成
## 推荐方案
### ✅ 采用方案 A轻度重构
**理由**:
1. **代码已经工作良好**: 没有明显的 bug 或性能问题
2. **低风险**: 只提取重复逻辑,不改变核心算法
3. **性价比高**: 小改动带来明显的代码质量提升
4. **易于验证**: 现有测试应该能覆盖
### 重构步骤
```python
# 1. 提取上下文检测
def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
"""检测上下文中的账户类型关键词"""
context_text = ' '.join(t.text.lower() for t in tokens)
return {
'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
}
# 2. 提取空格标准化
def _normalize_account_spaces(self, line: str) -> str:
"""移除账户号码中的空格"""
# (现有 line 460-481 的代码)
# 3. 提取账户格式化
def _format_account(
self,
account_digits: str,
is_plusgiro_context: bool
) -> tuple[str, str]:
"""格式化账户并确定类型"""
# (现有 line 485-523 的代码)
```
## 对比field_extractor vs machine_code_parser
| 特征 | field_extractor | machine_code_parser |
|------|-----------------|---------------------|
| 用途 | 值提取 | 机器码解析 |
| 重复代码 | ~400行normalize方法 | ~80行上下文检测 |
| 重构价值 | ❌ 不同用途,不应统一 | ✅ 可提取共享逻辑 |
| 风险 | 高(会破坏功能) | 低(只是代码组织) |
## 决策
### ✅ 建议重构 machine_code_parser.py
**与 field_extractor 的不同**:
- field_extractor: 重复的方法有**不同的用途**(提取 vs 变体生成)
- machine_code_parser: 重复的代码有**相同的用途**(都是上下文检测)
**预期收益**:
- 减少 ~70 行重复代码
- 提高可测试性(可以单独测试上下文检测)
- 更清晰的代码组织
- **低风险**,易于验证
## 下一步
1. ✅ 备份原文件
2. ✅ 提取 `_detect_account_context` 方法
3. ✅ 提取 `_normalize_account_spaces` 方法
4. ✅ 提取 `_format_account` 方法
5. ✅ 更新所有调用点
6. ✅ 运行测试验证
7. ✅ 检查代码覆盖率
---
**状态**: 📋 分析完成,建议轻度重构
**风险评估**: 🟢 低风险
**预期收益**: 919行 → ~850行 (↓7%)

View File

@@ -1,519 +0,0 @@
# Performance Optimization Guide
This document provides performance optimization recommendations for the Invoice Field Extraction system.
## Table of Contents
1. [Batch Processing Optimization](#batch-processing-optimization)
2. [Database Query Optimization](#database-query-optimization)
3. [Caching Strategies](#caching-strategies)
4. [Memory Management](#memory-management)
5. [Profiling and Monitoring](#profiling-and-monitoring)
---
## Batch Processing Optimization
### Current State
The system processes invoices one at a time. For large batches, this can be inefficient.
### Recommendations
#### 1. Database Batch Operations
**Current**: Individual inserts for each document
```python
# Inefficient
for doc in documents:
db.insert_document(doc) # Individual DB call
```
**Optimized**: Use `execute_values` for batch inserts
```python
# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values
execute_values(cursor, """
INSERT INTO documents (...)
VALUES %s
""", document_values)
```
**Impact**: 10-50x faster for batches of 100+ documents
#### 2. PDF Processing Batching
**Recommendation**: Process PDFs in parallel using multiprocessing
```python
from multiprocessing import Pool
def process_batch(pdf_paths, batch_size=10):
"""Process PDFs in parallel batches."""
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
return results
```
**Considerations**:
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
**Status**: ✅ Already implemented in `src/processing/` modules
#### 3. Image Caching for Multi-Page PDFs
**Current**: Each page rendered independently
```python
# Current pattern in field_extractor.py
for page_num in range(total_pages):
image = render_pdf_page(pdf_path, page_num, dpi=300)
```
**Optimized**: Pre-render all pages if processing multiple fields per page
```python
# Batch render
images = {
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
for page_num in page_numbers_needed
}
# Reuse images
for detection in detections:
image = images[detection.page_no]
extract_field(detection, image)
```
**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
---
## Database Query Optimization
### Current Performance
- **Parameterized queries**: ✅ Implemented (Phase 1)
- **Connection pooling**: ❌ Not implemented
- **Query batching**: ✅ Partially implemented
- **Index optimization**: ⚠️ Needs verification
### Recommendations
#### 1. Connection Pooling
**Current**: New connection for each operation
```python
def connect(self):
"""Create new database connection."""
return psycopg2.connect(**self.config)
```
**Optimized**: Use connection pooling
```python
from psycopg2 import pool
class DocumentDatabase:
def __init__(self, config):
self.pool = pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
**config
)
def connect(self):
return self.pool.getconn()
def close(self, conn):
self.pool.putconn(conn)
```
**Impact**:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations
#### 2. Index Recommendations
**Check current indexes**:
```sql
-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';
```
**Recommended indexes**:
```sql
-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
ON documents(success);
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
ON documents(timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
ON field_results(document_id);
CREATE INDEX IF NOT EXISTS idx_field_results_matched
ON field_results(matched);
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
ON field_results(field_name);
```
**Impact**:
- 10-100x faster queries for filtered/sorted results
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
#### 3. Query Batching
**Status**: ✅ Already implemented for field results (line 519)
**Verify batching is used**:
```python
# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
```
**Additional opportunity**: Batch `SELECT` queries
```python
# Current
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
# Optimized
docs = get_documents_batch(doc_ids) # 1 query with IN clause
```
**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
---
## Caching Strategies
### 1. Model Loading Cache
**Current**: Models loaded per-instance
**Recommendation**: Singleton pattern for YOLO model
```python
class YOLODetectorSingleton:
_instance = None
_model = None
@classmethod
def get_instance(cls, model_path):
if cls._instance is None:
cls._instance = YOLODetector(model_path)
return cls._instance
```
**Impact**: Reduces memory usage by 90% when processing multiple documents
### 2. Parser Instance Caching
**Current**: ✅ Already optimal
```python
# Good pattern in field_extractor.py
def __init__(self):
self.payment_line_parser = PaymentLineParser() # Reused
self.customer_number_parser = CustomerNumberParser() # Reused
```
**Status**: No changes needed
### 3. OCR Result Caching
**Recommendation**: Cache OCR results for identical regions
```python
from functools import lru_cache
@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
"""Cache OCR results by image hash + bbox."""
return paddle_ocr.ocr_region(image, bbox)
```
**Impact**: 50-80% speedup when re-processing similar documents
**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
---
## Memory Management
### Current Issues
**Potential memory leaks**:
1. Large images kept in memory after processing
2. OCR results accumulated without cleanup
3. Model outputs not explicitly cleared
### Recommendations
#### 1. Explicit Image Cleanup
```python
import gc
def process_pdf(pdf_path):
try:
image = render_pdf(pdf_path)
result = extract_fields(image)
return result
finally:
del image # Explicit cleanup
gc.collect() # Force garbage collection
```
#### 2. Generator Pattern for Large Batches
**Current**: Load all documents into memory
```python
docs = [process_pdf(path) for path in pdf_paths] # All in memory
```
**Optimized**: Use generator for streaming processing
```python
def process_batch_streaming(pdf_paths):
"""Process documents one at a time, yielding results."""
for path in pdf_paths:
result = process_pdf(path)
yield result
# Result can be saved to DB immediately
# Previous result is garbage collected
```
**Impact**: Constant memory usage regardless of batch size
#### 3. Context Managers for Resources
```python
class InferencePipeline:
def __enter__(self):
self.detector.load_model()
return self
def __exit__(self, *args):
self.detector.unload_model()
self.extractor.cleanup()
# Usage
with InferencePipeline(...) as pipeline:
results = pipeline.process_pdf(path)
# Automatic cleanup
```
---
## Profiling and Monitoring
### Recommended Profiling Tools
#### 1. cProfile for CPU Profiling
```python
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your code here
pipeline.process_pdf(pdf_path)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 slowest functions
```
#### 2. memory_profiler for Memory Analysis
```bash
pip install memory_profiler
python -m memory_profiler your_script.py
```
Or decorator-based:
```python
from memory_profiler import profile
@profile
def process_large_batch(pdf_paths):
# Memory usage tracked line-by-line
results = [process_pdf(path) for path in pdf_paths]
return results
```
#### 3. py-spy for Production Profiling
```bash
pip install py-spy
# Profile running process
py-spy top --pid 12345
# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py
```
**Advantage**: No code changes needed, minimal overhead
### Key Metrics to Monitor
1. **Processing Time per Document**
- Target: <10 seconds for single-page invoice
- Current: ~2-5 seconds (estimated)
2. **Memory Usage**
- Target: <2GB for batch of 100 documents
- Monitor: Peak memory usage
3. **Database Query Time**
- Target: <100ms per query (with indexes)
- Monitor: Slow query log
4. **OCR Accuracy vs Speed Trade-off**
- Current: PaddleOCR with GPU (~200ms per region)
- Alternative: Tesseract (~500ms, slightly more accurate)
### Logging Performance Metrics
**Add to pipeline.py**:
```python
import time
import logging
logger = logging.getLogger(__name__)
def process_pdf(self, pdf_path):
start = time.time()
# Processing...
result = self._process_internal(pdf_path)
elapsed = time.time() - start
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
# Log to database for analysis
self.db.log_performance({
'document_id': result.document_id,
'processing_time': elapsed,
'field_count': len(result.fields)
})
return result
```
---
## Performance Optimization Priorities
### High Priority (Implement First)
1. **Database parameterized queries** - Already done (Phase 1)
2. **Database connection pooling** - Not implemented
3. **Index optimization** - Needs verification
### Medium Priority
4. **Batch PDF rendering** - Optimization possible
5. **Parser instance reuse** - Already done (Phase 2)
6. **Model caching** - Could improve
### Low Priority (Nice to Have)
7. **OCR result caching** - Complex implementation
8. **Generator patterns** - Refactoring needed
9. **Advanced profiling** - For production optimization
---
## Benchmarking Script
```python
"""
Benchmark script for invoice processing performance.
"""
import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline
def benchmark_single_document(pdf_path, iterations=10):
"""Benchmark single document processing."""
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
times = []
for i in range(iterations):
start = time.time()
result = pipeline.process_pdf(pdf_path)
elapsed = time.time() - start
times.append(elapsed)
print(f"Iteration {i+1}: {elapsed:.2f}s")
avg_time = sum(times) / len(times)
print(f"\nAverage: {avg_time:.2f}s")
print(f"Min: {min(times):.2f}s")
print(f"Max: {max(times):.2f}s")
def benchmark_batch(pdf_paths, batch_size=10):
"""Benchmark batch processing."""
from multiprocessing import Pool
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
start = time.time()
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
elapsed = time.time() - start
avg_per_doc = elapsed / len(pdf_paths)
print(f"Total time: {elapsed:.2f}s")
print(f"Documents: {len(pdf_paths)}")
print(f"Average per document: {avg_per_doc:.2f}s")
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
if __name__ == "__main__":
# Single document benchmark
benchmark_single_document("test.pdf")
# Batch benchmark
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
benchmark_batch(pdf_paths[:100])
```
---
## Summary
**Implemented (Phase 1-2)**:
- Parameterized queries (SQL injection fix)
- Parser instance reuse (Phase 2 refactoring)
- Batch insert operations (execute_values)
- Dual pool processing (CPU/GPU separation)
**Quick Wins (Low effort, high impact)**:
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)
**Long-term Improvements**:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring
**Expected Impact**:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- **Overall**: 2-5x throughput improvement for batch processing

File diff suppressed because it is too large Load Diff

View File

@@ -1,170 +0,0 @@
# 代码重构总结报告
## 📊 整体成果
### 测试状态
-**688/688 测试全部通过** (100%)
-**代码覆盖率**: 34% → 37% (+3%)
-**0 个失败**, 0 个错误
### 测试覆盖率改进
-**machine_code_parser**: 25% → 65% (+40%)
-**新增测试**: 55个633 → 688
---
## 🎯 已完成的重构
### 1. ✅ Matcher 模块化 (876行 → 205行, ↓76%)
**文件**:
**重构内容**:
- 将单一876行文件拆分为 **11个模块**
- 提取 **5种独立的匹配策略**
- 创建专门的数据模型、工具函数和上下文处理模块
**新模块结构**:
**测试结果**:
- ✅ 77个 matcher 测试全部通过
- ✅ 完整的README文档
- ✅ 策略模式,易于扩展
**收益**:
- 📉 代码量减少 76%
- 📈 可维护性显著提高
- ✨ 每个策略独立测试
- 🔧 易于添加新策略
---
### 2. ✅ Machine Code Parser 轻度重构 + 测试覆盖 (919行 → 929行)
**文件**: src/ocr/machine_code_parser.py
**重构内容**:
- 提取 **3个共享辅助方法**,消除重复代码
- 优化上下文检测逻辑
- 简化账号格式化方法
**测试改进**:
-**新增55个测试**24 → 79个
-**覆盖率**: 25% → 65% (+40%)
- ✅ 所有688个项目测试通过
**新增测试覆盖**:
- **第一轮** (22个测试):
- `_detect_account_context()` - 8个测试上下文检测
- `_normalize_account_spaces()` - 5个测试空格规范化
- `_format_account()` - 4个测试账号格式化
- `parse()` - 5个测试主入口方法
- **第二轮** (33个测试):
- `_extract_ocr()` - 8个测试OCR 提取)
- `_extract_bankgiro()` - 9个测试Bankgiro 提取)
- `_extract_plusgiro()` - 8个测试Plusgiro 提取)
- `_extract_amount()` - 8个测试金额提取
**收益**:
- 🔄 消除80行重复代码
- 📈 可测试性提高(可独立测试辅助方法)
- 📖 代码可读性提升
- ✅ 覆盖率从25%提升到65% (+40%)
- 🎯 低风险,高回报
---
### 3. ✅ Field Extractor 分析 (决定不重构)
**文件**: (1183行)
**分析结果**: ❌ **不应重构**
**关键洞察**:
- 表面相似的代码可能有**完全不同的用途**
- field_extractor: **解析/提取** 字段值
- src/normalize: **标准化/生成变体** 用于匹配
- 两者职责不同,不应统一
**文档**:
---
## 📈 重构统计
### 代码行数变化
| 文件 | 重构前 | 重构后 | 变化 | 百分比 |
|------|--------|--------|------|--------|
| **matcher/field_matcher.py** | 876行 | 205行 | -671 | ↓76% |
| **matcher/* (新增10个模块)** | 0行 | 466行 | +466 | 新增 |
| **matcher 总计** | 876行 | 671行 | -205 | ↓23% |
| **ocr/machine_code_parser.py** | 919行 | 929行 | +10 | +1% |
| **总净减少** | - | - | **-195行** | **↓11%** |
### 测试覆盖
| 模块 | 测试数 | 通过率 | 覆盖率 | 状态 |
|------|--------|--------|--------|------|
| matcher | 77 | 100% | - | ✅ |
| field_extractor | 45 | 100% | 39% | ✅ |
| machine_code_parser | 79 | 100% | 65% | ✅ |
| normalizer | ~120 | 100% | - | ✅ |
| 其他模块 | ~367 | 100% | - | ✅ |
| **总计** | **688** | **100%** | **37%** | ✅ |
---
## 🎓 重构经验总结
### 成功经验
1. **✅ 先测试后重构**
- 所有重构都有完整测试覆盖
- 每次改动后立即验证测试
- 100%测试通过率保证质量
2. **✅ 识别真正的重复**
- 不是所有相似代码都是重复
- field_extractor vs normalizer: 表面相似但用途不同
- machine_code_parser: 真正的代码重复
3. **✅ 渐进式重构**
- matcher: 大规模模块化 (策略模式)
- machine_code_parser: 轻度重构 (提取共享方法)
- field_extractor: 分析后决定不重构
### 关键决策
#### ✅ 应该重构的情况
- **matcher**: 单一文件过长 (876行),包含多种策略
- **machine_code_parser**: 多处相同用途的重复代码
#### ❌ 不应重构的情况
- **field_extractor**: 相似代码有不同用途
### 教训
**不要盲目追求DRY原则**
> 相似代码不一定是重复。要理解代码的**真实用途**。
---
## ✅ 总结
**关键成果**:
- 📉 净减少 195 行代码
- 📈 代码覆盖率 +3% (34% → 37%)
- ✅ 测试数量 +55 (633 → 688)
- 🎯 machine_code_parser 覆盖率 +40% (25% → 65%)
- ✨ 模块化程度显著提高
- 🎯 可维护性大幅提升
**重要教训**:
> 相似的代码不一定是重复的代码。理解代码的真实用途,才能做出正确的重构决策。
**下一步建议**:
1. 继续提升 machine_code_parser 覆盖率到 80%+ (目前 65%)
2. 为其他低覆盖模块添加测试field_extractor 39%, pipeline 19%
3. 完善边界条件和异常情况的测试

View File

@@ -1,258 +0,0 @@
# 测试覆盖率改进报告
## 📊 改进概览
### 整体统计
-**测试总数**: 633 → 688 (+55个测试, +8.7%)
-**通过率**: 100% (688/688)
-**整体覆盖率**: 34% → 37% (+3%)
### machine_code_parser.py 专项改进
-**测试数**: 24 → 79 (+55个测试, +229%)
-**覆盖率**: 25% → 65% (+40%)
-**未覆盖行**: 273 → 129 (减少144行)
---
## 🎯 新增测试详情
### 第一轮改进 (22个测试)
#### 1. TestDetectAccountContext (8个测试)
测试新增的 `_detect_account_context()` 辅助方法。
**测试用例**:
1. `test_bankgiro_keyword` - 检测 'bankgiro' 关键词
2. `test_bg_keyword` - 检测 'bg:' 缩写
3. `test_plusgiro_keyword` - 检测 'plusgiro' 关键词
4. `test_postgiro_keyword` - 检测 'postgiro' 别名
5. `test_pg_keyword` - 检测 'pg:' 缩写
6. `test_both_contexts` - 同时存在两种关键词
7. `test_no_context` - 无账号关键词
8. `test_case_insensitive` - 大小写不敏感检测
**覆盖的代码路径**:
```python
def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
context_text = ' '.join(t.text.lower() for t in tokens)
return {
'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
}
```
---
### 2. TestNormalizeAccountSpacesMethod (5个测试)
测试新增的 `_normalize_account_spaces()` 辅助方法。
**测试用例**:
1. `test_removes_spaces_after_arrow` - 移除 > 后的空格
2. `test_multiple_consecutive_spaces` - 处理多个连续空格
3. `test_no_arrow_returns_unchanged` - 无 > 标记时返回原值
4. `test_spaces_before_arrow_preserved` - 保留 > 前的空格
5. `test_empty_string` - 空字符串处理
**覆盖的代码路径**:
```python
def _normalize_account_spaces(self, line: str) -> str:
if '>' not in line:
return line
parts = line.split('>', 1)
after_arrow = parts[1]
normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
while re.search(r'(\d)\s+(\d)', normalized):
normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
return parts[0] + '>' + normalized
```
---
### 3. TestFormatAccount (4个测试)
测试新增的 `_format_account()` 辅助方法。
**测试用例**:
1. `test_plusgiro_context_forces_plusgiro` - Plusgiro 上下文强制格式化为 Plusgiro
2. `test_valid_bankgiro_7_digits` - 7位有效 Bankgiro 格式化
3. `test_valid_bankgiro_8_digits` - 8位有效 Bankgiro 格式化
4. `test_defaults_to_bankgiro_when_ambiguous` - 模糊情况默认 Bankgiro
**覆盖的代码路径**:
```python
def _format_account(self, account_digits: str, is_plusgiro_context: bool) -> tuple[str, str]:
if is_plusgiro_context:
formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
return formatted, 'plusgiro'
# Luhn 验证逻辑
pg_valid = FieldValidators.is_valid_plusgiro(account_digits)
bg_valid = FieldValidators.is_valid_bankgiro(account_digits)
# 决策逻辑
if pg_valid and not bg_valid:
return pg_formatted, 'plusgiro'
elif bg_valid and not pg_valid:
return bg_formatted, 'bankgiro'
else:
return bg_formatted, 'bankgiro'
```
---
### 4. TestParseMethod (5个测试)
测试主入口 `parse()` 方法。
**测试用例**:
1. `test_parse_empty_tokens` - 空 token 列表处理
2. `test_parse_finds_payment_line_in_bottom_region` - 在页面底部35%区域查找付款行
3. `test_parse_ignores_top_region` - 忽略页面顶部区域
4. `test_parse_with_context_keywords` - 检测上下文关键词
5. `test_parse_stores_source_tokens` - 存储源 token
**覆盖的代码路径**:
- Token 过滤(底部区域检测)
- 上下文关键词检测
- 付款行查找和解析
- 结果对象构建
---
### 第二轮改进 (33个测试)
#### 5. TestExtractOCR (8个测试)
测试 `_extract_ocr()` 方法 - OCR 参考号码提取。
**测试用例**:
1. `test_extract_valid_ocr_10_digits` - 提取10位 OCR 号码
2. `test_extract_valid_ocr_15_digits` - 提取15位 OCR 号码
3. `test_extract_ocr_with_hash_markers` - 带 # 标记的 OCR
4. `test_extract_longest_ocr_when_multiple` - 多个候选时选最长
5. `test_extract_ocr_ignores_short_numbers` - 忽略短于10位的数字
6. `test_extract_ocr_ignores_long_numbers` - 忽略长于25位的数字
7. `test_extract_ocr_excludes_bankgiro_variants` - 排除 Bankgiro 变体
8. `test_extract_ocr_empty_tokens` - 空 token 处理
#### 6. TestExtractBankgiro (9个测试)
测试 `_extract_bankgiro()` 方法 - Bankgiro 账号提取。
**测试用例**:
1. `test_extract_bankgiro_7_digits_with_dash` - 带破折号的7位 Bankgiro
2. `test_extract_bankgiro_7_digits_without_dash` - 无破折号的7位 Bankgiro
3. `test_extract_bankgiro_8_digits_with_dash` - 带破折号的8位 Bankgiro
4. `test_extract_bankgiro_8_digits_without_dash` - 无破折号的8位 Bankgiro
5. `test_extract_bankgiro_with_spaces` - 带空格的 Bankgiro
6. `test_extract_bankgiro_handles_plusgiro_format` - 处理 Plusgiro 格式
7. `test_extract_bankgiro_with_context` - 带上下文关键词
8. `test_extract_bankgiro_ignores_plusgiro_context` - 忽略 Plusgiro 上下文
9. `test_extract_bankgiro_empty_tokens` - 空 token 处理
#### 7. TestExtractPlusgiro (8个测试)
测试 `_extract_plusgiro()` 方法 - Plusgiro 账号提取。
**测试用例**:
1. `test_extract_plusgiro_7_digits_with_dash` - 带破折号的7位 Plusgiro
2. `test_extract_plusgiro_7_digits_without_dash` - 无破折号的7位 Plusgiro
3. `test_extract_plusgiro_8_digits` - 8位 Plusgiro
4. `test_extract_plusgiro_with_spaces` - 带空格的 Plusgiro
5. `test_extract_plusgiro_with_context` - 带上下文关键词
6. `test_extract_plusgiro_ignores_too_short` - 忽略少于7位
7. `test_extract_plusgiro_ignores_too_long` - 忽略多于8位
8. `test_extract_plusgiro_empty_tokens` - 空 token 处理
#### 8. TestExtractAmount (8个测试)
测试 `_extract_amount()` 方法 - 金额提取。
**测试用例**:
1. `test_extract_amount_with_comma_decimal` - 逗号小数分隔符
2. `test_extract_amount_with_dot_decimal` - 点号小数分隔符
3. `test_extract_amount_integer` - 整数金额
4. `test_extract_amount_with_thousand_separator` - 千位分隔符
5. `test_extract_amount_large_number` - 大额金额
6. `test_extract_amount_ignores_too_large` - 忽略过大金额
7. `test_extract_amount_ignores_zero` - 忽略零或负数
8. `test_extract_amount_empty_tokens` - 空 token 处理
---
## 📈 覆盖率分析
### 已覆盖的方法
`_detect_account_context()` - **100%** (第一轮新增)
`_normalize_account_spaces()` - **100%** (第一轮新增)
`_format_account()` - **95%** (第一轮新增)
`parse()` - **70%** (第一轮改进)
`_parse_standard_payment_line()` - **95%** (已有测试)
`_extract_ocr()` - **85%** (第二轮新增)
`_extract_bankgiro()` - **90%** (第二轮新增)
`_extract_plusgiro()` - **90%** (第二轮新增)
`_extract_amount()` - **80%** (第二轮新增)
### 仍需改进的方法 (未覆盖/部分覆盖)
⚠️ `_calculate_confidence()` - **0%** (未测试)
⚠️ `cross_validate()` - **0%** (未测试)
⚠️ `get_region_bbox()` - **0%** (未测试)
⚠️ `_find_tokens_with_values()` - **部分覆盖**
⚠️ `_find_machine_code_line_tokens()` - **部分覆盖**
### 未覆盖的代码行129行
主要集中在:
1. **验证方法** (lines 805-824): `_calculate_confidence`, `cross_validate`
2. **辅助方法** (lines 80-92, 336-369, 377-407): Token 查找、bbox 计算、日志记录
3. **边界条件** (lines 648-653, 690, 699, 759-760等): 某些提取方法的边界情况
---
## 🎯 改进建议
### ✅ 已完成目标
- ✅ 覆盖率从 25% 提升到 65% (+40%)
- ✅ 测试数量从 24 增加到 79 (+55个)
- ✅ 提取方法全部测试_extract_ocr, _extract_bankgiro, _extract_plusgiro, _extract_amount
### 下一步目标(覆盖率 65% → 80%+
1. **添加验证方法测试** - 为 `_calculate_confidence`, `cross_validate` 添加测试
2. **添加辅助方法测试** - 为 token 查找和 bbox 计算方法添加测试
3. **完善边界条件** - 增加边界情况和异常处理的测试
4. **集成测试** - 添加端到端的集成测试,使用真实 PDF token 数据
---
## ✅ 已完成的改进
### 重构收益
- ✅ 提取的3个辅助方法现在可以独立测试
- ✅ 测试粒度更细,更容易定位问题
- ✅ 代码可读性提高,测试用例清晰易懂
### 质量保证
- ✅ 所有655个测试100%通过
- ✅ 无回归问题
- ✅ 新增测试覆盖了之前未测试的重构代码
---
## 📚 测试编写经验
### 成功经验
1. **使用 fixture 创建测试数据** - `_create_token()` 辅助方法简化了 token 创建
2. **按方法组织测试类** - 每个方法一个测试类,结构清晰
3. **测试用例命名清晰** - `test_<what>_<condition>` 格式,一目了然
4. **覆盖关键路径** - 优先测试常见场景和边界条件
### 遇到的问题
1. **Token 初始化参数** - 忘记了 `page_no` 参数,导致初始测试失败
- 解决:修复 `_create_token()` 辅助方法,添加 `page_no=0`
---
**报告日期**: 2026-01-24
**状态**: ✅ 完成
**下一步**: 继续提升覆盖率到 60%+

View File

@@ -1,619 +0,0 @@
# 多池处理架构设计文档
## 1. 研究总结
### 1.1 当前问题分析
我们之前实现的双池模式存在稳定性问题,主要原因:
| 问题 | 原因 | 解决方案 |
|------|------|----------|
| 处理卡住 | 线程 + ProcessPoolExecutor 混用导致死锁 | 使用 asyncio 或纯 Queue 模式 |
| Queue.get() 无限阻塞 | 没有超时机制 | 添加 timeout 和哨兵值 |
| GPU 内存冲突 | 多进程同时访问 GPU | 限制 GPU worker = 1 |
| CUDA fork 问题 | Linux 默认 fork 不兼容 CUDA | 使用 spawn 启动方式 |
### 1.2 推荐架构方案
经过研究,最适合我们场景的方案是 **生产者-消费者队列模式**
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Main Process │ │ CPU Workers │ │ GPU Worker │
│ │ │ (4 processes) │ │ (1 process) │
│ ┌───────────┐ │ │ │ │ │
│ │ Task │──┼────▶│ Text PDF处理 │ │ Scanned PDF处理 │
│ │ Dispatcher│ │ │ (无需OCR) │ │ (PaddleOCR) │
│ └───────────┘ │ │ │ │ │
│ ▲ │ │ │ │ │ │ │
│ │ │ │ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ Result Queue │ │ Result Queue │
│ │ Result │◀─┼─────│◀────────────────│─────│◀────────────────│
│ │ Collector │ │ │ │ │ │
│ └───────────┘ │ └─────────────────┘ └─────────────────┘
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ Database │ │
│ │ Batch │ │
│ │ Writer │ │
│ └───────────┘ │
└─────────────────┘
```
---
## 2. 核心设计原则
### 2.1 CUDA 兼容性
```python
# 关键:使用 spawn 启动方式
import multiprocessing as mp
ctx = mp.get_context("spawn")
# GPU worker 初始化时设置设备
def init_gpu_worker(gpu_id: int = 0):
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
global _ocr
from paddleocr import PaddleOCR
_ocr = PaddleOCR(use_gpu=True, ...)
```
### 2.2 Worker 初始化模式
使用 `initializer` 参数一次性加载模型,避免每个任务重新加载:
```python
# 全局变量保存模型
_ocr = None
def init_worker(use_gpu: bool, gpu_id: int = 0):
global _ocr
if use_gpu:
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
else:
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from paddleocr import PaddleOCR
_ocr = PaddleOCR(use_gpu=use_gpu, ...)
# 创建 Pool 时使用 initializer
pool = ProcessPoolExecutor(
max_workers=1,
initializer=init_worker,
initargs=(True, 0), # use_gpu=True, gpu_id=0
mp_context=mp.get_context("spawn")
)
```
### 2.3 队列模式 vs as_completed
| 方式 | 优点 | 缺点 | 适用场景 |
|------|------|------|----------|
| `as_completed()` | 简单、无需管理队列 | 无法跨多个 Pool 使用 | 单池场景 |
| `multiprocessing.Queue` | 高性能、灵活 | 需要手动管理、死锁风险 | 多池流水线 |
| `Manager().Queue()` | 可 pickle、跨 Pool | 性能较低 | 需要 Pool.map 场景 |
**推荐**:对于双池场景,使用 `as_completed()` 分别处理每个池,然后合并结果。
---
## 3. 详细开发计划
### 阶段 1重构基础架构 (2-3天)
#### 1.1 创建 WorkerPool 抽象类
```python
# src/processing/worker_pool.py
from __future__ import annotations
from abc import ABC, abstractmethod
from concurrent.futures import ProcessPoolExecutor, Future
from dataclasses import dataclass
from typing import List, Any, Optional, Callable
import multiprocessing as mp
@dataclass
class TaskResult:
"""任务结果容器"""
task_id: str
success: bool
data: Any
error: Optional[str] = None
processing_time: float = 0.0
class WorkerPool(ABC):
"""Worker Pool 抽象基类"""
def __init__(self, max_workers: int, use_gpu: bool = False, gpu_id: int = 0):
self.max_workers = max_workers
self.use_gpu = use_gpu
self.gpu_id = gpu_id
self._executor: Optional[ProcessPoolExecutor] = None
@abstractmethod
def get_initializer(self) -> Callable:
"""返回 worker 初始化函数"""
pass
@abstractmethod
def get_init_args(self) -> tuple:
"""返回初始化参数"""
pass
def start(self):
"""启动 worker pool"""
ctx = mp.get_context("spawn")
self._executor = ProcessPoolExecutor(
max_workers=self.max_workers,
mp_context=ctx,
initializer=self.get_initializer(),
initargs=self.get_init_args()
)
def submit(self, fn: Callable, *args, **kwargs) -> Future:
"""提交任务"""
if not self._executor:
raise RuntimeError("Pool not started")
return self._executor.submit(fn, *args, **kwargs)
def shutdown(self, wait: bool = True):
"""关闭 pool"""
if self._executor:
self._executor.shutdown(wait=wait)
self._executor = None
def __enter__(self):
self.start()
return self
def __exit__(self, *args):
self.shutdown()
```
#### 1.2 实现 CPU 和 GPU Worker Pool
```python
# src/processing/cpu_pool.py
class CPUWorkerPool(WorkerPool):
"""CPU-only worker pool for text PDF processing"""
def __init__(self, max_workers: int = 4):
super().__init__(max_workers=max_workers, use_gpu=False)
def get_initializer(self) -> Callable:
return init_cpu_worker
def get_init_args(self) -> tuple:
return ()
# src/processing/gpu_pool.py
class GPUWorkerPool(WorkerPool):
"""GPU worker pool for OCR processing"""
def __init__(self, max_workers: int = 1, gpu_id: int = 0):
super().__init__(max_workers=max_workers, use_gpu=True, gpu_id=gpu_id)
def get_initializer(self) -> Callable:
return init_gpu_worker
def get_init_args(self) -> tuple:
return (self.gpu_id,)
```
---
### 阶段 2实现双池协调器 (2-3天)
#### 2.1 任务分发器
```python
# src/processing/task_dispatcher.py
from dataclasses import dataclass
from enum import Enum, auto
from typing import List, Tuple
class TaskType(Enum):
CPU = auto() # Text PDF
GPU = auto() # Scanned PDF
@dataclass
class Task:
id: str
task_type: TaskType
data: Any
class TaskDispatcher:
"""根据 PDF 类型分发任务到不同的 pool"""
def classify_task(self, doc_info: dict) -> TaskType:
"""判断文档是否需要 OCR"""
# 基于 PDF 特征判断
if self._is_scanned_pdf(doc_info):
return TaskType.GPU
return TaskType.CPU
def _is_scanned_pdf(self, doc_info: dict) -> bool:
"""检测是否为扫描件"""
# 1. 检查是否有可提取文本
# 2. 检查图片比例
# 3. 检查文本密度
pass
def partition_tasks(self, tasks: List[Task]) -> Tuple[List[Task], List[Task]]:
"""将任务分为 CPU 和 GPU 两组"""
cpu_tasks = [t for t in tasks if t.task_type == TaskType.CPU]
gpu_tasks = [t for t in tasks if t.task_type == TaskType.GPU]
return cpu_tasks, gpu_tasks
```
#### 2.2 双池协调器
```python
# src/processing/dual_pool_coordinator.py
from concurrent.futures import as_completed
from typing import List, Iterator
import logging
logger = logging.getLogger(__name__)
class DualPoolCoordinator:
"""协调 CPU 和 GPU 两个 worker pool"""
def __init__(
self,
cpu_workers: int = 4,
gpu_workers: int = 1,
gpu_id: int = 0
):
self.cpu_pool = CPUWorkerPool(max_workers=cpu_workers)
self.gpu_pool = GPUWorkerPool(max_workers=gpu_workers, gpu_id=gpu_id)
self.dispatcher = TaskDispatcher()
def __enter__(self):
self.cpu_pool.start()
self.gpu_pool.start()
return self
def __exit__(self, *args):
self.cpu_pool.shutdown()
self.gpu_pool.shutdown()
def process_batch(
self,
documents: List[dict],
cpu_task_fn: Callable,
gpu_task_fn: Callable,
on_result: Optional[Callable[[TaskResult], None]] = None,
on_error: Optional[Callable[[str, Exception], None]] = None
) -> List[TaskResult]:
"""
处理一批文档,自动分发到 CPU 或 GPU pool
Args:
documents: 待处理文档列表
cpu_task_fn: CPU 任务处理函数
gpu_task_fn: GPU 任务处理函数
on_result: 结果回调(可选)
on_error: 错误回调(可选)
Returns:
所有任务结果列表
"""
# 分类任务
tasks = [
Task(id=doc['id'], task_type=self.dispatcher.classify_task(doc), data=doc)
for doc in documents
]
cpu_tasks, gpu_tasks = self.dispatcher.partition_tasks(tasks)
logger.info(f"Task partition: {len(cpu_tasks)} CPU, {len(gpu_tasks)} GPU")
# 提交任务到各自的 pool
cpu_futures = {
self.cpu_pool.submit(cpu_task_fn, t.data): t.id
for t in cpu_tasks
}
gpu_futures = {
self.gpu_pool.submit(gpu_task_fn, t.data): t.id
for t in gpu_tasks
}
# 收集结果
results = []
all_futures = list(cpu_futures.keys()) + list(gpu_futures.keys())
for future in as_completed(all_futures):
task_id = cpu_futures.get(future) or gpu_futures.get(future)
pool_type = "CPU" if future in cpu_futures else "GPU"
try:
data = future.result(timeout=300) # 5分钟超时
result = TaskResult(task_id=task_id, success=True, data=data)
if on_result:
on_result(result)
except Exception as e:
logger.error(f"[{pool_type}] Task {task_id} failed: {e}")
result = TaskResult(task_id=task_id, success=False, data=None, error=str(e))
if on_error:
on_error(task_id, e)
results.append(result)
return results
```
---
### 阶段 3集成到 autolabel (1-2天)
#### 3.1 修改 autolabel.py
```python
# src/cli/autolabel.py
def run_autolabel_dual_pool(args):
"""使用双池模式运行自动标注"""
from src.processing.dual_pool_coordinator import DualPoolCoordinator
# 初始化数据库批处理
db_batch = []
db_batch_size = 100
def on_result(result: TaskResult):
"""处理成功结果"""
nonlocal db_batch
db_batch.append(result.data)
if len(db_batch) >= db_batch_size:
save_documents_batch(db_batch)
db_batch.clear()
def on_error(task_id: str, error: Exception):
"""处理错误"""
logger.error(f"Task {task_id} failed: {error}")
# 创建双池协调器
with DualPoolCoordinator(
cpu_workers=args.cpu_workers or 4,
gpu_workers=args.gpu_workers or 1,
gpu_id=0
) as coordinator:
# 处理所有 CSV
for csv_file in csv_files:
documents = load_documents_from_csv(csv_file)
results = coordinator.process_batch(
documents=documents,
cpu_task_fn=process_text_pdf,
gpu_task_fn=process_scanned_pdf,
on_result=on_result,
on_error=on_error
)
logger.info(f"CSV {csv_file}: {len(results)} processed")
# 保存剩余批次
if db_batch:
save_documents_batch(db_batch)
```
---
### 阶段 4测试与验证 (1-2天)
#### 4.1 单元测试
```python
# tests/unit/test_dual_pool.py
import pytest
from src.processing.dual_pool_coordinator import DualPoolCoordinator, TaskResult
class TestDualPoolCoordinator:
def test_cpu_only_batch(self):
"""测试纯 CPU 任务批处理"""
with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
docs = [{"id": f"doc_{i}", "type": "text"} for i in range(10)]
results = coord.process_batch(docs, cpu_fn, gpu_fn)
assert len(results) == 10
assert all(r.success for r in results)
def test_mixed_batch(self):
"""测试混合任务批处理"""
with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
docs = [
{"id": "text_1", "type": "text"},
{"id": "scan_1", "type": "scanned"},
{"id": "text_2", "type": "text"},
]
results = coord.process_batch(docs, cpu_fn, gpu_fn)
assert len(results) == 3
def test_timeout_handling(self):
"""测试超时处理"""
pass
def test_error_recovery(self):
"""测试错误恢复"""
pass
```
#### 4.2 集成测试
```python
# tests/integration/test_autolabel_dual_pool.py
def test_autolabel_with_dual_pool():
"""端到端测试双池模式"""
# 使用少量测试数据
result = subprocess.run([
"python", "-m", "src.cli.autolabel",
"--cpu-workers", "2",
"--gpu-workers", "1",
"--limit", "50"
], capture_output=True)
assert result.returncode == 0
# 验证数据库记录
```
---
## 4. 关键技术点
### 4.1 避免死锁的策略
```python
# 1. 使用 timeout
try:
result = future.result(timeout=300)
except TimeoutError:
logger.warning(f"Task timed out")
# 2. 使用哨兵值
SENTINEL = object()
queue.put(SENTINEL) # 发送结束信号
# 3. 检查进程状态
if not worker.is_alive():
logger.error("Worker died unexpectedly")
break
# 4. 先清空队列再 join
while not queue.empty():
results.append(queue.get_nowait())
worker.join(timeout=5.0)
```
### 4.2 PaddleOCR 特殊处理
```python
# PaddleOCR 必须在 worker 进程中初始化
def init_paddle_worker(gpu_id: int):
global _ocr
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
# 延迟导入,确保 CUDA 环境变量生效
from paddleocr import PaddleOCR
_ocr = PaddleOCR(
use_angle_cls=True,
lang='en',
use_gpu=True,
show_log=False,
# 重要:设置 GPU 内存比例
gpu_mem=2000 # 限制 GPU 内存使用 (MB)
)
```
### 4.3 资源监控
```python
import psutil
import GPUtil
def get_resource_usage():
"""获取系统资源使用情况"""
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
gpu_info = []
for gpu in GPUtil.getGPUs():
gpu_info.append({
"id": gpu.id,
"memory_used": gpu.memoryUsed,
"memory_total": gpu.memoryTotal,
"utilization": gpu.load * 100
})
return {
"cpu_percent": cpu_percent,
"memory_percent": memory.percent,
"gpu": gpu_info
}
```
---
## 5. 风险评估与应对
| 风险 | 可能性 | 影响 | 应对策略 |
|------|--------|------|----------|
| GPU 内存不足 | 中 | 高 | 限制 GPU worker = 1设置 gpu_mem 参数 |
| 进程僵死 | 低 | 高 | 添加心跳检测,超时自动重启 |
| 任务分类错误 | 中 | 中 | 添加回退机制CPU 失败后尝试 GPU |
| 数据库写入瓶颈 | 低 | 中 | 增大批处理大小,异步写入 |
---
## 6. 备选方案
如果上述方案仍存在问题,可以考虑:
### 6.1 使用 Ray
```python
import ray
ray.init()
@ray.remote(num_cpus=1)
def cpu_task(data):
return process_text_pdf(data)
@ray.remote(num_gpus=1)
def gpu_task(data):
return process_scanned_pdf(data)
# 自动资源调度
futures = [cpu_task.remote(d) for d in cpu_docs]
futures += [gpu_task.remote(d) for d in gpu_docs]
results = ray.get(futures)
```
### 6.2 单池 + 动态 GPU 调度
保持单池模式,但在每个任务内部动态决定是否使用 GPU
```python
def process_document(doc_data):
if is_scanned_pdf(doc_data):
# 使用 GPU (需要全局锁或信号量控制并发)
with gpu_semaphore:
return process_with_ocr(doc_data)
else:
return process_text_only(doc_data)
```
---
## 7. 时间线总结
| 阶段 | 任务 | 预计工作量 |
|------|------|------------|
| 阶段 1 | 基础架构重构 | 2-3 天 |
| 阶段 2 | 双池协调器实现 | 2-3 天 |
| 阶段 3 | 集成到 autolabel | 1-2 天 |
| 阶段 4 | 测试与验证 | 1-2 天 |
| **总计** | | **6-10 天** |
---
## 8. 参考资料
1. [Python concurrent.futures 官方文档](https://docs.python.org/3/library/concurrent.futures.html)
2. [PyTorch Multiprocessing Best Practices](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html)
3. [Super Fast Python - ProcessPoolExecutor 完整指南](https://superfastpython.com/processpoolexecutor-in-python/)
4. [PaddleOCR 并行推理文档](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/instructions/parallel_inference.html)
5. [AWS - 跨 CPU/GPU 并行化 ML 推理](https://aws.amazon.com/blogs/machine-learning/parallelizing-across-multiple-cpu-gpus-to-speed-up-deep-learning-inference-at-the-edge/)
6. [Ray 分布式多进程处理](https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html)

1223
docs/product-plan-v2.md Normal file

File diff suppressed because it is too large Load Diff

302
docs/ux-design-prompt-v2.md Normal file
View File

@@ -0,0 +1,302 @@
# Document Annotation Tool UX Design Spec v2
## Theme: Warm Graphite (Modern Enterprise)
---
## 1. Design Principles (Updated)
1. **Clarity** High contrast, but never pure black-on-white
2. **Warm Neutrality** Slightly warm grays reduce visual fatigue
3. **Focus** Content-first layouts with restrained accents
4. **Consistency** Reusable patterns, predictable behavior
5. **Professional Trust** Calm, serious, enterprise-ready
6. **Longevity** No trendy colors that age quickly
---
## 2. Color Palette (Warm Graphite)
### Core Colors
| Usage | Color Name | Hex |
|------|-----------|-----|
| Primary Text | Soft Black | #121212 |
| Secondary Text | Charcoal Gray | #2A2A2A |
| Muted Text | Warm Gray | #6B6B6B |
| Disabled Text | Light Warm Gray | #9A9A9A |
### Backgrounds
| Usage | Color | Hex |
|-----|------|-----|
| App Background | Paper White | #FAFAF8 |
| Card / Panel | White | #FFFFFF |
| Hover Surface | Subtle Warm Gray | #F1F0ED |
| Selected Row | Very Light Warm Gray | #ECEAE6 |
### Borders & Dividers
| Usage | Color | Hex |
|------|------|-----|
| Default Border | Warm Light Gray | #E6E4E1 |
| Strong Divider | Neutral Gray | #D8D6D2 |
### Semantic States (Muted & Professional)
| State | Color | Hex |
|------|-------|-----|
| Success | Olive Gray | #3E4A3A |
| Error | Brick Gray | #4A3A3A |
| Warning | Sand Gray | #4A4A3A |
| Info | Graphite Gray | #3A3A3A |
> Accent colors are **never saturated** and are used only for status, progress, or selection.
---
## 3. Typography
- **Font Family**: Inter / SF Pro / system-ui
- **Headings**:
- Weight: 600700
- Color: #121212
- Letter spacing: -0.01em
- **Body Text**:
- Weight: 400
- Color: #2A2A2A
- **Captions / Meta**:
- Weight: 400
- Color: #6B6B6B
- **Monospace (IDs / Values)**:
- JetBrains Mono / SF Mono
- Color: #2A2A2A
---
## 4. Global Layout
### Top Navigation Bar
- Height: 56px
- Background: #FAFAF8
- Bottom Border: 1px solid #E6E4E1
- Logo: Text or icon in #121212
**Navigation Items**
- Default: #6B6B6B
- Hover: #2A2A2A
- Active:
- Text: #121212
- Bottom indicator: 2px solid #3A3A3A (rounded ends)
**Avatar**
- Circle background: #ECEAE6
- Text: #2A2A2A
---
## 5. Page: Documents (Dashboard)
### Page Header
- Title: "Documents" (#121212)
- Actions:
- Primary button: Dark graphite outline
- Secondary button: Subtle border only
### Filters Bar
- Background: #FFFFFF
- Border: 1px solid #E6E4E1
- Inputs:
- Background: #FFFFFF
- Hover: #F1F0ED
- Focus ring: 1px #3A3A3A
### Document Table
- Table background: #FFFFFF
- Header text: #6B6B6B
- Row hover: #F1F0ED
- Row selected:
- Background: #ECEAE6
- Left indicator: 3px solid #3A3A3A
### Status Badges
- Pending:
- BG: #FFFFFF
- Border: #D8D6D2
- Text: #2A2A2A
- Labeled:
- BG: #2A2A2A
- Text: #FFFFFF
- Exported:
- BG: #ECEAE6
- Text: #2A2A2A
- Icon: ✓
### Auto-label States
- Running:
- Progress bar: #3A3A3A on #ECEAE6
- Completed:
- Text: #3E4A3A
- Failed:
- BG: #F1EDED
- Text: #4A3A3A
---
## 6. Upload Modals (Single & Batch)
### Modal Container
- Background: #FFFFFF
- Border radius: 8px
- Shadow: 0 1px 3px rgba(0,0,0,0.08)
### Drop Zone
- Background: #FAFAF8
- Border: 1px dashed #D8D6D2
- Hover: #F1F0ED
- Icon: Graphite gray
### Form Fields
- Input BG: #FFFFFF
- Border: #D8D6D2
- Focus: 1px solid #3A3A3A
Primary Action Button:
- Text: #FFFFFF
- BG: #2A2A2A
- Hover: #121212
---
## 7. Document Detail View
### Canvas Area
- Background: #FFFFFF
- Annotation styles:
- Manual: Solid border #2A2A2A
- Auto: Dashed border #6B6B6B
- Selected: 2px border #3A3A3A + resize handles
### Right Info Panel
- Card background: #FFFFFF
- Section headers: #121212
- Meta text: #6B6B6B
### Annotation Table
- Same table styles as Documents
- Inline edit:
- Input background: #FAFAF8
- Save button: Graphite
### Locked State (Auto-label Running)
- Banner BG: #FAFAF8
- Border-left: 3px solid #4A4A3A
- Progress bar: Graphite
---
## 8. Training Page
### Document Selector
- Selected rows use same highlight rules
- Verified state:
- Full: Olive gray check
- Partial: Sand gray warning
### Configuration Panel
- Card layout
- Inputs aligned to grid
- Schedule option visually muted until enabled
Primary CTA:
- Start Training button in dark graphite
---
## 9. Models & Training History
### Training Job List
- Job cards use #FFFFFF background
- Running job:
- Progress bar: #3A3A3A
- Completed job:
- Metrics bars in graphite
### Model Detail Panel
- Sectioned cards
- Metric bars:
- Track: #ECEAE6
- Fill: #3A3A3A
Actions:
- Primary: Download Model
- Secondary: View Logs / Use as Base
---
## 10. Micro-interactions (Refined)
| Element | Interaction | Animation |
|------|------------|-----------|
| Button hover | BG lightens | 150ms ease-out |
| Button press | Scale 0.98 | 100ms |
| Row hover | BG fade | 120ms |
| Modal open | Fade + scale 0.96 → 1 | 200ms |
| Progress fill | Smooth | ease-out |
| Annotation select | Border + handles | 120ms |
---
## 11. Tailwind Theme (Updated)
```js
colors: {
text: {
primary: '#121212',
secondary: '#2A2A2A',
muted: '#6B6B6B',
disabled: '#9A9A9A',
},
bg: {
app: '#FAFAF8',
card: '#FFFFFF',
hover: '#F1F0ED',
selected: '#ECEAE6',
},
border: '#E6E4E1',
accent: '#3A3A3A',
success: '#3E4A3A',
error: '#4A3A3A',
warning: '#4A4A3A',
}
```
---
## 12. Final Notes
- Pure black (#000000) should **never** be used as large surfaces
- Accent color usage should stay under **10% of UI area**
- Warm grays are intentional and must not be "corrected" to blue-grays
This theme is designed to scale from internal tool → polished SaaS without redesign.

View File

@@ -0,0 +1,273 @@
# Web Directory Refactoring - Complete ✅
**Date**: 2026-01-25
**Status**: ✅ Completed
**Tests**: 188 passing (0 failures)
**Coverage**: 23% (maintained)
---
## Final Directory Structure
```
src/web/
├── api/
│ ├── __init__.py
│ └── v1/
│ ├── __init__.py
│ ├── routes.py # Public inference API
│ ├── admin/
│ │ ├── __init__.py
│ │ ├── documents.py # Document management (was admin_routes.py)
│ │ ├── annotations.py # Annotation routes (was admin_annotation_routes.py)
│ │ └── training.py # Training routes (was admin_training_routes.py)
│ ├── async_api/
│ │ ├── __init__.py
│ │ └── routes.py # Async processing API (was async_routes.py)
│ └── batch/
│ ├── __init__.py
│ └── routes.py # Batch upload API (was batch_upload_routes.py)
├── schemas/
│ ├── __init__.py
│ ├── common.py # Shared models (ErrorResponse)
│ ├── admin.py # Admin schemas (was admin_schemas.py)
│ └── inference.py # Inference + async schemas (was schemas.py)
├── services/
│ ├── __init__.py
│ ├── inference.py # Inference service (was services.py)
│ ├── autolabel.py # Auto-label service (was admin_autolabel.py)
│ ├── async_processing.py # Async processing (was async_service.py)
│ └── batch_upload.py # Batch upload service (was batch_upload_service.py)
├── core/
│ ├── __init__.py
│ ├── auth.py # Authentication (was admin_auth.py)
│ ├── rate_limiter.py # Rate limiting (unchanged)
│ └── scheduler.py # Task scheduler (was admin_scheduler.py)
├── workers/
│ ├── __init__.py
│ ├── async_queue.py # Async task queue (was async_queue.py)
│ └── batch_queue.py # Batch task queue (was batch_queue.py)
├── __init__.py # Main exports
├── app.py # FastAPI app (imports updated)
├── config.py # Configuration (unchanged)
└── dependencies.py # Global dependencies (unchanged)
```
---
## Changes Summary
### Files Moved and Renamed
| Old Location | New Location | Change Type |
|-------------|--------------|-------------|
| `admin_routes.py` | `api/v1/admin/documents.py` | Moved + Renamed |
| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Moved + Renamed |
| `admin_training_routes.py` | `api/v1/admin/training.py` | Moved + Renamed |
| `admin_auth.py` | `core/auth.py` | Moved |
| `admin_autolabel.py` | `services/autolabel.py` | Moved |
| `admin_scheduler.py` | `core/scheduler.py` | Moved |
| `admin_schemas.py` | `schemas/admin.py` | Moved |
| `routes.py` | `api/v1/routes.py` | Moved |
| `schemas.py` | `schemas/inference.py` | Moved |
| `services.py` | `services/inference.py` | Moved |
| `async_routes.py` | `api/v1/async_api/routes.py` | Moved |
| `async_queue.py` | `workers/async_queue.py` | Moved |
| `async_service.py` | `services/async_processing.py` | Moved + Renamed |
| `batch_queue.py` | `workers/batch_queue.py` | Moved |
| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Moved |
| `batch_upload_service.py` | `services/batch_upload.py` | Moved |
**Total**: 16 files reorganized
### Files Updated
**Source Files** (imports updated):
- `app.py` - Updated all imports to new structure
- `api/v1/admin/documents.py` - Updated schema/auth imports
- `api/v1/admin/annotations.py` - Updated schema/service imports
- `api/v1/admin/training.py` - Updated schema/auth imports
- `api/v1/routes.py` - Updated schema imports
- `api/v1/async_api/routes.py` - Updated schema imports
- `api/v1/batch/routes.py` - Updated service/worker imports
- `services/async_processing.py` - Updated worker/core imports
**Test Files** (all 15 updated):
- `test_admin_annotations.py`
- `test_admin_auth.py`
- `test_admin_routes.py`
- `test_admin_routes_enhanced.py`
- `test_admin_training.py`
- `test_annotation_locks.py`
- `test_annotation_phase5.py`
- `test_async_queue.py`
- `test_async_routes.py`
- `test_async_service.py`
- `test_autolabel_with_locks.py`
- `test_batch_queue.py`
- `test_batch_upload_routes.py`
- `test_batch_upload_service.py`
- `test_training_phase4.py`
- `conftest.py`
---
## Import Examples
### Old Import Style (Before Refactoring)
```python
from src.web.admin_routes import create_admin_router
from src.web.admin_schemas import DocumentItem
from src.web.admin_auth import validate_admin_token
from src.web.async_routes import create_async_router
from src.web.schemas import ErrorResponse
```
### New Import Style (After Refactoring)
```python
# Admin API
from src.web.api.v1.admin.documents import create_admin_router
from src.web.api.v1.admin import create_admin_router # Shorter alternative
# Schemas
from src.web.schemas.admin import DocumentItem
from src.web.schemas.common import ErrorResponse
# Core components
from src.web.core.auth import validate_admin_token
# Async API
from src.web.api.v1.async_api.routes import create_async_router
```
---
## Benefits Achieved
### 1. **Clear Separation of Concerns**
- **API Routes**: All in `api/v1/` by version and feature
- **Data Models**: All in `schemas/` by domain
- **Business Logic**: All in `services/`
- **Core Components**: Reusable utilities in `core/`
- **Background Jobs**: Task queues in `workers/`
### 2. **Better Scalability**
- Easy to add API v2 without touching v1
- Clear namespace for each module
- Reduced file sizes (no 800+ line files)
- Follows single responsibility principle
### 3. **Improved Maintainability**
- Find files by function, not by prefix
- Each module has one clear purpose
- Easier to onboard new developers
- Better IDE navigation
### 4. **Standards Compliance**
- Follows FastAPI best practices
- Matches Django/Flask project structures
- Standard Python package organization
- Industry-standard naming conventions
---
## Testing Results
**Before Refactoring**:
- 188 tests passing
- 23% code coverage
- Flat directory structure
**After Refactoring**:
- ✅ 188 tests passing (0 failures)
- ✅ 23% code coverage (maintained)
- ✅ Clean hierarchical structure
- ✅ All imports updated
- ✅ No backward compatibility shims needed
---
## Migration Statistics
| Metric | Count |
|--------|-------|
| Files moved | 16 |
| Directories created | 9 |
| Files updated (source) | 8 |
| Files updated (tests) | 16 |
| Import statements updated | ~150 |
| Lines of code changed | ~200 |
| Tests broken | 0 |
| Coverage lost | 0% |
---
## Code Diff Summary
```diff
Before:
src/web/
├── admin_routes.py (645 lines)
├── admin_annotation_routes.py (504 lines)
├── admin_training_routes.py (565 lines)
├── admin_auth.py (22 lines)
├── admin_schemas.py (262 lines)
... (15 more files at root level)
After:
src/web/
├── api/v1/
│ ├── admin/ (3 route files)
│ ├── async_api/ (1 route file)
│ └── batch/ (1 route file)
├── schemas/ (3 schema files)
├── services/ (4 service files)
├── core/ (3 core files)
└── workers/ (2 worker files)
```
---
## Next Steps (Optional)
### Phase 2: Documentation
- [ ] Update API documentation with new import paths
- [ ] Create migration guide for external developers
- [ ] Update CLAUDE.md with new structure
### Phase 3: Further Optimization
- [ ] Split large files (>400 lines) if needed
- [ ] Extract common utilities
- [ ] Add typing stubs
### Phase 4: Deprecation (Future)
- [ ] Add deprecation warnings if creating compatibility layer
- [ ] Remove old imports after grace period
- [ ] Update all documentation
---
## Rollback Instructions
If needed, rollback is simple:
```bash
git revert <commit-hash>
```
All changes are in version control, making rollback safe and easy.
---
## Conclusion
**Refactoring completed successfully**
**Zero breaking changes**
**All tests passing**
**Industry-standard structure achieved**
The web directory is now organized following Python and FastAPI best practices, making it easier to scale, maintain, and extend.

View File

@@ -0,0 +1,186 @@
# Web Directory Refactoring Plan
## Current Structure Issues
1. **Flat structure**: All files in one directory (20 Python files)
2. **Naming inconsistency**: Mix of `admin_*`, `async_*`, `batch_*` prefixes
3. **Mixed concerns**: Routes, schemas, services, and workers in same directory
4. **Poor scalability**: Hard to navigate and maintain as project grows
## Proposed Structure (Best Practices)
```
src/web/
├── __init__.py # Main exports
├── app.py # FastAPI app factory
├── config.py # App configuration
├── dependencies.py # Global dependencies
├── api/ # API Routes Layer
│ ├── __init__.py
│ └── v1/ # API version 1
│ ├── __init__.py
│ ├── routes.py # Public API routes (inference)
│ ├── admin/ # Admin API routes
│ │ ├── __init__.py
│ │ ├── documents.py # admin_routes.py → documents.py
│ │ ├── annotations.py # admin_annotation_routes.py → annotations.py
│ │ ├── training.py # admin_training_routes.py → training.py
│ │ └── auth.py # admin_auth.py → auth.py (routes only)
│ ├── async_api/ # Async processing API
│ │ ├── __init__.py
│ │ └── routes.py # async_routes.py → routes.py
│ └── batch/ # Batch upload API
│ ├── __init__.py
│ └── routes.py # batch_upload_routes.py → routes.py
├── schemas/ # Pydantic Models
│ ├── __init__.py
│ ├── common.py # Shared schemas (ErrorResponse, etc.)
│ ├── inference.py # schemas.py → inference.py
│ ├── admin.py # admin_schemas.py → admin.py
│ ├── async_api.py # New: async API schemas
│ └── batch.py # New: batch upload schemas
├── services/ # Business Logic Layer
│ ├── __init__.py
│ ├── inference.py # services.py → inference.py
│ ├── autolabel.py # admin_autolabel.py → autolabel.py
│ ├── async_processing.py # async_service.py → async_processing.py
│ └── batch_upload.py # batch_upload_service.py → batch_upload.py
├── core/ # Core Components
│ ├── __init__.py
│ ├── auth.py # admin_auth.py → auth.py (logic only)
│ ├── rate_limiter.py # rate_limiter.py → rate_limiter.py
│ └── scheduler.py # admin_scheduler.py → scheduler.py
└── workers/ # Background Task Queues
├── __init__.py
├── async_queue.py # async_queue.py → async_queue.py
└── batch_queue.py # batch_queue.py → batch_queue.py
```
## File Mapping
### Current → New Location
| Current File | New Location | Purpose |
|--------------|--------------|---------|
| `admin_routes.py` | `api/v1/admin/documents.py` | Document management routes |
| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Annotation routes |
| `admin_training_routes.py` | `api/v1/admin/training.py` | Training routes |
| `admin_auth.py` | Split: `api/v1/admin/auth.py` + `core/auth.py` | Auth routes + logic |
| `admin_schemas.py` | `schemas/admin.py` | Admin Pydantic models |
| `admin_autolabel.py` | `services/autolabel.py` | Auto-label service |
| `admin_scheduler.py` | `core/scheduler.py` | Training scheduler |
| `routes.py` | `api/v1/routes.py` | Public inference API |
| `schemas.py` | `schemas/inference.py` | Inference models |
| `services.py` | `services/inference.py` | Inference service |
| `async_routes.py` | `api/v1/async_api/routes.py` | Async API routes |
| `async_service.py` | `services/async_processing.py` | Async processing service |
| `async_queue.py` | `workers/async_queue.py` | Async task queue |
| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Batch upload routes |
| `batch_upload_service.py` | `services/batch_upload.py` | Batch upload service |
| `batch_queue.py` | `workers/batch_queue.py` | Batch task queue |
| `rate_limiter.py` | `core/rate_limiter.py` | Rate limiting logic |
| `config.py` | `config.py` | Keep as-is |
| `dependencies.py` | `dependencies.py` | Keep as-is |
| `app.py` | `app.py` | Keep as-is (update imports) |
## Benefits
### 1. Clear Separation of Concerns
- **Routes**: API endpoint definitions
- **Schemas**: Data validation models
- **Services**: Business logic
- **Core**: Reusable components
- **Workers**: Background processing
### 2. Better Scalability
- Easy to add new API versions (`v2/`)
- Clear namespace for each domain
- Reduced file size (no 800+ line files)
### 3. Improved Maintainability
- Find files by function, not by prefix
- Each module has single responsibility
- Easier to write focused tests
### 4. Standard Python Patterns
- Package-based organization
- Follows FastAPI best practices
- Similar to Django/Flask structures
## Implementation Steps
### Phase 1: Create New Structure (No Breaking Changes)
1. Create new directories: `api/`, `schemas/`, `services/`, `core/`, `workers/`
2. Copy files to new locations (don't delete originals yet)
3. Update imports in new files
4. Add `__init__.py` with proper exports
### Phase 2: Update Tests
5. Update test imports to use new structure
6. Run tests to verify nothing breaks
7. Fix any import issues
### Phase 3: Update Main App
8. Update `app.py` to import from new locations
9. Run full test suite
10. Verify all endpoints work
### Phase 4: Cleanup
11. Delete old files
12. Update documentation
13. Final test run
## Migration Priority
**High Priority** (Most used):
- Routes and schemas (user-facing APIs)
- Services (core business logic)
**Medium Priority**:
- Core components (auth, rate limiter)
- Workers (background tasks)
**Low Priority**:
- Config and dependencies (already well-located)
## Backwards Compatibility
During migration, maintain backwards compatibility:
```python
# src/web/__init__.py
# Old imports still work
from src.web.api.v1.admin.documents import router as admin_router
from src.web.schemas.admin import AdminDocument
# Keep old names for compatibility (temporary)
admin_routes = admin_router # Deprecated alias
```
## Testing Strategy
1. **Unit Tests**: Test each module independently
2. **Integration Tests**: Test API endpoints still work
3. **Import Tests**: Verify all old imports still work
4. **Coverage**: Maintain current 23% coverage minimum
## Rollback Plan
If issues arise:
1. Keep old files until fully migrated
2. Git allows easy revert
3. Tests catch breaking changes early
---
## Next Steps
Would you like me to:
1. **Start Phase 1**: Create new directory structure and move files?
2. **Create migration script**: Automate the file moves and import updates?
3. **Focus on specific area**: Start with admin API or async API first?

View File

@@ -0,0 +1,218 @@
# Web Directory Refactoring - Current Status
## ✅ Completed Steps
### 1. Directory Structure Created
```
src/web/
├── api/
│ ├── v1/
│ │ ├── admin/ (documents.py, annotations.py, training.py)
│ │ ├── async_api/ (routes.py)
│ │ ├── batch/ (routes.py)
│ │ └── routes.py (public inference API)
├── schemas/
│ ├── admin.py (admin schemas)
│ ├── inference.py (inference + async schemas)
│ └── common.py (ErrorResponse)
├── services/
│ ├── autolabel.py
│ ├── async_processing.py
│ ├── batch_upload.py
│ └── inference.py
├── core/
│ ├── auth.py
│ ├── rate_limiter.py
│ └── scheduler.py
└── workers/
├── async_queue.py
└── batch_queue.py
```
### 2. Files Copied and Imports Updated
#### Admin API (✅ Complete)
- [x] `admin_routes.py``api/v1/admin/documents.py` (imports updated)
- [x] `admin_annotation_routes.py``api/v1/admin/annotations.py` (imports updated)
- [x] `admin_training_routes.py``api/v1/admin/training.py` (imports updated)
- [x] `api/v1/admin/__init__.py` created with exports
#### Public & Async API (✅ Complete)
- [x] `routes.py``api/v1/routes.py` (imports updated)
- [x] `async_routes.py``api/v1/async_api/routes.py` (imports updated)
- [x] `batch_upload_routes.py``api/v1/batch/routes.py` (copied, imports pending)
#### Schemas (✅ Complete)
- [x] `admin_schemas.py``schemas/admin.py`
- [x] `schemas.py``schemas/inference.py`
- [x] `schemas/common.py` created
- [x] `schemas/__init__.py` created with exports
#### Services (✅ Complete)
- [x] `admin_autolabel.py``services/autolabel.py`
- [x] `async_service.py``services/async_processing.py`
- [x] `batch_upload_service.py``services/batch_upload.py`
- [x] `services.py``services/inference.py`
- [x] `services/__init__.py` created
#### Core Components (✅ Complete)
- [x] `admin_auth.py``core/auth.py`
- [x] `rate_limiter.py``core/rate_limiter.py`
- [x] `admin_scheduler.py``core/scheduler.py`
- [x] `core/__init__.py` created
#### Workers (✅ Complete)
- [x] `async_queue.py``workers/async_queue.py`
- [x] `batch_queue.py``workers/batch_queue.py`
- [x] `workers/__init__.py` created
#### Main App (✅ Complete)
- [x] `app.py` imports updated to use new structure
---
## ⏳ Remaining Work
### 1. Update Remaining File Imports (HIGH PRIORITY)
Files that need import updates:
- [ ] `api/v1/batch/routes.py` - update to use new schema/service imports
- [ ] `services/autolabel.py` - may need import updates if it references old paths
- [ ] `services/async_processing.py` - check for old import references
- [ ] `services/batch_upload.py` - check for old import references
- [ ] `services/inference.py` - check for old import references
### 2. Update ALL Test Files (CRITICAL)
Test files need to import from new locations. Pattern:
**Old:**
```python
from src.web.admin_routes import create_admin_router
from src.web.admin_schemas import DocumentItem
from src.web.admin_auth import validate_admin_token
```
**New:**
```python
from src.web.api.v1.admin import create_admin_router
from src.web.schemas.admin import DocumentItem
from src.web.core.auth import validate_admin_token
```
Test files to update:
- [ ] `tests/web/test_admin_annotations.py`
- [ ] `tests/web/test_admin_auth.py`
- [ ] `tests/web/test_admin_routes.py`
- [ ] `tests/web/test_admin_routes_enhanced.py`
- [ ] `tests/web/test_admin_training.py`
- [ ] `tests/web/test_annotation_locks.py`
- [ ] `tests/web/test_annotation_phase5.py`
- [ ] `tests/web/test_async_queue.py`
- [ ] `tests/web/test_async_routes.py`
- [ ] `tests/web/test_async_service.py`
- [ ] `tests/web/test_autolabel_with_locks.py`
- [ ] `tests/web/test_batch_queue.py`
- [ ] `tests/web/test_batch_upload_routes.py`
- [ ] `tests/web/test_batch_upload_service.py`
- [ ] `tests/web/test_rate_limiter.py`
- [ ] `tests/web/test_training_phase4.py`
### 3. Create Backward Compatibility Layer (OPTIONAL)
Keep old imports working temporarily:
```python
# src/web/admin_routes.py (temporary compatibility shim)
\"\"\"
DEPRECATED: Use src.web.api.v1.admin.documents instead.
This file will be removed in next version.
\"\"\"
import warnings
from src.web.api.v1.admin.documents import *
warnings.warn(
"Importing from src.web.admin_routes is deprecated. "
"Use src.web.api.v1.admin.documents instead.",
DeprecationWarning,
stacklevel=2
)
```
### 4. Verify and Test
1. Run tests:
```bash
pytest tests/web/ -v
```
2. Check for any import errors:
```bash
python -c "from src.web.app import create_app; create_app()"
```
3. Start server and test endpoints:
```bash
python run_server.py
```
### 5. Clean Up Old Files (ONLY AFTER TESTS PASS)
Old files to remove:
- `src/web/admin_*.py` (7 files)
- `src/web/async_*.py` (3 files)
- `src/web/batch_*.py` (3 files)
- `src/web/routes.py`
- `src/web/services.py`
- `src/web/schemas.py`
- `src/web/rate_limiter.py`
Keep these files (don't remove):
- `src/web/__init__.py`
- `src/web/app.py`
- `src/web/config.py`
- `src/web/dependencies.py`
---
## 🎯 Next Immediate Steps
1. **Update batch/routes.py imports** - Quick fix for remaining API route
2. **Update test file imports** - Critical for verification
3. **Run test suite** - Verify nothing broke
4. **Fix any import errors** - Address failures
5. **Remove old files** - Clean up after tests pass
---
## 📊 Migration Impact Summary
| Category | Files Moved | Imports Updated | Status |
|----------|-------------|-----------------|--------|
| API Routes | 7 | 5/7 | 🟡 In Progress |
| Schemas | 3 | 3/3 | ✅ Complete |
| Services | 4 | 0/4 | ⚠️ Pending |
| Core | 3 | 3/3 | ✅ Complete |
| Workers | 2 | 2/2 | ✅ Complete |
| Tests | 0 | 0/16 | ❌ Not Started |
**Overall Progress: 65%**
---
## 🚀 Benefits After Migration
1. **Better Organization**: Clear separation by function
2. **Easier Navigation**: Find files by purpose, not prefix
3. **Scalability**: Easy to add new API versions
4. **Standard Structure**: Follows FastAPI best practices
5. **Maintainability**: Each module has single responsibility
---
## 📝 Notes
- All original files are still in place (no data loss risk)
- New structure is operational but needs import updates
- Backward compatibility can be added if needed
- Tests will validate the migration success