WIP
This commit is contained in:
@@ -1,405 +0,0 @@
|
||||
# Invoice Master POC v2 - 代码审查报告
|
||||
|
||||
**审查日期**: 2026-01-22
|
||||
**代码库规模**: 67 个 Python 源文件,约 22,434 行代码
|
||||
**测试覆盖率**: ~40-50%
|
||||
|
||||
---
|
||||
|
||||
## 执行摘要
|
||||
|
||||
### 总体评估:**良好(B+)**
|
||||
|
||||
**优势**:
|
||||
- ✅ 清晰的模块化架构,职责分离良好
|
||||
- ✅ 使用了合适的数据类和类型提示
|
||||
- ✅ 针对瑞典发票的全面规范化逻辑
|
||||
- ✅ 空间索引优化(O(1) token 查找)
|
||||
- ✅ 完善的降级机制(YOLO 失败时的 OCR fallback)
|
||||
- ✅ 设计良好的 Web API 和 UI
|
||||
|
||||
**主要问题**:
|
||||
- ❌ 支付行解析代码重复(3+ 处)
|
||||
- ❌ 长函数(`_normalize_customer_number` 127 行)
|
||||
- ❌ 配置安全问题(明文数据库密码)
|
||||
- ❌ 异常处理不一致(到处都是通用 Exception)
|
||||
- ❌ 缺少集成测试
|
||||
- ❌ 魔法数字散布各处(0.5, 0.95, 300 等)
|
||||
|
||||
---
|
||||
|
||||
## 1. 架构分析
|
||||
|
||||
### 1.1 模块结构
|
||||
|
||||
```
|
||||
src/
|
||||
├── inference/ # 推理管道核心
|
||||
│ ├── pipeline.py (517 行) ⚠️
|
||||
│ ├── field_extractor.py (1,347 行) 🔴 太长
|
||||
│ └── yolo_detector.py
|
||||
├── web/ # FastAPI Web 服务
|
||||
│ ├── app.py (765 行) ⚠️ HTML 内联
|
||||
│ ├── routes.py (184 行)
|
||||
│ └── services.py (286 行)
|
||||
├── ocr/ # OCR 提取
|
||||
│ ├── paddle_ocr.py
|
||||
│ └── machine_code_parser.py (919 行) 🔴 太长
|
||||
├── matcher/ # 字段匹配
|
||||
│ └── field_matcher.py (875 行) ⚠️
|
||||
├── utils/ # 共享工具
|
||||
│ ├── validators.py
|
||||
│ ├── text_cleaner.py
|
||||
│ ├── fuzzy_matcher.py
|
||||
│ ├── ocr_corrections.py
|
||||
│ └── format_variants.py (610 行)
|
||||
├── processing/ # 批处理
|
||||
├── data/ # 数据管理
|
||||
└── cli/ # 命令行工具
|
||||
```
|
||||
|
||||
### 1.2 推理流程
|
||||
|
||||
```
|
||||
PDF/Image 输入
|
||||
↓
|
||||
渲染为图片 (pdf/renderer.py)
|
||||
↓
|
||||
YOLO 检测 (yolo_detector.py) - 检测字段区域
|
||||
↓
|
||||
字段提取 (field_extractor.py)
|
||||
├→ OCR 文本提取 (ocr/paddle_ocr.py)
|
||||
├→ 规范化 & 验证
|
||||
└→ 置信度计算
|
||||
↓
|
||||
交叉验证 (pipeline.py)
|
||||
├→ 解析 payment_line 格式
|
||||
├→ 从 payment_line 提取 OCR/Amount/Account
|
||||
└→ 与检测字段验证,payment_line 值优先
|
||||
↓
|
||||
降级 OCR(如果关键字段缺失)
|
||||
├→ 全页 OCR
|
||||
└→ 正则提取
|
||||
↓
|
||||
InferenceResult 输出
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 代码质量问题
|
||||
|
||||
### 2.1 长函数(>50 行)🔴
|
||||
|
||||
| 函数 | 文件 | 行数 | 复杂度 | 问题 |
|
||||
|------|------|------|--------|------|
|
||||
| `_normalize_customer_number()` | field_extractor.py | **127** | 极高 | 4 层模式匹配,7+ 正则,复杂评分 |
|
||||
| `_cross_validate_payment_line()` | pipeline.py | **127** | 极高 | 核心验证逻辑,8+ 条件分支 |
|
||||
| `_normalize_bankgiro()` | field_extractor.py | 62 | 高 | Luhn 验证 + 多种降级 |
|
||||
| `_normalize_plusgiro()` | field_extractor.py | 63 | 高 | 类似 bankgiro |
|
||||
| `_normalize_payment_line()` | field_extractor.py | 74 | 高 | 4 种正则模式 |
|
||||
| `_normalize_amount()` | field_extractor.py | 78 | 高 | 多策略降级 |
|
||||
|
||||
**示例问题** - `_normalize_customer_number()` (第 776-902 行):
|
||||
```python
|
||||
def _normalize_customer_number(self, text: str):
|
||||
# 127 行函数,包含:
|
||||
# - 4 个嵌套的 if/for 循环
|
||||
# - 7 种不同的正则模式
|
||||
# - 5 个评分机制
|
||||
# - 处理有标签和无标签格式
|
||||
```
|
||||
|
||||
**建议**: 拆分为:
|
||||
- `_find_customer_code_patterns()`
|
||||
- `_find_labeled_customer_code()`
|
||||
- `_score_customer_candidates()`
|
||||
|
||||
### 2.2 代码重复 🔴
|
||||
|
||||
**支付行解析(3+ 处重复实现)**:
|
||||
|
||||
1. `_parse_machine_readable_payment_line()` (pipeline.py:217-252)
|
||||
2. `MachineCodeParser.parse()` (machine_code_parser.py:919 行)
|
||||
3. `_normalize_payment_line()` (field_extractor.py:632-705)
|
||||
|
||||
所有三处都实现类似的正则模式:
|
||||
```
|
||||
格式: # <OCR> # <Kronor> <Öre> <Type> > <Account>#<Check>#
|
||||
```
|
||||
|
||||
**Bankgiro/Plusgiro 验证(重复)**:
|
||||
- `validators.py`: `is_valid_bankgiro()`, `format_bankgiro()`
|
||||
- `field_extractor.py`: `_normalize_bankgiro()`, `_normalize_plusgiro()`, `_luhn_checksum()`
|
||||
- `normalizer.py`: `normalize_bankgiro()`, `normalize_plusgiro()`
|
||||
- `field_matcher.py`: 类似匹配逻辑
|
||||
|
||||
**建议**: 创建统一模块:
|
||||
```python
|
||||
# src/common/payment_line_parser.py
|
||||
class PaymentLineParser:
|
||||
def parse(text: str) -> PaymentLineResult
|
||||
|
||||
# src/common/giro_validator.py
|
||||
class GiroValidator:
|
||||
def validate_and_format(value: str, giro_type: str) -> str
|
||||
```
|
||||
|
||||
### 2.3 错误处理不一致 ⚠️
|
||||
|
||||
**通用异常捕获(31 处)**:
|
||||
```python
|
||||
except Exception as e: # 代码库中 31 处
|
||||
result.errors.append(str(e))
|
||||
```
|
||||
|
||||
**问题**:
|
||||
- 没有捕获特定错误类型
|
||||
- 通用错误消息丢失上下文
|
||||
- 第 142-147 行 (routes.py): 捕获所有异常,返回 500 状态
|
||||
|
||||
**当前写法** (routes.py:142-147):
|
||||
```python
|
||||
try:
|
||||
service_result = inference_service.process_pdf(...)
|
||||
except Exception as e: # 太宽泛
|
||||
logger.error(f"Error processing document: {e}")
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
```
|
||||
|
||||
**改进建议**:
|
||||
```python
|
||||
except FileNotFoundError:
|
||||
raise HTTPException(status_code=400, detail="PDF 文件未找到")
|
||||
except PyMuPDFError:
|
||||
raise HTTPException(status_code=400, detail="无效的 PDF 格式")
|
||||
except OCRError:
|
||||
raise HTTPException(status_code=503, detail="OCR 服务不可用")
|
||||
```
|
||||
|
||||
### 2.4 配置安全问题 🔴
|
||||
|
||||
**config.py 第 24-30 行** - 明文凭据:
|
||||
```python
|
||||
DATABASE = {
|
||||
'host': '192.168.68.31', # 硬编码 IP
|
||||
'user': 'docmaster', # 硬编码用户名
|
||||
'password': 'nY6LYK5d', # 🔴 明文密码!
|
||||
'database': 'invoice_master'
|
||||
}
|
||||
```
|
||||
|
||||
**建议**:
|
||||
```python
|
||||
DATABASE = {
|
||||
'host': os.getenv('DB_HOST', 'localhost'),
|
||||
'user': os.getenv('DB_USER', 'docmaster'),
|
||||
'password': os.getenv('DB_PASSWORD'), # 从环境变量读取
|
||||
'database': os.getenv('DB_NAME', 'invoice_master')
|
||||
}
|
||||
```
|
||||
|
||||
### 2.5 魔法数字 ⚠️
|
||||
|
||||
| 值 | 位置 | 用途 | 问题 |
|
||||
|---|------|------|------|
|
||||
| 0.5 | 多处 | 置信度阈值 | 不可按字段配置 |
|
||||
| 0.95 | pipeline.py | payment_line 置信度 | 无说明 |
|
||||
| 300 | 多处 | DPI | 硬编码 |
|
||||
| 0.1 | field_extractor.py | BBox 填充 | 应为配置 |
|
||||
| 72 | 多处 | PDF 基础 DPI | 公式中的魔法数字 |
|
||||
| 50 | field_extractor.py | 客户编号评分加分 | 无说明 |
|
||||
|
||||
**建议**: 提取到配置:
|
||||
```python
|
||||
INFERENCE_CONFIG = {
|
||||
'confidence_threshold': 0.5,
|
||||
'payment_line_confidence': 0.95,
|
||||
'dpi': 300,
|
||||
'bbox_padding': 0.1,
|
||||
}
|
||||
```
|
||||
|
||||
### 2.6 命名不一致 ⚠️
|
||||
|
||||
**字段名称不一致**:
|
||||
- YOLO 类名: `invoice_number`, `ocr_number`, `supplier_org_number`
|
||||
- 字段名: `InvoiceNumber`, `OCR`, `supplier_org_number`
|
||||
- CSV 列名: 可能又不同
|
||||
- 数据库字段名: 另一种变体
|
||||
|
||||
映射维护在多处:
|
||||
- `yolo_detector.py` (90-100 行): `CLASS_TO_FIELD`
|
||||
- 多个其他位置
|
||||
|
||||
---
|
||||
|
||||
## 3. 测试分析
|
||||
|
||||
### 3.1 测试覆盖率
|
||||
|
||||
**测试文件**: 13 个
|
||||
- ✅ 覆盖良好: field_matcher, normalizer, payment_line_parser
|
||||
- ⚠️ 中等覆盖: field_extractor, pipeline
|
||||
- ❌ 覆盖不足: web 层, CLI, 批处理
|
||||
|
||||
**估算覆盖率**: 40-50%
|
||||
|
||||
### 3.2 缺失的测试用例 🔴
|
||||
|
||||
**关键缺失**:
|
||||
1. 交叉验证逻辑 - 最复杂部分,测试很少
|
||||
2. payment_line 解析变体 - 多种实现,边界情况不清楚
|
||||
3. OCR 错误纠正 - 不同策略的复杂逻辑
|
||||
4. Web API 端点 - 没有请求/响应测试
|
||||
5. 批处理 - 多 worker 协调未测试
|
||||
6. 降级 OCR 机制 - YOLO 检测失败时
|
||||
|
||||
---
|
||||
|
||||
## 4. 架构风险
|
||||
|
||||
### 🔴 关键风险
|
||||
|
||||
1. **配置安全** - config.py 中明文数据库凭据(24-30 行)
|
||||
2. **错误恢复** - 宽泛的异常处理掩盖真实问题
|
||||
3. **可测试性** - 硬编码依赖阻止单元测试
|
||||
|
||||
### 🟡 高风险
|
||||
|
||||
1. **代码可维护性** - 支付行解析重复
|
||||
2. **可扩展性** - 没有长时间推理的异步处理
|
||||
3. **扩展性** - 添加新字段类型会很困难
|
||||
|
||||
### 🟢 中等风险
|
||||
|
||||
1. **性能** - 懒加载有帮助,但 ORM 查询未优化
|
||||
2. **文档** - 大部分足够但可以更好
|
||||
|
||||
---
|
||||
|
||||
## 5. 优先级矩阵
|
||||
|
||||
| 优先级 | 行动 | 工作量 | 影响 |
|
||||
|--------|------|--------|------|
|
||||
| 🔴 关键 | 修复配置安全(环境变量) | 1 小时 | 高 |
|
||||
| 🔴 关键 | 添加集成测试 | 2-3 天 | 高 |
|
||||
| 🔴 关键 | 文档化错误处理策略 | 4 小时 | 中 |
|
||||
| 🟡 高 | 统一 payment_line 解析 | 1-2 天 | 高 |
|
||||
| 🟡 高 | 提取规范化到子模块 | 2-3 天 | 中 |
|
||||
| 🟡 高 | 添加依赖注入 | 2-3 天 | 中 |
|
||||
| 🟡 高 | 拆分长函数 | 2-3 天 | 低 |
|
||||
| 🟢 中 | 提高测试覆盖率到 70%+ | 3-5 天 | 高 |
|
||||
| 🟢 中 | 提取魔法数字 | 4 小时 | 低 |
|
||||
| 🟢 中 | 标准化命名约定 | 1-2 天 | 中 |
|
||||
|
||||
---
|
||||
|
||||
## 6. 具体文件建议
|
||||
|
||||
### 高优先级(代码质量)
|
||||
|
||||
| 文件 | 问题 | 建议 |
|
||||
|------|------|------|
|
||||
| `field_extractor.py` | 1,347 行;6 个长规范化方法 | 拆分为 `normalizers/` 子模块 |
|
||||
| `pipeline.py` | 127 行 `_cross_validate_payment_line()` | 提取到单独的 `CrossValidator` 类 |
|
||||
| `field_matcher.py` | 875 行;复杂匹配逻辑 | 拆分为 `matching/` 子模块 |
|
||||
| `config.py` | 硬编码凭据(第 29 行) | 使用环境变量 |
|
||||
| `machine_code_parser.py` | 919 行;payment_line 解析 | 与 pipeline 解析合并 |
|
||||
|
||||
### 中优先级(重构)
|
||||
|
||||
| 文件 | 问题 | 建议 |
|
||||
|------|------|------|
|
||||
| `app.py` | 765 行;HTML 内联在 Python 中 | 提取到 `templates/` 目录 |
|
||||
| `autolabel.py` | 753 行;批处理逻辑 | 提取 worker 函数到模块 |
|
||||
| `format_variants.py` | 610 行;变体生成 | 考虑策略模式 |
|
||||
|
||||
---
|
||||
|
||||
## 7. 建议行动
|
||||
|
||||
### 第 1 阶段:关键修复(1 周)
|
||||
|
||||
1. **配置安全** (1 小时)
|
||||
- 移除 config.py 中的明文密码
|
||||
- 添加环境变量支持
|
||||
- 更新 README 说明配置
|
||||
|
||||
2. **错误处理标准化** (1 天)
|
||||
- 定义自定义异常类
|
||||
- 替换通用 Exception 捕获
|
||||
- 添加错误代码常量
|
||||
|
||||
3. **添加关键集成测试** (2 天)
|
||||
- 端到端推理测试
|
||||
- payment_line 交叉验证测试
|
||||
- API 端点测试
|
||||
|
||||
### 第 2 阶段:重构(2-3 周)
|
||||
|
||||
4. **统一 payment_line 解析** (2 天)
|
||||
- 创建 `src/common/payment_line_parser.py`
|
||||
- 合并 3 处重复实现
|
||||
- 迁移所有调用方
|
||||
|
||||
5. **拆分 field_extractor.py** (3 天)
|
||||
- 创建 `src/inference/normalizers/` 子模块
|
||||
- 每个字段类型一个文件
|
||||
- 提取共享验证逻辑
|
||||
|
||||
6. **拆分长函数** (2 天)
|
||||
- `_normalize_customer_number()` → 3 个函数
|
||||
- `_cross_validate_payment_line()` → CrossValidator 类
|
||||
|
||||
### 第 3 阶段:改进(1-2 周)
|
||||
|
||||
7. **提高测试覆盖率** (5 天)
|
||||
- 目标:70%+ 覆盖率
|
||||
- 专注于验证逻辑
|
||||
- 添加边界情况测试
|
||||
|
||||
8. **配置管理改进** (1 天)
|
||||
- 提取所有魔法数字
|
||||
- 创建配置文件(YAML)
|
||||
- 添加配置验证
|
||||
|
||||
9. **文档改进** (2 天)
|
||||
- 添加架构图
|
||||
- 文档化所有私有方法
|
||||
- 创建贡献指南
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:度量指标
|
||||
|
||||
### 代码复杂度
|
||||
|
||||
| 类别 | 计数 | 平均行数 |
|
||||
|------|------|----------|
|
||||
| 源文件 | 67 | 334 |
|
||||
| 长文件 (>500 行) | 12 | 875 |
|
||||
| 长函数 (>50 行) | 23 | 89 |
|
||||
| 测试文件 | 13 | 298 |
|
||||
|
||||
### 依赖关系
|
||||
|
||||
| 类型 | 计数 |
|
||||
|------|------|
|
||||
| 外部依赖 | ~25 |
|
||||
| 内部模块 | 10 |
|
||||
| 循环依赖 | 0 ✅ |
|
||||
|
||||
### 代码风格
|
||||
|
||||
| 指标 | 覆盖率 |
|
||||
|------|--------|
|
||||
| 类型提示 | 80% |
|
||||
| Docstrings (公开) | 80% |
|
||||
| Docstrings (私有) | 40% |
|
||||
| 测试覆盖率 | 45% |
|
||||
|
||||
---
|
||||
|
||||
**生成日期**: 2026-01-22
|
||||
**审查者**: Claude Code
|
||||
**版本**: v2.0
|
||||
@@ -1,96 +0,0 @@
|
||||
# Field Extractor 分析报告
|
||||
|
||||
## 概述
|
||||
|
||||
field_extractor.py (1183行) 最初被识别为可优化文件,尝试使用 `src/normalize` 模块进行重构,但经过分析和测试后发现 **不应该重构**。
|
||||
|
||||
## 重构尝试
|
||||
|
||||
### 初始计划
|
||||
将 field_extractor.py 中的重复 normalize 方法删除,统一使用 `src/normalize/normalize_field()` 接口。
|
||||
|
||||
### 实施步骤
|
||||
1. ✅ 备份原文件 (`field_extractor_old.py`)
|
||||
2. ✅ 修改 `_normalize_and_validate` 使用统一 normalizer
|
||||
3. ✅ 删除重复的 normalize 方法 (~400行)
|
||||
4. ❌ 运行测试 - **28个失败**
|
||||
5. ✅ 添加 wrapper 方法委托给 normalizer
|
||||
6. ❌ 再次测试 - **12个失败**
|
||||
7. ✅ 还原原文件
|
||||
8. ✅ 测试通过 - **全部45个测试通过**
|
||||
|
||||
## 关键发现
|
||||
|
||||
### 两个模块的不同用途
|
||||
|
||||
| 模块 | 用途 | 输入 | 输出 | 示例 |
|
||||
|------|------|------|------|------|
|
||||
| **src/normalize/** | **变体生成** 用于匹配 | 已提取的字段值 | 多个匹配变体列表 | `"INV-12345"` → `["INV-12345", "12345"]` |
|
||||
| **field_extractor** | **值提取** 从OCR文本 | 包含字段的原始OCR文本 | 提取的单个字段值 | `"Fakturanummer: A3861"` → `"A3861"` |
|
||||
|
||||
### 为什么不能统一?
|
||||
|
||||
1. **src/normalize/** 的设计目的:
|
||||
- 接收已经提取的字段值
|
||||
- 生成多个标准化变体用于fuzzy matching
|
||||
- 例如 BankgiroNormalizer:
|
||||
```python
|
||||
normalize("782-1713") → ["7821713", "782-1713"] # 生成变体
|
||||
```
|
||||
|
||||
2. **field_extractor** 的 normalize 方法:
|
||||
- 接收包含字段的原始OCR文本(可能包含标签、其他文本等)
|
||||
- **提取**特定模式的字段值
|
||||
- 例如 `_normalize_bankgiro`:
|
||||
```python
|
||||
_normalize_bankgiro("Bankgiro: 782-1713") → ("782-1713", True, None) # 从文本提取
|
||||
```
|
||||
|
||||
3. **关键区别**:
|
||||
- Normalizer: 变体生成器 (for matching)
|
||||
- Field Extractor: 模式提取器 (for parsing)
|
||||
|
||||
### 测试失败示例
|
||||
|
||||
使用 normalizer 替代 field extractor 方法后的失败:
|
||||
|
||||
```python
|
||||
# InvoiceNumber 测试
|
||||
Input: "Fakturanummer: A3861"
|
||||
期望: "A3861"
|
||||
实际: "Fakturanummer: A3861" # 没有提取,只是清理
|
||||
|
||||
# Bankgiro 测试
|
||||
Input: "Bankgiro: 782-1713"
|
||||
期望: "782-1713"
|
||||
实际: "7821713" # 返回了不带破折号的变体,而不是提取格式化值
|
||||
```
|
||||
|
||||
## 结论
|
||||
|
||||
**field_extractor.py 不应该使用 src/normalize 模块重构**,因为:
|
||||
|
||||
1. ✅ **职责不同**: 提取 vs 变体生成
|
||||
2. ✅ **输入不同**: 包含标签的原始OCR文本 vs 已提取的字段值
|
||||
3. ✅ **输出不同**: 单个提取值 vs 多个匹配变体
|
||||
4. ✅ **现有代码运行良好**: 所有45个测试通过
|
||||
5. ✅ **提取逻辑有价值**: 包含复杂的模式匹配规则(例如区分 Bankgiro/Plusgiro 格式)
|
||||
|
||||
## 建议
|
||||
|
||||
1. **保留 field_extractor.py 原样**: 不进行重构
|
||||
2. **文档化两个模块的差异**: 确保团队理解各自用途
|
||||
3. **关注其他优化目标**: machine_code_parser.py (919行)
|
||||
|
||||
## 学习点
|
||||
|
||||
重构前应该:
|
||||
1. 理解模块的**真实用途**,而不只是看代码相似度
|
||||
2. 运行完整测试套件验证假设
|
||||
3. 评估是否真的存在重复,还是表面相似但用途不同
|
||||
|
||||
---
|
||||
|
||||
**状态**: ✅ 分析完成,决定不重构
|
||||
**测试**: ✅ 45/45 通过
|
||||
**文件**: 保持 1183行 原样
|
||||
@@ -1,238 +0,0 @@
|
||||
# Machine Code Parser 分析报告
|
||||
|
||||
## 文件概况
|
||||
|
||||
- **文件**: `src/ocr/machine_code_parser.py`
|
||||
- **总行数**: 919 行
|
||||
- **代码行**: 607 行 (66%)
|
||||
- **方法数**: 14 个
|
||||
- **正则表达式使用**: 47 次
|
||||
|
||||
## 代码结构
|
||||
|
||||
### 类结构
|
||||
|
||||
```
|
||||
MachineCodeResult (数据类)
|
||||
├── to_dict()
|
||||
└── get_region_bbox()
|
||||
|
||||
MachineCodeParser (主解析器)
|
||||
├── __init__()
|
||||
├── parse() - 主入口
|
||||
├── _find_tokens_with_values()
|
||||
├── _find_machine_code_line_tokens()
|
||||
├── _parse_standard_payment_line_with_tokens()
|
||||
├── _parse_standard_payment_line() - 142行 ⚠️
|
||||
├── _extract_ocr() - 50行
|
||||
├── _extract_bankgiro() - 58行
|
||||
├── _extract_plusgiro() - 30行
|
||||
├── _extract_amount() - 68行
|
||||
├── _calculate_confidence()
|
||||
└── cross_validate()
|
||||
```
|
||||
|
||||
## 发现的问题
|
||||
|
||||
### 1. ⚠️ `_parse_standard_payment_line` 方法过长 (142行)
|
||||
|
||||
**位置**: 442-582 行
|
||||
|
||||
**问题**:
|
||||
- 包含嵌套函数 `normalize_account_spaces` 和 `format_account`
|
||||
- 多个正则匹配分支
|
||||
- 逻辑复杂,难以测试和维护
|
||||
|
||||
**建议**:
|
||||
可以拆分为独立方法:
|
||||
- `_normalize_account_spaces(line)`
|
||||
- `_format_account(account_digits, context)`
|
||||
- `_match_primary_pattern(line)`
|
||||
- `_match_fallback_patterns(line)`
|
||||
|
||||
### 2. 🔁 4个 `_extract_*` 方法有重复模式
|
||||
|
||||
所有 extract 方法都遵循相同模式:
|
||||
|
||||
```python
|
||||
def _extract_XXX(self, tokens):
|
||||
candidates = []
|
||||
|
||||
for token in tokens:
|
||||
text = token.text.strip()
|
||||
matches = self.XXX_PATTERN.findall(text)
|
||||
for match in matches:
|
||||
# 验证逻辑
|
||||
# 上下文检测
|
||||
candidates.append((normalized, context_score, token))
|
||||
|
||||
if not candidates:
|
||||
return None
|
||||
|
||||
candidates.sort(key=lambda x: (x[1], 1), reverse=True)
|
||||
return candidates[0][0]
|
||||
```
|
||||
|
||||
**重复的逻辑**:
|
||||
- Token 迭代
|
||||
- 模式匹配
|
||||
- 候选收集
|
||||
- 上下文评分
|
||||
- 排序和选择最佳匹配
|
||||
|
||||
**建议**:
|
||||
可以提取基础提取器类或通用方法来减少重复。
|
||||
|
||||
### 3. ✅ 上下文检测重复
|
||||
|
||||
上下文检测代码在多个地方重复:
|
||||
|
||||
```python
|
||||
# _extract_bankgiro 中
|
||||
context_text = ' '.join(t.text.lower() for t in tokens)
|
||||
is_bankgiro_context = (
|
||||
'bankgiro' in context_text or
|
||||
'bg:' in context_text or
|
||||
'bg ' in context_text
|
||||
)
|
||||
|
||||
# _extract_plusgiro 中
|
||||
context_text = ' '.join(t.text.lower() for t in tokens)
|
||||
is_plusgiro_context = (
|
||||
'plusgiro' in context_text or
|
||||
'postgiro' in context_text or
|
||||
'pg:' in context_text or
|
||||
'pg ' in context_text
|
||||
)
|
||||
|
||||
# _parse_standard_payment_line 中
|
||||
context = (context_line or raw_line).lower()
|
||||
is_plusgiro_context = (
|
||||
('plusgiro' in context or 'postgiro' in context or 'plusgirokonto' in context)
|
||||
and 'bankgiro' not in context
|
||||
)
|
||||
```
|
||||
|
||||
**建议**:
|
||||
提取为独立方法:
|
||||
- `_detect_account_context(tokens) -> dict[str, bool]`
|
||||
|
||||
## 重构建议
|
||||
|
||||
### 方案 A: 轻度重构(推荐)✅
|
||||
|
||||
**目标**: 提取重复的上下文检测逻辑,不改变主要结构
|
||||
|
||||
**步骤**:
|
||||
1. 提取 `_detect_account_context(tokens)` 方法
|
||||
2. 提取 `_normalize_account_spaces(line)` 为独立方法
|
||||
3. 提取 `_format_account(digits, context)` 为独立方法
|
||||
|
||||
**影响**:
|
||||
- 减少 ~50-80 行重复代码
|
||||
- 提高可测试性
|
||||
- 低风险,易于验证
|
||||
|
||||
**预期结果**: 919 行 → ~850 行 (↓7%)
|
||||
|
||||
### 方案 B: 中度重构
|
||||
|
||||
**目标**: 创建通用的字段提取框架
|
||||
|
||||
**步骤**:
|
||||
1. 创建 `_generic_extract(pattern, normalizer, context_checker)`
|
||||
2. 重构所有 `_extract_*` 方法使用通用框架
|
||||
3. 拆分 `_parse_standard_payment_line` 为多个小方法
|
||||
|
||||
**影响**:
|
||||
- 减少 ~150-200 行代码
|
||||
- 显著提高可维护性
|
||||
- 中等风险,需要全面测试
|
||||
|
||||
**预期结果**: 919 行 → ~720 行 (↓22%)
|
||||
|
||||
### 方案 C: 深度重构(不推荐)
|
||||
|
||||
**目标**: 完全重新设计为策略模式
|
||||
|
||||
**风险**:
|
||||
- 高风险,可能引入 bugs
|
||||
- 需要大量测试
|
||||
- 可能破坏现有集成
|
||||
|
||||
## 推荐方案
|
||||
|
||||
### ✅ 采用方案 A(轻度重构)
|
||||
|
||||
**理由**:
|
||||
1. **代码已经工作良好**: 没有明显的 bug 或性能问题
|
||||
2. **低风险**: 只提取重复逻辑,不改变核心算法
|
||||
3. **性价比高**: 小改动带来明显的代码质量提升
|
||||
4. **易于验证**: 现有测试应该能覆盖
|
||||
|
||||
### 重构步骤
|
||||
|
||||
```python
|
||||
# 1. 提取上下文检测
|
||||
def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
|
||||
"""检测上下文中的账户类型关键词"""
|
||||
context_text = ' '.join(t.text.lower() for t in tokens)
|
||||
|
||||
return {
|
||||
'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
|
||||
'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
|
||||
}
|
||||
|
||||
# 2. 提取空格标准化
|
||||
def _normalize_account_spaces(self, line: str) -> str:
|
||||
"""移除账户号码中的空格"""
|
||||
# (现有 line 460-481 的代码)
|
||||
|
||||
# 3. 提取账户格式化
|
||||
def _format_account(
|
||||
self,
|
||||
account_digits: str,
|
||||
is_plusgiro_context: bool
|
||||
) -> tuple[str, str]:
|
||||
"""格式化账户并确定类型"""
|
||||
# (现有 line 485-523 的代码)
|
||||
```
|
||||
|
||||
## 对比:field_extractor vs machine_code_parser
|
||||
|
||||
| 特征 | field_extractor | machine_code_parser |
|
||||
|------|-----------------|---------------------|
|
||||
| 用途 | 值提取 | 机器码解析 |
|
||||
| 重复代码 | ~400行normalize方法 | ~80行上下文检测 |
|
||||
| 重构价值 | ❌ 不同用途,不应统一 | ✅ 可提取共享逻辑 |
|
||||
| 风险 | 高(会破坏功能) | 低(只是代码组织) |
|
||||
|
||||
## 决策
|
||||
|
||||
### ✅ 建议重构 machine_code_parser.py
|
||||
|
||||
**与 field_extractor 的不同**:
|
||||
- field_extractor: 重复的方法有**不同的用途**(提取 vs 变体生成)
|
||||
- machine_code_parser: 重复的代码有**相同的用途**(都是上下文检测)
|
||||
|
||||
**预期收益**:
|
||||
- 减少 ~70 行重复代码
|
||||
- 提高可测试性(可以单独测试上下文检测)
|
||||
- 更清晰的代码组织
|
||||
- **低风险**,易于验证
|
||||
|
||||
## 下一步
|
||||
|
||||
1. ✅ 备份原文件
|
||||
2. ✅ 提取 `_detect_account_context` 方法
|
||||
3. ✅ 提取 `_normalize_account_spaces` 方法
|
||||
4. ✅ 提取 `_format_account` 方法
|
||||
5. ✅ 更新所有调用点
|
||||
6. ✅ 运行测试验证
|
||||
7. ✅ 检查代码覆盖率
|
||||
|
||||
---
|
||||
|
||||
**状态**: 📋 分析完成,建议轻度重构
|
||||
**风险评估**: 🟢 低风险
|
||||
**预期收益**: 919行 → ~850行 (↓7%)
|
||||
@@ -1,519 +0,0 @@
|
||||
# Performance Optimization Guide
|
||||
|
||||
This document provides performance optimization recommendations for the Invoice Field Extraction system.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Batch Processing Optimization](#batch-processing-optimization)
|
||||
2. [Database Query Optimization](#database-query-optimization)
|
||||
3. [Caching Strategies](#caching-strategies)
|
||||
4. [Memory Management](#memory-management)
|
||||
5. [Profiling and Monitoring](#profiling-and-monitoring)
|
||||
|
||||
---
|
||||
|
||||
## Batch Processing Optimization
|
||||
|
||||
### Current State
|
||||
|
||||
The system processes invoices one at a time. For large batches, this can be inefficient.
|
||||
|
||||
### Recommendations
|
||||
|
||||
#### 1. Database Batch Operations
|
||||
|
||||
**Current**: Individual inserts for each document
|
||||
```python
|
||||
# Inefficient
|
||||
for doc in documents:
|
||||
db.insert_document(doc) # Individual DB call
|
||||
```
|
||||
|
||||
**Optimized**: Use `execute_values` for batch inserts
|
||||
```python
|
||||
# Efficient - already implemented in db.py line 519
|
||||
from psycopg2.extras import execute_values
|
||||
|
||||
execute_values(cursor, """
|
||||
INSERT INTO documents (...)
|
||||
VALUES %s
|
||||
""", document_values)
|
||||
```
|
||||
|
||||
**Impact**: 10-50x faster for batches of 100+ documents
|
||||
|
||||
#### 2. PDF Processing Batching
|
||||
|
||||
**Recommendation**: Process PDFs in parallel using multiprocessing
|
||||
|
||||
```python
|
||||
from multiprocessing import Pool
|
||||
|
||||
def process_batch(pdf_paths, batch_size=10):
|
||||
"""Process PDFs in parallel batches."""
|
||||
with Pool(processes=batch_size) as pool:
|
||||
results = pool.map(pipeline.process_pdf, pdf_paths)
|
||||
return results
|
||||
```
|
||||
|
||||
**Considerations**:
|
||||
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
|
||||
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
|
||||
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
|
||||
|
||||
**Status**: ✅ Already implemented in `src/processing/` modules
|
||||
|
||||
#### 3. Image Caching for Multi-Page PDFs
|
||||
|
||||
**Current**: Each page rendered independently
|
||||
```python
|
||||
# Current pattern in field_extractor.py
|
||||
for page_num in range(total_pages):
|
||||
image = render_pdf_page(pdf_path, page_num, dpi=300)
|
||||
```
|
||||
|
||||
**Optimized**: Pre-render all pages if processing multiple fields per page
|
||||
```python
|
||||
# Batch render
|
||||
images = {
|
||||
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
|
||||
for page_num in page_numbers_needed
|
||||
}
|
||||
|
||||
# Reuse images
|
||||
for detection in detections:
|
||||
image = images[detection.page_no]
|
||||
extract_field(detection, image)
|
||||
```
|
||||
|
||||
**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
|
||||
|
||||
---
|
||||
|
||||
## Database Query Optimization
|
||||
|
||||
### Current Performance
|
||||
|
||||
- **Parameterized queries**: ✅ Implemented (Phase 1)
|
||||
- **Connection pooling**: ❌ Not implemented
|
||||
- **Query batching**: ✅ Partially implemented
|
||||
- **Index optimization**: ⚠️ Needs verification
|
||||
|
||||
### Recommendations
|
||||
|
||||
#### 1. Connection Pooling
|
||||
|
||||
**Current**: New connection for each operation
|
||||
```python
|
||||
def connect(self):
|
||||
"""Create new database connection."""
|
||||
return psycopg2.connect(**self.config)
|
||||
```
|
||||
|
||||
**Optimized**: Use connection pooling
|
||||
```python
|
||||
from psycopg2 import pool
|
||||
|
||||
class DocumentDatabase:
|
||||
def __init__(self, config):
|
||||
self.pool = pool.SimpleConnectionPool(
|
||||
minconn=1,
|
||||
maxconn=10,
|
||||
**config
|
||||
)
|
||||
|
||||
def connect(self):
|
||||
return self.pool.getconn()
|
||||
|
||||
def close(self, conn):
|
||||
self.pool.putconn(conn)
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Reduces connection overhead by 80-95%
|
||||
- Especially important for high-frequency operations
|
||||
|
||||
#### 2. Index Recommendations
|
||||
|
||||
**Check current indexes**:
|
||||
```sql
|
||||
-- Verify indexes exist on frequently queried columns
|
||||
SELECT tablename, indexname, indexdef
|
||||
FROM pg_indexes
|
||||
WHERE schemaname = 'public';
|
||||
```
|
||||
|
||||
**Recommended indexes**:
|
||||
```sql
|
||||
-- If not already present
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_success
|
||||
ON documents(success);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
|
||||
ON documents(timestamp DESC);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
|
||||
ON field_results(document_id);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_field_results_matched
|
||||
ON field_results(matched);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
|
||||
ON field_results(field_name);
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- 10-100x faster queries for filtered/sorted results
|
||||
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
|
||||
|
||||
#### 3. Query Batching
|
||||
|
||||
**Status**: ✅ Already implemented for field results (line 519)
|
||||
|
||||
**Verify batching is used**:
|
||||
```python
|
||||
# Good pattern in db.py
|
||||
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
|
||||
```
|
||||
|
||||
**Additional opportunity**: Batch `SELECT` queries
|
||||
```python
|
||||
# Current
|
||||
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
|
||||
|
||||
# Optimized
|
||||
docs = get_documents_batch(doc_ids) # 1 query with IN clause
|
||||
```
|
||||
|
||||
**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
|
||||
|
||||
---
|
||||
|
||||
## Caching Strategies
|
||||
|
||||
### 1. Model Loading Cache
|
||||
|
||||
**Current**: Models loaded per-instance
|
||||
|
||||
**Recommendation**: Singleton pattern for YOLO model
|
||||
```python
|
||||
class YOLODetectorSingleton:
|
||||
_instance = None
|
||||
_model = None
|
||||
|
||||
@classmethod
|
||||
def get_instance(cls, model_path):
|
||||
if cls._instance is None:
|
||||
cls._instance = YOLODetector(model_path)
|
||||
return cls._instance
|
||||
```
|
||||
|
||||
**Impact**: Reduces memory usage by 90% when processing multiple documents
|
||||
|
||||
### 2. Parser Instance Caching
|
||||
|
||||
**Current**: ✅ Already optimal
|
||||
```python
|
||||
# Good pattern in field_extractor.py
|
||||
def __init__(self):
|
||||
self.payment_line_parser = PaymentLineParser() # Reused
|
||||
self.customer_number_parser = CustomerNumberParser() # Reused
|
||||
```
|
||||
|
||||
**Status**: No changes needed
|
||||
|
||||
### 3. OCR Result Caching
|
||||
|
||||
**Recommendation**: Cache OCR results for identical regions
|
||||
```python
|
||||
from functools import lru_cache
|
||||
|
||||
@lru_cache(maxsize=1000)
|
||||
def ocr_region_cached(image_hash, bbox):
|
||||
"""Cache OCR results by image hash + bbox."""
|
||||
return paddle_ocr.ocr_region(image, bbox)
|
||||
```
|
||||
|
||||
**Impact**: 50-80% speedup when re-processing similar documents
|
||||
|
||||
**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
|
||||
|
||||
---
|
||||
|
||||
## Memory Management
|
||||
|
||||
### Current Issues
|
||||
|
||||
**Potential memory leaks**:
|
||||
1. Large images kept in memory after processing
|
||||
2. OCR results accumulated without cleanup
|
||||
3. Model outputs not explicitly cleared
|
||||
|
||||
### Recommendations
|
||||
|
||||
#### 1. Explicit Image Cleanup
|
||||
|
||||
```python
|
||||
import gc
|
||||
|
||||
def process_pdf(pdf_path):
|
||||
try:
|
||||
image = render_pdf(pdf_path)
|
||||
result = extract_fields(image)
|
||||
return result
|
||||
finally:
|
||||
del image # Explicit cleanup
|
||||
gc.collect() # Force garbage collection
|
||||
```
|
||||
|
||||
#### 2. Generator Pattern for Large Batches
|
||||
|
||||
**Current**: Load all documents into memory
|
||||
```python
|
||||
docs = [process_pdf(path) for path in pdf_paths] # All in memory
|
||||
```
|
||||
|
||||
**Optimized**: Use generator for streaming processing
|
||||
```python
|
||||
def process_batch_streaming(pdf_paths):
|
||||
"""Process documents one at a time, yielding results."""
|
||||
for path in pdf_paths:
|
||||
result = process_pdf(path)
|
||||
yield result
|
||||
# Result can be saved to DB immediately
|
||||
# Previous result is garbage collected
|
||||
```
|
||||
|
||||
**Impact**: Constant memory usage regardless of batch size
|
||||
|
||||
#### 3. Context Managers for Resources
|
||||
|
||||
```python
|
||||
class InferencePipeline:
|
||||
def __enter__(self):
|
||||
self.detector.load_model()
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
self.detector.unload_model()
|
||||
self.extractor.cleanup()
|
||||
|
||||
# Usage
|
||||
with InferencePipeline(...) as pipeline:
|
||||
results = pipeline.process_pdf(path)
|
||||
# Automatic cleanup
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Profiling and Monitoring
|
||||
|
||||
### Recommended Profiling Tools
|
||||
|
||||
#### 1. cProfile for CPU Profiling
|
||||
|
||||
```python
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
profiler = cProfile.Profile()
|
||||
profiler.enable()
|
||||
|
||||
# Your code here
|
||||
pipeline.process_pdf(pdf_path)
|
||||
|
||||
profiler.disable()
|
||||
stats = pstats.Stats(profiler)
|
||||
stats.sort_stats('cumulative')
|
||||
stats.print_stats(20) # Top 20 slowest functions
|
||||
```
|
||||
|
||||
#### 2. memory_profiler for Memory Analysis
|
||||
|
||||
```bash
|
||||
pip install memory_profiler
|
||||
python -m memory_profiler your_script.py
|
||||
```
|
||||
|
||||
Or decorator-based:
|
||||
```python
|
||||
from memory_profiler import profile
|
||||
|
||||
@profile
|
||||
def process_large_batch(pdf_paths):
|
||||
# Memory usage tracked line-by-line
|
||||
results = [process_pdf(path) for path in pdf_paths]
|
||||
return results
|
||||
```
|
||||
|
||||
#### 3. py-spy for Production Profiling
|
||||
|
||||
```bash
|
||||
pip install py-spy
|
||||
|
||||
# Profile running process
|
||||
py-spy top --pid 12345
|
||||
|
||||
# Generate flamegraph
|
||||
py-spy record -o profile.svg -- python your_script.py
|
||||
```
|
||||
|
||||
**Advantage**: No code changes needed, minimal overhead
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
1. **Processing Time per Document**
|
||||
- Target: <10 seconds for single-page invoice
|
||||
- Current: ~2-5 seconds (estimated)
|
||||
|
||||
2. **Memory Usage**
|
||||
- Target: <2GB for batch of 100 documents
|
||||
- Monitor: Peak memory usage
|
||||
|
||||
3. **Database Query Time**
|
||||
- Target: <100ms per query (with indexes)
|
||||
- Monitor: Slow query log
|
||||
|
||||
4. **OCR Accuracy vs Speed Trade-off**
|
||||
- Current: PaddleOCR with GPU (~200ms per region)
|
||||
- Alternative: Tesseract (~500ms, slightly more accurate)
|
||||
|
||||
### Logging Performance Metrics
|
||||
|
||||
**Add to pipeline.py**:
|
||||
```python
|
||||
import time
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def process_pdf(self, pdf_path):
|
||||
start = time.time()
|
||||
|
||||
# Processing...
|
||||
result = self._process_internal(pdf_path)
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
|
||||
|
||||
# Log to database for analysis
|
||||
self.db.log_performance({
|
||||
'document_id': result.document_id,
|
||||
'processing_time': elapsed,
|
||||
'field_count': len(result.fields)
|
||||
})
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization Priorities
|
||||
|
||||
### High Priority (Implement First)
|
||||
|
||||
1. ✅ **Database parameterized queries** - Already done (Phase 1)
|
||||
2. ⚠️ **Database connection pooling** - Not implemented
|
||||
3. ⚠️ **Index optimization** - Needs verification
|
||||
|
||||
### Medium Priority
|
||||
|
||||
4. ⚠️ **Batch PDF rendering** - Optimization possible
|
||||
5. ✅ **Parser instance reuse** - Already done (Phase 2)
|
||||
6. ⚠️ **Model caching** - Could improve
|
||||
|
||||
### Low Priority (Nice to Have)
|
||||
|
||||
7. ⚠️ **OCR result caching** - Complex implementation
|
||||
8. ⚠️ **Generator patterns** - Refactoring needed
|
||||
9. ⚠️ **Advanced profiling** - For production optimization
|
||||
|
||||
---
|
||||
|
||||
## Benchmarking Script
|
||||
|
||||
```python
|
||||
"""
|
||||
Benchmark script for invoice processing performance.
|
||||
"""
|
||||
|
||||
import time
|
||||
from pathlib import Path
|
||||
from src.inference.pipeline import InferencePipeline
|
||||
|
||||
def benchmark_single_document(pdf_path, iterations=10):
|
||||
"""Benchmark single document processing."""
|
||||
pipeline = InferencePipeline(
|
||||
model_path="path/to/model.pt",
|
||||
use_gpu=True
|
||||
)
|
||||
|
||||
times = []
|
||||
for i in range(iterations):
|
||||
start = time.time()
|
||||
result = pipeline.process_pdf(pdf_path)
|
||||
elapsed = time.time() - start
|
||||
times.append(elapsed)
|
||||
print(f"Iteration {i+1}: {elapsed:.2f}s")
|
||||
|
||||
avg_time = sum(times) / len(times)
|
||||
print(f"\nAverage: {avg_time:.2f}s")
|
||||
print(f"Min: {min(times):.2f}s")
|
||||
print(f"Max: {max(times):.2f}s")
|
||||
|
||||
def benchmark_batch(pdf_paths, batch_size=10):
|
||||
"""Benchmark batch processing."""
|
||||
from multiprocessing import Pool
|
||||
|
||||
pipeline = InferencePipeline(
|
||||
model_path="path/to/model.pt",
|
||||
use_gpu=True
|
||||
)
|
||||
|
||||
start = time.time()
|
||||
|
||||
with Pool(processes=batch_size) as pool:
|
||||
results = pool.map(pipeline.process_pdf, pdf_paths)
|
||||
|
||||
elapsed = time.time() - start
|
||||
avg_per_doc = elapsed / len(pdf_paths)
|
||||
|
||||
print(f"Total time: {elapsed:.2f}s")
|
||||
print(f"Documents: {len(pdf_paths)}")
|
||||
print(f"Average per document: {avg_per_doc:.2f}s")
|
||||
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Single document benchmark
|
||||
benchmark_single_document("test.pdf")
|
||||
|
||||
# Batch benchmark
|
||||
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
|
||||
benchmark_batch(pdf_paths[:100])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**Implemented (Phase 1-2)**:
|
||||
- ✅ Parameterized queries (SQL injection fix)
|
||||
- ✅ Parser instance reuse (Phase 2 refactoring)
|
||||
- ✅ Batch insert operations (execute_values)
|
||||
- ✅ Dual pool processing (CPU/GPU separation)
|
||||
|
||||
**Quick Wins (Low effort, high impact)**:
|
||||
- Database connection pooling (2-4 hours)
|
||||
- Index verification and optimization (1-2 hours)
|
||||
- Batch PDF rendering (4-6 hours)
|
||||
|
||||
**Long-term Improvements**:
|
||||
- OCR result caching with hashing
|
||||
- Generator patterns for streaming
|
||||
- Advanced profiling and monitoring
|
||||
|
||||
**Expected Impact**:
|
||||
- Connection pooling: 80-95% reduction in DB overhead
|
||||
- Indexes: 10-100x faster queries
|
||||
- Batch rendering: 50-90% less redundant work
|
||||
- **Overall**: 2-5x throughput improvement for batch processing
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,170 +0,0 @@
|
||||
# 代码重构总结报告
|
||||
|
||||
## 📊 整体成果
|
||||
|
||||
### 测试状态
|
||||
- ✅ **688/688 测试全部通过** (100%)
|
||||
- ✅ **代码覆盖率**: 34% → 37% (+3%)
|
||||
- ✅ **0 个失败**, 0 个错误
|
||||
|
||||
### 测试覆盖率改进
|
||||
- ✅ **machine_code_parser**: 25% → 65% (+40%)
|
||||
- ✅ **新增测试**: 55个(633 → 688)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 已完成的重构
|
||||
|
||||
### 1. ✅ Matcher 模块化 (876行 → 205行, ↓76%)
|
||||
|
||||
**文件**:
|
||||
|
||||
**重构内容**:
|
||||
- 将单一876行文件拆分为 **11个模块**
|
||||
- 提取 **5种独立的匹配策略**
|
||||
- 创建专门的数据模型、工具函数和上下文处理模块
|
||||
|
||||
**新模块结构**:
|
||||
|
||||
|
||||
**测试结果**:
|
||||
- ✅ 77个 matcher 测试全部通过
|
||||
- ✅ 完整的README文档
|
||||
- ✅ 策略模式,易于扩展
|
||||
|
||||
**收益**:
|
||||
- 📉 代码量减少 76%
|
||||
- 📈 可维护性显著提高
|
||||
- ✨ 每个策略独立测试
|
||||
- 🔧 易于添加新策略
|
||||
|
||||
---
|
||||
|
||||
### 2. ✅ Machine Code Parser 轻度重构 + 测试覆盖 (919行 → 929行)
|
||||
|
||||
**文件**: src/ocr/machine_code_parser.py
|
||||
|
||||
**重构内容**:
|
||||
- 提取 **3个共享辅助方法**,消除重复代码
|
||||
- 优化上下文检测逻辑
|
||||
- 简化账号格式化方法
|
||||
|
||||
**测试改进**:
|
||||
- ✅ **新增55个测试**(24 → 79个)
|
||||
- ✅ **覆盖率**: 25% → 65% (+40%)
|
||||
- ✅ 所有688个项目测试通过
|
||||
|
||||
**新增测试覆盖**:
|
||||
- **第一轮** (22个测试):
|
||||
- `_detect_account_context()` - 8个测试(上下文检测)
|
||||
- `_normalize_account_spaces()` - 5个测试(空格规范化)
|
||||
- `_format_account()` - 4个测试(账号格式化)
|
||||
- `parse()` - 5个测试(主入口方法)
|
||||
- **第二轮** (33个测试):
|
||||
- `_extract_ocr()` - 8个测试(OCR 提取)
|
||||
- `_extract_bankgiro()` - 9个测试(Bankgiro 提取)
|
||||
- `_extract_plusgiro()` - 8个测试(Plusgiro 提取)
|
||||
- `_extract_amount()` - 8个测试(金额提取)
|
||||
|
||||
**收益**:
|
||||
- 🔄 消除80行重复代码
|
||||
- 📈 可测试性提高(可独立测试辅助方法)
|
||||
- 📖 代码可读性提升
|
||||
- ✅ 覆盖率从25%提升到65% (+40%)
|
||||
- 🎯 低风险,高回报
|
||||
|
||||
---
|
||||
|
||||
### 3. ✅ Field Extractor 分析 (决定不重构)
|
||||
|
||||
**文件**: (1183行)
|
||||
|
||||
**分析结果**: ❌ **不应重构**
|
||||
|
||||
**关键洞察**:
|
||||
- 表面相似的代码可能有**完全不同的用途**
|
||||
- field_extractor: **解析/提取** 字段值
|
||||
- src/normalize: **标准化/生成变体** 用于匹配
|
||||
- 两者职责不同,不应统一
|
||||
|
||||
**文档**:
|
||||
|
||||
---
|
||||
|
||||
## 📈 重构统计
|
||||
|
||||
### 代码行数变化
|
||||
|
||||
| 文件 | 重构前 | 重构后 | 变化 | 百分比 |
|
||||
|------|--------|--------|------|--------|
|
||||
| **matcher/field_matcher.py** | 876行 | 205行 | -671 | ↓76% |
|
||||
| **matcher/* (新增10个模块)** | 0行 | 466行 | +466 | 新增 |
|
||||
| **matcher 总计** | 876行 | 671行 | -205 | ↓23% |
|
||||
| **ocr/machine_code_parser.py** | 919行 | 929行 | +10 | +1% |
|
||||
| **总净减少** | - | - | **-195行** | **↓11%** |
|
||||
|
||||
### 测试覆盖
|
||||
|
||||
| 模块 | 测试数 | 通过率 | 覆盖率 | 状态 |
|
||||
|------|--------|--------|--------|------|
|
||||
| matcher | 77 | 100% | - | ✅ |
|
||||
| field_extractor | 45 | 100% | 39% | ✅ |
|
||||
| machine_code_parser | 79 | 100% | 65% | ✅ |
|
||||
| normalizer | ~120 | 100% | - | ✅ |
|
||||
| 其他模块 | ~367 | 100% | - | ✅ |
|
||||
| **总计** | **688** | **100%** | **37%** | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 🎓 重构经验总结
|
||||
|
||||
### 成功经验
|
||||
|
||||
1. **✅ 先测试后重构**
|
||||
- 所有重构都有完整测试覆盖
|
||||
- 每次改动后立即验证测试
|
||||
- 100%测试通过率保证质量
|
||||
|
||||
2. **✅ 识别真正的重复**
|
||||
- 不是所有相似代码都是重复
|
||||
- field_extractor vs normalizer: 表面相似但用途不同
|
||||
- machine_code_parser: 真正的代码重复
|
||||
|
||||
3. **✅ 渐进式重构**
|
||||
- matcher: 大规模模块化 (策略模式)
|
||||
- machine_code_parser: 轻度重构 (提取共享方法)
|
||||
- field_extractor: 分析后决定不重构
|
||||
|
||||
### 关键决策
|
||||
|
||||
#### ✅ 应该重构的情况
|
||||
- **matcher**: 单一文件过长 (876行),包含多种策略
|
||||
- **machine_code_parser**: 多处相同用途的重复代码
|
||||
|
||||
#### ❌ 不应重构的情况
|
||||
- **field_extractor**: 相似代码有不同用途
|
||||
|
||||
### 教训
|
||||
|
||||
**不要盲目追求DRY原则**
|
||||
> 相似代码不一定是重复。要理解代码的**真实用途**。
|
||||
|
||||
---
|
||||
|
||||
## ✅ 总结
|
||||
|
||||
**关键成果**:
|
||||
- 📉 净减少 195 行代码
|
||||
- 📈 代码覆盖率 +3% (34% → 37%)
|
||||
- ✅ 测试数量 +55 (633 → 688)
|
||||
- 🎯 machine_code_parser 覆盖率 +40% (25% → 65%)
|
||||
- ✨ 模块化程度显著提高
|
||||
- 🎯 可维护性大幅提升
|
||||
|
||||
**重要教训**:
|
||||
> 相似的代码不一定是重复的代码。理解代码的真实用途,才能做出正确的重构决策。
|
||||
|
||||
**下一步建议**:
|
||||
1. 继续提升 machine_code_parser 覆盖率到 80%+ (目前 65%)
|
||||
2. 为其他低覆盖模块添加测试(field_extractor 39%, pipeline 19%)
|
||||
3. 完善边界条件和异常情况的测试
|
||||
@@ -1,258 +0,0 @@
|
||||
# 测试覆盖率改进报告
|
||||
|
||||
## 📊 改进概览
|
||||
|
||||
### 整体统计
|
||||
- ✅ **测试总数**: 633 → 688 (+55个测试, +8.7%)
|
||||
- ✅ **通过率**: 100% (688/688)
|
||||
- ✅ **整体覆盖率**: 34% → 37% (+3%)
|
||||
|
||||
### machine_code_parser.py 专项改进
|
||||
- ✅ **测试数**: 24 → 79 (+55个测试, +229%)
|
||||
- ✅ **覆盖率**: 25% → 65% (+40%)
|
||||
- ✅ **未覆盖行**: 273 → 129 (减少144行)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 新增测试详情
|
||||
|
||||
### 第一轮改进 (22个测试)
|
||||
|
||||
#### 1. TestDetectAccountContext (8个测试)
|
||||
|
||||
测试新增的 `_detect_account_context()` 辅助方法。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_bankgiro_keyword` - 检测 'bankgiro' 关键词
|
||||
2. `test_bg_keyword` - 检测 'bg:' 缩写
|
||||
3. `test_plusgiro_keyword` - 检测 'plusgiro' 关键词
|
||||
4. `test_postgiro_keyword` - 检测 'postgiro' 别名
|
||||
5. `test_pg_keyword` - 检测 'pg:' 缩写
|
||||
6. `test_both_contexts` - 同时存在两种关键词
|
||||
7. `test_no_context` - 无账号关键词
|
||||
8. `test_case_insensitive` - 大小写不敏感检测
|
||||
|
||||
**覆盖的代码路径**:
|
||||
```python
|
||||
def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
|
||||
context_text = ' '.join(t.text.lower() for t in tokens)
|
||||
return {
|
||||
'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
|
||||
'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. TestNormalizeAccountSpacesMethod (5个测试)
|
||||
|
||||
测试新增的 `_normalize_account_spaces()` 辅助方法。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_removes_spaces_after_arrow` - 移除 > 后的空格
|
||||
2. `test_multiple_consecutive_spaces` - 处理多个连续空格
|
||||
3. `test_no_arrow_returns_unchanged` - 无 > 标记时返回原值
|
||||
4. `test_spaces_before_arrow_preserved` - 保留 > 前的空格
|
||||
5. `test_empty_string` - 空字符串处理
|
||||
|
||||
**覆盖的代码路径**:
|
||||
```python
|
||||
def _normalize_account_spaces(self, line: str) -> str:
|
||||
if '>' not in line:
|
||||
return line
|
||||
parts = line.split('>', 1)
|
||||
after_arrow = parts[1]
|
||||
normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
|
||||
while re.search(r'(\d)\s+(\d)', normalized):
|
||||
normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
|
||||
return parts[0] + '>' + normalized
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. TestFormatAccount (4个测试)
|
||||
|
||||
测试新增的 `_format_account()` 辅助方法。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_plusgiro_context_forces_plusgiro` - Plusgiro 上下文强制格式化为 Plusgiro
|
||||
2. `test_valid_bankgiro_7_digits` - 7位有效 Bankgiro 格式化
|
||||
3. `test_valid_bankgiro_8_digits` - 8位有效 Bankgiro 格式化
|
||||
4. `test_defaults_to_bankgiro_when_ambiguous` - 模糊情况默认 Bankgiro
|
||||
|
||||
**覆盖的代码路径**:
|
||||
```python
|
||||
def _format_account(self, account_digits: str, is_plusgiro_context: bool) -> tuple[str, str]:
|
||||
if is_plusgiro_context:
|
||||
formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
|
||||
return formatted, 'plusgiro'
|
||||
|
||||
# Luhn 验证逻辑
|
||||
pg_valid = FieldValidators.is_valid_plusgiro(account_digits)
|
||||
bg_valid = FieldValidators.is_valid_bankgiro(account_digits)
|
||||
|
||||
# 决策逻辑
|
||||
if pg_valid and not bg_valid:
|
||||
return pg_formatted, 'plusgiro'
|
||||
elif bg_valid and not pg_valid:
|
||||
return bg_formatted, 'bankgiro'
|
||||
else:
|
||||
return bg_formatted, 'bankgiro'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. TestParseMethod (5个测试)
|
||||
|
||||
测试主入口 `parse()` 方法。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_parse_empty_tokens` - 空 token 列表处理
|
||||
2. `test_parse_finds_payment_line_in_bottom_region` - 在页面底部35%区域查找付款行
|
||||
3. `test_parse_ignores_top_region` - 忽略页面顶部区域
|
||||
4. `test_parse_with_context_keywords` - 检测上下文关键词
|
||||
5. `test_parse_stores_source_tokens` - 存储源 token
|
||||
|
||||
**覆盖的代码路径**:
|
||||
- Token 过滤(底部区域检测)
|
||||
- 上下文关键词检测
|
||||
- 付款行查找和解析
|
||||
- 结果对象构建
|
||||
|
||||
---
|
||||
|
||||
### 第二轮改进 (33个测试)
|
||||
|
||||
#### 5. TestExtractOCR (8个测试)
|
||||
|
||||
测试 `_extract_ocr()` 方法 - OCR 参考号码提取。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_extract_valid_ocr_10_digits` - 提取10位 OCR 号码
|
||||
2. `test_extract_valid_ocr_15_digits` - 提取15位 OCR 号码
|
||||
3. `test_extract_ocr_with_hash_markers` - 带 # 标记的 OCR
|
||||
4. `test_extract_longest_ocr_when_multiple` - 多个候选时选最长
|
||||
5. `test_extract_ocr_ignores_short_numbers` - 忽略短于10位的数字
|
||||
6. `test_extract_ocr_ignores_long_numbers` - 忽略长于25位的数字
|
||||
7. `test_extract_ocr_excludes_bankgiro_variants` - 排除 Bankgiro 变体
|
||||
8. `test_extract_ocr_empty_tokens` - 空 token 处理
|
||||
|
||||
#### 6. TestExtractBankgiro (9个测试)
|
||||
|
||||
测试 `_extract_bankgiro()` 方法 - Bankgiro 账号提取。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_extract_bankgiro_7_digits_with_dash` - 带破折号的7位 Bankgiro
|
||||
2. `test_extract_bankgiro_7_digits_without_dash` - 无破折号的7位 Bankgiro
|
||||
3. `test_extract_bankgiro_8_digits_with_dash` - 带破折号的8位 Bankgiro
|
||||
4. `test_extract_bankgiro_8_digits_without_dash` - 无破折号的8位 Bankgiro
|
||||
5. `test_extract_bankgiro_with_spaces` - 带空格的 Bankgiro
|
||||
6. `test_extract_bankgiro_handles_plusgiro_format` - 处理 Plusgiro 格式
|
||||
7. `test_extract_bankgiro_with_context` - 带上下文关键词
|
||||
8. `test_extract_bankgiro_ignores_plusgiro_context` - 忽略 Plusgiro 上下文
|
||||
9. `test_extract_bankgiro_empty_tokens` - 空 token 处理
|
||||
|
||||
#### 7. TestExtractPlusgiro (8个测试)
|
||||
|
||||
测试 `_extract_plusgiro()` 方法 - Plusgiro 账号提取。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_extract_plusgiro_7_digits_with_dash` - 带破折号的7位 Plusgiro
|
||||
2. `test_extract_plusgiro_7_digits_without_dash` - 无破折号的7位 Plusgiro
|
||||
3. `test_extract_plusgiro_8_digits` - 8位 Plusgiro
|
||||
4. `test_extract_plusgiro_with_spaces` - 带空格的 Plusgiro
|
||||
5. `test_extract_plusgiro_with_context` - 带上下文关键词
|
||||
6. `test_extract_plusgiro_ignores_too_short` - 忽略少于7位
|
||||
7. `test_extract_plusgiro_ignores_too_long` - 忽略多于8位
|
||||
8. `test_extract_plusgiro_empty_tokens` - 空 token 处理
|
||||
|
||||
#### 8. TestExtractAmount (8个测试)
|
||||
|
||||
测试 `_extract_amount()` 方法 - 金额提取。
|
||||
|
||||
**测试用例**:
|
||||
1. `test_extract_amount_with_comma_decimal` - 逗号小数分隔符
|
||||
2. `test_extract_amount_with_dot_decimal` - 点号小数分隔符
|
||||
3. `test_extract_amount_integer` - 整数金额
|
||||
4. `test_extract_amount_with_thousand_separator` - 千位分隔符
|
||||
5. `test_extract_amount_large_number` - 大额金额
|
||||
6. `test_extract_amount_ignores_too_large` - 忽略过大金额
|
||||
7. `test_extract_amount_ignores_zero` - 忽略零或负数
|
||||
8. `test_extract_amount_empty_tokens` - 空 token 处理
|
||||
|
||||
---
|
||||
|
||||
## 📈 覆盖率分析
|
||||
|
||||
### 已覆盖的方法
|
||||
✅ `_detect_account_context()` - **100%** (第一轮新增)
|
||||
✅ `_normalize_account_spaces()` - **100%** (第一轮新增)
|
||||
✅ `_format_account()` - **95%** (第一轮新增)
|
||||
✅ `parse()` - **70%** (第一轮改进)
|
||||
✅ `_parse_standard_payment_line()` - **95%** (已有测试)
|
||||
✅ `_extract_ocr()` - **85%** (第二轮新增)
|
||||
✅ `_extract_bankgiro()` - **90%** (第二轮新增)
|
||||
✅ `_extract_plusgiro()` - **90%** (第二轮新增)
|
||||
✅ `_extract_amount()` - **80%** (第二轮新增)
|
||||
|
||||
### 仍需改进的方法 (未覆盖/部分覆盖)
|
||||
⚠️ `_calculate_confidence()` - **0%** (未测试)
|
||||
⚠️ `cross_validate()` - **0%** (未测试)
|
||||
⚠️ `get_region_bbox()` - **0%** (未测试)
|
||||
⚠️ `_find_tokens_with_values()` - **部分覆盖**
|
||||
⚠️ `_find_machine_code_line_tokens()` - **部分覆盖**
|
||||
|
||||
### 未覆盖的代码行(129行)
|
||||
主要集中在:
|
||||
1. **验证方法** (lines 805-824): `_calculate_confidence`, `cross_validate`
|
||||
2. **辅助方法** (lines 80-92, 336-369, 377-407): Token 查找、bbox 计算、日志记录
|
||||
3. **边界条件** (lines 648-653, 690, 699, 759-760等): 某些提取方法的边界情况
|
||||
|
||||
---
|
||||
|
||||
## 🎯 改进建议
|
||||
|
||||
### ✅ 已完成目标
|
||||
- ✅ 覆盖率从 25% 提升到 65% (+40%)
|
||||
- ✅ 测试数量从 24 增加到 79 (+55个)
|
||||
- ✅ 提取方法全部测试(_extract_ocr, _extract_bankgiro, _extract_plusgiro, _extract_amount)
|
||||
|
||||
### 下一步目标(覆盖率 65% → 80%+)
|
||||
1. **添加验证方法测试** - 为 `_calculate_confidence`, `cross_validate` 添加测试
|
||||
2. **添加辅助方法测试** - 为 token 查找和 bbox 计算方法添加测试
|
||||
3. **完善边界条件** - 增加边界情况和异常处理的测试
|
||||
4. **集成测试** - 添加端到端的集成测试,使用真实 PDF token 数据
|
||||
|
||||
---
|
||||
|
||||
## ✅ 已完成的改进
|
||||
|
||||
### 重构收益
|
||||
- ✅ 提取的3个辅助方法现在可以独立测试
|
||||
- ✅ 测试粒度更细,更容易定位问题
|
||||
- ✅ 代码可读性提高,测试用例清晰易懂
|
||||
|
||||
### 质量保证
|
||||
- ✅ 所有655个测试100%通过
|
||||
- ✅ 无回归问题
|
||||
- ✅ 新增测试覆盖了之前未测试的重构代码
|
||||
|
||||
---
|
||||
|
||||
## 📚 测试编写经验
|
||||
|
||||
### 成功经验
|
||||
1. **使用 fixture 创建测试数据** - `_create_token()` 辅助方法简化了 token 创建
|
||||
2. **按方法组织测试类** - 每个方法一个测试类,结构清晰
|
||||
3. **测试用例命名清晰** - `test_<what>_<condition>` 格式,一目了然
|
||||
4. **覆盖关键路径** - 优先测试常见场景和边界条件
|
||||
|
||||
### 遇到的问题
|
||||
1. **Token 初始化参数** - 忘记了 `page_no` 参数,导致初始测试失败
|
||||
- 解决:修复 `_create_token()` 辅助方法,添加 `page_no=0`
|
||||
|
||||
---
|
||||
|
||||
**报告日期**: 2026-01-24
|
||||
**状态**: ✅ 完成
|
||||
**下一步**: 继续提升覆盖率到 60%+
|
||||
@@ -1,619 +0,0 @@
|
||||
# 多池处理架构设计文档
|
||||
|
||||
## 1. 研究总结
|
||||
|
||||
### 1.1 当前问题分析
|
||||
|
||||
我们之前实现的双池模式存在稳定性问题,主要原因:
|
||||
|
||||
| 问题 | 原因 | 解决方案 |
|
||||
|------|------|----------|
|
||||
| 处理卡住 | 线程 + ProcessPoolExecutor 混用导致死锁 | 使用 asyncio 或纯 Queue 模式 |
|
||||
| Queue.get() 无限阻塞 | 没有超时机制 | 添加 timeout 和哨兵值 |
|
||||
| GPU 内存冲突 | 多进程同时访问 GPU | 限制 GPU worker = 1 |
|
||||
| CUDA fork 问题 | Linux 默认 fork 不兼容 CUDA | 使用 spawn 启动方式 |
|
||||
|
||||
### 1.2 推荐架构方案
|
||||
|
||||
经过研究,最适合我们场景的方案是 **生产者-消费者队列模式**:
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Main Process │ │ CPU Workers │ │ GPU Worker │
|
||||
│ │ │ (4 processes) │ │ (1 process) │
|
||||
│ ┌───────────┐ │ │ │ │ │
|
||||
│ │ Task │──┼────▶│ Text PDF处理 │ │ Scanned PDF处理 │
|
||||
│ │ Dispatcher│ │ │ (无需OCR) │ │ (PaddleOCR) │
|
||||
│ └───────────┘ │ │ │ │ │
|
||||
│ ▲ │ │ │ │ │ │ │
|
||||
│ │ │ │ ▼ │ │ ▼ │
|
||||
│ ┌───────────┐ │ │ Result Queue │ │ Result Queue │
|
||||
│ │ Result │◀─┼─────│◀────────────────│─────│◀────────────────│
|
||||
│ │ Collector │ │ │ │ │ │
|
||||
│ └───────────┘ │ └─────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌───────────┐ │
|
||||
│ │ Database │ │
|
||||
│ │ Batch │ │
|
||||
│ │ Writer │ │
|
||||
│ └───────────┘ │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 核心设计原则
|
||||
|
||||
### 2.1 CUDA 兼容性
|
||||
|
||||
```python
|
||||
# 关键:使用 spawn 启动方式
|
||||
import multiprocessing as mp
|
||||
ctx = mp.get_context("spawn")
|
||||
|
||||
# GPU worker 初始化时设置设备
|
||||
def init_gpu_worker(gpu_id: int = 0):
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
||||
global _ocr
|
||||
from paddleocr import PaddleOCR
|
||||
_ocr = PaddleOCR(use_gpu=True, ...)
|
||||
```
|
||||
|
||||
### 2.2 Worker 初始化模式
|
||||
|
||||
使用 `initializer` 参数一次性加载模型,避免每个任务重新加载:
|
||||
|
||||
```python
|
||||
# 全局变量保存模型
|
||||
_ocr = None
|
||||
|
||||
def init_worker(use_gpu: bool, gpu_id: int = 0):
|
||||
global _ocr
|
||||
if use_gpu:
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
||||
else:
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
|
||||
|
||||
from paddleocr import PaddleOCR
|
||||
_ocr = PaddleOCR(use_gpu=use_gpu, ...)
|
||||
|
||||
# 创建 Pool 时使用 initializer
|
||||
pool = ProcessPoolExecutor(
|
||||
max_workers=1,
|
||||
initializer=init_worker,
|
||||
initargs=(True, 0), # use_gpu=True, gpu_id=0
|
||||
mp_context=mp.get_context("spawn")
|
||||
)
|
||||
```
|
||||
|
||||
### 2.3 队列模式 vs as_completed
|
||||
|
||||
| 方式 | 优点 | 缺点 | 适用场景 |
|
||||
|------|------|------|----------|
|
||||
| `as_completed()` | 简单、无需管理队列 | 无法跨多个 Pool 使用 | 单池场景 |
|
||||
| `multiprocessing.Queue` | 高性能、灵活 | 需要手动管理、死锁风险 | 多池流水线 |
|
||||
| `Manager().Queue()` | 可 pickle、跨 Pool | 性能较低 | 需要 Pool.map 场景 |
|
||||
|
||||
**推荐**:对于双池场景,使用 `as_completed()` 分别处理每个池,然后合并结果。
|
||||
|
||||
---
|
||||
|
||||
## 3. 详细开发计划
|
||||
|
||||
### 阶段 1:重构基础架构 (2-3天)
|
||||
|
||||
#### 1.1 创建 WorkerPool 抽象类
|
||||
|
||||
```python
|
||||
# src/processing/worker_pool.py
|
||||
|
||||
from __future__ import annotations
|
||||
from abc import ABC, abstractmethod
|
||||
from concurrent.futures import ProcessPoolExecutor, Future
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Any, Optional, Callable
|
||||
import multiprocessing as mp
|
||||
|
||||
@dataclass
|
||||
class TaskResult:
|
||||
"""任务结果容器"""
|
||||
task_id: str
|
||||
success: bool
|
||||
data: Any
|
||||
error: Optional[str] = None
|
||||
processing_time: float = 0.0
|
||||
|
||||
class WorkerPool(ABC):
|
||||
"""Worker Pool 抽象基类"""
|
||||
|
||||
def __init__(self, max_workers: int, use_gpu: bool = False, gpu_id: int = 0):
|
||||
self.max_workers = max_workers
|
||||
self.use_gpu = use_gpu
|
||||
self.gpu_id = gpu_id
|
||||
self._executor: Optional[ProcessPoolExecutor] = None
|
||||
|
||||
@abstractmethod
|
||||
def get_initializer(self) -> Callable:
|
||||
"""返回 worker 初始化函数"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_init_args(self) -> tuple:
|
||||
"""返回初始化参数"""
|
||||
pass
|
||||
|
||||
def start(self):
|
||||
"""启动 worker pool"""
|
||||
ctx = mp.get_context("spawn")
|
||||
self._executor = ProcessPoolExecutor(
|
||||
max_workers=self.max_workers,
|
||||
mp_context=ctx,
|
||||
initializer=self.get_initializer(),
|
||||
initargs=self.get_init_args()
|
||||
)
|
||||
|
||||
def submit(self, fn: Callable, *args, **kwargs) -> Future:
|
||||
"""提交任务"""
|
||||
if not self._executor:
|
||||
raise RuntimeError("Pool not started")
|
||||
return self._executor.submit(fn, *args, **kwargs)
|
||||
|
||||
def shutdown(self, wait: bool = True):
|
||||
"""关闭 pool"""
|
||||
if self._executor:
|
||||
self._executor.shutdown(wait=wait)
|
||||
self._executor = None
|
||||
|
||||
def __enter__(self):
|
||||
self.start()
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
self.shutdown()
|
||||
```
|
||||
|
||||
#### 1.2 实现 CPU 和 GPU Worker Pool
|
||||
|
||||
```python
|
||||
# src/processing/cpu_pool.py
|
||||
|
||||
class CPUWorkerPool(WorkerPool):
|
||||
"""CPU-only worker pool for text PDF processing"""
|
||||
|
||||
def __init__(self, max_workers: int = 4):
|
||||
super().__init__(max_workers=max_workers, use_gpu=False)
|
||||
|
||||
def get_initializer(self) -> Callable:
|
||||
return init_cpu_worker
|
||||
|
||||
def get_init_args(self) -> tuple:
|
||||
return ()
|
||||
|
||||
# src/processing/gpu_pool.py
|
||||
|
||||
class GPUWorkerPool(WorkerPool):
|
||||
"""GPU worker pool for OCR processing"""
|
||||
|
||||
def __init__(self, max_workers: int = 1, gpu_id: int = 0):
|
||||
super().__init__(max_workers=max_workers, use_gpu=True, gpu_id=gpu_id)
|
||||
|
||||
def get_initializer(self) -> Callable:
|
||||
return init_gpu_worker
|
||||
|
||||
def get_init_args(self) -> tuple:
|
||||
return (self.gpu_id,)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 阶段 2:实现双池协调器 (2-3天)
|
||||
|
||||
#### 2.1 任务分发器
|
||||
|
||||
```python
|
||||
# src/processing/task_dispatcher.py
|
||||
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum, auto
|
||||
from typing import List, Tuple
|
||||
|
||||
class TaskType(Enum):
|
||||
CPU = auto() # Text PDF
|
||||
GPU = auto() # Scanned PDF
|
||||
|
||||
@dataclass
|
||||
class Task:
|
||||
id: str
|
||||
task_type: TaskType
|
||||
data: Any
|
||||
|
||||
class TaskDispatcher:
|
||||
"""根据 PDF 类型分发任务到不同的 pool"""
|
||||
|
||||
def classify_task(self, doc_info: dict) -> TaskType:
|
||||
"""判断文档是否需要 OCR"""
|
||||
# 基于 PDF 特征判断
|
||||
if self._is_scanned_pdf(doc_info):
|
||||
return TaskType.GPU
|
||||
return TaskType.CPU
|
||||
|
||||
def _is_scanned_pdf(self, doc_info: dict) -> bool:
|
||||
"""检测是否为扫描件"""
|
||||
# 1. 检查是否有可提取文本
|
||||
# 2. 检查图片比例
|
||||
# 3. 检查文本密度
|
||||
pass
|
||||
|
||||
def partition_tasks(self, tasks: List[Task]) -> Tuple[List[Task], List[Task]]:
|
||||
"""将任务分为 CPU 和 GPU 两组"""
|
||||
cpu_tasks = [t for t in tasks if t.task_type == TaskType.CPU]
|
||||
gpu_tasks = [t for t in tasks if t.task_type == TaskType.GPU]
|
||||
return cpu_tasks, gpu_tasks
|
||||
```
|
||||
|
||||
#### 2.2 双池协调器
|
||||
|
||||
```python
|
||||
# src/processing/dual_pool_coordinator.py
|
||||
|
||||
from concurrent.futures import as_completed
|
||||
from typing import List, Iterator
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DualPoolCoordinator:
|
||||
"""协调 CPU 和 GPU 两个 worker pool"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
cpu_workers: int = 4,
|
||||
gpu_workers: int = 1,
|
||||
gpu_id: int = 0
|
||||
):
|
||||
self.cpu_pool = CPUWorkerPool(max_workers=cpu_workers)
|
||||
self.gpu_pool = GPUWorkerPool(max_workers=gpu_workers, gpu_id=gpu_id)
|
||||
self.dispatcher = TaskDispatcher()
|
||||
|
||||
def __enter__(self):
|
||||
self.cpu_pool.start()
|
||||
self.gpu_pool.start()
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
self.cpu_pool.shutdown()
|
||||
self.gpu_pool.shutdown()
|
||||
|
||||
def process_batch(
|
||||
self,
|
||||
documents: List[dict],
|
||||
cpu_task_fn: Callable,
|
||||
gpu_task_fn: Callable,
|
||||
on_result: Optional[Callable[[TaskResult], None]] = None,
|
||||
on_error: Optional[Callable[[str, Exception], None]] = None
|
||||
) -> List[TaskResult]:
|
||||
"""
|
||||
处理一批文档,自动分发到 CPU 或 GPU pool
|
||||
|
||||
Args:
|
||||
documents: 待处理文档列表
|
||||
cpu_task_fn: CPU 任务处理函数
|
||||
gpu_task_fn: GPU 任务处理函数
|
||||
on_result: 结果回调(可选)
|
||||
on_error: 错误回调(可选)
|
||||
|
||||
Returns:
|
||||
所有任务结果列表
|
||||
"""
|
||||
# 分类任务
|
||||
tasks = [
|
||||
Task(id=doc['id'], task_type=self.dispatcher.classify_task(doc), data=doc)
|
||||
for doc in documents
|
||||
]
|
||||
cpu_tasks, gpu_tasks = self.dispatcher.partition_tasks(tasks)
|
||||
|
||||
logger.info(f"Task partition: {len(cpu_tasks)} CPU, {len(gpu_tasks)} GPU")
|
||||
|
||||
# 提交任务到各自的 pool
|
||||
cpu_futures = {
|
||||
self.cpu_pool.submit(cpu_task_fn, t.data): t.id
|
||||
for t in cpu_tasks
|
||||
}
|
||||
gpu_futures = {
|
||||
self.gpu_pool.submit(gpu_task_fn, t.data): t.id
|
||||
for t in gpu_tasks
|
||||
}
|
||||
|
||||
# 收集结果
|
||||
results = []
|
||||
all_futures = list(cpu_futures.keys()) + list(gpu_futures.keys())
|
||||
|
||||
for future in as_completed(all_futures):
|
||||
task_id = cpu_futures.get(future) or gpu_futures.get(future)
|
||||
pool_type = "CPU" if future in cpu_futures else "GPU"
|
||||
|
||||
try:
|
||||
data = future.result(timeout=300) # 5分钟超时
|
||||
result = TaskResult(task_id=task_id, success=True, data=data)
|
||||
if on_result:
|
||||
on_result(result)
|
||||
except Exception as e:
|
||||
logger.error(f"[{pool_type}] Task {task_id} failed: {e}")
|
||||
result = TaskResult(task_id=task_id, success=False, data=None, error=str(e))
|
||||
if on_error:
|
||||
on_error(task_id, e)
|
||||
|
||||
results.append(result)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 阶段 3:集成到 autolabel (1-2天)
|
||||
|
||||
#### 3.1 修改 autolabel.py
|
||||
|
||||
```python
|
||||
# src/cli/autolabel.py
|
||||
|
||||
def run_autolabel_dual_pool(args):
|
||||
"""使用双池模式运行自动标注"""
|
||||
|
||||
from src.processing.dual_pool_coordinator import DualPoolCoordinator
|
||||
|
||||
# 初始化数据库批处理
|
||||
db_batch = []
|
||||
db_batch_size = 100
|
||||
|
||||
def on_result(result: TaskResult):
|
||||
"""处理成功结果"""
|
||||
nonlocal db_batch
|
||||
db_batch.append(result.data)
|
||||
|
||||
if len(db_batch) >= db_batch_size:
|
||||
save_documents_batch(db_batch)
|
||||
db_batch.clear()
|
||||
|
||||
def on_error(task_id: str, error: Exception):
|
||||
"""处理错误"""
|
||||
logger.error(f"Task {task_id} failed: {error}")
|
||||
|
||||
# 创建双池协调器
|
||||
with DualPoolCoordinator(
|
||||
cpu_workers=args.cpu_workers or 4,
|
||||
gpu_workers=args.gpu_workers or 1,
|
||||
gpu_id=0
|
||||
) as coordinator:
|
||||
|
||||
# 处理所有 CSV
|
||||
for csv_file in csv_files:
|
||||
documents = load_documents_from_csv(csv_file)
|
||||
|
||||
results = coordinator.process_batch(
|
||||
documents=documents,
|
||||
cpu_task_fn=process_text_pdf,
|
||||
gpu_task_fn=process_scanned_pdf,
|
||||
on_result=on_result,
|
||||
on_error=on_error
|
||||
)
|
||||
|
||||
logger.info(f"CSV {csv_file}: {len(results)} processed")
|
||||
|
||||
# 保存剩余批次
|
||||
if db_batch:
|
||||
save_documents_batch(db_batch)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 阶段 4:测试与验证 (1-2天)
|
||||
|
||||
#### 4.1 单元测试
|
||||
|
||||
```python
|
||||
# tests/unit/test_dual_pool.py
|
||||
|
||||
import pytest
|
||||
from src.processing.dual_pool_coordinator import DualPoolCoordinator, TaskResult
|
||||
|
||||
class TestDualPoolCoordinator:
|
||||
|
||||
def test_cpu_only_batch(self):
|
||||
"""测试纯 CPU 任务批处理"""
|
||||
with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
|
||||
docs = [{"id": f"doc_{i}", "type": "text"} for i in range(10)]
|
||||
results = coord.process_batch(docs, cpu_fn, gpu_fn)
|
||||
assert len(results) == 10
|
||||
assert all(r.success for r in results)
|
||||
|
||||
def test_mixed_batch(self):
|
||||
"""测试混合任务批处理"""
|
||||
with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
|
||||
docs = [
|
||||
{"id": "text_1", "type": "text"},
|
||||
{"id": "scan_1", "type": "scanned"},
|
||||
{"id": "text_2", "type": "text"},
|
||||
]
|
||||
results = coord.process_batch(docs, cpu_fn, gpu_fn)
|
||||
assert len(results) == 3
|
||||
|
||||
def test_timeout_handling(self):
|
||||
"""测试超时处理"""
|
||||
pass
|
||||
|
||||
def test_error_recovery(self):
|
||||
"""测试错误恢复"""
|
||||
pass
|
||||
```
|
||||
|
||||
#### 4.2 集成测试
|
||||
|
||||
```python
|
||||
# tests/integration/test_autolabel_dual_pool.py
|
||||
|
||||
def test_autolabel_with_dual_pool():
|
||||
"""端到端测试双池模式"""
|
||||
# 使用少量测试数据
|
||||
result = subprocess.run([
|
||||
"python", "-m", "src.cli.autolabel",
|
||||
"--cpu-workers", "2",
|
||||
"--gpu-workers", "1",
|
||||
"--limit", "50"
|
||||
], capture_output=True)
|
||||
|
||||
assert result.returncode == 0
|
||||
# 验证数据库记录
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 关键技术点
|
||||
|
||||
### 4.1 避免死锁的策略
|
||||
|
||||
```python
|
||||
# 1. 使用 timeout
|
||||
try:
|
||||
result = future.result(timeout=300)
|
||||
except TimeoutError:
|
||||
logger.warning(f"Task timed out")
|
||||
|
||||
# 2. 使用哨兵值
|
||||
SENTINEL = object()
|
||||
queue.put(SENTINEL) # 发送结束信号
|
||||
|
||||
# 3. 检查进程状态
|
||||
if not worker.is_alive():
|
||||
logger.error("Worker died unexpectedly")
|
||||
break
|
||||
|
||||
# 4. 先清空队列再 join
|
||||
while not queue.empty():
|
||||
results.append(queue.get_nowait())
|
||||
worker.join(timeout=5.0)
|
||||
```
|
||||
|
||||
### 4.2 PaddleOCR 特殊处理
|
||||
|
||||
```python
|
||||
# PaddleOCR 必须在 worker 进程中初始化
|
||||
def init_paddle_worker(gpu_id: int):
|
||||
global _ocr
|
||||
import os
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
|
||||
|
||||
# 延迟导入,确保 CUDA 环境变量生效
|
||||
from paddleocr import PaddleOCR
|
||||
_ocr = PaddleOCR(
|
||||
use_angle_cls=True,
|
||||
lang='en',
|
||||
use_gpu=True,
|
||||
show_log=False,
|
||||
# 重要:设置 GPU 内存比例
|
||||
gpu_mem=2000 # 限制 GPU 内存使用 (MB)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.3 资源监控
|
||||
|
||||
```python
|
||||
import psutil
|
||||
import GPUtil
|
||||
|
||||
def get_resource_usage():
|
||||
"""获取系统资源使用情况"""
|
||||
cpu_percent = psutil.cpu_percent(interval=1)
|
||||
memory = psutil.virtual_memory()
|
||||
|
||||
gpu_info = []
|
||||
for gpu in GPUtil.getGPUs():
|
||||
gpu_info.append({
|
||||
"id": gpu.id,
|
||||
"memory_used": gpu.memoryUsed,
|
||||
"memory_total": gpu.memoryTotal,
|
||||
"utilization": gpu.load * 100
|
||||
})
|
||||
|
||||
return {
|
||||
"cpu_percent": cpu_percent,
|
||||
"memory_percent": memory.percent,
|
||||
"gpu": gpu_info
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 风险评估与应对
|
||||
|
||||
| 风险 | 可能性 | 影响 | 应对策略 |
|
||||
|------|--------|------|----------|
|
||||
| GPU 内存不足 | 中 | 高 | 限制 GPU worker = 1,设置 gpu_mem 参数 |
|
||||
| 进程僵死 | 低 | 高 | 添加心跳检测,超时自动重启 |
|
||||
| 任务分类错误 | 中 | 中 | 添加回退机制,CPU 失败后尝试 GPU |
|
||||
| 数据库写入瓶颈 | 低 | 中 | 增大批处理大小,异步写入 |
|
||||
|
||||
---
|
||||
|
||||
## 6. 备选方案
|
||||
|
||||
如果上述方案仍存在问题,可以考虑:
|
||||
|
||||
### 6.1 使用 Ray
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
ray.init()
|
||||
|
||||
@ray.remote(num_cpus=1)
|
||||
def cpu_task(data):
|
||||
return process_text_pdf(data)
|
||||
|
||||
@ray.remote(num_gpus=1)
|
||||
def gpu_task(data):
|
||||
return process_scanned_pdf(data)
|
||||
|
||||
# 自动资源调度
|
||||
futures = [cpu_task.remote(d) for d in cpu_docs]
|
||||
futures += [gpu_task.remote(d) for d in gpu_docs]
|
||||
results = ray.get(futures)
|
||||
```
|
||||
|
||||
### 6.2 单池 + 动态 GPU 调度
|
||||
|
||||
保持单池模式,但在每个任务内部动态决定是否使用 GPU:
|
||||
|
||||
```python
|
||||
def process_document(doc_data):
|
||||
if is_scanned_pdf(doc_data):
|
||||
# 使用 GPU (需要全局锁或信号量控制并发)
|
||||
with gpu_semaphore:
|
||||
return process_with_ocr(doc_data)
|
||||
else:
|
||||
return process_text_only(doc_data)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 时间线总结
|
||||
|
||||
| 阶段 | 任务 | 预计工作量 |
|
||||
|------|------|------------|
|
||||
| 阶段 1 | 基础架构重构 | 2-3 天 |
|
||||
| 阶段 2 | 双池协调器实现 | 2-3 天 |
|
||||
| 阶段 3 | 集成到 autolabel | 1-2 天 |
|
||||
| 阶段 4 | 测试与验证 | 1-2 天 |
|
||||
| **总计** | | **6-10 天** |
|
||||
|
||||
---
|
||||
|
||||
## 8. 参考资料
|
||||
|
||||
1. [Python concurrent.futures 官方文档](https://docs.python.org/3/library/concurrent.futures.html)
|
||||
2. [PyTorch Multiprocessing Best Practices](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html)
|
||||
3. [Super Fast Python - ProcessPoolExecutor 完整指南](https://superfastpython.com/processpoolexecutor-in-python/)
|
||||
4. [PaddleOCR 并行推理文档](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/instructions/parallel_inference.html)
|
||||
5. [AWS - 跨 CPU/GPU 并行化 ML 推理](https://aws.amazon.com/blogs/machine-learning/parallelizing-across-multiple-cpu-gpus-to-speed-up-deep-learning-inference-at-the-edge/)
|
||||
6. [Ray 分布式多进程处理](https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html)
|
||||
1223
docs/product-plan-v2.md
Normal file
1223
docs/product-plan-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
302
docs/ux-design-prompt-v2.md
Normal file
302
docs/ux-design-prompt-v2.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# Document Annotation Tool – UX Design Spec v2
|
||||
|
||||
## Theme: Warm Graphite (Modern Enterprise)
|
||||
|
||||
---
|
||||
|
||||
## 1. Design Principles (Updated)
|
||||
|
||||
1. **Clarity** – High contrast, but never pure black-on-white
|
||||
2. **Warm Neutrality** – Slightly warm grays reduce visual fatigue
|
||||
3. **Focus** – Content-first layouts with restrained accents
|
||||
4. **Consistency** – Reusable patterns, predictable behavior
|
||||
5. **Professional Trust** – Calm, serious, enterprise-ready
|
||||
6. **Longevity** – No trendy colors that age quickly
|
||||
|
||||
---
|
||||
|
||||
## 2. Color Palette (Warm Graphite)
|
||||
|
||||
### Core Colors
|
||||
|
||||
| Usage | Color Name | Hex |
|
||||
|------|-----------|-----|
|
||||
| Primary Text | Soft Black | #121212 |
|
||||
| Secondary Text | Charcoal Gray | #2A2A2A |
|
||||
| Muted Text | Warm Gray | #6B6B6B |
|
||||
| Disabled Text | Light Warm Gray | #9A9A9A |
|
||||
|
||||
### Backgrounds
|
||||
|
||||
| Usage | Color | Hex |
|
||||
|-----|------|-----|
|
||||
| App Background | Paper White | #FAFAF8 |
|
||||
| Card / Panel | White | #FFFFFF |
|
||||
| Hover Surface | Subtle Warm Gray | #F1F0ED |
|
||||
| Selected Row | Very Light Warm Gray | #ECEAE6 |
|
||||
|
||||
### Borders & Dividers
|
||||
|
||||
| Usage | Color | Hex |
|
||||
|------|------|-----|
|
||||
| Default Border | Warm Light Gray | #E6E4E1 |
|
||||
| Strong Divider | Neutral Gray | #D8D6D2 |
|
||||
|
||||
### Semantic States (Muted & Professional)
|
||||
|
||||
| State | Color | Hex |
|
||||
|------|-------|-----|
|
||||
| Success | Olive Gray | #3E4A3A |
|
||||
| Error | Brick Gray | #4A3A3A |
|
||||
| Warning | Sand Gray | #4A4A3A |
|
||||
| Info | Graphite Gray | #3A3A3A |
|
||||
|
||||
> Accent colors are **never saturated** and are used only for status, progress, or selection.
|
||||
|
||||
---
|
||||
|
||||
## 3. Typography
|
||||
|
||||
- **Font Family**: Inter / SF Pro / system-ui
|
||||
- **Headings**:
|
||||
- Weight: 600–700
|
||||
- Color: #121212
|
||||
- Letter spacing: -0.01em
|
||||
- **Body Text**:
|
||||
- Weight: 400
|
||||
- Color: #2A2A2A
|
||||
- **Captions / Meta**:
|
||||
- Weight: 400
|
||||
- Color: #6B6B6B
|
||||
- **Monospace (IDs / Values)**:
|
||||
- JetBrains Mono / SF Mono
|
||||
- Color: #2A2A2A
|
||||
|
||||
---
|
||||
|
||||
## 4. Global Layout
|
||||
|
||||
### Top Navigation Bar
|
||||
|
||||
- Height: 56px
|
||||
- Background: #FAFAF8
|
||||
- Bottom Border: 1px solid #E6E4E1
|
||||
- Logo: Text or icon in #121212
|
||||
|
||||
**Navigation Items**
|
||||
- Default: #6B6B6B
|
||||
- Hover: #2A2A2A
|
||||
- Active:
|
||||
- Text: #121212
|
||||
- Bottom indicator: 2px solid #3A3A3A (rounded ends)
|
||||
|
||||
**Avatar**
|
||||
- Circle background: #ECEAE6
|
||||
- Text: #2A2A2A
|
||||
|
||||
---
|
||||
|
||||
## 5. Page: Documents (Dashboard)
|
||||
|
||||
### Page Header
|
||||
|
||||
- Title: "Documents" (#121212)
|
||||
- Actions:
|
||||
- Primary button: Dark graphite outline
|
||||
- Secondary button: Subtle border only
|
||||
|
||||
### Filters Bar
|
||||
|
||||
- Background: #FFFFFF
|
||||
- Border: 1px solid #E6E4E1
|
||||
- Inputs:
|
||||
- Background: #FFFFFF
|
||||
- Hover: #F1F0ED
|
||||
- Focus ring: 1px #3A3A3A
|
||||
|
||||
### Document Table
|
||||
|
||||
- Table background: #FFFFFF
|
||||
- Header text: #6B6B6B
|
||||
- Row hover: #F1F0ED
|
||||
- Row selected:
|
||||
- Background: #ECEAE6
|
||||
- Left indicator: 3px solid #3A3A3A
|
||||
|
||||
### Status Badges
|
||||
|
||||
- Pending:
|
||||
- BG: #FFFFFF
|
||||
- Border: #D8D6D2
|
||||
- Text: #2A2A2A
|
||||
|
||||
- Labeled:
|
||||
- BG: #2A2A2A
|
||||
- Text: #FFFFFF
|
||||
|
||||
- Exported:
|
||||
- BG: #ECEAE6
|
||||
- Text: #2A2A2A
|
||||
- Icon: ✓
|
||||
|
||||
### Auto-label States
|
||||
|
||||
- Running:
|
||||
- Progress bar: #3A3A3A on #ECEAE6
|
||||
- Completed:
|
||||
- Text: #3E4A3A
|
||||
- Failed:
|
||||
- BG: #F1EDED
|
||||
- Text: #4A3A3A
|
||||
|
||||
---
|
||||
|
||||
## 6. Upload Modals (Single & Batch)
|
||||
|
||||
### Modal Container
|
||||
|
||||
- Background: #FFFFFF
|
||||
- Border radius: 8px
|
||||
- Shadow: 0 1px 3px rgba(0,0,0,0.08)
|
||||
|
||||
### Drop Zone
|
||||
|
||||
- Background: #FAFAF8
|
||||
- Border: 1px dashed #D8D6D2
|
||||
- Hover: #F1F0ED
|
||||
- Icon: Graphite gray
|
||||
|
||||
### Form Fields
|
||||
|
||||
- Input BG: #FFFFFF
|
||||
- Border: #D8D6D2
|
||||
- Focus: 1px solid #3A3A3A
|
||||
|
||||
Primary Action Button:
|
||||
- Text: #FFFFFF
|
||||
- BG: #2A2A2A
|
||||
- Hover: #121212
|
||||
|
||||
---
|
||||
|
||||
## 7. Document Detail View
|
||||
|
||||
### Canvas Area
|
||||
|
||||
- Background: #FFFFFF
|
||||
- Annotation styles:
|
||||
- Manual: Solid border #2A2A2A
|
||||
- Auto: Dashed border #6B6B6B
|
||||
- Selected: 2px border #3A3A3A + resize handles
|
||||
|
||||
### Right Info Panel
|
||||
|
||||
- Card background: #FFFFFF
|
||||
- Section headers: #121212
|
||||
- Meta text: #6B6B6B
|
||||
|
||||
### Annotation Table
|
||||
|
||||
- Same table styles as Documents
|
||||
- Inline edit:
|
||||
- Input background: #FAFAF8
|
||||
- Save button: Graphite
|
||||
|
||||
### Locked State (Auto-label Running)
|
||||
|
||||
- Banner BG: #FAFAF8
|
||||
- Border-left: 3px solid #4A4A3A
|
||||
- Progress bar: Graphite
|
||||
|
||||
---
|
||||
|
||||
## 8. Training Page
|
||||
|
||||
### Document Selector
|
||||
|
||||
- Selected rows use same highlight rules
|
||||
- Verified state:
|
||||
- Full: Olive gray check
|
||||
- Partial: Sand gray warning
|
||||
|
||||
### Configuration Panel
|
||||
|
||||
- Card layout
|
||||
- Inputs aligned to grid
|
||||
- Schedule option visually muted until enabled
|
||||
|
||||
Primary CTA:
|
||||
- Start Training button in dark graphite
|
||||
|
||||
---
|
||||
|
||||
## 9. Models & Training History
|
||||
|
||||
### Training Job List
|
||||
|
||||
- Job cards use #FFFFFF background
|
||||
- Running job:
|
||||
- Progress bar: #3A3A3A
|
||||
- Completed job:
|
||||
- Metrics bars in graphite
|
||||
|
||||
### Model Detail Panel
|
||||
|
||||
- Sectioned cards
|
||||
- Metric bars:
|
||||
- Track: #ECEAE6
|
||||
- Fill: #3A3A3A
|
||||
|
||||
Actions:
|
||||
- Primary: Download Model
|
||||
- Secondary: View Logs / Use as Base
|
||||
|
||||
---
|
||||
|
||||
## 10. Micro-interactions (Refined)
|
||||
|
||||
| Element | Interaction | Animation |
|
||||
|------|------------|-----------|
|
||||
| Button hover | BG lightens | 150ms ease-out |
|
||||
| Button press | Scale 0.98 | 100ms |
|
||||
| Row hover | BG fade | 120ms |
|
||||
| Modal open | Fade + scale 0.96 → 1 | 200ms |
|
||||
| Progress fill | Smooth | ease-out |
|
||||
| Annotation select | Border + handles | 120ms |
|
||||
|
||||
---
|
||||
|
||||
## 11. Tailwind Theme (Updated)
|
||||
|
||||
```js
|
||||
colors: {
|
||||
text: {
|
||||
primary: '#121212',
|
||||
secondary: '#2A2A2A',
|
||||
muted: '#6B6B6B',
|
||||
disabled: '#9A9A9A',
|
||||
},
|
||||
bg: {
|
||||
app: '#FAFAF8',
|
||||
card: '#FFFFFF',
|
||||
hover: '#F1F0ED',
|
||||
selected: '#ECEAE6',
|
||||
},
|
||||
border: '#E6E4E1',
|
||||
accent: '#3A3A3A',
|
||||
success: '#3E4A3A',
|
||||
error: '#4A3A3A',
|
||||
warning: '#4A4A3A',
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Final Notes
|
||||
|
||||
- Pure black (#000000) should **never** be used as large surfaces
|
||||
- Accent color usage should stay under **10% of UI area**
|
||||
- Warm grays are intentional and must not be "corrected" to blue-grays
|
||||
|
||||
This theme is designed to scale from internal tool → polished SaaS without redesign.
|
||||
|
||||
273
docs/web-refactoring-complete.md
Normal file
273
docs/web-refactoring-complete.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Web Directory Refactoring - Complete ✅
|
||||
|
||||
**Date**: 2026-01-25
|
||||
**Status**: ✅ Completed
|
||||
**Tests**: 188 passing (0 failures)
|
||||
**Coverage**: 23% (maintained)
|
||||
|
||||
---
|
||||
|
||||
## Final Directory Structure
|
||||
|
||||
```
|
||||
src/web/
|
||||
├── api/
|
||||
│ ├── __init__.py
|
||||
│ └── v1/
|
||||
│ ├── __init__.py
|
||||
│ ├── routes.py # Public inference API
|
||||
│ ├── admin/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── documents.py # Document management (was admin_routes.py)
|
||||
│ │ ├── annotations.py # Annotation routes (was admin_annotation_routes.py)
|
||||
│ │ └── training.py # Training routes (was admin_training_routes.py)
|
||||
│ ├── async_api/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── routes.py # Async processing API (was async_routes.py)
|
||||
│ └── batch/
|
||||
│ ├── __init__.py
|
||||
│ └── routes.py # Batch upload API (was batch_upload_routes.py)
|
||||
│
|
||||
├── schemas/
|
||||
│ ├── __init__.py
|
||||
│ ├── common.py # Shared models (ErrorResponse)
|
||||
│ ├── admin.py # Admin schemas (was admin_schemas.py)
|
||||
│ └── inference.py # Inference + async schemas (was schemas.py)
|
||||
│
|
||||
├── services/
|
||||
│ ├── __init__.py
|
||||
│ ├── inference.py # Inference service (was services.py)
|
||||
│ ├── autolabel.py # Auto-label service (was admin_autolabel.py)
|
||||
│ ├── async_processing.py # Async processing (was async_service.py)
|
||||
│ └── batch_upload.py # Batch upload service (was batch_upload_service.py)
|
||||
│
|
||||
├── core/
|
||||
│ ├── __init__.py
|
||||
│ ├── auth.py # Authentication (was admin_auth.py)
|
||||
│ ├── rate_limiter.py # Rate limiting (unchanged)
|
||||
│ └── scheduler.py # Task scheduler (was admin_scheduler.py)
|
||||
│
|
||||
├── workers/
|
||||
│ ├── __init__.py
|
||||
│ ├── async_queue.py # Async task queue (was async_queue.py)
|
||||
│ └── batch_queue.py # Batch task queue (was batch_queue.py)
|
||||
│
|
||||
├── __init__.py # Main exports
|
||||
├── app.py # FastAPI app (imports updated)
|
||||
├── config.py # Configuration (unchanged)
|
||||
└── dependencies.py # Global dependencies (unchanged)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Changes Summary
|
||||
|
||||
### Files Moved and Renamed
|
||||
|
||||
| Old Location | New Location | Change Type |
|
||||
|-------------|--------------|-------------|
|
||||
| `admin_routes.py` | `api/v1/admin/documents.py` | Moved + Renamed |
|
||||
| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Moved + Renamed |
|
||||
| `admin_training_routes.py` | `api/v1/admin/training.py` | Moved + Renamed |
|
||||
| `admin_auth.py` | `core/auth.py` | Moved |
|
||||
| `admin_autolabel.py` | `services/autolabel.py` | Moved |
|
||||
| `admin_scheduler.py` | `core/scheduler.py` | Moved |
|
||||
| `admin_schemas.py` | `schemas/admin.py` | Moved |
|
||||
| `routes.py` | `api/v1/routes.py` | Moved |
|
||||
| `schemas.py` | `schemas/inference.py` | Moved |
|
||||
| `services.py` | `services/inference.py` | Moved |
|
||||
| `async_routes.py` | `api/v1/async_api/routes.py` | Moved |
|
||||
| `async_queue.py` | `workers/async_queue.py` | Moved |
|
||||
| `async_service.py` | `services/async_processing.py` | Moved + Renamed |
|
||||
| `batch_queue.py` | `workers/batch_queue.py` | Moved |
|
||||
| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Moved |
|
||||
| `batch_upload_service.py` | `services/batch_upload.py` | Moved |
|
||||
|
||||
**Total**: 16 files reorganized
|
||||
|
||||
### Files Updated
|
||||
|
||||
**Source Files** (imports updated):
|
||||
- `app.py` - Updated all imports to new structure
|
||||
- `api/v1/admin/documents.py` - Updated schema/auth imports
|
||||
- `api/v1/admin/annotations.py` - Updated schema/service imports
|
||||
- `api/v1/admin/training.py` - Updated schema/auth imports
|
||||
- `api/v1/routes.py` - Updated schema imports
|
||||
- `api/v1/async_api/routes.py` - Updated schema imports
|
||||
- `api/v1/batch/routes.py` - Updated service/worker imports
|
||||
- `services/async_processing.py` - Updated worker/core imports
|
||||
|
||||
**Test Files** (all 15 updated):
|
||||
- `test_admin_annotations.py`
|
||||
- `test_admin_auth.py`
|
||||
- `test_admin_routes.py`
|
||||
- `test_admin_routes_enhanced.py`
|
||||
- `test_admin_training.py`
|
||||
- `test_annotation_locks.py`
|
||||
- `test_annotation_phase5.py`
|
||||
- `test_async_queue.py`
|
||||
- `test_async_routes.py`
|
||||
- `test_async_service.py`
|
||||
- `test_autolabel_with_locks.py`
|
||||
- `test_batch_queue.py`
|
||||
- `test_batch_upload_routes.py`
|
||||
- `test_batch_upload_service.py`
|
||||
- `test_training_phase4.py`
|
||||
- `conftest.py`
|
||||
|
||||
---
|
||||
|
||||
## Import Examples
|
||||
|
||||
### Old Import Style (Before Refactoring)
|
||||
```python
|
||||
from src.web.admin_routes import create_admin_router
|
||||
from src.web.admin_schemas import DocumentItem
|
||||
from src.web.admin_auth import validate_admin_token
|
||||
from src.web.async_routes import create_async_router
|
||||
from src.web.schemas import ErrorResponse
|
||||
```
|
||||
|
||||
### New Import Style (After Refactoring)
|
||||
```python
|
||||
# Admin API
|
||||
from src.web.api.v1.admin.documents import create_admin_router
|
||||
from src.web.api.v1.admin import create_admin_router # Shorter alternative
|
||||
|
||||
# Schemas
|
||||
from src.web.schemas.admin import DocumentItem
|
||||
from src.web.schemas.common import ErrorResponse
|
||||
|
||||
# Core components
|
||||
from src.web.core.auth import validate_admin_token
|
||||
|
||||
# Async API
|
||||
from src.web.api.v1.async_api.routes import create_async_router
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benefits Achieved
|
||||
|
||||
### 1. **Clear Separation of Concerns**
|
||||
- **API Routes**: All in `api/v1/` by version and feature
|
||||
- **Data Models**: All in `schemas/` by domain
|
||||
- **Business Logic**: All in `services/`
|
||||
- **Core Components**: Reusable utilities in `core/`
|
||||
- **Background Jobs**: Task queues in `workers/`
|
||||
|
||||
### 2. **Better Scalability**
|
||||
- Easy to add API v2 without touching v1
|
||||
- Clear namespace for each module
|
||||
- Reduced file sizes (no 800+ line files)
|
||||
- Follows single responsibility principle
|
||||
|
||||
### 3. **Improved Maintainability**
|
||||
- Find files by function, not by prefix
|
||||
- Each module has one clear purpose
|
||||
- Easier to onboard new developers
|
||||
- Better IDE navigation
|
||||
|
||||
### 4. **Standards Compliance**
|
||||
- Follows FastAPI best practices
|
||||
- Matches Django/Flask project structures
|
||||
- Standard Python package organization
|
||||
- Industry-standard naming conventions
|
||||
|
||||
---
|
||||
|
||||
## Testing Results
|
||||
|
||||
**Before Refactoring**:
|
||||
- 188 tests passing
|
||||
- 23% code coverage
|
||||
- Flat directory structure
|
||||
|
||||
**After Refactoring**:
|
||||
- ✅ 188 tests passing (0 failures)
|
||||
- ✅ 23% code coverage (maintained)
|
||||
- ✅ Clean hierarchical structure
|
||||
- ✅ All imports updated
|
||||
- ✅ No backward compatibility shims needed
|
||||
|
||||
---
|
||||
|
||||
## Migration Statistics
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| Files moved | 16 |
|
||||
| Directories created | 9 |
|
||||
| Files updated (source) | 8 |
|
||||
| Files updated (tests) | 16 |
|
||||
| Import statements updated | ~150 |
|
||||
| Lines of code changed | ~200 |
|
||||
| Tests broken | 0 |
|
||||
| Coverage lost | 0% |
|
||||
|
||||
---
|
||||
|
||||
## Code Diff Summary
|
||||
|
||||
```diff
|
||||
Before:
|
||||
src/web/
|
||||
├── admin_routes.py (645 lines)
|
||||
├── admin_annotation_routes.py (504 lines)
|
||||
├── admin_training_routes.py (565 lines)
|
||||
├── admin_auth.py (22 lines)
|
||||
├── admin_schemas.py (262 lines)
|
||||
... (15 more files at root level)
|
||||
|
||||
After:
|
||||
src/web/
|
||||
├── api/v1/
|
||||
│ ├── admin/ (3 route files)
|
||||
│ ├── async_api/ (1 route file)
|
||||
│ └── batch/ (1 route file)
|
||||
├── schemas/ (3 schema files)
|
||||
├── services/ (4 service files)
|
||||
├── core/ (3 core files)
|
||||
└── workers/ (2 worker files)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (Optional)
|
||||
|
||||
### Phase 2: Documentation
|
||||
- [ ] Update API documentation with new import paths
|
||||
- [ ] Create migration guide for external developers
|
||||
- [ ] Update CLAUDE.md with new structure
|
||||
|
||||
### Phase 3: Further Optimization
|
||||
- [ ] Split large files (>400 lines) if needed
|
||||
- [ ] Extract common utilities
|
||||
- [ ] Add typing stubs
|
||||
|
||||
### Phase 4: Deprecation (Future)
|
||||
- [ ] Add deprecation warnings if creating compatibility layer
|
||||
- [ ] Remove old imports after grace period
|
||||
- [ ] Update all documentation
|
||||
|
||||
---
|
||||
|
||||
## Rollback Instructions
|
||||
|
||||
If needed, rollback is simple:
|
||||
```bash
|
||||
git revert <commit-hash>
|
||||
```
|
||||
|
||||
All changes are in version control, making rollback safe and easy.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
✅ **Refactoring completed successfully**
|
||||
✅ **Zero breaking changes**
|
||||
✅ **All tests passing**
|
||||
✅ **Industry-standard structure achieved**
|
||||
|
||||
The web directory is now organized following Python and FastAPI best practices, making it easier to scale, maintain, and extend.
|
||||
186
docs/web-refactoring-plan.md
Normal file
186
docs/web-refactoring-plan.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Web Directory Refactoring Plan
|
||||
|
||||
## Current Structure Issues
|
||||
|
||||
1. **Flat structure**: All files in one directory (20 Python files)
|
||||
2. **Naming inconsistency**: Mix of `admin_*`, `async_*`, `batch_*` prefixes
|
||||
3. **Mixed concerns**: Routes, schemas, services, and workers in same directory
|
||||
4. **Poor scalability**: Hard to navigate and maintain as project grows
|
||||
|
||||
## Proposed Structure (Best Practices)
|
||||
|
||||
```
|
||||
src/web/
|
||||
├── __init__.py # Main exports
|
||||
├── app.py # FastAPI app factory
|
||||
├── config.py # App configuration
|
||||
├── dependencies.py # Global dependencies
|
||||
│
|
||||
├── api/ # API Routes Layer
|
||||
│ ├── __init__.py
|
||||
│ └── v1/ # API version 1
|
||||
│ ├── __init__.py
|
||||
│ ├── routes.py # Public API routes (inference)
|
||||
│ ├── admin/ # Admin API routes
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── documents.py # admin_routes.py → documents.py
|
||||
│ │ ├── annotations.py # admin_annotation_routes.py → annotations.py
|
||||
│ │ ├── training.py # admin_training_routes.py → training.py
|
||||
│ │ └── auth.py # admin_auth.py → auth.py (routes only)
|
||||
│ ├── async_api/ # Async processing API
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── routes.py # async_routes.py → routes.py
|
||||
│ └── batch/ # Batch upload API
|
||||
│ ├── __init__.py
|
||||
│ └── routes.py # batch_upload_routes.py → routes.py
|
||||
│
|
||||
├── schemas/ # Pydantic Models
|
||||
│ ├── __init__.py
|
||||
│ ├── common.py # Shared schemas (ErrorResponse, etc.)
|
||||
│ ├── inference.py # schemas.py → inference.py
|
||||
│ ├── admin.py # admin_schemas.py → admin.py
|
||||
│ ├── async_api.py # New: async API schemas
|
||||
│ └── batch.py # New: batch upload schemas
|
||||
│
|
||||
├── services/ # Business Logic Layer
|
||||
│ ├── __init__.py
|
||||
│ ├── inference.py # services.py → inference.py
|
||||
│ ├── autolabel.py # admin_autolabel.py → autolabel.py
|
||||
│ ├── async_processing.py # async_service.py → async_processing.py
|
||||
│ └── batch_upload.py # batch_upload_service.py → batch_upload.py
|
||||
│
|
||||
├── core/ # Core Components
|
||||
│ ├── __init__.py
|
||||
│ ├── auth.py # admin_auth.py → auth.py (logic only)
|
||||
│ ├── rate_limiter.py # rate_limiter.py → rate_limiter.py
|
||||
│ └── scheduler.py # admin_scheduler.py → scheduler.py
|
||||
│
|
||||
└── workers/ # Background Task Queues
|
||||
├── __init__.py
|
||||
├── async_queue.py # async_queue.py → async_queue.py
|
||||
└── batch_queue.py # batch_queue.py → batch_queue.py
|
||||
```
|
||||
|
||||
## File Mapping
|
||||
|
||||
### Current → New Location
|
||||
|
||||
| Current File | New Location | Purpose |
|
||||
|--------------|--------------|---------|
|
||||
| `admin_routes.py` | `api/v1/admin/documents.py` | Document management routes |
|
||||
| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Annotation routes |
|
||||
| `admin_training_routes.py` | `api/v1/admin/training.py` | Training routes |
|
||||
| `admin_auth.py` | Split: `api/v1/admin/auth.py` + `core/auth.py` | Auth routes + logic |
|
||||
| `admin_schemas.py` | `schemas/admin.py` | Admin Pydantic models |
|
||||
| `admin_autolabel.py` | `services/autolabel.py` | Auto-label service |
|
||||
| `admin_scheduler.py` | `core/scheduler.py` | Training scheduler |
|
||||
| `routes.py` | `api/v1/routes.py` | Public inference API |
|
||||
| `schemas.py` | `schemas/inference.py` | Inference models |
|
||||
| `services.py` | `services/inference.py` | Inference service |
|
||||
| `async_routes.py` | `api/v1/async_api/routes.py` | Async API routes |
|
||||
| `async_service.py` | `services/async_processing.py` | Async processing service |
|
||||
| `async_queue.py` | `workers/async_queue.py` | Async task queue |
|
||||
| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Batch upload routes |
|
||||
| `batch_upload_service.py` | `services/batch_upload.py` | Batch upload service |
|
||||
| `batch_queue.py` | `workers/batch_queue.py` | Batch task queue |
|
||||
| `rate_limiter.py` | `core/rate_limiter.py` | Rate limiting logic |
|
||||
| `config.py` | `config.py` | Keep as-is |
|
||||
| `dependencies.py` | `dependencies.py` | Keep as-is |
|
||||
| `app.py` | `app.py` | Keep as-is (update imports) |
|
||||
|
||||
## Benefits
|
||||
|
||||
### 1. Clear Separation of Concerns
|
||||
- **Routes**: API endpoint definitions
|
||||
- **Schemas**: Data validation models
|
||||
- **Services**: Business logic
|
||||
- **Core**: Reusable components
|
||||
- **Workers**: Background processing
|
||||
|
||||
### 2. Better Scalability
|
||||
- Easy to add new API versions (`v2/`)
|
||||
- Clear namespace for each domain
|
||||
- Reduced file size (no 800+ line files)
|
||||
|
||||
### 3. Improved Maintainability
|
||||
- Find files by function, not by prefix
|
||||
- Each module has single responsibility
|
||||
- Easier to write focused tests
|
||||
|
||||
### 4. Standard Python Patterns
|
||||
- Package-based organization
|
||||
- Follows FastAPI best practices
|
||||
- Similar to Django/Flask structures
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Phase 1: Create New Structure (No Breaking Changes)
|
||||
1. Create new directories: `api/`, `schemas/`, `services/`, `core/`, `workers/`
|
||||
2. Copy files to new locations (don't delete originals yet)
|
||||
3. Update imports in new files
|
||||
4. Add `__init__.py` with proper exports
|
||||
|
||||
### Phase 2: Update Tests
|
||||
5. Update test imports to use new structure
|
||||
6. Run tests to verify nothing breaks
|
||||
7. Fix any import issues
|
||||
|
||||
### Phase 3: Update Main App
|
||||
8. Update `app.py` to import from new locations
|
||||
9. Run full test suite
|
||||
10. Verify all endpoints work
|
||||
|
||||
### Phase 4: Cleanup
|
||||
11. Delete old files
|
||||
12. Update documentation
|
||||
13. Final test run
|
||||
|
||||
## Migration Priority
|
||||
|
||||
**High Priority** (Most used):
|
||||
- Routes and schemas (user-facing APIs)
|
||||
- Services (core business logic)
|
||||
|
||||
**Medium Priority**:
|
||||
- Core components (auth, rate limiter)
|
||||
- Workers (background tasks)
|
||||
|
||||
**Low Priority**:
|
||||
- Config and dependencies (already well-located)
|
||||
|
||||
## Backwards Compatibility
|
||||
|
||||
During migration, maintain backwards compatibility:
|
||||
|
||||
```python
|
||||
# src/web/__init__.py
|
||||
# Old imports still work
|
||||
from src.web.api.v1.admin.documents import router as admin_router
|
||||
from src.web.schemas.admin import AdminDocument
|
||||
|
||||
# Keep old names for compatibility (temporary)
|
||||
admin_routes = admin_router # Deprecated alias
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit Tests**: Test each module independently
|
||||
2. **Integration Tests**: Test API endpoints still work
|
||||
3. **Import Tests**: Verify all old imports still work
|
||||
4. **Coverage**: Maintain current 23% coverage minimum
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise:
|
||||
1. Keep old files until fully migrated
|
||||
2. Git allows easy revert
|
||||
3. Tests catch breaking changes early
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
Would you like me to:
|
||||
1. **Start Phase 1**: Create new directory structure and move files?
|
||||
2. **Create migration script**: Automate the file moves and import updates?
|
||||
3. **Focus on specific area**: Start with admin API or async API first?
|
||||
218
docs/web-refactoring-status.md
Normal file
218
docs/web-refactoring-status.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Web Directory Refactoring - Current Status
|
||||
|
||||
## ✅ Completed Steps
|
||||
|
||||
### 1. Directory Structure Created
|
||||
```
|
||||
src/web/
|
||||
├── api/
|
||||
│ ├── v1/
|
||||
│ │ ├── admin/ (documents.py, annotations.py, training.py)
|
||||
│ │ ├── async_api/ (routes.py)
|
||||
│ │ ├── batch/ (routes.py)
|
||||
│ │ └── routes.py (public inference API)
|
||||
├── schemas/
|
||||
│ ├── admin.py (admin schemas)
|
||||
│ ├── inference.py (inference + async schemas)
|
||||
│ └── common.py (ErrorResponse)
|
||||
├── services/
|
||||
│ ├── autolabel.py
|
||||
│ ├── async_processing.py
|
||||
│ ├── batch_upload.py
|
||||
│ └── inference.py
|
||||
├── core/
|
||||
│ ├── auth.py
|
||||
│ ├── rate_limiter.py
|
||||
│ └── scheduler.py
|
||||
└── workers/
|
||||
├── async_queue.py
|
||||
└── batch_queue.py
|
||||
```
|
||||
|
||||
### 2. Files Copied and Imports Updated
|
||||
|
||||
#### Admin API (✅ Complete)
|
||||
- [x] `admin_routes.py` → `api/v1/admin/documents.py` (imports updated)
|
||||
- [x] `admin_annotation_routes.py` → `api/v1/admin/annotations.py` (imports updated)
|
||||
- [x] `admin_training_routes.py` → `api/v1/admin/training.py` (imports updated)
|
||||
- [x] `api/v1/admin/__init__.py` created with exports
|
||||
|
||||
#### Public & Async API (✅ Complete)
|
||||
- [x] `routes.py` → `api/v1/routes.py` (imports updated)
|
||||
- [x] `async_routes.py` → `api/v1/async_api/routes.py` (imports updated)
|
||||
- [x] `batch_upload_routes.py` → `api/v1/batch/routes.py` (copied, imports pending)
|
||||
|
||||
#### Schemas (✅ Complete)
|
||||
- [x] `admin_schemas.py` → `schemas/admin.py`
|
||||
- [x] `schemas.py` → `schemas/inference.py`
|
||||
- [x] `schemas/common.py` created
|
||||
- [x] `schemas/__init__.py` created with exports
|
||||
|
||||
#### Services (✅ Complete)
|
||||
- [x] `admin_autolabel.py` → `services/autolabel.py`
|
||||
- [x] `async_service.py` → `services/async_processing.py`
|
||||
- [x] `batch_upload_service.py` → `services/batch_upload.py`
|
||||
- [x] `services.py` → `services/inference.py`
|
||||
- [x] `services/__init__.py` created
|
||||
|
||||
#### Core Components (✅ Complete)
|
||||
- [x] `admin_auth.py` → `core/auth.py`
|
||||
- [x] `rate_limiter.py` → `core/rate_limiter.py`
|
||||
- [x] `admin_scheduler.py` → `core/scheduler.py`
|
||||
- [x] `core/__init__.py` created
|
||||
|
||||
#### Workers (✅ Complete)
|
||||
- [x] `async_queue.py` → `workers/async_queue.py`
|
||||
- [x] `batch_queue.py` → `workers/batch_queue.py`
|
||||
- [x] `workers/__init__.py` created
|
||||
|
||||
#### Main App (✅ Complete)
|
||||
- [x] `app.py` imports updated to use new structure
|
||||
|
||||
---
|
||||
|
||||
## ⏳ Remaining Work
|
||||
|
||||
### 1. Update Remaining File Imports (HIGH PRIORITY)
|
||||
|
||||
Files that need import updates:
|
||||
- [ ] `api/v1/batch/routes.py` - update to use new schema/service imports
|
||||
- [ ] `services/autolabel.py` - may need import updates if it references old paths
|
||||
- [ ] `services/async_processing.py` - check for old import references
|
||||
- [ ] `services/batch_upload.py` - check for old import references
|
||||
- [ ] `services/inference.py` - check for old import references
|
||||
|
||||
### 2. Update ALL Test Files (CRITICAL)
|
||||
|
||||
Test files need to import from new locations. Pattern:
|
||||
|
||||
**Old:**
|
||||
```python
|
||||
from src.web.admin_routes import create_admin_router
|
||||
from src.web.admin_schemas import DocumentItem
|
||||
from src.web.admin_auth import validate_admin_token
|
||||
```
|
||||
|
||||
**New:**
|
||||
```python
|
||||
from src.web.api.v1.admin import create_admin_router
|
||||
from src.web.schemas.admin import DocumentItem
|
||||
from src.web.core.auth import validate_admin_token
|
||||
```
|
||||
|
||||
Test files to update:
|
||||
- [ ] `tests/web/test_admin_annotations.py`
|
||||
- [ ] `tests/web/test_admin_auth.py`
|
||||
- [ ] `tests/web/test_admin_routes.py`
|
||||
- [ ] `tests/web/test_admin_routes_enhanced.py`
|
||||
- [ ] `tests/web/test_admin_training.py`
|
||||
- [ ] `tests/web/test_annotation_locks.py`
|
||||
- [ ] `tests/web/test_annotation_phase5.py`
|
||||
- [ ] `tests/web/test_async_queue.py`
|
||||
- [ ] `tests/web/test_async_routes.py`
|
||||
- [ ] `tests/web/test_async_service.py`
|
||||
- [ ] `tests/web/test_autolabel_with_locks.py`
|
||||
- [ ] `tests/web/test_batch_queue.py`
|
||||
- [ ] `tests/web/test_batch_upload_routes.py`
|
||||
- [ ] `tests/web/test_batch_upload_service.py`
|
||||
- [ ] `tests/web/test_rate_limiter.py`
|
||||
- [ ] `tests/web/test_training_phase4.py`
|
||||
|
||||
### 3. Create Backward Compatibility Layer (OPTIONAL)
|
||||
|
||||
Keep old imports working temporarily:
|
||||
|
||||
```python
|
||||
# src/web/admin_routes.py (temporary compatibility shim)
|
||||
\"\"\"
|
||||
DEPRECATED: Use src.web.api.v1.admin.documents instead.
|
||||
This file will be removed in next version.
|
||||
\"\"\"
|
||||
import warnings
|
||||
from src.web.api.v1.admin.documents import *
|
||||
|
||||
warnings.warn(
|
||||
"Importing from src.web.admin_routes is deprecated. "
|
||||
"Use src.web.api.v1.admin.documents instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Verify and Test
|
||||
|
||||
1. Run tests:
|
||||
```bash
|
||||
pytest tests/web/ -v
|
||||
```
|
||||
|
||||
2. Check for any import errors:
|
||||
```bash
|
||||
python -c "from src.web.app import create_app; create_app()"
|
||||
```
|
||||
|
||||
3. Start server and test endpoints:
|
||||
```bash
|
||||
python run_server.py
|
||||
```
|
||||
|
||||
### 5. Clean Up Old Files (ONLY AFTER TESTS PASS)
|
||||
|
||||
Old files to remove:
|
||||
- `src/web/admin_*.py` (7 files)
|
||||
- `src/web/async_*.py` (3 files)
|
||||
- `src/web/batch_*.py` (3 files)
|
||||
- `src/web/routes.py`
|
||||
- `src/web/services.py`
|
||||
- `src/web/schemas.py`
|
||||
- `src/web/rate_limiter.py`
|
||||
|
||||
Keep these files (don't remove):
|
||||
- `src/web/__init__.py`
|
||||
- `src/web/app.py`
|
||||
- `src/web/config.py`
|
||||
- `src/web/dependencies.py`
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Next Immediate Steps
|
||||
|
||||
1. **Update batch/routes.py imports** - Quick fix for remaining API route
|
||||
2. **Update test file imports** - Critical for verification
|
||||
3. **Run test suite** - Verify nothing broke
|
||||
4. **Fix any import errors** - Address failures
|
||||
5. **Remove old files** - Clean up after tests pass
|
||||
|
||||
---
|
||||
|
||||
## 📊 Migration Impact Summary
|
||||
|
||||
| Category | Files Moved | Imports Updated | Status |
|
||||
|----------|-------------|-----------------|--------|
|
||||
| API Routes | 7 | 5/7 | 🟡 In Progress |
|
||||
| Schemas | 3 | 3/3 | ✅ Complete |
|
||||
| Services | 4 | 0/4 | ⚠️ Pending |
|
||||
| Core | 3 | 3/3 | ✅ Complete |
|
||||
| Workers | 2 | 2/2 | ✅ Complete |
|
||||
| Tests | 0 | 0/16 | ❌ Not Started |
|
||||
|
||||
**Overall Progress: 65%**
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Benefits After Migration
|
||||
|
||||
1. **Better Organization**: Clear separation by function
|
||||
2. **Easier Navigation**: Find files by purpose, not prefix
|
||||
3. **Scalability**: Easy to add new API versions
|
||||
4. **Standard Structure**: Follows FastAPI best practices
|
||||
5. **Maintainability**: Each module has single responsibility
|
||||
|
||||
---
|
||||
|
||||
## 📝 Notes
|
||||
|
||||
- All original files are still in place (no data loss risk)
|
||||
- New structure is operational but needs import updates
|
||||
- Backward compatibility can be added if needed
|
||||
- Tests will validate the migration success
|
||||
Reference in New Issue
Block a user