WIP

2026-01-27 00:47:10 +01:00
parent e83a0cae36
commit 58bf75db68
141 changed files with 24814 additions and 3884 deletions
--- a/docs/CODE_REVIEW_REPORT.md
+++ b/docs/CODE_REVIEW_REPORT.md
@@ -1,405 +0,0 @@
-# Invoice Master POC v2 - 代码审查报告
-
-**审查日期**: 2026-01-22
-**代码库规模**: 67 个 Python 源文件，约 22,434 行代码
-**测试覆盖率**: ~40-50%
-
---
-
-## 执行摘要
-
-### 总体评估：**良好（B+）**
-
-**优势**：
- ✅ 清晰的模块化架构，职责分离良好
- ✅ 使用了合适的数据类和类型提示
- ✅ 针对瑞典发票的全面规范化逻辑
- ✅ 空间索引优化（O(1) token 查找）
- ✅ 完善的降级机制（YOLO 失败时的 OCR fallback）
- ✅ 设计良好的 Web API 和 UI
-
-**主要问题**：
- ❌ 支付行解析代码重复（3+ 处）
- ❌ 长函数（`_normalize_customer_number` 127 行）
- ❌ 配置安全问题（明文数据库密码）
- ❌ 异常处理不一致（到处都是通用 Exception）
- ❌ 缺少集成测试
- ❌ 魔法数字散布各处（0.5, 0.95, 300 等）
-
---
-
-## 1. 架构分析
-
-### 1.1 模块结构
-
-```
-src/
-├── inference/        # 推理管道核心
-│   ├── pipeline.py           (517 行) ⚠️
-│   ├── field_extractor.py    (1,347 行) 🔴 太长
-│   └── yolo_detector.py
-├── web/             # FastAPI Web 服务
-│   ├── app.py               (765 行) ⚠️ HTML 内联
-│   ├── routes.py            (184 行)
-│   └── services.py          (286 行)
-├── ocr/             # OCR 提取
-│   ├── paddle_ocr.py
-│   └── machine_code_parser.py  (919 行) 🔴 太长
-├── matcher/         # 字段匹配
-│   └── field_matcher.py     (875 行) ⚠️
-├── utils/           # 共享工具
-│   ├── validators.py
-│   ├── text_cleaner.py
-│   ├── fuzzy_matcher.py
-│   ├── ocr_corrections.py
-│   └── format_variants.py   (610 行)
-├── processing/      # 批处理
-├── data/           # 数据管理
-└── cli/            # 命令行工具
-```
-
-### 1.2 推理流程
-
-```
-PDF/Image 输入
-    ↓
-渲染为图片 (pdf/renderer.py)
-    ↓
-YOLO 检测 (yolo_detector.py) - 检测字段区域
-    ↓
-字段提取 (field_extractor.py)
-    ├→ OCR 文本提取 (ocr/paddle_ocr.py)
-    ├→ 规范化 & 验证
-    └→ 置信度计算
-    ↓
-交叉验证 (pipeline.py)
-    ├→ 解析 payment_line 格式
-    ├→ 从 payment_line 提取 OCR/Amount/Account
-    └→ 与检测字段验证，payment_line 值优先
-    ↓
-降级 OCR（如果关键字段缺失）
-    ├→ 全页 OCR
-    └→ 正则提取
-    ↓
-InferenceResult 输出
-```
-
---
-
-## 2. 代码质量问题
-
-### 2.1 长函数（>50 行）🔴
-
-| 函数 | 文件 | 行数 | 复杂度 | 问题 |
-|------|------|------|--------|------|
-| `_normalize_customer_number()` | field_extractor.py | **127** | 极高 | 4 层模式匹配，7+ 正则，复杂评分 |
-| `_cross_validate_payment_line()` | pipeline.py | **127** | 极高 | 核心验证逻辑，8+ 条件分支 |
-| `_normalize_bankgiro()` | field_extractor.py | 62 | 高 | Luhn 验证 + 多种降级 |
-| `_normalize_plusgiro()` | field_extractor.py | 63 | 高 | 类似 bankgiro |
-| `_normalize_payment_line()` | field_extractor.py | 74 | 高 | 4 种正则模式 |
-| `_normalize_amount()` | field_extractor.py | 78 | 高 | 多策略降级 |
-
-**示例问题** - `_normalize_customer_number()` (第 776-902 行):
-```python
-def _normalize_customer_number(self, text: str):
-    # 127 行函数，包含：
-    # - 4 个嵌套的 if/for 循环
-    # - 7 种不同的正则模式
-    # - 5 个评分机制
-    # - 处理有标签和无标签格式
-```
-
-**建议**: 拆分为：
- `_find_customer_code_patterns()`
- `_find_labeled_customer_code()`
- `_score_customer_candidates()`
-
-### 2.2 代码重复 🔴
-
-**支付行解析（3+ 处重复实现）**:
-
-1. `_parse_machine_readable_payment_line()` (pipeline.py:217-252)
-2. `MachineCodeParser.parse()` (machine_code_parser.py:919 行)
-3. `_normalize_payment_line()` (field_extractor.py:632-705)
-
-所有三处都实现类似的正则模式：
-```
-格式: # <OCR> # <Kronor> <Öre> <Type> > <Account>#<Check>#
-```
-
-**Bankgiro/Plusgiro 验证（重复）**:
- `validators.py`: `is_valid_bankgiro()`, `format_bankgiro()`
- `field_extractor.py`: `_normalize_bankgiro()`, `_normalize_plusgiro()`, `_luhn_checksum()`
- `normalizer.py`: `normalize_bankgiro()`, `normalize_plusgiro()`
- `field_matcher.py`: 类似匹配逻辑
-
-**建议**: 创建统一模块：
-```python
-# src/common/payment_line_parser.py
-class PaymentLineParser:
-    def parse(text: str) -> PaymentLineResult
-
-# src/common/giro_validator.py
-class GiroValidator:
-    def validate_and_format(value: str, giro_type: str) -> str
-```
-
-### 2.3 错误处理不一致 ⚠️
-
-**通用异常捕获（31 处）**:
-```python
-except Exception as e:  # 代码库中 31 处
-    result.errors.append(str(e))
-```
-
-**问题**:
- 没有捕获特定错误类型
- 通用错误消息丢失上下文
- 第 142-147 行 (routes.py): 捕获所有异常，返回 500 状态
-
-**当前写法** (routes.py:142-147):
-```python
-try:
-    service_result = inference_service.process_pdf(...)
-except Exception as e:  # 太宽泛
-    logger.error(f"Error processing document: {e}")
-    raise HTTPException(status_code=500, detail=str(e))
-```
-
-**改进建议**:
-```python
-except FileNotFoundError:
-    raise HTTPException(status_code=400, detail="PDF 文件未找到")
-except PyMuPDFError:
-    raise HTTPException(status_code=400, detail="无效的 PDF 格式")
-except OCRError:
-    raise HTTPException(status_code=503, detail="OCR 服务不可用")
-```
-
-### 2.4 配置安全问题 🔴
-
-**config.py 第 24-30 行** - 明文凭据：
-```python
-DATABASE = {
-    'host': '192.168.68.31',  # 硬编码 IP
-    'user': 'docmaster',       # 硬编码用户名
-    'password': 'nY6LYK5d',    # 🔴 明文密码！
-    'database': 'invoice_master'
-}
-```
-
-**建议**:
-```python
-DATABASE = {
-    'host': os.getenv('DB_HOST', 'localhost'),
-    'user': os.getenv('DB_USER', 'docmaster'),
-    'password': os.getenv('DB_PASSWORD'),  # 从环境变量读取
-    'database': os.getenv('DB_NAME', 'invoice_master')
-}
-```
-
-### 2.5 魔法数字 ⚠️
-
-| 值 | 位置 | 用途 | 问题 |
-|---|------|------|------|
-| 0.5 | 多处 | 置信度阈值 | 不可按字段配置 |
-| 0.95 | pipeline.py | payment_line 置信度 | 无说明 |
-| 300 | 多处 | DPI | 硬编码 |
-| 0.1 | field_extractor.py | BBox 填充 | 应为配置 |
-| 72 | 多处 | PDF 基础 DPI | 公式中的魔法数字 |
-| 50 | field_extractor.py | 客户编号评分加分 | 无说明 |
-
-**建议**: 提取到配置：
-```python
-INFERENCE_CONFIG = {
-    'confidence_threshold': 0.5,
-    'payment_line_confidence': 0.95,
-    'dpi': 300,
-    'bbox_padding': 0.1,
-}
-```
-
-### 2.6 命名不一致 ⚠️
-
-**字段名称不一致**:
- YOLO 类名: `invoice_number`, `ocr_number`, `supplier_org_number`
- 字段名: `InvoiceNumber`, `OCR`, `supplier_org_number`
- CSV 列名: 可能又不同
- 数据库字段名: 另一种变体
-
-映射维护在多处：
- `yolo_detector.py` (90-100 行): `CLASS_TO_FIELD`
- 多个其他位置
-
---
-
-## 3. 测试分析
-
-### 3.1 测试覆盖率
-
-**测试文件**: 13 个
- ✅ 覆盖良好: field_matcher, normalizer, payment_line_parser
- ⚠️ 中等覆盖: field_extractor, pipeline
- ❌ 覆盖不足: web 层, CLI, 批处理
-
-**估算覆盖率**: 40-50%
-
-### 3.2 缺失的测试用例 🔴
-
-**关键缺失**:
-1. 交叉验证逻辑 - 最复杂部分，测试很少
-2. payment_line 解析变体 - 多种实现，边界情况不清楚
-3. OCR 错误纠正 - 不同策略的复杂逻辑
-4. Web API 端点 - 没有请求/响应测试
-5. 批处理 - 多 worker 协调未测试
-6. 降级 OCR 机制 - YOLO 检测失败时
-
---
-
-## 4. 架构风险
-
-### 🔴 关键风险
-
-1. **配置安全** - config.py 中明文数据库凭据（24-30 行）
-2. **错误恢复** - 宽泛的异常处理掩盖真实问题
-3. **可测试性** - 硬编码依赖阻止单元测试
-
-### 🟡 高风险
-
-1. **代码可维护性** - 支付行解析重复
-2. **可扩展性** - 没有长时间推理的异步处理
-3. **扩展性** - 添加新字段类型会很困难
-
-### 🟢 中等风险
-
-1. **性能** - 懒加载有帮助，但 ORM 查询未优化
-2. **文档** - 大部分足够但可以更好
-
---
-
-## 5. 优先级矩阵
-
-| 优先级 | 行动 | 工作量 | 影响 |
-|--------|------|--------|------|
-| 🔴 关键 | 修复配置安全（环境变量） | 1 小时 | 高 |
-| 🔴 关键 | 添加集成测试 | 2-3 天 | 高 |
-| 🔴 关键 | 文档化错误处理策略 | 4 小时 | 中 |
-| 🟡 高 | 统一 payment_line 解析 | 1-2 天 | 高 |
-| 🟡 高 | 提取规范化到子模块 | 2-3 天 | 中 |
-| 🟡 高 | 添加依赖注入 | 2-3 天 | 中 |
-| 🟡 高 | 拆分长函数 | 2-3 天 | 低 |
-| 🟢 中 | 提高测试覆盖率到 70%+ | 3-5 天 | 高 |
-| 🟢 中 | 提取魔法数字 | 4 小时 | 低 |
-| 🟢 中 | 标准化命名约定 | 1-2 天 | 中 |
-
---
-
-## 6. 具体文件建议
-
-### 高优先级（代码质量）
-
-| 文件 | 问题 | 建议 |
-|------|------|------|
-| `field_extractor.py` | 1,347 行；6 个长规范化方法 | 拆分为 `normalizers/` 子模块 |
-| `pipeline.py` | 127 行 `_cross_validate_payment_line()` | 提取到单独的 `CrossValidator` 类 |
-| `field_matcher.py` | 875 行；复杂匹配逻辑 | 拆分为 `matching/` 子模块 |
-| `config.py` | 硬编码凭据（第 29 行） | 使用环境变量 |
-| `machine_code_parser.py` | 919 行；payment_line 解析 | 与 pipeline 解析合并 |
-
-### 中优先级（重构）
-
-| 文件 | 问题 | 建议 |
-|------|------|------|
-| `app.py` | 765 行；HTML 内联在 Python 中 | 提取到 `templates/` 目录 |
-| `autolabel.py` | 753 行；批处理逻辑 | 提取 worker 函数到模块 |
-| `format_variants.py` | 610 行；变体生成 | 考虑策略模式 |
-
---
-
-## 7. 建议行动
-
-### 第 1 阶段：关键修复（1 周）
-
-1. **配置安全** (1 小时)
-   - 移除 config.py 中的明文密码
-   - 添加环境变量支持
-   - 更新 README 说明配置
-
-2. **错误处理标准化** (1 天)
-   - 定义自定义异常类
-   - 替换通用 Exception 捕获
-   - 添加错误代码常量
-
-3. **添加关键集成测试** (2 天)
-   - 端到端推理测试
-   - payment_line 交叉验证测试
-   - API 端点测试
-
-### 第 2 阶段：重构（2-3 周）
-
-4. **统一 payment_line 解析** (2 天)
-   - 创建 `src/common/payment_line_parser.py`
-   - 合并 3 处重复实现
-   - 迁移所有调用方
-
-5. **拆分 field_extractor.py** (3 天)
-   - 创建 `src/inference/normalizers/` 子模块
-   - 每个字段类型一个文件
-   - 提取共享验证逻辑
-
-6. **拆分长函数** (2 天)
-   - `_normalize_customer_number()` → 3 个函数
-   - `_cross_validate_payment_line()` → CrossValidator 类
-
-### 第 3 阶段：改进（1-2 周）
-
-7. **提高测试覆盖率** (5 天)
-   - 目标：70%+ 覆盖率
-   - 专注于验证逻辑
-   - 添加边界情况测试
-
-8. **配置管理改进** (1 天)
-   - 提取所有魔法数字
-   - 创建配置文件（YAML）
-   - 添加配置验证
-
-9. **文档改进** (2 天)
-   - 添加架构图
-   - 文档化所有私有方法
-   - 创建贡献指南
-
---
-
-## 附录 A：度量指标
-
-### 代码复杂度
-
-| 类别 | 计数 | 平均行数 |
-|------|------|----------|
-| 源文件 | 67 | 334 |
-| 长文件 (>500 行) | 12 | 875 |
-| 长函数 (>50 行) | 23 | 89 |
-| 测试文件 | 13 | 298 |
-
-### 依赖关系
-
-| 类型 | 计数 |
-|------|------|
-| 外部依赖 | ~25 |
-| 内部模块 | 10 |
-| 循环依赖 | 0 ✅ |
-
-### 代码风格
-
-| 指标 | 覆盖率 |
-|------|--------|
-| 类型提示 | 80% |
-| Docstrings (公开) | 80% |
-| Docstrings (私有) | 40% |
-| 测试覆盖率 | 45% |
-
---
-
-**生成日期**: 2026-01-22
-**审查者**: Claude Code
-**版本**: v2.0
--- a/docs/FIELD_EXTRACTOR_ANALYSIS.md
+++ b/docs/FIELD_EXTRACTOR_ANALYSIS.md
@@ -1,96 +0,0 @@
-# Field Extractor 分析报告
-
-## 概述
-
-field_extractor.py (1183行) 最初被识别为可优化文件，尝试使用 `src/normalize` 模块进行重构，但经过分析和测试后发现 **不应该重构**。
-
-## 重构尝试
-
-### 初始计划
-将 field_extractor.py 中的重复 normalize 方法删除，统一使用 `src/normalize/normalize_field()` 接口。
-
-### 实施步骤
-1. ✅ 备份原文件 (`field_extractor_old.py`)
-2. ✅ 修改 `_normalize_and_validate` 使用统一 normalizer
-3. ✅ 删除重复的 normalize 方法 (~400行)
-4. ❌ 运行测试 - **28个失败**
-5. ✅ 添加 wrapper 方法委托给 normalizer
-6. ❌ 再次测试 - **12个失败**
-7. ✅ 还原原文件
-8. ✅ 测试通过 - **全部45个测试通过**
-
-## 关键发现
-
-### 两个模块的不同用途
-
-| 模块 | 用途 | 输入 | 输出 | 示例 |
-|------|------|------|------|------|
-| **src/normalize/** | **变体生成** 用于匹配 | 已提取的字段值 | 多个匹配变体列表 | `"INV-12345"` → `["INV-12345", "12345"]` |
-| **field_extractor** | **值提取** 从OCR文本 | 包含字段的原始OCR文本 | 提取的单个字段值 | `"Fakturanummer: A3861"` → `"A3861"` |
-
-### 为什么不能统一？
-
-1. **src/normalize/** 的设计目的:
-   - 接收已经提取的字段值
-   - 生成多个标准化变体用于fuzzy matching
-   - 例如 BankgiroNormalizer:
-     ```python
-     normalize("782-1713") → ["7821713", "782-1713"]  # 生成变体
-     ```
-
-2. **field_extractor** 的 normalize 方法:
-   - 接收包含字段的原始OCR文本（可能包含标签、其他文本等）
-   - **提取**特定模式的字段值
-   - 例如 `_normalize_bankgiro`:
-     ```python
-     _normalize_bankgiro("Bankgiro: 782-1713") → ("782-1713", True, None)  # 从文本提取
-     ```
-
-3. **关键区别**:
-   - Normalizer: 变体生成器 (for matching)
-   - Field Extractor: 模式提取器 (for parsing)
-
-### 测试失败示例
-
-使用 normalizer 替代 field extractor 方法后的失败:
-
-```python
-# InvoiceNumber 测试
-Input: "Fakturanummer: A3861"
-期望: "A3861"
-实际: "Fakturanummer: A3861"  # 没有提取，只是清理
-
-# Bankgiro 测试
-Input: "Bankgiro: 782-1713"
-期望: "782-1713"
-实际: "7821713"  # 返回了不带破折号的变体，而不是提取格式化值
-```
-
-## 结论
-
-**field_extractor.py 不应该使用 src/normalize 模块重构**，因为:
-
-1. ✅ **职责不同**: 提取 vs 变体生成
-2. ✅ **输入不同**: 包含标签的原始OCR文本 vs 已提取的字段值
-3. ✅ **输出不同**: 单个提取值 vs 多个匹配变体
-4. ✅ **现有代码运行良好**: 所有45个测试通过
-5. ✅ **提取逻辑有价值**: 包含复杂的模式匹配规则（例如区分 Bankgiro/Plusgiro 格式）
-
-## 建议
-
-1. **保留 field_extractor.py 原样**: 不进行重构
-2. **文档化两个模块的差异**: 确保团队理解各自用途
-3. **关注其他优化目标**: machine_code_parser.py (919行)
-
-## 学习点
-
-重构前应该:
-1. 理解模块的**真实用途**，而不只是看代码相似度
-2. 运行完整测试套件验证假设
-3. 评估是否真的存在重复，还是表面相似但用途不同
-
---
-
-**状态**: ✅ 分析完成，决定不重构
-**测试**: ✅ 45/45 通过
-**文件**: 保持 1183行 原样
--- a/docs/MACHINE_CODE_PARSER_ANALYSIS.md
+++ b/docs/MACHINE_CODE_PARSER_ANALYSIS.md
@@ -1,238 +0,0 @@
-# Machine Code Parser 分析报告
-
-## 文件概况
-
- **文件**: `src/ocr/machine_code_parser.py`
- **总行数**: 919 行
- **代码行**: 607 行 (66%)
- **方法数**: 14 个
- **正则表达式使用**: 47 次
-
-## 代码结构
-
-### 类结构
-
-```
-MachineCodeResult (数据类)
-├── to_dict()
-└── get_region_bbox()
-
-MachineCodeParser (主解析器)
-├── __init__()
-├── parse() - 主入口
-├── _find_tokens_with_values()
-├── _find_machine_code_line_tokens()
-├── _parse_standard_payment_line_with_tokens()
-├── _parse_standard_payment_line() - 142行 ⚠️
-├── _extract_ocr() - 50行
-├── _extract_bankgiro() - 58行
-├── _extract_plusgiro() - 30行
-├── _extract_amount() - 68行
-├── _calculate_confidence()
-└── cross_validate()
-```
-
-## 发现的问题
-
-### 1. ⚠️ `_parse_standard_payment_line` 方法过长 (142行)
-
-**位置**: 442-582 行
-
-**问题**:
- 包含嵌套函数 `normalize_account_spaces` 和 `format_account`
- 多个正则匹配分支
- 逻辑复杂，难以测试和维护
-
-**建议**:
-可以拆分为独立方法:
- `_normalize_account_spaces(line)`
- `_format_account(account_digits, context)`
- `_match_primary_pattern(line)`
- `_match_fallback_patterns(line)`
-
-### 2. 🔁 4个 `_extract_*` 方法有重复模式
-
-所有 extract 方法都遵循相同模式：
-
-```python
-def _extract_XXX(self, tokens):
-    candidates = []
-
-    for token in tokens:
-        text = token.text.strip()
-        matches = self.XXX_PATTERN.findall(text)
-        for match in matches:
-            # 验证逻辑
-            # 上下文检测
-            candidates.append((normalized, context_score, token))
-
-    if not candidates:
-        return None
-
-    candidates.sort(key=lambda x: (x[1], 1), reverse=True)
-    return candidates[0][0]
-```
-
-**重复的逻辑**:
- Token 迭代
- 模式匹配
- 候选收集
- 上下文评分
- 排序和选择最佳匹配
-
-**建议**:
-可以提取基础提取器类或通用方法来减少重复。
-
-### 3. ✅ 上下文检测重复
-
-上下文检测代码在多个地方重复：
-
-```python
-# _extract_bankgiro 中
-context_text = ' '.join(t.text.lower() for t in tokens)
-is_bankgiro_context = (
-    'bankgiro' in context_text or
-    'bg:' in context_text or
-    'bg ' in context_text
-)
-
-# _extract_plusgiro 中
-context_text = ' '.join(t.text.lower() for t in tokens)
-is_plusgiro_context = (
-    'plusgiro' in context_text or
-    'postgiro' in context_text or
-    'pg:' in context_text or
-    'pg ' in context_text
-)
-
-# _parse_standard_payment_line 中
-context = (context_line or raw_line).lower()
-is_plusgiro_context = (
-    ('plusgiro' in context or 'postgiro' in context or 'plusgirokonto' in context)
-    and 'bankgiro' not in context
-)
-```
-
-**建议**:
-提取为独立方法:
- `_detect_account_context(tokens) -> dict[str, bool]`
-
-## 重构建议
-
-### 方案 A: 轻度重构（推荐）✅
-
-**目标**: 提取重复的上下文检测逻辑，不改变主要结构
-
-**步骤**:
-1. 提取 `_detect_account_context(tokens)` 方法
-2. 提取 `_normalize_account_spaces(line)` 为独立方法
-3. 提取 `_format_account(digits, context)` 为独立方法
-
-**影响**:
- 减少 ~50-80 行重复代码
- 提高可测试性
- 低风险，易于验证
-
-**预期结果**: 919 行 → ~850 行 (↓7%)
-
-### 方案 B: 中度重构
-
-**目标**: 创建通用的字段提取框架
-
-**步骤**:
-1. 创建 `_generic_extract(pattern, normalizer, context_checker)`
-2. 重构所有 `_extract_*` 方法使用通用框架
-3. 拆分 `_parse_standard_payment_line` 为多个小方法
-
-**影响**:
- 减少 ~150-200 行代码
- 显著提高可维护性
- 中等风险，需要全面测试
-
-**预期结果**: 919 行 → ~720 行 (↓22%)
-
-### 方案 C: 深度重构（不推荐）
-
-**目标**: 完全重新设计为策略模式
-
-**风险**:
- 高风险，可能引入 bugs
- 需要大量测试
- 可能破坏现有集成
-
-## 推荐方案
-
-### ✅ 采用方案 A（轻度重构）
-
-**理由**:
-1. **代码已经工作良好**: 没有明显的 bug 或性能问题
-2. **低风险**: 只提取重复逻辑，不改变核心算法
-3. **性价比高**: 小改动带来明显的代码质量提升
-4. **易于验证**: 现有测试应该能覆盖
-
-### 重构步骤
-
-```python
-# 1. 提取上下文检测
-def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
-    """检测上下文中的账户类型关键词"""
-    context_text = ' '.join(t.text.lower() for t in tokens)
-
-    return {
-        'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
-        'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
-    }
-
-# 2. 提取空格标准化
-def _normalize_account_spaces(self, line: str) -> str:
-    """移除账户号码中的空格"""
-    # (现有 line 460-481 的代码)
-
-# 3. 提取账户格式化
-def _format_account(
-    self,
-    account_digits: str,
-    is_plusgiro_context: bool
-) -> tuple[str, str]:
-    """格式化账户并确定类型"""
-    # (现有 line 485-523 的代码)
-```
-
-## 对比：field_extractor vs machine_code_parser
-
-| 特征 | field_extractor | machine_code_parser |
-|------|-----------------|---------------------|
-| 用途 | 值提取 | 机器码解析 |
-| 重复代码 | ~400行normalize方法 | ~80行上下文检测 |
-| 重构价值 | ❌ 不同用途，不应统一 | ✅ 可提取共享逻辑 |
-| 风险 | 高（会破坏功能） | 低（只是代码组织） |
-
-## 决策
-
-### ✅ 建议重构 machine_code_parser.py
-
-**与 field_extractor 的不同**:
- field_extractor: 重复的方法有**不同的用途**（提取 vs 变体生成）
- machine_code_parser: 重复的代码有**相同的用途**（都是上下文检测）
-
-**预期收益**:
- 减少 ~70 行重复代码
- 提高可测试性（可以单独测试上下文检测）
- 更清晰的代码组织
- **低风险**，易于验证
-
-## 下一步
-
-1. ✅ 备份原文件
-2. ✅ 提取 `_detect_account_context` 方法
-3. ✅ 提取 `_normalize_account_spaces` 方法
-4. ✅ 提取 `_format_account` 方法
-5. ✅ 更新所有调用点
-6. ✅ 运行测试验证
-7. ✅ 检查代码覆盖率
-
---
-
-**状态**: 📋 分析完成，建议轻度重构
-**风险评估**: 🟢 低风险
-**预期收益**: 919行 → ~850行 (↓7%)
--- a/docs/PERFORMANCE_OPTIMIZATION.md
+++ b/docs/PERFORMANCE_OPTIMIZATION.md
@@ -1,519 +0,0 @@
-# Performance Optimization Guide
-
-This document provides performance optimization recommendations for the Invoice Field Extraction system.
-
-## Table of Contents
-
-1. [Batch Processing Optimization](#batch-processing-optimization)
-2. [Database Query Optimization](#database-query-optimization)
-3. [Caching Strategies](#caching-strategies)
-4. [Memory Management](#memory-management)
-5. [Profiling and Monitoring](#profiling-and-monitoring)
-
---
-
-## Batch Processing Optimization
-
-### Current State
-
-The system processes invoices one at a time. For large batches, this can be inefficient.
-
-### Recommendations
-
-#### 1. Database Batch Operations
-
-**Current**: Individual inserts for each document
-```python
-# Inefficient
-for doc in documents:
-    db.insert_document(doc)  # Individual DB call
-```
-
-**Optimized**: Use `execute_values` for batch inserts
-```python
-# Efficient - already implemented in db.py line 519
-from psycopg2.extras import execute_values
-
-execute_values(cursor, """
-    INSERT INTO documents (...)
-    VALUES %s
-""", document_values)
-```
-
-**Impact**: 10-50x faster for batches of 100+ documents
-
-#### 2. PDF Processing Batching
-
-**Recommendation**: Process PDFs in parallel using multiprocessing
-
-```python
-from multiprocessing import Pool
-
-def process_batch(pdf_paths, batch_size=10):
-    """Process PDFs in parallel batches."""
-    with Pool(processes=batch_size) as pool:
-        results = pool.map(pipeline.process_pdf, pdf_paths)
-    return results
-```
-
-**Considerations**:
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
-
-**Status**: ✅ Already implemented in `src/processing/` modules
-
-#### 3. Image Caching for Multi-Page PDFs
-
-**Current**: Each page rendered independently
-```python
-# Current pattern in field_extractor.py
-for page_num in range(total_pages):
-    image = render_pdf_page(pdf_path, page_num, dpi=300)
-```
-
-**Optimized**: Pre-render all pages if processing multiple fields per page
-```python
-# Batch render
-images = {
-    page_num: render_pdf_page(pdf_path, page_num, dpi=300)
-    for page_num in page_numbers_needed
-}
-
-# Reuse images
-for detection in detections:
-    image = images[detection.page_no]
-    extract_field(detection, image)
-```
-
-**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
-
---
-
-## Database Query Optimization
-
-### Current Performance
-
- **Parameterized queries**: ✅ Implemented (Phase 1)
- **Connection pooling**: ❌ Not implemented
- **Query batching**: ✅ Partially implemented
- **Index optimization**: ⚠️ Needs verification
-
-### Recommendations
-
-#### 1. Connection Pooling
-
-**Current**: New connection for each operation
-```python
-def connect(self):
-    """Create new database connection."""
-    return psycopg2.connect(**self.config)
-```
-
-**Optimized**: Use connection pooling
-```python
-from psycopg2 import pool
-
-class DocumentDatabase:
-    def __init__(self, config):
-        self.pool = pool.SimpleConnectionPool(
-            minconn=1,
-            maxconn=10,
-            **config
-        )
-
-    def connect(self):
-        return self.pool.getconn()
-
-    def close(self, conn):
-        self.pool.putconn(conn)
-```
-
-**Impact**:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations
-
-#### 2. Index Recommendations
-
-**Check current indexes**:
-```sql
-- Verify indexes exist on frequently queried columns
-SELECT tablename, indexname, indexdef
-FROM pg_indexes
-WHERE schemaname = 'public';
-```
-
-**Recommended indexes**:
-```sql
-- If not already present
-CREATE INDEX IF NOT EXISTS idx_documents_success
-    ON documents(success);
-
-CREATE INDEX IF NOT EXISTS idx_documents_timestamp
-    ON documents(timestamp DESC);
-
-CREATE INDEX IF NOT EXISTS idx_field_results_document_id
-    ON field_results(document_id);
-
-CREATE INDEX IF NOT EXISTS idx_field_results_matched
-    ON field_results(matched);
-
-CREATE INDEX IF NOT EXISTS idx_field_results_field_name
-    ON field_results(field_name);
-```
-
-**Impact**:
- 10-100x faster queries for filtered/sorted results
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
-
-#### 3. Query Batching
-
-**Status**: ✅ Already implemented for field results (line 519)
-
-**Verify batching is used**:
-```python
-# Good pattern in db.py
-execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
-```
-
-**Additional opportunity**: Batch `SELECT` queries
-```python
-# Current
-docs = [get_document(doc_id) for doc_id in doc_ids]  # N queries
-
-# Optimized
-docs = get_documents_batch(doc_ids)  # 1 query with IN clause
-```
-
-**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
-
---
-
-## Caching Strategies
-
-### 1. Model Loading Cache
-
-**Current**: Models loaded per-instance
-
-**Recommendation**: Singleton pattern for YOLO model
-```python
-class YOLODetectorSingleton:
-    _instance = None
-    _model = None
-
-    @classmethod
-    def get_instance(cls, model_path):
-        if cls._instance is None:
-            cls._instance = YOLODetector(model_path)
-        return cls._instance
-```
-
-**Impact**: Reduces memory usage by 90% when processing multiple documents
-
-### 2. Parser Instance Caching
-
-**Current**: ✅ Already optimal
-```python
-# Good pattern in field_extractor.py
-def __init__(self):
-    self.payment_line_parser = PaymentLineParser()  # Reused
-    self.customer_number_parser = CustomerNumberParser()  # Reused
-```
-
-**Status**: No changes needed
-
-### 3. OCR Result Caching
-
-**Recommendation**: Cache OCR results for identical regions
-```python
-from functools import lru_cache
-
-@lru_cache(maxsize=1000)
-def ocr_region_cached(image_hash, bbox):
-    """Cache OCR results by image hash + bbox."""
-    return paddle_ocr.ocr_region(image, bbox)
-```
-
-**Impact**: 50-80% speedup when re-processing similar documents
-
-**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
-
---
-
-## Memory Management
-
-### Current Issues
-
-**Potential memory leaks**:
-1. Large images kept in memory after processing
-2. OCR results accumulated without cleanup
-3. Model outputs not explicitly cleared
-
-### Recommendations
-
-#### 1. Explicit Image Cleanup
-
-```python
-import gc
-
-def process_pdf(pdf_path):
-    try:
-        image = render_pdf(pdf_path)
-        result = extract_fields(image)
-        return result
-    finally:
-        del image  # Explicit cleanup
-        gc.collect()  # Force garbage collection
-```
-
-#### 2. Generator Pattern for Large Batches
-
-**Current**: Load all documents into memory
-```python
-docs = [process_pdf(path) for path in pdf_paths]  # All in memory
-```
-
-**Optimized**: Use generator for streaming processing
-```python
-def process_batch_streaming(pdf_paths):
-    """Process documents one at a time, yielding results."""
-    for path in pdf_paths:
-        result = process_pdf(path)
-        yield result
-        # Result can be saved to DB immediately
-        # Previous result is garbage collected
-```
-
-**Impact**: Constant memory usage regardless of batch size
-
-#### 3. Context Managers for Resources
-
-```python
-class InferencePipeline:
-    def __enter__(self):
-        self.detector.load_model()
-        return self
-
-    def __exit__(self, *args):
-        self.detector.unload_model()
-        self.extractor.cleanup()
-
-# Usage
-with InferencePipeline(...) as pipeline:
-    results = pipeline.process_pdf(path)
-# Automatic cleanup
-```
-
---
-
-## Profiling and Monitoring
-
-### Recommended Profiling Tools
-
-#### 1. cProfile for CPU Profiling
-
-```python
-import cProfile
-import pstats
-
-profiler = cProfile.Profile()
-profiler.enable()
-
-# Your code here
-pipeline.process_pdf(pdf_path)
-
-profiler.disable()
-stats = pstats.Stats(profiler)
-stats.sort_stats('cumulative')
-stats.print_stats(20)  # Top 20 slowest functions
-```
-
-#### 2. memory_profiler for Memory Analysis
-
-```bash
-pip install memory_profiler
-python -m memory_profiler your_script.py
-```
-
-Or decorator-based:
-```python
-from memory_profiler import profile
-
-@profile
-def process_large_batch(pdf_paths):
-    # Memory usage tracked line-by-line
-    results = [process_pdf(path) for path in pdf_paths]
-    return results
-```
-
-#### 3. py-spy for Production Profiling
-
-```bash
-pip install py-spy
-
-# Profile running process
-py-spy top --pid 12345
-
-# Generate flamegraph
-py-spy record -o profile.svg -- python your_script.py
-```
-
-**Advantage**: No code changes needed, minimal overhead
-
-### Key Metrics to Monitor
-
-1. **Processing Time per Document**
-   - Target: <10 seconds for single-page invoice
-   - Current: ~2-5 seconds (estimated)
-
-2. **Memory Usage**
-   - Target: <2GB for batch of 100 documents
-   - Monitor: Peak memory usage
-
-3. **Database Query Time**
-   - Target: <100ms per query (with indexes)
-   - Monitor: Slow query log
-
-4. **OCR Accuracy vs Speed Trade-off**
-   - Current: PaddleOCR with GPU (~200ms per region)
-   - Alternative: Tesseract (~500ms, slightly more accurate)
-
-### Logging Performance Metrics
-
-**Add to pipeline.py**:
-```python
-import time
-import logging
-
-logger = logging.getLogger(__name__)
-
-def process_pdf(self, pdf_path):
-    start = time.time()
-
-    # Processing...
-    result = self._process_internal(pdf_path)
-
-    elapsed = time.time() - start
-    logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
-
-    # Log to database for analysis
-    self.db.log_performance({
-        'document_id': result.document_id,
-        'processing_time': elapsed,
-        'field_count': len(result.fields)
-    })
-
-    return result
-```
-
---
-
-## Performance Optimization Priorities
-
-### High Priority (Implement First)
-
-1. ✅ **Database parameterized queries** - Already done (Phase 1)
-2. ⚠️ **Database connection pooling** - Not implemented
-3. ⚠️ **Index optimization** - Needs verification
-
-### Medium Priority
-
-4. ⚠️ **Batch PDF rendering** - Optimization possible
-5. ✅ **Parser instance reuse** - Already done (Phase 2)
-6. ⚠️ **Model caching** - Could improve
-
-### Low Priority (Nice to Have)
-
-7. ⚠️ **OCR result caching** - Complex implementation
-8. ⚠️ **Generator patterns** - Refactoring needed
-9. ⚠️ **Advanced profiling** - For production optimization
-
---
-
-## Benchmarking Script
-
-```python
-"""
-Benchmark script for invoice processing performance.
-"""
-
-import time
-from pathlib import Path
-from src.inference.pipeline import InferencePipeline
-
-def benchmark_single_document(pdf_path, iterations=10):
-    """Benchmark single document processing."""
-    pipeline = InferencePipeline(
-        model_path="path/to/model.pt",
-        use_gpu=True
-    )
-
-    times = []
-    for i in range(iterations):
-        start = time.time()
-        result = pipeline.process_pdf(pdf_path)
-        elapsed = time.time() - start
-        times.append(elapsed)
-        print(f"Iteration {i+1}: {elapsed:.2f}s")
-
-    avg_time = sum(times) / len(times)
-    print(f"\nAverage: {avg_time:.2f}s")
-    print(f"Min: {min(times):.2f}s")
-    print(f"Max: {max(times):.2f}s")
-
-def benchmark_batch(pdf_paths, batch_size=10):
-    """Benchmark batch processing."""
-    from multiprocessing import Pool
-
-    pipeline = InferencePipeline(
-        model_path="path/to/model.pt",
-        use_gpu=True
-    )
-
-    start = time.time()
-
-    with Pool(processes=batch_size) as pool:
-        results = pool.map(pipeline.process_pdf, pdf_paths)
-
-    elapsed = time.time() - start
-    avg_per_doc = elapsed / len(pdf_paths)
-
-    print(f"Total time: {elapsed:.2f}s")
-    print(f"Documents: {len(pdf_paths)}")
-    print(f"Average per document: {avg_per_doc:.2f}s")
-    print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
-
-if __name__ == "__main__":
-    # Single document benchmark
-    benchmark_single_document("test.pdf")
-
-    # Batch benchmark
-    pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
-    benchmark_batch(pdf_paths[:100])
-```
-
---
-
-## Summary
-
-**Implemented (Phase 1-2)**:
- ✅ Parameterized queries (SQL injection fix)
- ✅ Parser instance reuse (Phase 2 refactoring)
- ✅ Batch insert operations (execute_values)
- ✅ Dual pool processing (CPU/GPU separation)
-
-**Quick Wins (Low effort, high impact)**:
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)
-
-**Long-term Improvements**:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring
-
-**Expected Impact**:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- **Overall**: 2-5x throughput improvement for batch processing
--- a/docs/REFACTORING_PLAN.md
+++ b/docs/REFACTORING_PLAN.md
--- a/docs/REFACTORING_SUMMARY.md
+++ b/docs/REFACTORING_SUMMARY.md
@@ -1,170 +0,0 @@
-# 代码重构总结报告
-
-## 📊 整体成果
-
-### 测试状态
- ✅ **688/688 测试全部通过** (100%)
- ✅ **代码覆盖率**: 34% → 37% (+3%)
- ✅ **0 个失败**, 0 个错误
-
-### 测试覆盖率改进
- ✅ **machine_code_parser**: 25% → 65% (+40%)
- ✅ **新增测试**: 55个（633 → 688）
-
---
-
-## 🎯 已完成的重构
-
-### 1. ✅ Matcher 模块化 (876行 → 205行, ↓76%)
-
-**文件**: 
-
-**重构内容**:
- 将单一876行文件拆分为 **11个模块**
- 提取 **5种独立的匹配策略**
- 创建专门的数据模型、工具函数和上下文处理模块
-
-**新模块结构**:
-
-
-**测试结果**:
- ✅ 77个 matcher 测试全部通过
- ✅ 完整的README文档
- ✅ 策略模式，易于扩展
-
-**收益**:
- 📉 代码量减少 76%
- 📈 可维护性显著提高
- ✨ 每个策略独立测试
- 🔧 易于添加新策略
-
---
-
-### 2. ✅ Machine Code Parser 轻度重构 + 测试覆盖 (919行 → 929行)
-
-**文件**: src/ocr/machine_code_parser.py
-
-**重构内容**:
- 提取 **3个共享辅助方法**，消除重复代码
- 优化上下文检测逻辑
- 简化账号格式化方法
-
-**测试改进**:
- ✅ **新增55个测试**（24 → 79个）
- ✅ **覆盖率**: 25% → 65% (+40%)
- ✅ 所有688个项目测试通过
-
-**新增测试覆盖**:
- **第一轮** (22个测试):
-  - `_detect_account_context()` - 8个测试（上下文检测）
-  - `_normalize_account_spaces()` - 5个测试（空格规范化）
-  - `_format_account()` - 4个测试（账号格式化）
-  - `parse()` - 5个测试（主入口方法）
- **第二轮** (33个测试):
-  - `_extract_ocr()` - 8个测试（OCR 提取）
-  - `_extract_bankgiro()` - 9个测试（Bankgiro 提取）
-  - `_extract_plusgiro()` - 8个测试（Plusgiro 提取）
-  - `_extract_amount()` - 8个测试（金额提取）
-
-**收益**:
- 🔄 消除80行重复代码
- 📈 可测试性提高（可独立测试辅助方法）
- 📖 代码可读性提升
- ✅ 覆盖率从25%提升到65% (+40%)
- 🎯 低风险，高回报
-
---
-
-### 3. ✅ Field Extractor 分析 (决定不重构)
-
-**文件**:  (1183行)
-
-**分析结果**: ❌ **不应重构**
-
-**关键洞察**:
- 表面相似的代码可能有**完全不同的用途**
- field_extractor: **解析/提取** 字段值
- src/normalize: **标准化/生成变体** 用于匹配
- 两者职责不同，不应统一
-
-**文档**: 
-
---
-
-## 📈 重构统计
-
-### 代码行数变化
-
-| 文件 | 重构前 | 重构后 | 变化 | 百分比 |
-|------|--------|--------|------|--------|
-| **matcher/field_matcher.py** | 876行 | 205行 | -671 | ↓76% |
-| **matcher/* (新增10个模块)** | 0行 | 466行 | +466 | 新增 |
-| **matcher 总计** | 876行 | 671行 | -205 | ↓23% |
-| **ocr/machine_code_parser.py** | 919行 | 929行 | +10 | +1% |
-| **总净减少** | - | - | **-195行** | **↓11%** |
-
-### 测试覆盖
-
-| 模块 | 测试数 | 通过率 | 覆盖率 | 状态 |
-|------|--------|--------|--------|------|
-| matcher | 77 | 100% | - | ✅ |
-| field_extractor | 45 | 100% | 39% | ✅ |
-| machine_code_parser | 79 | 100% | 65% | ✅ |
-| normalizer | ~120 | 100% | - | ✅ |
-| 其他模块 | ~367 | 100% | - | ✅ |
-| **总计** | **688** | **100%** | **37%** | ✅ |
-
---
-
-## 🎓 重构经验总结
-
-### 成功经验
-
-1. **✅ 先测试后重构**
-   - 所有重构都有完整测试覆盖
-   - 每次改动后立即验证测试
-   - 100%测试通过率保证质量
-
-2. **✅ 识别真正的重复**
-   - 不是所有相似代码都是重复
-   - field_extractor vs normalizer: 表面相似但用途不同
-   - machine_code_parser: 真正的代码重复
-
-3. **✅ 渐进式重构**
-   - matcher: 大规模模块化 (策略模式)
-   - machine_code_parser: 轻度重构 (提取共享方法)
-   - field_extractor: 分析后决定不重构
-
-### 关键决策
-
-#### ✅ 应该重构的情况
- **matcher**: 单一文件过长 (876行)，包含多种策略
- **machine_code_parser**: 多处相同用途的重复代码
-
-#### ❌ 不应重构的情况
- **field_extractor**: 相似代码有不同用途
-
-### 教训
-
-**不要盲目追求DRY原则**
-> 相似代码不一定是重复。要理解代码的**真实用途**。
-
---
-
-## ✅ 总结
-
-**关键成果**:
- 📉 净减少 195 行代码
- 📈 代码覆盖率 +3% (34% → 37%)
- ✅ 测试数量 +55 (633 → 688)
- 🎯 machine_code_parser 覆盖率 +40% (25% → 65%)
- ✨ 模块化程度显著提高
- 🎯 可维护性大幅提升
-
-**重要教训**:
-> 相似的代码不一定是重复的代码。理解代码的真实用途，才能做出正确的重构决策。
-
-**下一步建议**:
-1. 继续提升 machine_code_parser 覆盖率到 80%+ (目前 65%)
-2. 为其他低覆盖模块添加测试（field_extractor 39%, pipeline 19%）
-3. 完善边界条件和异常情况的测试
--- a/docs/TEST_COVERAGE_IMPROVEMENT.md
+++ b/docs/TEST_COVERAGE_IMPROVEMENT.md
@@ -1,258 +0,0 @@
-# 测试覆盖率改进报告
-
-## 📊 改进概览
-
-### 整体统计
- ✅ **测试总数**: 633 → 688 (+55个测试, +8.7%)
- ✅ **通过率**: 100% (688/688)
- ✅ **整体覆盖率**: 34% → 37% (+3%)
-
-### machine_code_parser.py 专项改进
- ✅ **测试数**: 24 → 79 (+55个测试, +229%)
- ✅ **覆盖率**: 25% → 65% (+40%)
- ✅ **未覆盖行**: 273 → 129 (减少144行)
-
---
-
-## 🎯 新增测试详情
-
-### 第一轮改进 (22个测试)
-
-#### 1. TestDetectAccountContext (8个测试)
-
-测试新增的 `_detect_account_context()` 辅助方法。
-
-**测试用例**:
-1. `test_bankgiro_keyword` - 检测 'bankgiro' 关键词
-2. `test_bg_keyword` - 检测 'bg:' 缩写
-3. `test_plusgiro_keyword` - 检测 'plusgiro' 关键词
-4. `test_postgiro_keyword` - 检测 'postgiro' 别名
-5. `test_pg_keyword` - 检测 'pg:' 缩写
-6. `test_both_contexts` - 同时存在两种关键词
-7. `test_no_context` - 无账号关键词
-8. `test_case_insensitive` - 大小写不敏感检测
-
-**覆盖的代码路径**:
-```python
-def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
-    context_text = ' '.join(t.text.lower() for t in tokens)
-    return {
-        'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
-        'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
-    }
-```
-
---
-
-### 2. TestNormalizeAccountSpacesMethod (5个测试)
-
-测试新增的 `_normalize_account_spaces()` 辅助方法。
-
-**测试用例**:
-1. `test_removes_spaces_after_arrow` - 移除 > 后的空格
-2. `test_multiple_consecutive_spaces` - 处理多个连续空格
-3. `test_no_arrow_returns_unchanged` - 无 > 标记时返回原值
-4. `test_spaces_before_arrow_preserved` - 保留 > 前的空格
-5. `test_empty_string` - 空字符串处理
-
-**覆盖的代码路径**:
-```python
-def _normalize_account_spaces(self, line: str) -> str:
-    if '>' not in line:
-        return line
-    parts = line.split('>', 1)
-    after_arrow = parts[1]
-    normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
-    while re.search(r'(\d)\s+(\d)', normalized):
-        normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
-    return parts[0] + '>' + normalized
-```
-
---
-
-### 3. TestFormatAccount (4个测试)
-
-测试新增的 `_format_account()` 辅助方法。
-
-**测试用例**:
-1. `test_plusgiro_context_forces_plusgiro` - Plusgiro 上下文强制格式化为 Plusgiro
-2. `test_valid_bankgiro_7_digits` - 7位有效 Bankgiro 格式化
-3. `test_valid_bankgiro_8_digits` - 8位有效 Bankgiro 格式化
-4. `test_defaults_to_bankgiro_when_ambiguous` - 模糊情况默认 Bankgiro
-
-**覆盖的代码路径**:
-```python
-def _format_account(self, account_digits: str, is_plusgiro_context: bool) -> tuple[str, str]:
-    if is_plusgiro_context:
-        formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
-        return formatted, 'plusgiro'
-
-    # Luhn 验证逻辑
-    pg_valid = FieldValidators.is_valid_plusgiro(account_digits)
-    bg_valid = FieldValidators.is_valid_bankgiro(account_digits)
-
-    # 决策逻辑
-    if pg_valid and not bg_valid:
-        return pg_formatted, 'plusgiro'
-    elif bg_valid and not pg_valid:
-        return bg_formatted, 'bankgiro'
-    else:
-        return bg_formatted, 'bankgiro'
-```
-
---
-
-### 4. TestParseMethod (5个测试)
-
-测试主入口 `parse()` 方法。
-
-**测试用例**:
-1. `test_parse_empty_tokens` - 空 token 列表处理
-2. `test_parse_finds_payment_line_in_bottom_region` - 在页面底部35%区域查找付款行
-3. `test_parse_ignores_top_region` - 忽略页面顶部区域
-4. `test_parse_with_context_keywords` - 检测上下文关键词
-5. `test_parse_stores_source_tokens` - 存储源 token
-
-**覆盖的代码路径**:
- Token 过滤（底部区域检测）
- 上下文关键词检测
- 付款行查找和解析
- 结果对象构建
-
---
-
-### 第二轮改进 (33个测试)
-
-#### 5. TestExtractOCR (8个测试)
-
-测试 `_extract_ocr()` 方法 - OCR 参考号码提取。
-
-**测试用例**:
-1. `test_extract_valid_ocr_10_digits` - 提取10位 OCR 号码
-2. `test_extract_valid_ocr_15_digits` - 提取15位 OCR 号码
-3. `test_extract_ocr_with_hash_markers` - 带 # 标记的 OCR
-4. `test_extract_longest_ocr_when_multiple` - 多个候选时选最长
-5. `test_extract_ocr_ignores_short_numbers` - 忽略短于10位的数字
-6. `test_extract_ocr_ignores_long_numbers` - 忽略长于25位的数字
-7. `test_extract_ocr_excludes_bankgiro_variants` - 排除 Bankgiro 变体
-8. `test_extract_ocr_empty_tokens` - 空 token 处理
-
-#### 6. TestExtractBankgiro (9个测试)
-
-测试 `_extract_bankgiro()` 方法 - Bankgiro 账号提取。
-
-**测试用例**:
-1. `test_extract_bankgiro_7_digits_with_dash` - 带破折号的7位 Bankgiro
-2. `test_extract_bankgiro_7_digits_without_dash` - 无破折号的7位 Bankgiro
-3. `test_extract_bankgiro_8_digits_with_dash` - 带破折号的8位 Bankgiro
-4. `test_extract_bankgiro_8_digits_without_dash` - 无破折号的8位 Bankgiro
-5. `test_extract_bankgiro_with_spaces` - 带空格的 Bankgiro
-6. `test_extract_bankgiro_handles_plusgiro_format` - 处理 Plusgiro 格式
-7. `test_extract_bankgiro_with_context` - 带上下文关键词
-8. `test_extract_bankgiro_ignores_plusgiro_context` - 忽略 Plusgiro 上下文
-9. `test_extract_bankgiro_empty_tokens` - 空 token 处理
-
-#### 7. TestExtractPlusgiro (8个测试)
-
-测试 `_extract_plusgiro()` 方法 - Plusgiro 账号提取。
-
-**测试用例**:
-1. `test_extract_plusgiro_7_digits_with_dash` - 带破折号的7位 Plusgiro
-2. `test_extract_plusgiro_7_digits_without_dash` - 无破折号的7位 Plusgiro
-3. `test_extract_plusgiro_8_digits` - 8位 Plusgiro
-4. `test_extract_plusgiro_with_spaces` - 带空格的 Plusgiro
-5. `test_extract_plusgiro_with_context` - 带上下文关键词
-6. `test_extract_plusgiro_ignores_too_short` - 忽略少于7位
-7. `test_extract_plusgiro_ignores_too_long` - 忽略多于8位
-8. `test_extract_plusgiro_empty_tokens` - 空 token 处理
-
-#### 8. TestExtractAmount (8个测试)
-
-测试 `_extract_amount()` 方法 - 金额提取。
-
-**测试用例**:
-1. `test_extract_amount_with_comma_decimal` - 逗号小数分隔符
-2. `test_extract_amount_with_dot_decimal` - 点号小数分隔符
-3. `test_extract_amount_integer` - 整数金额
-4. `test_extract_amount_with_thousand_separator` - 千位分隔符
-5. `test_extract_amount_large_number` - 大额金额
-6. `test_extract_amount_ignores_too_large` - 忽略过大金额
-7. `test_extract_amount_ignores_zero` - 忽略零或负数
-8. `test_extract_amount_empty_tokens` - 空 token 处理
-
---
-
-## 📈 覆盖率分析
-
-### 已覆盖的方法
-✅ `_detect_account_context()` - **100%** (第一轮新增)
-✅ `_normalize_account_spaces()` - **100%** (第一轮新增)
-✅ `_format_account()` - **95%** (第一轮新增)
-✅ `parse()` - **70%** (第一轮改进)
-✅ `_parse_standard_payment_line()` - **95%** (已有测试)
-✅ `_extract_ocr()` - **85%** (第二轮新增)
-✅ `_extract_bankgiro()` - **90%** (第二轮新增)
-✅ `_extract_plusgiro()` - **90%** (第二轮新增)
-✅ `_extract_amount()` - **80%** (第二轮新增)
-
-### 仍需改进的方法 (未覆盖/部分覆盖)
-⚠️ `_calculate_confidence()` - **0%** (未测试)
-⚠️ `cross_validate()` - **0%** (未测试)
-⚠️ `get_region_bbox()` - **0%** (未测试)
-⚠️ `_find_tokens_with_values()` - **部分覆盖**
-⚠️ `_find_machine_code_line_tokens()` - **部分覆盖**
-
-### 未覆盖的代码行（129行）
-主要集中在：
-1. **验证方法** (lines 805-824): `_calculate_confidence`, `cross_validate`
-2. **辅助方法** (lines 80-92, 336-369, 377-407): Token 查找、bbox 计算、日志记录
-3. **边界条件** (lines 648-653, 690, 699, 759-760等): 某些提取方法的边界情况
-
---
-
-## 🎯 改进建议
-
-### ✅ 已完成目标
- ✅ 覆盖率从 25% 提升到 65% (+40%)
- ✅ 测试数量从 24 增加到 79 (+55个)
- ✅ 提取方法全部测试（_extract_ocr, _extract_bankgiro, _extract_plusgiro, _extract_amount）
-
-### 下一步目标（覆盖率 65% → 80%+）
-1. **添加验证方法测试** - 为 `_calculate_confidence`, `cross_validate` 添加测试
-2. **添加辅助方法测试** - 为 token 查找和 bbox 计算方法添加测试
-3. **完善边界条件** - 增加边界情况和异常处理的测试
-4. **集成测试** - 添加端到端的集成测试，使用真实 PDF token 数据
-
---
-
-## ✅ 已完成的改进
-
-### 重构收益
- ✅ 提取的3个辅助方法现在可以独立测试
- ✅ 测试粒度更细，更容易定位问题
- ✅ 代码可读性提高，测试用例清晰易懂
-
-### 质量保证
- ✅ 所有655个测试100%通过
- ✅ 无回归问题
- ✅ 新增测试覆盖了之前未测试的重构代码
-
---
-
-## 📚 测试编写经验
-
-### 成功经验
-1. **使用 fixture 创建测试数据** - `_create_token()` 辅助方法简化了 token 创建
-2. **按方法组织测试类** - 每个方法一个测试类，结构清晰
-3. **测试用例命名清晰** - `test_<what>_<condition>` 格式，一目了然
-4. **覆盖关键路径** - 优先测试常见场景和边界条件
-
-### 遇到的问题
-1. **Token 初始化参数** - 忘记了 `page_no` 参数，导致初始测试失败
-   - 解决：修复 `_create_token()` 辅助方法，添加 `page_no=0`
-
---
-
-**报告日期**: 2026-01-24
-**状态**: ✅ 完成
-**下一步**: 继续提升覆盖率到 60%+
--- a/docs/multi_pool_design.md
+++ b/docs/multi_pool_design.md
@@ -1,619 +0,0 @@
-# 多池处理架构设计文档
-
-## 1. 研究总结
-
-### 1.1 当前问题分析
-
-我们之前实现的双池模式存在稳定性问题，主要原因：
-
-| 问题 | 原因 | 解决方案 |
-|------|------|----------|
-| 处理卡住 | 线程 + ProcessPoolExecutor 混用导致死锁 | 使用 asyncio 或纯 Queue 模式 |
-| Queue.get() 无限阻塞 | 没有超时机制 | 添加 timeout 和哨兵值 |
-| GPU 内存冲突 | 多进程同时访问 GPU | 限制 GPU worker = 1 |
-| CUDA fork 问题 | Linux 默认 fork 不兼容 CUDA | 使用 spawn 启动方式 |
-
-### 1.2 推荐架构方案
-
-经过研究，最适合我们场景的方案是 **生产者-消费者队列模式**：
-
-```
-┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
-│   Main Process  │     │   CPU Workers   │     │   GPU Worker    │
-│                 │     │  (4 processes)  │     │  (1 process)    │
-│  ┌───────────┐  │     │                 │     │                 │
-│  │ Task      │──┼────▶│ Text PDF处理    │     │ Scanned PDF处理 │
-│  │ Dispatcher│  │     │ (无需OCR)       │     │ (PaddleOCR)     │
-│  └───────────┘  │     │                 │     │                 │
-│       ▲         │     │       │         │     │       │         │
-│       │         │     │       ▼         │     │       ▼         │
-│  ┌───────────┐  │     │  Result Queue   │     │  Result Queue   │
-│  │ Result    │◀─┼─────│◀────────────────│─────│◀────────────────│
-│  │ Collector │  │     │                 │     │                 │
-│  └───────────┘  │     └─────────────────┘     └─────────────────┘
-│       │         │
-│       ▼         │
-│  ┌───────────┐  │
-│  │ Database  │  │
-│  │ Batch     │  │
-│  │ Writer    │  │
-│  └───────────┘  │
-└─────────────────┘
-```
-
---
-
-## 2. 核心设计原则
-
-### 2.1 CUDA 兼容性
-
-```python
-# 关键：使用 spawn 启动方式
-import multiprocessing as mp
-ctx = mp.get_context("spawn")
-
-# GPU worker 初始化时设置设备
-def init_gpu_worker(gpu_id: int = 0):
-    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
-    global _ocr
-    from paddleocr import PaddleOCR
-    _ocr = PaddleOCR(use_gpu=True, ...)
-```
-
-### 2.2 Worker 初始化模式
-
-使用 `initializer` 参数一次性加载模型，避免每个任务重新加载：
-
-```python
-# 全局变量保存模型
-_ocr = None
-
-def init_worker(use_gpu: bool, gpu_id: int = 0):
-    global _ocr
-    if use_gpu:
-        os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
-    else:
-        os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
-
-    from paddleocr import PaddleOCR
-    _ocr = PaddleOCR(use_gpu=use_gpu, ...)
-
-# 创建 Pool 时使用 initializer
-pool = ProcessPoolExecutor(
-    max_workers=1,
-    initializer=init_worker,
-    initargs=(True, 0),  # use_gpu=True, gpu_id=0
-    mp_context=mp.get_context("spawn")
-)
-```
-
-### 2.3 队列模式 vs as_completed
-
-| 方式 | 优点 | 缺点 | 适用场景 |
-|------|------|------|----------|
-| `as_completed()` | 简单、无需管理队列 | 无法跨多个 Pool 使用 | 单池场景 |
-| `multiprocessing.Queue` | 高性能、灵活 | 需要手动管理、死锁风险 | 多池流水线 |
-| `Manager().Queue()` | 可 pickle、跨 Pool | 性能较低 | 需要 Pool.map 场景 |
-
-**推荐**：对于双池场景，使用 `as_completed()` 分别处理每个池，然后合并结果。
-
---
-
-## 3. 详细开发计划
-
-### 阶段 1：重构基础架构 (2-3天)
-
-#### 1.1 创建 WorkerPool 抽象类
-
-```python
-# src/processing/worker_pool.py
-
-from __future__ import annotations
-from abc import ABC, abstractmethod
-from concurrent.futures import ProcessPoolExecutor, Future
-from dataclasses import dataclass
-from typing import List, Any, Optional, Callable
-import multiprocessing as mp
-
-@dataclass
-class TaskResult:
-    """任务结果容器"""
-    task_id: str
-    success: bool
-    data: Any
-    error: Optional[str] = None
-    processing_time: float = 0.0
-
-class WorkerPool(ABC):
-    """Worker Pool 抽象基类"""
-
-    def __init__(self, max_workers: int, use_gpu: bool = False, gpu_id: int = 0):
-        self.max_workers = max_workers
-        self.use_gpu = use_gpu
-        self.gpu_id = gpu_id
-        self._executor: Optional[ProcessPoolExecutor] = None
-
-    @abstractmethod
-    def get_initializer(self) -> Callable:
-        """返回 worker 初始化函数"""
-        pass
-
-    @abstractmethod
-    def get_init_args(self) -> tuple:
-        """返回初始化参数"""
-        pass
-
-    def start(self):
-        """启动 worker pool"""
-        ctx = mp.get_context("spawn")
-        self._executor = ProcessPoolExecutor(
-            max_workers=self.max_workers,
-            mp_context=ctx,
-            initializer=self.get_initializer(),
-            initargs=self.get_init_args()
-        )
-
-    def submit(self, fn: Callable, *args, **kwargs) -> Future:
-        """提交任务"""
-        if not self._executor:
-            raise RuntimeError("Pool not started")
-        return self._executor.submit(fn, *args, **kwargs)
-
-    def shutdown(self, wait: bool = True):
-        """关闭 pool"""
-        if self._executor:
-            self._executor.shutdown(wait=wait)
-            self._executor = None
-
-    def __enter__(self):
-        self.start()
-        return self
-
-    def __exit__(self, *args):
-        self.shutdown()
-```
-
-#### 1.2 实现 CPU 和 GPU Worker Pool
-
-```python
-# src/processing/cpu_pool.py
-
-class CPUWorkerPool(WorkerPool):
-    """CPU-only worker pool for text PDF processing"""
-
-    def __init__(self, max_workers: int = 4):
-        super().__init__(max_workers=max_workers, use_gpu=False)
-
-    def get_initializer(self) -> Callable:
-        return init_cpu_worker
-
-    def get_init_args(self) -> tuple:
-        return ()
-
-# src/processing/gpu_pool.py
-
-class GPUWorkerPool(WorkerPool):
-    """GPU worker pool for OCR processing"""
-
-    def __init__(self, max_workers: int = 1, gpu_id: int = 0):
-        super().__init__(max_workers=max_workers, use_gpu=True, gpu_id=gpu_id)
-
-    def get_initializer(self) -> Callable:
-        return init_gpu_worker
-
-    def get_init_args(self) -> tuple:
-        return (self.gpu_id,)
-```
-
---
-
-### 阶段 2：实现双池协调器 (2-3天)
-
-#### 2.1 任务分发器
-
-```python
-# src/processing/task_dispatcher.py
-
-from dataclasses import dataclass
-from enum import Enum, auto
-from typing import List, Tuple
-
-class TaskType(Enum):
-    CPU = auto()  # Text PDF
-    GPU = auto()  # Scanned PDF
-
-@dataclass
-class Task:
-    id: str
-    task_type: TaskType
-    data: Any
-
-class TaskDispatcher:
-    """根据 PDF 类型分发任务到不同的 pool"""
-
-    def classify_task(self, doc_info: dict) -> TaskType:
-        """判断文档是否需要 OCR"""
-        # 基于 PDF 特征判断
-        if self._is_scanned_pdf(doc_info):
-            return TaskType.GPU
-        return TaskType.CPU
-
-    def _is_scanned_pdf(self, doc_info: dict) -> bool:
-        """检测是否为扫描件"""
-        # 1. 检查是否有可提取文本
-        # 2. 检查图片比例
-        # 3. 检查文本密度
-        pass
-
-    def partition_tasks(self, tasks: List[Task]) -> Tuple[List[Task], List[Task]]:
-        """将任务分为 CPU 和 GPU 两组"""
-        cpu_tasks = [t for t in tasks if t.task_type == TaskType.CPU]
-        gpu_tasks = [t for t in tasks if t.task_type == TaskType.GPU]
-        return cpu_tasks, gpu_tasks
-```
-
-#### 2.2 双池协调器
-
-```python
-# src/processing/dual_pool_coordinator.py
-
-from concurrent.futures import as_completed
-from typing import List, Iterator
-import logging
-
-logger = logging.getLogger(__name__)
-
-class DualPoolCoordinator:
-    """协调 CPU 和 GPU 两个 worker pool"""
-
-    def __init__(
-        self,
-        cpu_workers: int = 4,
-        gpu_workers: int = 1,
-        gpu_id: int = 0
-    ):
-        self.cpu_pool = CPUWorkerPool(max_workers=cpu_workers)
-        self.gpu_pool = GPUWorkerPool(max_workers=gpu_workers, gpu_id=gpu_id)
-        self.dispatcher = TaskDispatcher()
-
-    def __enter__(self):
-        self.cpu_pool.start()
-        self.gpu_pool.start()
-        return self
-
-    def __exit__(self, *args):
-        self.cpu_pool.shutdown()
-        self.gpu_pool.shutdown()
-
-    def process_batch(
-        self,
-        documents: List[dict],
-        cpu_task_fn: Callable,
-        gpu_task_fn: Callable,
-        on_result: Optional[Callable[[TaskResult], None]] = None,
-        on_error: Optional[Callable[[str, Exception], None]] = None
-    ) -> List[TaskResult]:
-        """
-        处理一批文档，自动分发到 CPU 或 GPU pool
-
-        Args:
-            documents: 待处理文档列表
-            cpu_task_fn: CPU 任务处理函数
-            gpu_task_fn: GPU 任务处理函数
-            on_result: 结果回调（可选）
-            on_error: 错误回调（可选）
-
-        Returns:
-            所有任务结果列表
-        """
-        # 分类任务
-        tasks = [
-            Task(id=doc['id'], task_type=self.dispatcher.classify_task(doc), data=doc)
-            for doc in documents
-        ]
-        cpu_tasks, gpu_tasks = self.dispatcher.partition_tasks(tasks)
-
-        logger.info(f"Task partition: {len(cpu_tasks)} CPU, {len(gpu_tasks)} GPU")
-
-        # 提交任务到各自的 pool
-        cpu_futures = {
-            self.cpu_pool.submit(cpu_task_fn, t.data): t.id
-            for t in cpu_tasks
-        }
-        gpu_futures = {
-            self.gpu_pool.submit(gpu_task_fn, t.data): t.id
-            for t in gpu_tasks
-        }
-
-        # 收集结果
-        results = []
-        all_futures = list(cpu_futures.keys()) + list(gpu_futures.keys())
-
-        for future in as_completed(all_futures):
-            task_id = cpu_futures.get(future) or gpu_futures.get(future)
-            pool_type = "CPU" if future in cpu_futures else "GPU"
-
-            try:
-                data = future.result(timeout=300)  # 5分钟超时
-                result = TaskResult(task_id=task_id, success=True, data=data)
-                if on_result:
-                    on_result(result)
-            except Exception as e:
-                logger.error(f"[{pool_type}] Task {task_id} failed: {e}")
-                result = TaskResult(task_id=task_id, success=False, data=None, error=str(e))
-                if on_error:
-                    on_error(task_id, e)
-
-            results.append(result)
-
-        return results
-```
-
---
-
-### 阶段 3：集成到 autolabel (1-2天)
-
-#### 3.1 修改 autolabel.py
-
-```python
-# src/cli/autolabel.py
-
-def run_autolabel_dual_pool(args):
-    """使用双池模式运行自动标注"""
-
-    from src.processing.dual_pool_coordinator import DualPoolCoordinator
-
-    # 初始化数据库批处理
-    db_batch = []
-    db_batch_size = 100
-
-    def on_result(result: TaskResult):
-        """处理成功结果"""
-        nonlocal db_batch
-        db_batch.append(result.data)
-
-        if len(db_batch) >= db_batch_size:
-            save_documents_batch(db_batch)
-            db_batch.clear()
-
-    def on_error(task_id: str, error: Exception):
-        """处理错误"""
-        logger.error(f"Task {task_id} failed: {error}")
-
-    # 创建双池协调器
-    with DualPoolCoordinator(
-        cpu_workers=args.cpu_workers or 4,
-        gpu_workers=args.gpu_workers or 1,
-        gpu_id=0
-    ) as coordinator:
-
-        # 处理所有 CSV
-        for csv_file in csv_files:
-            documents = load_documents_from_csv(csv_file)
-
-            results = coordinator.process_batch(
-                documents=documents,
-                cpu_task_fn=process_text_pdf,
-                gpu_task_fn=process_scanned_pdf,
-                on_result=on_result,
-                on_error=on_error
-            )
-
-            logger.info(f"CSV {csv_file}: {len(results)} processed")
-
-    # 保存剩余批次
-    if db_batch:
-        save_documents_batch(db_batch)
-```
-
---
-
-### 阶段 4：测试与验证 (1-2天)
-
-#### 4.1 单元测试
-
-```python
-# tests/unit/test_dual_pool.py
-
-import pytest
-from src.processing.dual_pool_coordinator import DualPoolCoordinator, TaskResult
-
-class TestDualPoolCoordinator:
-
-    def test_cpu_only_batch(self):
-        """测试纯 CPU 任务批处理"""
-        with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
-            docs = [{"id": f"doc_{i}", "type": "text"} for i in range(10)]
-            results = coord.process_batch(docs, cpu_fn, gpu_fn)
-            assert len(results) == 10
-            assert all(r.success for r in results)
-
-    def test_mixed_batch(self):
-        """测试混合任务批处理"""
-        with DualPoolCoordinator(cpu_workers=2, gpu_workers=1) as coord:
-            docs = [
-                {"id": "text_1", "type": "text"},
-                {"id": "scan_1", "type": "scanned"},
-                {"id": "text_2", "type": "text"},
-            ]
-            results = coord.process_batch(docs, cpu_fn, gpu_fn)
-            assert len(results) == 3
-
-    def test_timeout_handling(self):
-        """测试超时处理"""
-        pass
-
-    def test_error_recovery(self):
-        """测试错误恢复"""
-        pass
-```
-
-#### 4.2 集成测试
-
-```python
-# tests/integration/test_autolabel_dual_pool.py
-
-def test_autolabel_with_dual_pool():
-    """端到端测试双池模式"""
-    # 使用少量测试数据
-    result = subprocess.run([
-        "python", "-m", "src.cli.autolabel",
-        "--cpu-workers", "2",
-        "--gpu-workers", "1",
-        "--limit", "50"
-    ], capture_output=True)
-
-    assert result.returncode == 0
-    # 验证数据库记录
-```
-
---
-
-## 4. 关键技术点
-
-### 4.1 避免死锁的策略
-
-```python
-# 1. 使用 timeout
-try:
-    result = future.result(timeout=300)
-except TimeoutError:
-    logger.warning(f"Task timed out")
-
-# 2. 使用哨兵值
-SENTINEL = object()
-queue.put(SENTINEL)  # 发送结束信号
-
-# 3. 检查进程状态
-if not worker.is_alive():
-    logger.error("Worker died unexpectedly")
-    break
-
-# 4. 先清空队列再 join
-while not queue.empty():
-    results.append(queue.get_nowait())
-worker.join(timeout=5.0)
-```
-
-### 4.2 PaddleOCR 特殊处理
-
-```python
-# PaddleOCR 必须在 worker 进程中初始化
-def init_paddle_worker(gpu_id: int):
-    global _ocr
-    import os
-    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
-
-    # 延迟导入，确保 CUDA 环境变量生效
-    from paddleocr import PaddleOCR
-    _ocr = PaddleOCR(
-        use_angle_cls=True,
-        lang='en',
-        use_gpu=True,
-        show_log=False,
-        # 重要：设置 GPU 内存比例
-        gpu_mem=2000  # 限制 GPU 内存使用 (MB)
-    )
-```
-
-### 4.3 资源监控
-
-```python
-import psutil
-import GPUtil
-
-def get_resource_usage():
-    """获取系统资源使用情况"""
-    cpu_percent = psutil.cpu_percent(interval=1)
-    memory = psutil.virtual_memory()
-
-    gpu_info = []
-    for gpu in GPUtil.getGPUs():
-        gpu_info.append({
-            "id": gpu.id,
-            "memory_used": gpu.memoryUsed,
-            "memory_total": gpu.memoryTotal,
-            "utilization": gpu.load * 100
-        })
-
-    return {
-        "cpu_percent": cpu_percent,
-        "memory_percent": memory.percent,
-        "gpu": gpu_info
-    }
-```
-
---
-
-## 5. 风险评估与应对
-
-| 风险 | 可能性 | 影响 | 应对策略 |
-|------|--------|------|----------|
-| GPU 内存不足 | 中 | 高 | 限制 GPU worker = 1，设置 gpu_mem 参数 |
-| 进程僵死 | 低 | 高 | 添加心跳检测，超时自动重启 |
-| 任务分类错误 | 中 | 中 | 添加回退机制，CPU 失败后尝试 GPU |
-| 数据库写入瓶颈 | 低 | 中 | 增大批处理大小，异步写入 |
-
---
-
-## 6. 备选方案
-
-如果上述方案仍存在问题，可以考虑：
-
-### 6.1 使用 Ray
-
-```python
-import ray
-
-ray.init()
-
-@ray.remote(num_cpus=1)
-def cpu_task(data):
-    return process_text_pdf(data)
-
-@ray.remote(num_gpus=1)
-def gpu_task(data):
-    return process_scanned_pdf(data)
-
-# 自动资源调度
-futures = [cpu_task.remote(d) for d in cpu_docs]
-futures += [gpu_task.remote(d) for d in gpu_docs]
-results = ray.get(futures)
-```
-
-### 6.2 单池 + 动态 GPU 调度
-
-保持单池模式，但在每个任务内部动态决定是否使用 GPU：
-
-```python
-def process_document(doc_data):
-    if is_scanned_pdf(doc_data):
-        # 使用 GPU (需要全局锁或信号量控制并发)
-        with gpu_semaphore:
-            return process_with_ocr(doc_data)
-    else:
-        return process_text_only(doc_data)
-```
-
---
-
-## 7. 时间线总结
-
-| 阶段 | 任务 | 预计工作量 |
-|------|------|------------|
-| 阶段 1 | 基础架构重构 | 2-3 天 |
-| 阶段 2 | 双池协调器实现 | 2-3 天 |
-| 阶段 3 | 集成到 autolabel | 1-2 天 |
-| 阶段 4 | 测试与验证 | 1-2 天 |
-| **总计** | | **6-10 天** |
-
---
-
-## 8. 参考资料
-
-1. [Python concurrent.futures 官方文档](https://docs.python.org/3/library/concurrent.futures.html)
-2. [PyTorch Multiprocessing Best Practices](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html)
-3. [Super Fast Python - ProcessPoolExecutor 完整指南](https://superfastpython.com/processpoolexecutor-in-python/)
-4. [PaddleOCR 并行推理文档](http://www.paddleocr.ai/main/en/version3.x/pipeline_usage/instructions/parallel_inference.html)
-5. [AWS - 跨 CPU/GPU 并行化 ML 推理](https://aws.amazon.com/blogs/machine-learning/parallelizing-across-multiple-cpu-gpus-to-speed-up-deep-learning-inference-at-the-edge/)
-6. [Ray 分布式多进程处理](https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html)
--- a/docs/product-plan-v2.md
+++ b/docs/product-plan-v2.md
--- a/docs/ux-design-prompt-v2.md
+++ b/docs/ux-design-prompt-v2.md
@@ -0,0 +1,302 @@
+# Document Annotation Tool – UX Design Spec v2
+
+## Theme: Warm Graphite (Modern Enterprise)
+
+---
+
+## 1. Design Principles (Updated)
+
+1. **Clarity** – High contrast, but never pure black-on-white
+2. **Warm Neutrality** – Slightly warm grays reduce visual fatigue
+3. **Focus** – Content-first layouts with restrained accents
+4. **Consistency** – Reusable patterns, predictable behavior
+5. **Professional Trust** – Calm, serious, enterprise-ready
+6. **Longevity** – No trendy colors that age quickly
+
+---
+
+## 2. Color Palette (Warm Graphite)
+
+### Core Colors
+
+| Usage | Color Name | Hex |
+|------|-----------|-----|
+| Primary Text | Soft Black | #121212 |
+| Secondary Text | Charcoal Gray | #2A2A2A |
+| Muted Text | Warm Gray | #6B6B6B |
+| Disabled Text | Light Warm Gray | #9A9A9A |
+
+### Backgrounds
+
+| Usage | Color | Hex |
+|-----|------|-----|
+| App Background | Paper White | #FAFAF8 |
+| Card / Panel | White | #FFFFFF |
+| Hover Surface | Subtle Warm Gray | #F1F0ED |
+| Selected Row | Very Light Warm Gray | #ECEAE6 |
+
+### Borders & Dividers
+
+| Usage | Color | Hex |
+|------|------|-----|
+| Default Border | Warm Light Gray | #E6E4E1 |
+| Strong Divider | Neutral Gray | #D8D6D2 |
+
+### Semantic States (Muted & Professional)
+
+| State | Color | Hex |
+|------|-------|-----|
+| Success | Olive Gray | #3E4A3A |
+| Error | Brick Gray | #4A3A3A |
+| Warning | Sand Gray | #4A4A3A |
+| Info | Graphite Gray | #3A3A3A |
+
+> Accent colors are **never saturated** and are used only for status, progress, or selection.
+
+---
+
+## 3. Typography
+
+- **Font Family**: Inter / SF Pro / system-ui
+- **Headings**:
+  - Weight: 600–700
+  - Color: #121212
+  - Letter spacing: -0.01em
+- **Body Text**:
+  - Weight: 400
+  - Color: #2A2A2A
+- **Captions / Meta**:
+  - Weight: 400
+  - Color: #6B6B6B
+- **Monospace (IDs / Values)**:
+  - JetBrains Mono / SF Mono
+  - Color: #2A2A2A
+
+---
+
+## 4. Global Layout
+
+### Top Navigation Bar
+
+- Height: 56px
+- Background: #FAFAF8
+- Bottom Border: 1px solid #E6E4E1
+- Logo: Text or icon in #121212
+
+**Navigation Items**
+- Default: #6B6B6B
+- Hover: #2A2A2A
+- Active:
+  - Text: #121212
+  - Bottom indicator: 2px solid #3A3A3A (rounded ends)
+
+**Avatar**
+- Circle background: #ECEAE6
+- Text: #2A2A2A
+
+---
+
+## 5. Page: Documents (Dashboard)
+
+### Page Header
+
+- Title: "Documents" (#121212)
+- Actions:
+  - Primary button: Dark graphite outline
+  - Secondary button: Subtle border only
+
+### Filters Bar
+
+- Background: #FFFFFF
+- Border: 1px solid #E6E4E1
+- Inputs:
+  - Background: #FFFFFF
+  - Hover: #F1F0ED
+  - Focus ring: 1px #3A3A3A
+
+### Document Table
+
+- Table background: #FFFFFF
+- Header text: #6B6B6B
+- Row hover: #F1F0ED
+- Row selected:
+  - Background: #ECEAE6
+  - Left indicator: 3px solid #3A3A3A
+
+### Status Badges
+
+- Pending:
+  - BG: #FFFFFF
+  - Border: #D8D6D2
+  - Text: #2A2A2A
+
+- Labeled:
+  - BG: #2A2A2A
+  - Text: #FFFFFF
+
+- Exported:
+  - BG: #ECEAE6
+  - Text: #2A2A2A
+  - Icon: ✓
+
+### Auto-label States
+
+- Running:
+  - Progress bar: #3A3A3A on #ECEAE6
+- Completed:
+  - Text: #3E4A3A
+- Failed:
+  - BG: #F1EDED
+  - Text: #4A3A3A
+
+---
+
+## 6. Upload Modals (Single & Batch)
+
+### Modal Container
+
+- Background: #FFFFFF
+- Border radius: 8px
+- Shadow: 0 1px 3px rgba(0,0,0,0.08)
+
+### Drop Zone
+
+- Background: #FAFAF8
+- Border: 1px dashed #D8D6D2
+- Hover: #F1F0ED
+- Icon: Graphite gray
+
+### Form Fields
+
+- Input BG: #FFFFFF
+- Border: #D8D6D2
+- Focus: 1px solid #3A3A3A
+
+Primary Action Button:
+- Text: #FFFFFF
+- BG: #2A2A2A
+- Hover: #121212
+
+---
+
+## 7. Document Detail View
+
+### Canvas Area
+
+- Background: #FFFFFF
+- Annotation styles:
+  - Manual: Solid border #2A2A2A
+  - Auto: Dashed border #6B6B6B
+  - Selected: 2px border #3A3A3A + resize handles
+
+### Right Info Panel
+
+- Card background: #FFFFFF
+- Section headers: #121212
+- Meta text: #6B6B6B
+
+### Annotation Table
+
+- Same table styles as Documents
+- Inline edit:
+  - Input background: #FAFAF8
+  - Save button: Graphite
+
+### Locked State (Auto-label Running)
+
+- Banner BG: #FAFAF8
+- Border-left: 3px solid #4A4A3A
+- Progress bar: Graphite
+
+---
+
+## 8. Training Page
+
+### Document Selector
+
+- Selected rows use same highlight rules
+- Verified state:
+  - Full: Olive gray check
+  - Partial: Sand gray warning
+
+### Configuration Panel
+
+- Card layout
+- Inputs aligned to grid
+- Schedule option visually muted until enabled
+
+Primary CTA:
+- Start Training button in dark graphite
+
+---
+
+## 9. Models & Training History
+
+### Training Job List
+
+- Job cards use #FFFFFF background
+- Running job:
+  - Progress bar: #3A3A3A
+- Completed job:
+  - Metrics bars in graphite
+
+### Model Detail Panel
+
+- Sectioned cards
+- Metric bars:
+  - Track: #ECEAE6
+  - Fill: #3A3A3A
+
+Actions:
+- Primary: Download Model
+- Secondary: View Logs / Use as Base
+
+---
+
+## 10. Micro-interactions (Refined)
+
+| Element | Interaction | Animation |
+|------|------------|-----------|
+| Button hover | BG lightens | 150ms ease-out |
+| Button press | Scale 0.98 | 100ms |
+| Row hover | BG fade | 120ms |
+| Modal open | Fade + scale 0.96 → 1 | 200ms |
+| Progress fill | Smooth | ease-out |
+| Annotation select | Border + handles | 120ms |
+
+---
+
+## 11. Tailwind Theme (Updated)
+
+```js
+colors: {
+  text: {
+    primary: '#121212',
+    secondary: '#2A2A2A',
+    muted: '#6B6B6B',
+    disabled: '#9A9A9A',
+  },
+  bg: {
+    app: '#FAFAF8',
+    card: '#FFFFFF',
+    hover: '#F1F0ED',
+    selected: '#ECEAE6',
+  },
+  border: '#E6E4E1',
+  accent: '#3A3A3A',
+  success: '#3E4A3A',
+  error: '#4A3A3A',
+  warning: '#4A4A3A',
+}
+```
+
+---
+
+## 12. Final Notes
+
+- Pure black (#000000) should **never** be used as large surfaces
+- Accent color usage should stay under **10% of UI area**
+- Warm grays are intentional and must not be "corrected" to blue-grays
+
+This theme is designed to scale from internal tool → polished SaaS without redesign.
+
--- a/docs/web-refactoring-complete.md
+++ b/docs/web-refactoring-complete.md
@@ -0,0 +1,273 @@
+# Web Directory Refactoring - Complete ✅
+
+**Date**: 2026-01-25
+**Status**: ✅ Completed
+**Tests**: 188 passing (0 failures)
+**Coverage**: 23% (maintained)
+
+---
+
+## Final Directory Structure
+
+```
+src/web/
+├── api/
+│   ├── __init__.py
+│   └── v1/
+│       ├── __init__.py
+│       ├── routes.py              # Public inference API
+│       ├── admin/
+│       │   ├── __init__.py
+│       │   ├── documents.py       # Document management (was admin_routes.py)
+│       │   ├── annotations.py     # Annotation routes (was admin_annotation_routes.py)
+│       │   └── training.py        # Training routes (was admin_training_routes.py)
+│       ├── async_api/
+│       │   ├── __init__.py
+│       │   └── routes.py          # Async processing API (was async_routes.py)
+│       └── batch/
+│           ├── __init__.py
+│           └── routes.py          # Batch upload API (was batch_upload_routes.py)
+│
+├── schemas/
+│   ├── __init__.py
+│   ├── common.py                  # Shared models (ErrorResponse)
+│   ├── admin.py                   # Admin schemas (was admin_schemas.py)
+│   └── inference.py               # Inference + async schemas (was schemas.py)
+│
+├── services/
+│   ├── __init__.py
+│   ├── inference.py               # Inference service (was services.py)
+│   ├── autolabel.py              # Auto-label service (was admin_autolabel.py)
+│   ├── async_processing.py       # Async processing (was async_service.py)
+│   └── batch_upload.py           # Batch upload service (was batch_upload_service.py)
+│
+├── core/
+│   ├── __init__.py
+│   ├── auth.py                   # Authentication (was admin_auth.py)
+│   ├── rate_limiter.py           # Rate limiting (unchanged)
+│   └── scheduler.py              # Task scheduler (was admin_scheduler.py)
+│
+├── workers/
+│   ├── __init__.py
+│   ├── async_queue.py            # Async task queue (was async_queue.py)
+│   └── batch_queue.py            # Batch task queue (was batch_queue.py)
+│
+├── __init__.py                   # Main exports
+├── app.py                        # FastAPI app (imports updated)
+├── config.py                     # Configuration (unchanged)
+└── dependencies.py               # Global dependencies (unchanged)
+```
+
+---
+
+## Changes Summary
+
+### Files Moved and Renamed
+
+| Old Location | New Location | Change Type |
+|-------------|--------------|-------------|
+| `admin_routes.py` | `api/v1/admin/documents.py` | Moved + Renamed |
+| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Moved + Renamed |
+| `admin_training_routes.py` | `api/v1/admin/training.py` | Moved + Renamed |
+| `admin_auth.py` | `core/auth.py` | Moved |
+| `admin_autolabel.py` | `services/autolabel.py` | Moved |
+| `admin_scheduler.py` | `core/scheduler.py` | Moved |
+| `admin_schemas.py` | `schemas/admin.py` | Moved |
+| `routes.py` | `api/v1/routes.py` | Moved |
+| `schemas.py` | `schemas/inference.py` | Moved |
+| `services.py` | `services/inference.py` | Moved |
+| `async_routes.py` | `api/v1/async_api/routes.py` | Moved |
+| `async_queue.py` | `workers/async_queue.py` | Moved |
+| `async_service.py` | `services/async_processing.py` | Moved + Renamed |
+| `batch_queue.py` | `workers/batch_queue.py` | Moved |
+| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Moved |
+| `batch_upload_service.py` | `services/batch_upload.py` | Moved |
+
+**Total**: 16 files reorganized
+
+### Files Updated
+
+**Source Files** (imports updated):
+- `app.py` - Updated all imports to new structure
+- `api/v1/admin/documents.py` - Updated schema/auth imports
+- `api/v1/admin/annotations.py` - Updated schema/service imports
+- `api/v1/admin/training.py` - Updated schema/auth imports
+- `api/v1/routes.py` - Updated schema imports
+- `api/v1/async_api/routes.py` - Updated schema imports
+- `api/v1/batch/routes.py` - Updated service/worker imports
+- `services/async_processing.py` - Updated worker/core imports
+
+**Test Files** (all 15 updated):
+- `test_admin_annotations.py`
+- `test_admin_auth.py`
+- `test_admin_routes.py`
+- `test_admin_routes_enhanced.py`
+- `test_admin_training.py`
+- `test_annotation_locks.py`
+- `test_annotation_phase5.py`
+- `test_async_queue.py`
+- `test_async_routes.py`
+- `test_async_service.py`
+- `test_autolabel_with_locks.py`
+- `test_batch_queue.py`
+- `test_batch_upload_routes.py`
+- `test_batch_upload_service.py`
+- `test_training_phase4.py`
+- `conftest.py`
+
+---
+
+## Import Examples
+
+### Old Import Style (Before Refactoring)
+```python
+from src.web.admin_routes import create_admin_router
+from src.web.admin_schemas import DocumentItem
+from src.web.admin_auth import validate_admin_token
+from src.web.async_routes import create_async_router
+from src.web.schemas import ErrorResponse
+```
+
+### New Import Style (After Refactoring)
+```python
+# Admin API
+from src.web.api.v1.admin.documents import create_admin_router
+from src.web.api.v1.admin import create_admin_router  # Shorter alternative
+
+# Schemas
+from src.web.schemas.admin import DocumentItem
+from src.web.schemas.common import ErrorResponse
+
+# Core components
+from src.web.core.auth import validate_admin_token
+
+# Async API
+from src.web.api.v1.async_api.routes import create_async_router
+```
+
+---
+
+## Benefits Achieved
+
+### 1. **Clear Separation of Concerns**
+- **API Routes**: All in `api/v1/` by version and feature
+- **Data Models**: All in `schemas/` by domain
+- **Business Logic**: All in `services/`
+- **Core Components**: Reusable utilities in `core/`
+- **Background Jobs**: Task queues in `workers/`
+
+### 2. **Better Scalability**
+- Easy to add API v2 without touching v1
+- Clear namespace for each module
+- Reduced file sizes (no 800+ line files)
+- Follows single responsibility principle
+
+### 3. **Improved Maintainability**
+- Find files by function, not by prefix
+- Each module has one clear purpose
+- Easier to onboard new developers
+- Better IDE navigation
+
+### 4. **Standards Compliance**
+- Follows FastAPI best practices
+- Matches Django/Flask project structures
+- Standard Python package organization
+- Industry-standard naming conventions
+
+---
+
+## Testing Results
+
+**Before Refactoring**:
+- 188 tests passing
+- 23% code coverage
+- Flat directory structure
+
+**After Refactoring**:
+- ✅ 188 tests passing (0 failures)
+- ✅ 23% code coverage (maintained)
+- ✅ Clean hierarchical structure
+- ✅ All imports updated
+- ✅ No backward compatibility shims needed
+
+---
+
+## Migration Statistics
+
+| Metric | Count |
+|--------|-------|
+| Files moved | 16 |
+| Directories created | 9 |
+| Files updated (source) | 8 |
+| Files updated (tests) | 16 |
+| Import statements updated | ~150 |
+| Lines of code changed | ~200 |
+| Tests broken | 0 |
+| Coverage lost | 0% |
+
+---
+
+## Code Diff Summary
+
+```diff
+Before:
+src/web/
+├── admin_routes.py (645 lines)
+├── admin_annotation_routes.py (504 lines)
+├── admin_training_routes.py (565 lines)
+├── admin_auth.py (22 lines)
+├── admin_schemas.py (262 lines)
+... (15 more files at root level)
+
+After:
+src/web/
+├── api/v1/
+│   ├── admin/ (3 route files)
+│   ├── async_api/ (1 route file)
+│   └── batch/ (1 route file)
+├── schemas/ (3 schema files)
+├── services/ (4 service files)
+├── core/ (3 core files)
+└── workers/ (2 worker files)
+```
+
+---
+
+## Next Steps (Optional)
+
+### Phase 2: Documentation
+- [ ] Update API documentation with new import paths
+- [ ] Create migration guide for external developers
+- [ ] Update CLAUDE.md with new structure
+
+### Phase 3: Further Optimization
+- [ ] Split large files (>400 lines) if needed
+- [ ] Extract common utilities
+- [ ] Add typing stubs
+
+### Phase 4: Deprecation (Future)
+- [ ] Add deprecation warnings if creating compatibility layer
+- [ ] Remove old imports after grace period
+- [ ] Update all documentation
+
+---
+
+## Rollback Instructions
+
+If needed, rollback is simple:
+```bash
+git revert <commit-hash>
+```
+
+All changes are in version control, making rollback safe and easy.
+
+---
+
+## Conclusion
+
+✅ **Refactoring completed successfully**
+✅ **Zero breaking changes**
+✅ **All tests passing**
+✅ **Industry-standard structure achieved**
+
+The web directory is now organized following Python and FastAPI best practices, making it easier to scale, maintain, and extend.
--- a/docs/web-refactoring-plan.md
+++ b/docs/web-refactoring-plan.md
@@ -0,0 +1,186 @@
+# Web Directory Refactoring Plan
+
+## Current Structure Issues
+
+1. **Flat structure**: All files in one directory (20 Python files)
+2. **Naming inconsistency**: Mix of `admin_*`, `async_*`, `batch_*` prefixes
+3. **Mixed concerns**: Routes, schemas, services, and workers in same directory
+4. **Poor scalability**: Hard to navigate and maintain as project grows
+
+## Proposed Structure (Best Practices)
+
+```
+src/web/
+├── __init__.py                  # Main exports
+├── app.py                       # FastAPI app factory
+├── config.py                    # App configuration
+├── dependencies.py              # Global dependencies
+│
+├── api/                         # API Routes Layer
+│   ├── __init__.py
+│   └── v1/                      # API version 1
+│       ├── __init__.py
+│       ├── routes.py            # Public API routes (inference)
+│       ├── admin/               # Admin API routes
+│       │   ├── __init__.py
+│       │   ├── documents.py     # admin_routes.py → documents.py
+│       │   ├── annotations.py   # admin_annotation_routes.py → annotations.py
+│       │   ├── training.py      # admin_training_routes.py → training.py
+│       │   └── auth.py          # admin_auth.py → auth.py (routes only)
+│       ├── async_api/           # Async processing API
+│       │   ├── __init__.py
+│       │   └── routes.py        # async_routes.py → routes.py
+│       └── batch/               # Batch upload API
+│           ├── __init__.py
+│           └── routes.py        # batch_upload_routes.py → routes.py
+│
+├── schemas/                     # Pydantic Models
+│   ├── __init__.py
+│   ├── common.py                # Shared schemas (ErrorResponse, etc.)
+│   ├── inference.py             # schemas.py → inference.py
+│   ├── admin.py                 # admin_schemas.py → admin.py
+│   ├── async_api.py             # New: async API schemas
+│   └── batch.py                 # New: batch upload schemas
+│
+├── services/                    # Business Logic Layer
+│   ├── __init__.py
+│   ├── inference.py             # services.py → inference.py
+│   ├── autolabel.py             # admin_autolabel.py → autolabel.py
+│   ├── async_processing.py      # async_service.py → async_processing.py
+│   └── batch_upload.py          # batch_upload_service.py → batch_upload.py
+│
+├── core/                        # Core Components
+│   ├── __init__.py
+│   ├── auth.py                  # admin_auth.py → auth.py (logic only)
+│   ├── rate_limiter.py          # rate_limiter.py → rate_limiter.py
+│   └── scheduler.py             # admin_scheduler.py → scheduler.py
+│
+└── workers/                     # Background Task Queues
+    ├── __init__.py
+    ├── async_queue.py           # async_queue.py → async_queue.py
+    └── batch_queue.py           # batch_queue.py → batch_queue.py
+```
+
+## File Mapping
+
+### Current → New Location
+
+| Current File | New Location | Purpose |
+|--------------|--------------|---------|
+| `admin_routes.py` | `api/v1/admin/documents.py` | Document management routes |
+| `admin_annotation_routes.py` | `api/v1/admin/annotations.py` | Annotation routes |
+| `admin_training_routes.py` | `api/v1/admin/training.py` | Training routes |
+| `admin_auth.py` | Split: `api/v1/admin/auth.py` + `core/auth.py` | Auth routes + logic |
+| `admin_schemas.py` | `schemas/admin.py` | Admin Pydantic models |
+| `admin_autolabel.py` | `services/autolabel.py` | Auto-label service |
+| `admin_scheduler.py` | `core/scheduler.py` | Training scheduler |
+| `routes.py` | `api/v1/routes.py` | Public inference API |
+| `schemas.py` | `schemas/inference.py` | Inference models |
+| `services.py` | `services/inference.py` | Inference service |
+| `async_routes.py` | `api/v1/async_api/routes.py` | Async API routes |
+| `async_service.py` | `services/async_processing.py` | Async processing service |
+| `async_queue.py` | `workers/async_queue.py` | Async task queue |
+| `batch_upload_routes.py` | `api/v1/batch/routes.py` | Batch upload routes |
+| `batch_upload_service.py` | `services/batch_upload.py` | Batch upload service |
+| `batch_queue.py` | `workers/batch_queue.py` | Batch task queue |
+| `rate_limiter.py` | `core/rate_limiter.py` | Rate limiting logic |
+| `config.py` | `config.py` | Keep as-is |
+| `dependencies.py` | `dependencies.py` | Keep as-is |
+| `app.py` | `app.py` | Keep as-is (update imports) |
+
+## Benefits
+
+### 1. Clear Separation of Concerns
+- **Routes**: API endpoint definitions
+- **Schemas**: Data validation models
+- **Services**: Business logic
+- **Core**: Reusable components
+- **Workers**: Background processing
+
+### 2. Better Scalability
+- Easy to add new API versions (`v2/`)
+- Clear namespace for each domain
+- Reduced file size (no 800+ line files)
+
+### 3. Improved Maintainability
+- Find files by function, not by prefix
+- Each module has single responsibility
+- Easier to write focused tests
+
+### 4. Standard Python Patterns
+- Package-based organization
+- Follows FastAPI best practices
+- Similar to Django/Flask structures
+
+## Implementation Steps
+
+### Phase 1: Create New Structure (No Breaking Changes)
+1. Create new directories: `api/`, `schemas/`, `services/`, `core/`, `workers/`
+2. Copy files to new locations (don't delete originals yet)
+3. Update imports in new files
+4. Add `__init__.py` with proper exports
+
+### Phase 2: Update Tests
+5. Update test imports to use new structure
+6. Run tests to verify nothing breaks
+7. Fix any import issues
+
+### Phase 3: Update Main App
+8. Update `app.py` to import from new locations
+9. Run full test suite
+10. Verify all endpoints work
+
+### Phase 4: Cleanup
+11. Delete old files
+12. Update documentation
+13. Final test run
+
+## Migration Priority
+
+**High Priority** (Most used):
+- Routes and schemas (user-facing APIs)
+- Services (core business logic)
+
+**Medium Priority**:
+- Core components (auth, rate limiter)
+- Workers (background tasks)
+
+**Low Priority**:
+- Config and dependencies (already well-located)
+
+## Backwards Compatibility
+
+During migration, maintain backwards compatibility:
+
+```python
+# src/web/__init__.py
+# Old imports still work
+from src.web.api.v1.admin.documents import router as admin_router
+from src.web.schemas.admin import AdminDocument
+
+# Keep old names for compatibility (temporary)
+admin_routes = admin_router  # Deprecated alias
+```
+
+## Testing Strategy
+
+1. **Unit Tests**: Test each module independently
+2. **Integration Tests**: Test API endpoints still work
+3. **Import Tests**: Verify all old imports still work
+4. **Coverage**: Maintain current 23% coverage minimum
+
+## Rollback Plan
+
+If issues arise:
+1. Keep old files until fully migrated
+2. Git allows easy revert
+3. Tests catch breaking changes early
+
+---
+
+## Next Steps
+
+Would you like me to:
+1. **Start Phase 1**: Create new directory structure and move files?
+2. **Create migration script**: Automate the file moves and import updates?
+3. **Focus on specific area**: Start with admin API or async API first?
--- a/docs/web-refactoring-status.md
+++ b/docs/web-refactoring-status.md
@@ -0,0 +1,218 @@
+# Web Directory Refactoring - Current Status
+
+## ✅ Completed Steps
+
+### 1. Directory Structure Created
+```
+src/web/
+├── api/
+│   ├── v1/
+│   │   ├── admin/      (documents.py, annotations.py, training.py)
+│   │   ├── async_api/  (routes.py)
+│   │   ├── batch/      (routes.py)
+│   │   └── routes.py   (public inference API)
+├── schemas/
+│   ├── admin.py        (admin schemas)
+│   ├── inference.py    (inference + async schemas)
+│   └── common.py       (ErrorResponse)
+├── services/
+│   ├── autolabel.py
+│   ├── async_processing.py
+│   ├── batch_upload.py
+│   └── inference.py
+├── core/
+│   ├── auth.py
+│   ├── rate_limiter.py
+│   └── scheduler.py
+└── workers/
+    ├── async_queue.py
+    └── batch_queue.py
+```
+
+### 2. Files Copied and Imports Updated
+
+#### Admin API (✅ Complete)
+- [x] `admin_routes.py` → `api/v1/admin/documents.py` (imports updated)
+- [x] `admin_annotation_routes.py` → `api/v1/admin/annotations.py` (imports updated)
+- [x] `admin_training_routes.py` → `api/v1/admin/training.py` (imports updated)
+- [x] `api/v1/admin/__init__.py` created with exports
+
+#### Public & Async API (✅ Complete)
+- [x] `routes.py` → `api/v1/routes.py` (imports updated)
+- [x] `async_routes.py` → `api/v1/async_api/routes.py` (imports updated)
+- [x] `batch_upload_routes.py` → `api/v1/batch/routes.py` (copied, imports pending)
+
+#### Schemas (✅ Complete)
+- [x] `admin_schemas.py` → `schemas/admin.py`
+- [x] `schemas.py` → `schemas/inference.py`
+- [x] `schemas/common.py` created
+- [x] `schemas/__init__.py` created with exports
+
+#### Services (✅ Complete)
+- [x] `admin_autolabel.py` → `services/autolabel.py`
+- [x] `async_service.py` → `services/async_processing.py`
+- [x] `batch_upload_service.py` → `services/batch_upload.py`
+- [x] `services.py` → `services/inference.py`
+- [x] `services/__init__.py` created
+
+#### Core Components (✅ Complete)
+- [x] `admin_auth.py` → `core/auth.py`
+- [x] `rate_limiter.py` → `core/rate_limiter.py`
+- [x] `admin_scheduler.py` → `core/scheduler.py`
+- [x] `core/__init__.py` created
+
+#### Workers (✅ Complete)
+- [x] `async_queue.py` → `workers/async_queue.py`
+- [x] `batch_queue.py` → `workers/batch_queue.py`
+- [x] `workers/__init__.py` created
+
+#### Main App (✅ Complete)
+- [x] `app.py` imports updated to use new structure
+
+---
+
+## ⏳ Remaining Work
+
+### 1. Update Remaining File Imports (HIGH PRIORITY)
+
+Files that need import updates:
+- [ ] `api/v1/batch/routes.py` - update to use new schema/service imports
+- [ ] `services/autolabel.py` - may need import updates if it references old paths
+- [ ] `services/async_processing.py` - check for old import references
+- [ ] `services/batch_upload.py` - check for old import references
+- [ ] `services/inference.py` - check for old import references
+
+### 2. Update ALL Test Files (CRITICAL)
+
+Test files need to import from new locations. Pattern:
+
+**Old:**
+```python
+from src.web.admin_routes import create_admin_router
+from src.web.admin_schemas import DocumentItem
+from src.web.admin_auth import validate_admin_token
+```
+
+**New:**
+```python
+from src.web.api.v1.admin import create_admin_router
+from src.web.schemas.admin import DocumentItem
+from src.web.core.auth import validate_admin_token
+```
+
+Test files to update:
+- [ ] `tests/web/test_admin_annotations.py`
+- [ ] `tests/web/test_admin_auth.py`
+- [ ] `tests/web/test_admin_routes.py`
+- [ ] `tests/web/test_admin_routes_enhanced.py`
+- [ ] `tests/web/test_admin_training.py`
+- [ ] `tests/web/test_annotation_locks.py`
+- [ ] `tests/web/test_annotation_phase5.py`
+- [ ] `tests/web/test_async_queue.py`
+- [ ] `tests/web/test_async_routes.py`
+- [ ] `tests/web/test_async_service.py`
+- [ ] `tests/web/test_autolabel_with_locks.py`
+- [ ] `tests/web/test_batch_queue.py`
+- [ ] `tests/web/test_batch_upload_routes.py`
+- [ ] `tests/web/test_batch_upload_service.py`
+- [ ] `tests/web/test_rate_limiter.py`
+- [ ] `tests/web/test_training_phase4.py`
+
+### 3. Create Backward Compatibility Layer (OPTIONAL)
+
+Keep old imports working temporarily:
+
+```python
+# src/web/admin_routes.py (temporary compatibility shim)
+\"\"\"
+DEPRECATED: Use src.web.api.v1.admin.documents instead.
+This file will be removed in next version.
+\"\"\"
+import warnings
+from src.web.api.v1.admin.documents import *
+
+warnings.warn(
+    "Importing from src.web.admin_routes is deprecated. "
+    "Use src.web.api.v1.admin.documents instead.",
+    DeprecationWarning,
+    stacklevel=2
+)
+```
+
+### 4. Verify and Test
+
+1. Run tests:
+```bash
+pytest tests/web/ -v
+```
+
+2. Check for any import errors:
+```bash
+python -c "from src.web.app import create_app; create_app()"
+```
+
+3. Start server and test endpoints:
+```bash
+python run_server.py
+```
+
+### 5. Clean Up Old Files (ONLY AFTER TESTS PASS)
+
+Old files to remove:
+- `src/web/admin_*.py` (7 files)
+- `src/web/async_*.py` (3 files)
+- `src/web/batch_*.py` (3 files)
+- `src/web/routes.py`
+- `src/web/services.py`
+- `src/web/schemas.py`
+- `src/web/rate_limiter.py`
+
+Keep these files (don't remove):
+- `src/web/__init__.py`
+- `src/web/app.py`
+- `src/web/config.py`
+- `src/web/dependencies.py`
+
+---
+
+## 🎯 Next Immediate Steps
+
+1. **Update batch/routes.py imports** - Quick fix for remaining API route
+2. **Update test file imports** - Critical for verification
+3. **Run test suite** - Verify nothing broke
+4. **Fix any import errors** - Address failures
+5. **Remove old files** - Clean up after tests pass
+
+---
+
+## 📊 Migration Impact Summary
+
+| Category | Files Moved | Imports Updated | Status |
+|----------|-------------|-----------------|--------|
+| API Routes | 7 | 5/7 | 🟡 In Progress |
+| Schemas | 3 | 3/3 | ✅ Complete |
+| Services | 4 | 0/4 | ⚠️ Pending |
+| Core | 3 | 3/3 | ✅ Complete |
+| Workers | 2 | 2/2 | ✅ Complete |
+| Tests | 0 | 0/16 | ❌ Not Started |
+
+**Overall Progress: 65%**
+
+---
+
+## 🚀 Benefits After Migration
+
+1. **Better Organization**: Clear separation by function
+2. **Easier Navigation**: Find files by purpose, not prefix
+3. **Scalability**: Easy to add new API versions
+4. **Standard Structure**: Follows FastAPI best practices
+5. **Maintainability**: Each module has single responsibility
+
+---
+
+## 📝 Notes
+
+- All original files are still in place (no data loss risk)
+- New structure is operational but needs import updates
+- Backward compatibility can be added if needed
+- Tests will validate the migration success