Re-structure the project.

2026-01-25 15:21:11 +01:00
parent 8fd61ea928
commit e599424a92
80 changed files with 10672 additions and 1584 deletions
--- a/docs/MACHINE_CODE_PARSER_ANALYSIS.md
+++ b/docs/MACHINE_CODE_PARSER_ANALYSIS.md
@@ -0,0 +1,238 @@
+# Machine Code Parser 分析报告
+
+## 文件概况
+
+- **文件**: `src/ocr/machine_code_parser.py`
+- **总行数**: 919 行
+- **代码行**: 607 行 (66%)
+- **方法数**: 14 个
+- **正则表达式使用**: 47 次
+
+## 代码结构
+
+### 类结构
+
+```
+MachineCodeResult (数据类)
+├── to_dict()
+└── get_region_bbox()
+
+MachineCodeParser (主解析器)
+├── __init__()
+├── parse() - 主入口
+├── _find_tokens_with_values()
+├── _find_machine_code_line_tokens()
+├── _parse_standard_payment_line_with_tokens()
+├── _parse_standard_payment_line() - 142行 ⚠️
+├── _extract_ocr() - 50行
+├── _extract_bankgiro() - 58行
+├── _extract_plusgiro() - 30行
+├── _extract_amount() - 68行
+├── _calculate_confidence()
+└── cross_validate()
+```
+
+## 发现的问题
+
+### 1. ⚠️ `_parse_standard_payment_line` 方法过长 (142行)
+
+**位置**: 442-582 行
+
+**问题**:
+- 包含嵌套函数 `normalize_account_spaces` 和 `format_account`
+- 多个正则匹配分支
+- 逻辑复杂，难以测试和维护
+
+**建议**:
+可以拆分为独立方法:
+- `_normalize_account_spaces(line)`
+- `_format_account(account_digits, context)`
+- `_match_primary_pattern(line)`
+- `_match_fallback_patterns(line)`
+
+### 2. 🔁 4个 `_extract_*` 方法有重复模式
+
+所有 extract 方法都遵循相同模式：
+
+```python
+def _extract_XXX(self, tokens):
+    candidates = []
+
+    for token in tokens:
+        text = token.text.strip()
+        matches = self.XXX_PATTERN.findall(text)
+        for match in matches:
+            # 验证逻辑
+            # 上下文检测
+            candidates.append((normalized, context_score, token))
+
+    if not candidates:
+        return None
+
+    candidates.sort(key=lambda x: (x[1], 1), reverse=True)
+    return candidates[0][0]
+```
+
+**重复的逻辑**:
+- Token 迭代
+- 模式匹配
+- 候选收集
+- 上下文评分
+- 排序和选择最佳匹配
+
+**建议**:
+可以提取基础提取器类或通用方法来减少重复。
+
+### 3. ✅ 上下文检测重复
+
+上下文检测代码在多个地方重复：
+
+```python
+# _extract_bankgiro 中
+context_text = ' '.join(t.text.lower() for t in tokens)
+is_bankgiro_context = (
+    'bankgiro' in context_text or
+    'bg:' in context_text or
+    'bg ' in context_text
+)
+
+# _extract_plusgiro 中
+context_text = ' '.join(t.text.lower() for t in tokens)
+is_plusgiro_context = (
+    'plusgiro' in context_text or
+    'postgiro' in context_text or
+    'pg:' in context_text or
+    'pg ' in context_text
+)
+
+# _parse_standard_payment_line 中
+context = (context_line or raw_line).lower()
+is_plusgiro_context = (
+    ('plusgiro' in context or 'postgiro' in context or 'plusgirokonto' in context)
+    and 'bankgiro' not in context
+)
+```
+
+**建议**:
+提取为独立方法:
+- `_detect_account_context(tokens) -> dict[str, bool]`
+
+## 重构建议
+
+### 方案 A: 轻度重构（推荐）✅
+
+**目标**: 提取重复的上下文检测逻辑，不改变主要结构
+
+**步骤**:
+1. 提取 `_detect_account_context(tokens)` 方法
+2. 提取 `_normalize_account_spaces(line)` 为独立方法
+3. 提取 `_format_account(digits, context)` 为独立方法
+
+**影响**:
+- 减少 ~50-80 行重复代码
+- 提高可测试性
+- 低风险，易于验证
+
+**预期结果**: 919 行 → ~850 行 (↓7%)
+
+### 方案 B: 中度重构
+
+**目标**: 创建通用的字段提取框架
+
+**步骤**:
+1. 创建 `_generic_extract(pattern, normalizer, context_checker)`
+2. 重构所有 `_extract_*` 方法使用通用框架
+3. 拆分 `_parse_standard_payment_line` 为多个小方法
+
+**影响**:
+- 减少 ~150-200 行代码
+- 显著提高可维护性
+- 中等风险，需要全面测试
+
+**预期结果**: 919 行 → ~720 行 (↓22%)
+
+### 方案 C: 深度重构（不推荐）
+
+**目标**: 完全重新设计为策略模式
+
+**风险**:
+- 高风险，可能引入 bugs
+- 需要大量测试
+- 可能破坏现有集成
+
+## 推荐方案
+
+### ✅ 采用方案 A（轻度重构）
+
+**理由**:
+1. **代码已经工作良好**: 没有明显的 bug 或性能问题
+2. **低风险**: 只提取重复逻辑，不改变核心算法
+3. **性价比高**: 小改动带来明显的代码质量提升
+4. **易于验证**: 现有测试应该能覆盖
+
+### 重构步骤
+
+```python
+# 1. 提取上下文检测
+def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
+    """检测上下文中的账户类型关键词"""
+    context_text = ' '.join(t.text.lower() for t in tokens)
+
+    return {
+        'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
+        'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
+    }
+
+# 2. 提取空格标准化
+def _normalize_account_spaces(self, line: str) -> str:
+    """移除账户号码中的空格"""
+    # (现有 line 460-481 的代码)
+
+# 3. 提取账户格式化
+def _format_account(
+    self,
+    account_digits: str,
+    is_plusgiro_context: bool
+) -> tuple[str, str]:
+    """格式化账户并确定类型"""
+    # (现有 line 485-523 的代码)
+```
+
+## 对比：field_extractor vs machine_code_parser
+
+| 特征 | field_extractor | machine_code_parser |
+|------|-----------------|---------------------|
+| 用途 | 值提取 | 机器码解析 |
+| 重复代码 | ~400行normalize方法 | ~80行上下文检测 |
+| 重构价值 | ❌ 不同用途，不应统一 | ✅ 可提取共享逻辑 |
+| 风险 | 高（会破坏功能） | 低（只是代码组织） |
+
+## 决策
+
+### ✅ 建议重构 machine_code_parser.py
+
+**与 field_extractor 的不同**:
+- field_extractor: 重复的方法有**不同的用途**（提取 vs 变体生成）
+- machine_code_parser: 重复的代码有**相同的用途**（都是上下文检测）
+
+**预期收益**:
+- 减少 ~70 行重复代码
+- 提高可测试性（可以单独测试上下文检测）
+- 更清晰的代码组织
+- **低风险**，易于验证
+
+## 下一步
+
+1. ✅ 备份原文件
+2. ✅ 提取 `_detect_account_context` 方法
+3. ✅ 提取 `_normalize_account_spaces` 方法
+4. ✅ 提取 `_format_account` 方法
+5. ✅ 更新所有调用点
+6. ✅ 运行测试验证
+7. ✅ 检查代码覆盖率
+
+---
+
+**状态**: 📋 分析完成，建议轻度重构
+**风险评估**: 🟢 低风险
+**预期收益**: 919行 → ~850行 (↓7%)