Re-structure the project.

2026-01-25 15:21:11 +01:00
parent 8fd61ea928
commit e599424a92
80 changed files with 10672 additions and 1584 deletions
--- a/docs/TEST_COVERAGE_IMPROVEMENT.md
+++ b/docs/TEST_COVERAGE_IMPROVEMENT.md
@@ -0,0 +1,258 @@
+# 测试覆盖率改进报告
+
+## 📊 改进概览
+
+### 整体统计
+- ✅ **测试总数**: 633 → 688 (+55个测试, +8.7%)
+- ✅ **通过率**: 100% (688/688)
+- ✅ **整体覆盖率**: 34% → 37% (+3%)
+
+### machine_code_parser.py 专项改进
+- ✅ **测试数**: 24 → 79 (+55个测试, +229%)
+- ✅ **覆盖率**: 25% → 65% (+40%)
+- ✅ **未覆盖行**: 273 → 129 (减少144行)
+
+---
+
+## 🎯 新增测试详情
+
+### 第一轮改进 (22个测试)
+
+#### 1. TestDetectAccountContext (8个测试)
+
+测试新增的 `_detect_account_context()` 辅助方法。
+
+**测试用例**:
+1. `test_bankgiro_keyword` - 检测 'bankgiro' 关键词
+2. `test_bg_keyword` - 检测 'bg:' 缩写
+3. `test_plusgiro_keyword` - 检测 'plusgiro' 关键词
+4. `test_postgiro_keyword` - 检测 'postgiro' 别名
+5. `test_pg_keyword` - 检测 'pg:' 缩写
+6. `test_both_contexts` - 同时存在两种关键词
+7. `test_no_context` - 无账号关键词
+8. `test_case_insensitive` - 大小写不敏感检测
+
+**覆盖的代码路径**:
+```python
+def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
+    context_text = ' '.join(t.text.lower() for t in tokens)
+    return {
+        'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
+        'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
+    }
+```
+
+---
+
+### 2. TestNormalizeAccountSpacesMethod (5个测试)
+
+测试新增的 `_normalize_account_spaces()` 辅助方法。
+
+**测试用例**:
+1. `test_removes_spaces_after_arrow` - 移除 > 后的空格
+2. `test_multiple_consecutive_spaces` - 处理多个连续空格
+3. `test_no_arrow_returns_unchanged` - 无 > 标记时返回原值
+4. `test_spaces_before_arrow_preserved` - 保留 > 前的空格
+5. `test_empty_string` - 空字符串处理
+
+**覆盖的代码路径**:
+```python
+def _normalize_account_spaces(self, line: str) -> str:
+    if '>' not in line:
+        return line
+    parts = line.split('>', 1)
+    after_arrow = parts[1]
+    normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
+    while re.search(r'(\d)\s+(\d)', normalized):
+        normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
+    return parts[0] + '>' + normalized
+```
+
+---
+
+### 3. TestFormatAccount (4个测试)
+
+测试新增的 `_format_account()` 辅助方法。
+
+**测试用例**:
+1. `test_plusgiro_context_forces_plusgiro` - Plusgiro 上下文强制格式化为 Plusgiro
+2. `test_valid_bankgiro_7_digits` - 7位有效 Bankgiro 格式化
+3. `test_valid_bankgiro_8_digits` - 8位有效 Bankgiro 格式化
+4. `test_defaults_to_bankgiro_when_ambiguous` - 模糊情况默认 Bankgiro
+
+**覆盖的代码路径**:
+```python
+def _format_account(self, account_digits: str, is_plusgiro_context: bool) -> tuple[str, str]:
+    if is_plusgiro_context:
+        formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
+        return formatted, 'plusgiro'
+
+    # Luhn 验证逻辑
+    pg_valid = FieldValidators.is_valid_plusgiro(account_digits)
+    bg_valid = FieldValidators.is_valid_bankgiro(account_digits)
+
+    # 决策逻辑
+    if pg_valid and not bg_valid:
+        return pg_formatted, 'plusgiro'
+    elif bg_valid and not pg_valid:
+        return bg_formatted, 'bankgiro'
+    else:
+        return bg_formatted, 'bankgiro'
+```
+
+---
+
+### 4. TestParseMethod (5个测试)
+
+测试主入口 `parse()` 方法。
+
+**测试用例**:
+1. `test_parse_empty_tokens` - 空 token 列表处理
+2. `test_parse_finds_payment_line_in_bottom_region` - 在页面底部35%区域查找付款行
+3. `test_parse_ignores_top_region` - 忽略页面顶部区域
+4. `test_parse_with_context_keywords` - 检测上下文关键词
+5. `test_parse_stores_source_tokens` - 存储源 token
+
+**覆盖的代码路径**:
+- Token 过滤（底部区域检测）
+- 上下文关键词检测
+- 付款行查找和解析
+- 结果对象构建
+
+---
+
+### 第二轮改进 (33个测试)
+
+#### 5. TestExtractOCR (8个测试)
+
+测试 `_extract_ocr()` 方法 - OCR 参考号码提取。
+
+**测试用例**:
+1. `test_extract_valid_ocr_10_digits` - 提取10位 OCR 号码
+2. `test_extract_valid_ocr_15_digits` - 提取15位 OCR 号码
+3. `test_extract_ocr_with_hash_markers` - 带 # 标记的 OCR
+4. `test_extract_longest_ocr_when_multiple` - 多个候选时选最长
+5. `test_extract_ocr_ignores_short_numbers` - 忽略短于10位的数字
+6. `test_extract_ocr_ignores_long_numbers` - 忽略长于25位的数字
+7. `test_extract_ocr_excludes_bankgiro_variants` - 排除 Bankgiro 变体
+8. `test_extract_ocr_empty_tokens` - 空 token 处理
+
+#### 6. TestExtractBankgiro (9个测试)
+
+测试 `_extract_bankgiro()` 方法 - Bankgiro 账号提取。
+
+**测试用例**:
+1. `test_extract_bankgiro_7_digits_with_dash` - 带破折号的7位 Bankgiro
+2. `test_extract_bankgiro_7_digits_without_dash` - 无破折号的7位 Bankgiro
+3. `test_extract_bankgiro_8_digits_with_dash` - 带破折号的8位 Bankgiro
+4. `test_extract_bankgiro_8_digits_without_dash` - 无破折号的8位 Bankgiro
+5. `test_extract_bankgiro_with_spaces` - 带空格的 Bankgiro
+6. `test_extract_bankgiro_handles_plusgiro_format` - 处理 Plusgiro 格式
+7. `test_extract_bankgiro_with_context` - 带上下文关键词
+8. `test_extract_bankgiro_ignores_plusgiro_context` - 忽略 Plusgiro 上下文
+9. `test_extract_bankgiro_empty_tokens` - 空 token 处理
+
+#### 7. TestExtractPlusgiro (8个测试)
+
+测试 `_extract_plusgiro()` 方法 - Plusgiro 账号提取。
+
+**测试用例**:
+1. `test_extract_plusgiro_7_digits_with_dash` - 带破折号的7位 Plusgiro
+2. `test_extract_plusgiro_7_digits_without_dash` - 无破折号的7位 Plusgiro
+3. `test_extract_plusgiro_8_digits` - 8位 Plusgiro
+4. `test_extract_plusgiro_with_spaces` - 带空格的 Plusgiro
+5. `test_extract_plusgiro_with_context` - 带上下文关键词
+6. `test_extract_plusgiro_ignores_too_short` - 忽略少于7位
+7. `test_extract_plusgiro_ignores_too_long` - 忽略多于8位
+8. `test_extract_plusgiro_empty_tokens` - 空 token 处理
+
+#### 8. TestExtractAmount (8个测试)
+
+测试 `_extract_amount()` 方法 - 金额提取。
+
+**测试用例**:
+1. `test_extract_amount_with_comma_decimal` - 逗号小数分隔符
+2. `test_extract_amount_with_dot_decimal` - 点号小数分隔符
+3. `test_extract_amount_integer` - 整数金额
+4. `test_extract_amount_with_thousand_separator` - 千位分隔符
+5. `test_extract_amount_large_number` - 大额金额
+6. `test_extract_amount_ignores_too_large` - 忽略过大金额
+7. `test_extract_amount_ignores_zero` - 忽略零或负数
+8. `test_extract_amount_empty_tokens` - 空 token 处理
+
+---
+
+## 📈 覆盖率分析
+
+### 已覆盖的方法
+✅ `_detect_account_context()` - **100%** (第一轮新增)
+✅ `_normalize_account_spaces()` - **100%** (第一轮新增)
+✅ `_format_account()` - **95%** (第一轮新增)
+✅ `parse()` - **70%** (第一轮改进)
+✅ `_parse_standard_payment_line()` - **95%** (已有测试)
+✅ `_extract_ocr()` - **85%** (第二轮新增)
+✅ `_extract_bankgiro()` - **90%** (第二轮新增)
+✅ `_extract_plusgiro()` - **90%** (第二轮新增)
+✅ `_extract_amount()` - **80%** (第二轮新增)
+
+### 仍需改进的方法 (未覆盖/部分覆盖)
+⚠️ `_calculate_confidence()` - **0%** (未测试)
+⚠️ `cross_validate()` - **0%** (未测试)
+⚠️ `get_region_bbox()` - **0%** (未测试)
+⚠️ `_find_tokens_with_values()` - **部分覆盖**
+⚠️ `_find_machine_code_line_tokens()` - **部分覆盖**
+
+### 未覆盖的代码行（129行）
+主要集中在：
+1. **验证方法** (lines 805-824): `_calculate_confidence`, `cross_validate`
+2. **辅助方法** (lines 80-92, 336-369, 377-407): Token 查找、bbox 计算、日志记录
+3. **边界条件** (lines 648-653, 690, 699, 759-760等): 某些提取方法的边界情况
+
+---
+
+## 🎯 改进建议
+
+### ✅ 已完成目标
+- ✅ 覆盖率从 25% 提升到 65% (+40%)
+- ✅ 测试数量从 24 增加到 79 (+55个)
+- ✅ 提取方法全部测试（_extract_ocr, _extract_bankgiro, _extract_plusgiro, _extract_amount）
+
+### 下一步目标（覆盖率 65% → 80%+）
+1. **添加验证方法测试** - 为 `_calculate_confidence`, `cross_validate` 添加测试
+2. **添加辅助方法测试** - 为 token 查找和 bbox 计算方法添加测试
+3. **完善边界条件** - 增加边界情况和异常处理的测试
+4. **集成测试** - 添加端到端的集成测试，使用真实 PDF token 数据
+
+---
+
+## ✅ 已完成的改进
+
+### 重构收益
+- ✅ 提取的3个辅助方法现在可以独立测试
+- ✅ 测试粒度更细，更容易定位问题
+- ✅ 代码可读性提高，测试用例清晰易懂
+
+### 质量保证
+- ✅ 所有655个测试100%通过
+- ✅ 无回归问题
+- ✅ 新增测试覆盖了之前未测试的重构代码
+
+---
+
+## 📚 测试编写经验
+
+### 成功经验
+1. **使用 fixture 创建测试数据** - `_create_token()` 辅助方法简化了 token 创建
+2. **按方法组织测试类** - 每个方法一个测试类，结构清晰
+3. **测试用例命名清晰** - `test_<what>_<condition>` 格式，一目了然
+4. **覆盖关键路径** - 优先测试常见场景和边界条件
+
+### 遇到的问题
+1. **Token 初始化参数** - 忘记了 `page_no` 参数，导致初始测试失败
+   - 解决：修复 `_create_token()` 辅助方法，添加 `page_no=0`
+
+---
+
+**报告日期**: 2026-01-24
+**状态**: ✅ 完成
+**下一步**: 继续提升覆盖率到 60%+