kai/invoice-master-poc-v2

Fork 0

Files

Yaojia Wang e599424a92 Re-structure the project.

2026-01-25 15:21:11 +01:00

9.2 KiB

Raw Blame History

测试覆盖率改进报告

📊 改进概览

整体统计

✅ 测试总数: 633 → 688 (+55个测试, +8.7%)
✅ 通过率: 100% (688/688)
✅ 整体覆盖率: 34% → 37% (+3%)

machine_code_parser.py 专项改进

✅ 测试数: 24 → 79 (+55个测试, +229%)
✅ 覆盖率: 25% → 65% (+40%)
✅ 未覆盖行: 273 → 129 (减少144行)

🎯 新增测试详情

第一轮改进 (22个测试)

1. TestDetectAccountContext (8个测试)

测试新增的 _detect_account_context() 辅助方法。

测试用例:

test_bankgiro_keyword - 检测 'bankgiro' 关键词
test_bg_keyword - 检测 'bg:' 缩写
test_plusgiro_keyword - 检测 'plusgiro' 关键词
test_postgiro_keyword - 检测 'postgiro' 别名
test_pg_keyword - 检测 'pg:' 缩写
test_both_contexts - 同时存在两种关键词
test_no_context - 无账号关键词
test_case_insensitive - 大小写不敏感检测

覆盖的代码路径:

def _detect_account_context(self, tokens: list[TextToken]) -> dict[str, bool]:
    context_text = ' '.join(t.text.lower() for t in tokens)
    return {
        'bankgiro': any(kw in context_text for kw in ['bankgiro', 'bg:', 'bg ']),
        'plusgiro': any(kw in context_text for kw in ['plusgiro', 'postgiro', 'plusgirokonto', 'pg:', 'pg ']),
    }

2. TestNormalizeAccountSpacesMethod (5个测试)

测试新增的 _normalize_account_spaces() 辅助方法。

测试用例:

test_removes_spaces_after_arrow - 移除 > 后的空格
test_multiple_consecutive_spaces - 处理多个连续空格
test_no_arrow_returns_unchanged - 无 > 标记时返回原值
test_spaces_before_arrow_preserved - 保留 > 前的空格
test_empty_string - 空字符串处理

覆盖的代码路径:

def _normalize_account_spaces(self, line: str) -> str:
    if '>' not in line:
        return line
    parts = line.split('>', 1)
    after_arrow = parts[1]
    normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
    while re.search(r'(\d)\s+(\d)', normalized):
        normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
    return parts[0] + '>' + normalized

3. TestFormatAccount (4个测试)

测试新增的 _format_account() 辅助方法。

测试用例:

test_plusgiro_context_forces_plusgiro - Plusgiro 上下文强制格式化为 Plusgiro
test_valid_bankgiro_7_digits - 7位有效 Bankgiro 格式化
test_valid_bankgiro_8_digits - 8位有效 Bankgiro 格式化
test_defaults_to_bankgiro_when_ambiguous - 模糊情况默认 Bankgiro

覆盖的代码路径:

def _format_account(self, account_digits: str, is_plusgiro_context: bool) -> tuple[str, str]:
    if is_plusgiro_context:
        formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
        return formatted, 'plusgiro'

    # Luhn 验证逻辑
    pg_valid = FieldValidators.is_valid_plusgiro(account_digits)
    bg_valid = FieldValidators.is_valid_bankgiro(account_digits)

    # 决策逻辑
    if pg_valid and not bg_valid:
        return pg_formatted, 'plusgiro'
    elif bg_valid and not pg_valid:
        return bg_formatted, 'bankgiro'
    else:
        return bg_formatted, 'bankgiro'

4. TestParseMethod (5个测试)

测试主入口 parse() 方法。

测试用例:

test_parse_empty_tokens - 空 token 列表处理
test_parse_finds_payment_line_in_bottom_region - 在页面底部35%区域查找付款行
test_parse_ignores_top_region - 忽略页面顶部区域
test_parse_with_context_keywords - 检测上下文关键词
test_parse_stores_source_tokens - 存储源 token

覆盖的代码路径:

Token 过滤（底部区域检测）
上下文关键词检测
付款行查找和解析
结果对象构建

第二轮改进 (33个测试)

5. TestExtractOCR (8个测试)

测试 _extract_ocr() 方法 - OCR 参考号码提取。

测试用例:

test_extract_valid_ocr_10_digits - 提取10位 OCR 号码
test_extract_valid_ocr_15_digits - 提取15位 OCR 号码
test_extract_ocr_with_hash_markers - 带 # 标记的 OCR
test_extract_longest_ocr_when_multiple - 多个候选时选最长
test_extract_ocr_ignores_short_numbers - 忽略短于10位的数字
test_extract_ocr_ignores_long_numbers - 忽略长于25位的数字
test_extract_ocr_excludes_bankgiro_variants - 排除 Bankgiro 变体
test_extract_ocr_empty_tokens - 空 token 处理

6. TestExtractBankgiro (9个测试)

测试 _extract_bankgiro() 方法 - Bankgiro 账号提取。

测试用例:

test_extract_bankgiro_7_digits_with_dash - 带破折号的7位 Bankgiro
test_extract_bankgiro_7_digits_without_dash - 无破折号的7位 Bankgiro
test_extract_bankgiro_8_digits_with_dash - 带破折号的8位 Bankgiro
test_extract_bankgiro_8_digits_without_dash - 无破折号的8位 Bankgiro
test_extract_bankgiro_with_spaces - 带空格的 Bankgiro
test_extract_bankgiro_handles_plusgiro_format - 处理 Plusgiro 格式
test_extract_bankgiro_with_context - 带上下文关键词
test_extract_bankgiro_ignores_plusgiro_context - 忽略 Plusgiro 上下文
test_extract_bankgiro_empty_tokens - 空 token 处理

7. TestExtractPlusgiro (8个测试)

测试 _extract_plusgiro() 方法 - Plusgiro 账号提取。

测试用例:

test_extract_plusgiro_7_digits_with_dash - 带破折号的7位 Plusgiro
test_extract_plusgiro_7_digits_without_dash - 无破折号的7位 Plusgiro
test_extract_plusgiro_8_digits - 8位 Plusgiro
test_extract_plusgiro_with_spaces - 带空格的 Plusgiro
test_extract_plusgiro_with_context - 带上下文关键词
test_extract_plusgiro_ignores_too_short - 忽略少于7位
test_extract_plusgiro_ignores_too_long - 忽略多于8位
test_extract_plusgiro_empty_tokens - 空 token 处理

8. TestExtractAmount (8个测试)

测试 _extract_amount() 方法 - 金额提取。

测试用例:

test_extract_amount_with_comma_decimal - 逗号小数分隔符
test_extract_amount_with_dot_decimal - 点号小数分隔符
test_extract_amount_integer - 整数金额
test_extract_amount_with_thousand_separator - 千位分隔符
test_extract_amount_large_number - 大额金额
test_extract_amount_ignores_too_large - 忽略过大金额
test_extract_amount_ignores_zero - 忽略零或负数
test_extract_amount_empty_tokens - 空 token 处理

📈 覆盖率分析

已覆盖的方法

✅ _detect_account_context() - 100% (第一轮新增) ✅ _normalize_account_spaces() - 100% (第一轮新增) ✅ _format_account() - 95% (第一轮新增) ✅ parse() - 70% (第一轮改进) ✅ _parse_standard_payment_line() - 95% (已有测试) ✅ _extract_ocr() - 85% (第二轮新增) ✅ _extract_bankgiro() - 90% (第二轮新增) ✅ _extract_plusgiro() - 90% (第二轮新增) ✅ _extract_amount() - 80% (第二轮新增)

仍需改进的方法 (未覆盖/部分覆盖)

⚠️ _calculate_confidence() - 0% (未测试) ⚠️ cross_validate() - 0% (未测试) ⚠️ get_region_bbox() - 0% (未测试) ⚠️ _find_tokens_with_values() - 部分覆盖 ⚠️ _find_machine_code_line_tokens() - 部分覆盖

未覆盖的代码行（129行）

主要集中在：

验证方法 (lines 805-824): _calculate_confidence, cross_validate
辅助方法 (lines 80-92, 336-369, 377-407): Token 查找、bbox 计算、日志记录
边界条件 (lines 648-653, 690, 699, 759-760等): 某些提取方法的边界情况

🎯 改进建议

✅ 已完成目标

✅ 覆盖率从 25% 提升到 65% (+40%)
✅ 测试数量从 24 增加到 79 (+55个)
✅ 提取方法全部测试（_extract_ocr, _extract_bankgiro, _extract_plusgiro, _extract_amount）

下一步目标（覆盖率 65% → 80%+）

添加验证方法测试 - 为 _calculate_confidence, cross_validate 添加测试
添加辅助方法测试 - 为 token 查找和 bbox 计算方法添加测试
完善边界条件 - 增加边界情况和异常处理的测试
集成测试 - 添加端到端的集成测试，使用真实 PDF token 数据

✅ 已完成的改进

重构收益

✅ 提取的3个辅助方法现在可以独立测试
✅ 测试粒度更细，更容易定位问题
✅ 代码可读性提高，测试用例清晰易懂

质量保证

✅ 所有655个测试100%通过
✅ 无回归问题
✅ 新增测试覆盖了之前未测试的重构代码

📚 测试编写经验

成功经验

使用 fixture 创建测试数据 - _create_token() 辅助方法简化了 token 创建
按方法组织测试类 - 每个方法一个测试类，结构清晰
测试用例命名清晰 - test_<what>_<condition> 格式，一目了然
覆盖关键路径 - 优先测试常见场景和边界条件

遇到的问题

Token 初始化参数 - 忘记了 page_no 参数，导致初始测试失败
- 解决：修复 _create_token() 辅助方法，添加 page_no=0

报告日期: 2026-01-24 状态: ✅ 完成 下一步: 继续提升覆盖率到 60%+

9.2 KiB Raw Blame History Unescape Escape

测试覆盖率改进报告

📊 改进概览

整体统计

machine_code_parser.py 专项改进

🎯 新增测试详情

第一轮改进 (22个测试)

1. TestDetectAccountContext (8个测试)

2. TestNormalizeAccountSpacesMethod (5个测试)

3. TestFormatAccount (4个测试)

4. TestParseMethod (5个测试)

第二轮改进 (33个测试)

5. TestExtractOCR (8个测试)

6. TestExtractBankgiro (9个测试)

7. TestExtractPlusgiro (8个测试)

8. TestExtractAmount (8个测试)

📈 覆盖率分析

已覆盖的方法

仍需改进的方法 (未覆盖/部分覆盖)

未覆盖的代码行（129行）

🎯 改进建议

✅ 已完成目标

下一步目标（覆盖率 65% → 80%+）

✅ 已完成的改进

重构收益

质量保证

📚 测试编写经验

成功经验

遇到的问题

9.2 KiB

Raw Blame History