Files

Yaojia Wang 0990239e9c feat: add field-specific bbox expansion strategies for YOLO training

Implement center-point based bbox scaling with directional compensation
to capture field labels that typically appear above or to the left of
field values. This improves YOLO training data quality by including
contextual information around field values.

Key changes:
- Add shared.bbox module with ScaleStrategy dataclass and expand_bbox function
- Define field-specific strategies (ocr_number, bankgiro, invoice_date, etc.)
- Support manual_mode for minimal padding (no scaling)
- Integrate expand_bbox into AnnotationGenerator
- Add FIELD_TO_CLASS mapping for field_name to class_name lookup
- Comprehensive tests with 100% coverage (45 tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-04 22:56:52 +01:00

data

re-structure

2026-02-01 22:55:31 +01:00

domain

WIP

2026-02-03 22:29:53 +01:00

inference

Update paddle, and support invoice line item

2026-02-03 21:28:06 +01:00

integration

Update paddle, and support invoice line item

2026-02-03 21:28:06 +01:00

matcher

restructure project

2026-01-27 23:58:17 +01:00

normalize

restructure project

2026-01-27 23:58:17 +01:00

ocr

restructure project

2026-01-27 23:58:17 +01:00

pdf

restructure project

2026-01-27 23:58:17 +01:00

shared

feat: add field-specific bbox expansion strategies for YOLO training

2026-02-04 22:56:52 +01:00

table

refactor: split line_items_extractor into smaller modules with comprehensive tests

2026-02-03 23:02:00 +01:00

training

feat: add field-specific bbox expansion strategies for YOLO training

2026-02-04 22:56:52 +01:00

utils

restructure project

2026-01-27 23:58:17 +01:00

validation

Update paddle, and support invoice line item

2026-02-03 21:28:06 +01:00

vat

Update paddle, and support invoice line item

2026-02-03 21:28:06 +01:00

web

WIP

2026-02-03 22:29:53 +01:00

__init__.py

Re-structure the project.

2026-01-25 15:21:11 +01:00

README.md

Re-structure the project.

2026-01-25 15:21:11 +01:00

test_config.py

restructure project

2026-01-27 23:58:17 +01:00

test_customer_number_parser.py

re-structure

2026-02-01 22:55:31 +01:00

test_db_security.py

restructure project

2026-01-27 23:58:17 +01:00

test_exceptions.py

restructure project

2026-01-27 23:58:17 +01:00

test_imports.py

re-structure

2026-02-01 22:55:31 +01:00

test_payment_line_parser.py

re-structure

2026-02-01 22:55:31 +01:00

README.md

Tests

完整的测试套件,遵循 pytest 最佳实践组织。

📁 测试目录结构

tests/
├── __init__.py
├── README.md                               # 本文件
│
├── data/                                   # 数据模块测试
│   ├── __init__.py
│   └── test_csv_loader.py                  # CSV 加载器测试
│
├── inference/                              # 推理模块测试
│   ├── __init__.py
│   ├── test_field_extractor.py             # 字段提取器测试
│   └── test_pipeline.py                    # 推理管道测试
│
├── matcher/                                # 匹配模块测试
│   ├── __init__.py
│   └── test_field_matcher.py               # 字段匹配器测试
│
├── normalize/                              # 标准化模块测试
│   ├── __init__.py
│   ├── test_normalizer.py                  # FieldNormalizer 测试 (85 tests)
│   └── normalizers/                        # 独立 normalizer 测试
│       ├── __init__.py
│       ├── test_invoice_number_normalizer.py    # 12 tests
│       ├── test_ocr_normalizer.py               # 9 tests
│       ├── test_bankgiro_normalizer.py          # 11 tests
│       ├── test_plusgiro_normalizer.py          # 10 tests
│       ├── test_amount_normalizer.py            # 15 tests
│       ├── test_date_normalizer.py              # 19 tests
│       ├── test_organisation_number_normalizer.py  # 11 tests
│       ├── test_supplier_accounts_normalizer.py    # 13 tests
│       ├── test_customer_number_normalizer.py      # 12 tests
│       └── README.md                        # Normalizer 测试文档
│
├── ocr/                                    # OCR 模块测试
│   ├── __init__.py
│   └── test_machine_code_parser.py         # 机器码解析器测试
│
├── pdf/                                    # PDF 模块测试
│   ├── __init__.py
│   ├── test_detector.py                    # PDF 类型检测器测试
│   └── test_extractor.py                   # PDF 提取器测试
│
├── utils/                                  # 工具模块测试
│   ├── __init__.py
│   ├── test_utils.py                       # 基础工具测试
│   └── test_advanced_utils.py              # 高级工具测试
│
├── test_config.py                          # 配置测试
├── test_customer_number_parser.py          # 客户编号解析器测试
├── test_db_security.py                     # 数据库安全测试
├── test_exceptions.py                      # 异常测试
└── test_payment_line_parser.py             # 支付行解析器测试

📊 测试统计

总测试数: 628 个测试状态: ✅ 全部通过 执行时间: ~7.7 秒 代码覆盖率: 37% (整体)

按模块分类

模块	测试文件数	测试数量	覆盖率
normalize	10	197	~98%
- normalizers/	9	112	100%
- test_normalizer.py	1	85	71%
utils	2	~149	73-93%
pdf	2	~282	94-97%
matcher	1	~402	-
ocr	1	~146	25%
inference	2	~408	-
data	1	~282	-
其他	4	~110	-

🚀 运行测试

运行所有测试

# 在 WSL 环境中
conda activate invoice-py311
pytest tests/ -v

运行特定模块的测试

# Normalizer 测试
pytest tests/normalize/ -v

# 独立 normalizer 测试
pytest tests/normalize/normalizers/ -v

# PDF 测试
pytest tests/pdf/ -v

# Utils 测试
pytest tests/utils/ -v

# Inference 测试
pytest tests/inference/ -v

运行单个测试文件

pytest tests/normalize/normalizers/test_amount_normalizer.py -v
pytest tests/pdf/test_extractor.py -v
pytest tests/utils/test_utils.py -v

查看测试覆盖率

# 生成覆盖率报告
pytest tests/ --cov=src --cov-report=html

# 仅查看某个模块的覆盖率
pytest tests/normalize/ --cov=src/normalize --cov-report=term-missing

运行特定测试

# 按测试类运行
pytest tests/normalize/normalizers/test_amount_normalizer.py::TestAmountNormalizer -v

# 按测试方法运行
pytest tests/normalize/normalizers/test_amount_normalizer.py::TestAmountNormalizer::test_integer_amount -v

# 按关键字运行
pytest tests/ -k "normalizer" -v
pytest tests/ -k "amount" -v

🎯 测试最佳实践

1. 目录结构镜像源代码

src/normalize/normalizers/amount_normalizer.py
tests/normalize/normalizers/test_amount_normalizer.py

2. 测试文件命名

测试文件以 test_ 开头
测试类以 Test 开头
测试方法以 test_ 开头

3. 使用 pytest fixtures

@pytest.fixture
def normalizer():
    """Create normalizer instance for testing"""
    return AmountNormalizer()

def test_something(normalizer):
    result = normalizer.normalize('test')
    assert 'expected' in result

4. 清晰的测试描述

def test_with_comma_decimal(self, normalizer):
    """Amount with comma decimal should generate dot variant"""
    result = normalizer.normalize('114,00')
    assert '114.00' in result

5. Arrange-Act-Assert 模式

def test_example(self):
    # Arrange
    input_data = 'test-input'
    expected = 'expected-output'

    # Act
    result = process(input_data)

    # Assert
    assert expected in result

📝 添加新测试

为新功能添加测试

在相应的 tests/ 子目录创建测试文件
遵循命名约定: test_<module_name>.py
创建测试类和方法
运行测试验证

示例:

# tests/new_module/test_new_feature.py
import pytest
from src.new_module.new_feature import NewFeature


class TestNewFeature:
    """Test NewFeature functionality"""

    @pytest.fixture
    def feature(self):
        """Create feature instance for testing"""
        return NewFeature()

    def test_basic_functionality(self, feature):
        """Test basic functionality"""
        result = feature.process('input')
        assert result == 'expected'

    def test_edge_case(self, feature):
        """Test edge case handling"""
        result = feature.process('')
        assert result == []

🔧 pytest 配置

项目的 pytest 配置在 pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]

📈 持续集成

测试可以轻松集成到 CI/CD:

# .github/workflows/test.yml
- name: Run Tests
  run: |
    conda activate invoice-py311
    pytest tests/ -v --cov=src --cov-report=xml

- name: Upload Coverage
  uses: codecov/codecov-action@v3
  with:
    file: ./coverage.xml

🎨 测试覆盖率目标

模块	当前覆盖率	目标
normalize/	98%	✅ 达标
utils/	73-93%	🎯 提升到 90%
pdf/	94-97%	✅ 达标
inference/	待评估	🎯 80%
matcher/	待评估	🎯 80%
ocr/	25%	🎯 提升到 70%

📚 相关文档

Normalizer Tests - 独立 normalizer 测试详细文档
pytest Documentation - pytest 官方文档
Code Coverage - 覆盖率工具文档

✅ 测试检查清单

添加新功能时,确保:

创建对应的测试文件
测试正常功能
测试边界条件 (空值、None、空字符串)
测试错误处理
测试覆盖率 > 80%
所有测试通过
更新相关文档

🎉 总结

✅ 628 个测试全部通过
✅ 镜像源代码的清晰目录结构
✅ 遵循 pytest 最佳实践
✅ 完整的文档
✅ 易于维护和扩展