invoice-master-poc-v2/docs/REFACTORING_PLAN.md

# 重构计划文档 (Refactoring Plan)

**项目**: Invoice Field Extraction System
**生成日期**: 2026-01-22
**基于**: CODE_REVIEW_REPORT.md
**目标**: 提升代码可维护性、可测试性和安全性

---

## 📋 目录

1. [重构目标](#重构目标)
2. [总体策略](#总体策略)
3. [三阶段执行计划](#三阶段执行计划)
4. [详细重构步骤](#详细重构步骤)
5. [测试策略](#测试策略)
6. [风险管理](#风险管理)
7. [成功指标](#成功指标)

---

## 🎯 重构目标

### 主要目标
1. **安全性**: 消除明文密码、SQL注入等安全隐患
2. **可维护性**: 减少代码重复，降低函数复杂度
3. **可测试性**: 提升测试覆盖率至70%+，增加集成测试
4. **可读性**: 统一代码风格，添加必要文档
5. **性能**: 优化批处理和并发处理

### 量化指标
- 测试覆盖率: 45% → 70%+
- 平均函数长度: 80行 → 50行以下
- 代码重复率: 15% → 5%以下
- 循环复杂度: 最高15+ → 最高10
- 关键函数文档覆盖: 30% → 80%+

---

## 📐 总体策略

### 原则
1. **增量重构**: 小步快跑，每次重构保持系统可运行
2. **测试先行**: 重构前先补充测试，确保行为不变
3. **向后兼容**: API接口保持兼容，逐步废弃旧接口
4. **文档同步**: 代码变更同步更新文档

### 工作流程
```
1. 为待重构模块补充测试 (确保现有行为被覆盖)
   ↓
2. 执行重构 (Extract Method, Extract Class, etc.)
   ↓
3. 运行全量测试 (确保行为不变)
   ↓
4. 更新文档
   ↓
5. Code Review
   ↓
6. 合并主分支
```

---

## 🗓️ 三阶段执行计划

### Phase 1: 紧急修复 (1周)
**目标**: 修复安全漏洞和关键bug

| 任务 | 优先级 | 预计时间 | 负责模块 |
|------|--------|----------|----------|
| 修复明文密码问题 | P0 | 1小时 | `src/db/config.py` |
| 配置环境变量管理 | P0 | 2小时 | 根目录 `.env` |
| 修复SQL注入风险 | P0 | 3小时 | `src/db/operations.py` |
| 添加输入验证 | P1 | 4小时 | `src/web/routes.py` |
| 异常处理规范化 | P1 | 1天 | 全局 |

### Phase 2: 核心重构 (2-3周)
**目标**: 降低代码复杂度，消除重复

| 任务 | 优先级 | 预计时间 | 负责模块 |
|------|--------|----------|----------|
| 拆分 `_normalize_customer_number` | P0 | 1天 | `field_extractor.py` |
| 统一 payment_line 解析 | P0 | 2天 | 抽取到单独模块 |
| 重构 `process_document` | P1 | 2天 | `pipeline.py` |
| Extract Method: 长函数拆分 | P1 | 3天 | 全局 |
| 添加集成测试 | P0 | 3天 | `tests/integration/` |
| 提升单元测试覆盖率 | P1 | 2天 | 各模块 |

### Phase 3: 优化改进 (1-2周)
**目标**: 性能优化、文档完善

| 任务 | 优先级 | 预计时间 | 负责模块 |
|------|--------|----------|----------|
| 批处理并发优化 | P1 | 2天 | `batch_processor.py` |
| API文档完善 | P2 | 1天 | `docs/API.md` |
| 配置提取到常量 | P2 | 1天 | `src/config/constants.py` |
| 日志系统优化 | P2 | 1天 | `src/utils/logging.py` |
| 性能分析和优化 | P2 | 2天 | 全局 |

---

## 🔧 详细重构步骤

### Step 1: 修复明文密码 (P0, 1小时)

**当前问题**:
```python
# src/db/config.py:29
DATABASE_CONFIG = {
    "host": "localhost",
    "port": 3306,
    "user": "root",
    "password": "your_password",  # ❌ 明文密码
    "database": "invoice_extraction",
}
```

**重构步骤**:

1. 创建 `.env.example` 模板:
```bash
# Database Configuration
DB_HOST=localhost
DB_PORT=3306
DB_USER=root
DB_PASSWORD=your_password_here
DB_NAME=invoice_extraction
```

2. 创建 `.env` 文件 (加入 `.gitignore`):
```bash
DB_PASSWORD=actual_secure_password
```

3. 修改 `src/db/config.py`:
```python
import os
from dotenv import load_dotenv

load_dotenv()

DATABASE_CONFIG = {
    "host": os.getenv("DB_HOST", "localhost"),
    "port": int(os.getenv("DB_PORT", "3306")),
    "user": os.getenv("DB_USER", "root"),
    "password": os.getenv("DB_PASSWORD"),  # ✅ 从环境变量读取
    "database": os.getenv("DB_NAME", "invoice_extraction"),
}

# 启动时验证
if not DATABASE_CONFIG["password"]:
    raise ValueError("DB_PASSWORD environment variable not set")
```

4. 安装依赖:
```bash
pip install python-dotenv
```

5. 更新 `requirements.txt`:
```
python-dotenv>=1.0.0
```

**测试**:
- 验证环境变量读取正常
- 确认缺少环境变量时抛出异常
- 测试数据库连接

---

### Step 2: 修复SQL注入 (P0, 3小时)

**当前问题**:
```python
# src/db/operations.py:156
query = f"SELECT * FROM documents WHERE id = {doc_id}"  # ❌ SQL注入风险
cursor.execute(query)
```

**重构步骤**:

1. 审查所有SQL查询，识别字符串拼接:
```bash
grep -n "f\".*SELECT" src/db/operations.py
grep -n "f\".*INSERT" src/db/operations.py
grep -n "f\".*UPDATE" src/db/operations.py
grep -n "f\".*DELETE" src/db/operations.py
```

2. 替换为参数化查询:
```python
# Before
query = f"SELECT * FROM documents WHERE id = {doc_id}"
cursor.execute(query)

# After ✅
query = "SELECT * FROM documents WHERE id = %s"
cursor.execute(query, (doc_id,))
```

3. 常见场景替换:
```python
# INSERT
query = "INSERT INTO documents (filename, status) VALUES (%s, %s)"
cursor.execute(query, (filename, status))

# UPDATE
query = "UPDATE documents SET status = %s WHERE id = %s"
cursor.execute(query, (new_status, doc_id))

# IN clause
placeholders = ','.join(['%s'] * len(ids))
query = f"SELECT * FROM documents WHERE id IN ({placeholders})"
cursor.execute(query, ids)
```

4. 创建查询构建器辅助函数:
```python
# src/db/query_builder.py
def build_select(table: str, columns: list[str] = None, where: dict = None):
    """Build safe SELECT query with parameters."""
    cols = ', '.join(columns) if columns else '*'
    query = f"SELECT {cols} FROM {table}"

    params = []
    if where:
        conditions = []
        for key, value in where.items():
            conditions.append(f"{key} = %s")
            params.append(value)
        query += " WHERE " + " AND ".join(conditions)

    return query, tuple(params)
```

**测试**:
- 单元测试所有修改的查询函数
- SQL注入测试: 传入 `"1 OR 1=1"` 等恶意输入
- 集成测试验证功能正常

---

### Step 3: 统一 payment_line 解析 (P0, 2天)

**当前问题**: payment_line 解析逻辑在3个地方重复实现
- `src/inference/field_extractor.py:632-705` (normalization)
- `src/inference/pipeline.py:217-252` (parsing for cross-validation)
- `src/inference/test_field_extractor.py:269-344` (test cases)

**重构步骤**:

1. 创建独立模块 `src/inference/payment_line_parser.py`:
```python
"""
Swedish Payment Line Parser

Handles parsing and validation of Swedish machine-readable payment lines.
Format: # <OCR> # <Kronor> <Öre> <Type> > <Account>#<Check>#
"""

import re
from dataclasses import dataclass
from typing import Optional


@dataclass
class PaymentLineData:
    """Parsed payment line data."""
    ocr_number: str
    amount: str  # Format: "KRONOR.ÖRE"
    account_number: str  # Bankgiro or Plusgiro
    record_type: str  # Usually "5" or "9"
    check_digits: str
    raw_text: str
    is_valid: bool
    error: Optional[str] = None


class PaymentLineParser:
    """Parser for Swedish payment lines with OCR error handling."""

    # Pattern with OCR error tolerance
    FULL_PATTERN = re.compile(
        r'#\s*(\d[\d\s]*)\s*#\s*([\d\s]+?)\s+(\d{2})\s+(\d)\s*>?\s*([\d\s]+)\s*#\s*(\d+)\s*#'
    )

    # Pattern without amount (fallback)
    PARTIAL_PATTERN = re.compile(
        r'#\s*(\d[\d\s]*)\s*#.*?(\d)\s*>?\s*([\d\s]+)\s*#\s*(\d+)\s*#'
    )

    def __init__(self):
        self.logger = logging.getLogger(__name__)

    def parse(self, text: str) -> PaymentLineData:
        """
        Parse payment line text.

        Handles common OCR errors:
        - Spaces in numbers: "12 0 0" → "1200"
        - Missing symbols: Missing ">"
        - Spaces in check digits: "#41 #" → "#41#"

        Args:
            text: Raw payment line text

        Returns:
            PaymentLineData with parsed fields
        """
        text = text.strip()

        # Try full pattern with amount
        match = self.FULL_PATTERN.search(text)
        if match:
            return self._parse_full_match(match, text)

        # Try partial pattern without amount
        match = self.PARTIAL_PATTERN.search(text)
        if match:
            return self._parse_partial_match(match, text)

        # No match
        return PaymentLineData(
            ocr_number="",
            amount="",
            account_number="",
            record_type="",
            check_digits="",
            raw_text=text,
            is_valid=False,
            error="Invalid payment line format"
        )

    def _parse_full_match(self, match: re.Match, raw_text: str) -> PaymentLineData:
        """Parse full pattern match (with amount)."""
        ocr = self._clean_digits(match.group(1))
        kronor = self._clean_digits(match.group(2))
        ore = match.group(3)
        record_type = match.group(4)
        account = self._clean_digits(match.group(5))
        check_digits = match.group(6)

        amount = f"{kronor}.{ore}"

        return PaymentLineData(
            ocr_number=ocr,
            amount=amount,
            account_number=account,
            record_type=record_type,
            check_digits=check_digits,
            raw_text=raw_text,
            is_valid=True
        )

    def _parse_partial_match(self, match: re.Match, raw_text: str) -> PaymentLineData:
        """Parse partial pattern match (without amount)."""
        ocr = self._clean_digits(match.group(1))
        record_type = match.group(2)
        account = self._clean_digits(match.group(3))
        check_digits = match.group(4)

        return PaymentLineData(
            ocr_number=ocr,
            amount="",  # No amount in partial format
            account_number=account,
            record_type=record_type,
            check_digits=check_digits,
            raw_text=raw_text,
            is_valid=True
        )

    def _clean_digits(self, text: str) -> str:
        """Remove spaces from digit string."""
        return text.replace(' ', '')

    def format_machine_readable(self, data: PaymentLineData) -> str:
        """
        Format parsed data back to machine-readable format.

        Returns:
            Formatted string: "# OCR # KRONOR ÖRE TYPE > ACCOUNT#CHECK#"
        """
        if not data.is_valid:
            return data.raw_text

        if data.amount:
            kronor, ore = data.amount.split('.')
            return (
                f"# {data.ocr_number} # {kronor} {ore} {data.record_type} > "
                f"{data.account_number}#{data.check_digits}#"
            )
        else:
            return (
                f"# {data.ocr_number} # ... {data.record_type} > "
                f"{data.account_number}#{data.check_digits}#"
            )
```

2. 重构 `field_extractor.py` 使用新parser:
```python
# src/inference/field_extractor.py
from .payment_line_parser import PaymentLineParser

class FieldExtractor:
    def __init__(self):
        self.payment_parser = PaymentLineParser()
        # ...

    def _normalize_payment_line(self, text: str) -> tuple[str | None, bool, str | None]:
        """Normalize payment line using dedicated parser."""
        data = self.payment_parser.parse(text)

        if not data.is_valid:
            return None, False, data.error

        formatted = self.payment_parser.format_machine_readable(data)
        return formatted, True, None
```

3. 重构 `pipeline.py` 使用新parser:
```python
# src/inference/pipeline.py
from .payment_line_parser import PaymentLineParser

class InferencePipeline:
    def __init__(self):
        self.payment_parser = PaymentLineParser()
        # ...

    def _parse_machine_readable_payment_line(
        self, payment_line: str
    ) -> tuple[str | None, str | None, str | None]:
        """Parse payment line for cross-validation."""
        data = self.payment_parser.parse(payment_line)

        if not data.is_valid:
            return None, None, None

        return data.ocr_number, data.amount, data.account_number
```

4. 更新测试使用新parser:
```python
# tests/unit/test_payment_line_parser.py
from src.inference.payment_line_parser import PaymentLineParser

class TestPaymentLineParser:
    def test_full_format_with_spaces(self):
        """Test parsing with OCR-induced spaces."""
        parser = PaymentLineParser()
        text = "# 6026726908 # 736 00 9 > 5692041 #41 #"

        data = parser.parse(text)

        assert data.is_valid
        assert data.ocr_number == "6026726908"
        assert data.amount == "736.00"
        assert data.account_number == "5692041"
        assert data.check_digits == "41"

    def test_format_without_amount(self):
        """Test parsing without amount."""
        parser = PaymentLineParser()
        text = "# 11000770600242 # ... 5 > 3082963#41#"

        data = parser.parse(text)

        assert data.is_valid
        assert data.ocr_number == "11000770600242"
        assert data.amount == ""
        assert data.account_number == "3082963"

    def test_machine_readable_format(self):
        """Test formatting back to machine-readable."""
        parser = PaymentLineParser()
        text = "# 6026726908 # 736 00 9 > 5692041 #41 #"

        data = parser.parse(text)
        formatted = parser.format_machine_readable(data)

        assert "# 6026726908 #" in formatted
        assert "736 00" in formatted
        assert "5692041#41#" in formatted
```

**迁移步骤**:
1. 创建 `payment_line_parser.py` 并添加测试
2. 运行测试确保新实现正确
3. 逐个文件迁移到新parser
4. 每次迁移后运行全量测试
5. 删除旧实现代码
6. 更新文档

**测试**:
- 单元测试覆盖所有解析场景
- 集成测试验证端到端功能
- 回归测试确保行为不变

---

### Step 4: 拆分 `_normalize_customer_number` (P0, 1天)

**当前问题**:
- 函数长度: 127行
- 循环复杂度: 15+
- 职责过多: 模式匹配、格式化、验证混在一起

**重构策略**: Extract Method + Strategy Pattern

**重构步骤**:

1. 创建 `src/inference/customer_number_parser.py`:
```python
"""
Customer Number Parser

Handles extraction and normalization of Swedish customer numbers.
"""

import re
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Optional


@dataclass
class CustomerNumberMatch:
    """Customer number match result."""
    value: str
    pattern_name: str
    confidence: float
    raw_text: str


class CustomerNumberPattern(ABC):
    """Abstract base for customer number patterns."""

    @abstractmethod
    def match(self, text: str) -> Optional[CustomerNumberMatch]:
        """Try to match pattern in text."""
        pass

    @abstractmethod
    def format(self, match: re.Match) -> str:
        """Format matched groups to standard format."""
        pass


class DashFormatPattern(CustomerNumberPattern):
    """Pattern: ABC 123-X"""

    PATTERN = re.compile(r'\b([A-Za-z]{2,4})\s+(\d{1,4})-([A-Za-z0-9])\b')

    def match(self, text: str) -> Optional[CustomerNumberMatch]:
        match = self.PATTERN.search(text)
        if not match:
            return None

        formatted = self.format(match)
        return CustomerNumberMatch(
            value=formatted,
            pattern_name="DashFormat",
            confidence=0.95,
            raw_text=match.group(0)
        )

    def format(self, match: re.Match) -> str:
        prefix = match.group(1).upper()
        number = match.group(2)
        suffix = match.group(3).upper()
        return f"{prefix} {number}-{suffix}"


class NoDashFormatPattern(CustomerNumberPattern):
    """Pattern: ABC 123X (no dash)"""

    PATTERN = re.compile(r'\b([A-Za-z]{2,4})\s+(\d{2,4})([A-Za-z])\b')

    def match(self, text: str) -> Optional[CustomerNumberMatch]:
        match = self.PATTERN.search(text)
        if not match:
            return None

        # Exclude postal codes
        full_text = match.group(0)
        if self._is_postal_code(full_text):
            return None

        formatted = self.format(match)
        return CustomerNumberMatch(
            value=formatted,
            pattern_name="NoDashFormat",
            confidence=0.90,
            raw_text=full_text
        )

    def format(self, match: re.Match) -> str:
        prefix = match.group(1).upper()
        number = match.group(2)
        suffix = match.group(3).upper()
        return f"{prefix} {number}-{suffix}"

    def _is_postal_code(self, text: str) -> bool:
        """Check if text looks like Swedish postal code."""
        # SE 106 43, SE 10643, etc.
        return bool(re.match(r'^SE\s*\d{3}\s*\d{2}', text, re.IGNORECASE))


class CustomerNumberParser:
    """Parser for Swedish customer numbers."""

    def __init__(self):
        # Patterns ordered by specificity (most specific first)
        self.patterns: list[CustomerNumberPattern] = [
            DashFormatPattern(),
            NoDashFormatPattern(),
            # Add more patterns as needed
        ]
        self.logger = logging.getLogger(__name__)

    def parse(self, text: str) -> tuple[Optional[str], bool, Optional[str]]:
        """
        Parse customer number from text.

        Returns:
            (customer_number, is_valid, error)
        """
        text = text.strip()

        # Try each pattern
        matches: list[CustomerNumberMatch] = []
        for pattern in self.patterns:
            match = pattern.match(text)
            if match:
                matches.append(match)

        # No matches
        if not matches:
            return None, False, "No customer number found"

        # Return highest confidence match
        best_match = max(matches, key=lambda m: m.confidence)
        return best_match.value, True, None

    def parse_all(self, text: str) -> list[CustomerNumberMatch]:
        """
        Find all customer numbers in text.

        Useful for cases with multiple potential matches.
        """
        matches: list[CustomerNumberMatch] = []
        for pattern in self.patterns:
            match = pattern.match(text)
            if match:
                matches.append(match)
        return sorted(matches, key=lambda m: m.confidence, reverse=True)
```

2. 重构 `field_extractor.py`:
```python
# src/inference/field_extractor.py
from .customer_number_parser import CustomerNumberParser

class FieldExtractor:
    def __init__(self):
        self.customer_parser = CustomerNumberParser()
        # ...

    def _normalize_customer_number(
        self, text: str
    ) -> tuple[str | None, bool, str | None]:
        """Normalize customer number using dedicated parser."""
        return self.customer_parser.parse(text)
```

3. 添加测试:
```python
# tests/unit/test_customer_number_parser.py
from src.inference.customer_number_parser import (
    CustomerNumberParser,
    DashFormatPattern,
    NoDashFormatPattern,
)

class TestDashFormatPattern:
    def test_standard_format(self):
        pattern = DashFormatPattern()
        match = pattern.match("Customer: JTY 576-3")

        assert match is not None
        assert match.value == "JTY 576-3"
        assert match.confidence == 0.95

class TestNoDashFormatPattern:
    def test_no_dash_format(self):
        pattern = NoDashFormatPattern()
        match = pattern.match("Dwq 211X")

        assert match is not None
        assert match.value == "DWQ 211-X"
        assert match.confidence == 0.90

    def test_exclude_postal_code(self):
        pattern = NoDashFormatPattern()
        match = pattern.match("SE 106 43")

        assert match is None  # Should be filtered out

class TestCustomerNumberParser:
    def test_parse_with_dash(self):
        parser = CustomerNumberParser()
        result, is_valid, error = parser.parse("Customer: JTY 576-3")

        assert is_valid
        assert result == "JTY 576-3"
        assert error is None

    def test_parse_without_dash(self):
        parser = CustomerNumberParser()
        result, is_valid, error = parser.parse("Dwq 211X Billo")

        assert is_valid
        assert result == "DWQ 211-X"

    def test_parse_all_finds_multiple(self):
        parser = CustomerNumberParser()
        text = "JTY 576-3 and DWQ 211X"
        matches = parser.parse_all(text)

        assert len(matches) >= 1  # At least one match
        assert matches[0].confidence >= 0.90
```

**迁移计划**:
1. Day 1 上午: 创建新parser和测试
2. Day 1 下午: 迁移 `field_extractor.py`，运行测试
3. 回归测试确保所有文档处理正常

---

### Step 5: 重构 `process_document` (P1, 2天)

**当前问题**: `pipeline.py:100-250` (150行) 职责过多

**重构策略**: Extract Method + 责任分离

**目标结构**:
```python
def process_document(self, image_path: Path, document_id: str) -> DocumentResult:
    """Main orchestration - keep under 30 lines."""
    # 1. Run detection
    detections = self._run_yolo_detection(image_path)

    # 2. Extract fields
    fields = self._extract_fields_from_detections(detections, image_path)

    # 3. Apply cross-validation
    fields = self._apply_cross_validation(fields)

    # 4. Multi-source fusion
    fields = self._apply_multi_source_fusion(fields)

    # 5. Build result
    return self._build_document_result(document_id, fields, detections)
```

详细步骤见 `docs/CODE_REVIEW_REPORT.md` Section 5.3.

---

### Step 6: 添加集成测试 (P0, 3天)

**当前状况**: 缺少端到端集成测试

**目标**: 创建完整的集成测试套件

**测试场景**:
1. PDF → 推理 → 结果验证 (端到端)
2. 批处理多文档
3. API端点测试
4. 数据库集成测试
5. 错误场景测试

**实施步骤**:

1. 创建测试数据集:
```
tests/
├── fixtures/
│   ├── sample_invoices/
│   │   ├── billo_363.pdf
│   │   ├── billo_308.pdf
│   │   └── billo_310.pdf
│   └── expected_results/
│       ├── billo_363.json
│       ├── billo_308.json
│       └── billo_310.json
```

2. 创建 `tests/integration/test_end_to_end.py`:
```python
import pytest
from pathlib import Path
from src.inference.pipeline import InferencePipeline
from src.inference.field_extractor import FieldExtractor


@pytest.fixture
def pipeline():
    """Create inference pipeline."""
    extractor = FieldExtractor()
    return InferencePipeline(
        model_path="runs/train/invoice_fields/weights/best.pt",
        confidence_threshold=0.5,
        dpi=150,
        field_extractor=extractor
    )


@pytest.fixture
def sample_invoices():
    """Load sample invoices and expected results."""
    fixtures_dir = Path(__file__).parent.parent / "fixtures"
    samples = []

    for pdf_path in (fixtures_dir / "sample_invoices").glob("*.pdf"):
        json_path = fixtures_dir / "expected_results" / f"{pdf_path.stem}.json"

        with open(json_path) as f:
            expected = json.load(f)

        samples.append({
            "pdf_path": pdf_path,
            "expected": expected
        })

    return samples


class TestEndToEnd:
    """End-to-end integration tests."""

    def test_single_document_processing(self, pipeline, sample_invoices):
        """Test processing a single invoice from PDF to extracted fields."""
        sample = sample_invoices[0]

        # Process PDF
        result = pipeline.process_pdf(
            sample["pdf_path"],
            document_id="test_001"
        )

        # Verify success
        assert result.success

        # Verify extracted fields match expected
        expected = sample["expected"]
        assert result.fields["amount"] == expected["amount"]
        assert result.fields["ocr_number"] == expected["ocr_number"]
        assert result.fields["customer_number"] == expected["customer_number"]

    def test_batch_processing(self, pipeline, sample_invoices):
        """Test batch processing multiple invoices."""
        pdf_paths = [s["pdf_path"] for s in sample_invoices]

        # Process batch
        results = pipeline.process_batch(pdf_paths)

        # Verify all processed
        assert len(results) == len(pdf_paths)

        # Verify success rate
        success_count = sum(1 for r in results if r.success)
        assert success_count >= len(pdf_paths) * 0.9  # At least 90% success

    def test_cross_validation_overrides(self, pipeline):
        """Test that payment_line values override detected values."""
        # Use sample with known discrepancy (Billo310)
        pdf_path = Path("tests/fixtures/sample_invoices/billo_310.pdf")

        result = pipeline.process_pdf(pdf_path, document_id="test_cross_val")

        # Verify payment_line was parsed
        assert "payment_line" in result.fields

        # Verify Amount was corrected from payment_line
        # (Billo310: detected 20736.00, payment_line has 736.00)
        assert result.fields["amount"] == "736.00"

    def test_error_handling_invalid_pdf(self, pipeline):
        """Test graceful error handling for invalid PDF."""
        invalid_pdf = Path("tests/fixtures/invalid.pdf")

        result = pipeline.process_pdf(invalid_pdf, document_id="test_error")

        # Should return result with success=False
        assert not result.success
        assert result.errors
        assert len(result.errors) > 0


class TestAPIIntegration:
    """API endpoint integration tests."""

    @pytest.fixture
    def client(self):
        """Create test client."""
        from fastapi.testclient import TestClient
        from src.web.app import create_app
        from src.web.config import AppConfig

        config = AppConfig.from_defaults()
        app = create_app(config)
        return TestClient(app)

    def test_health_endpoint(self, client):
        """Test /api/v1/health endpoint."""
        response = client.get("/api/v1/health")

        assert response.status_code == 200
        data = response.json()
        assert data["status"] == "healthy"
        assert "model_loaded" in data

    def test_infer_endpoint_with_pdf(self, client, sample_invoices):
        """Test /api/v1/infer with PDF upload."""
        sample = sample_invoices[0]

        with open(sample["pdf_path"], "rb") as f:
            response = client.post(
                "/api/v1/infer",
                files={"file": ("test.pdf", f, "application/pdf")}
            )

        assert response.status_code == 200
        data = response.json()
        assert data["status"] == "success"
        assert "result" in data
        assert "fields" in data["result"]

    def test_infer_endpoint_invalid_file(self, client):
        """Test /api/v1/infer rejects invalid file."""
        response = client.post(
            "/api/v1/infer",
            files={"file": ("test.txt", b"invalid", "text/plain")}
        )

        assert response.status_code == 400
        assert "Unsupported file type" in response.json()["detail"]


class TestDatabaseIntegration:
    """Database integration tests."""

    @pytest.fixture
    def db_connection(self):
        """Create test database connection."""
        from src.db.connection import DatabaseConnection

        # Use test database
        conn = DatabaseConnection(database="invoice_extraction_test")
        yield conn
        conn.close()

    def test_save_and_retrieve_result(self, db_connection, pipeline, sample_invoices):
        """Test saving inference result to database and retrieving it."""
        sample = sample_invoices[0]

        # Process document
        result = pipeline.process_pdf(sample["pdf_path"], document_id="test_db_001")

        # Save to database
        db_connection.save_inference_result(result)

        # Retrieve from database
        retrieved = db_connection.get_inference_result("test_db_001")

        # Verify
        assert retrieved is not None
        assert retrieved["document_id"] == "test_db_001"
        assert retrieved["fields"]["amount"] == result.fields["amount"]
```

3. 配置 pytest 运行集成测试:
```ini
# pytest.ini
[pytest]
markers =
    unit: Unit tests (fast, no external dependencies)
    integration: Integration tests (slower, may use database/files)
    slow: Slow tests

# Run unit tests by default
addopts = -v -m "not integration"

# Run all tests including integration
# pytest -m ""
# Run only integration tests
# pytest -m integration
```

4. CI/CD集成:
```yaml
# .github/workflows/test.yml
name: Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov
      - name: Run unit tests
        run: pytest -m "not integration" --cov=src --cov-report=xml
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml

  integration-tests:
    runs-on: ubuntu-latest
    services:
      mysql:
        image: mysql:8.0
        env:
          MYSQL_ROOT_PASSWORD: test_password
          MYSQL_DATABASE: invoice_extraction_test
        ports:
          - 3306:3306
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest
      - name: Run integration tests
        env:
          DB_HOST: localhost
          DB_PORT: 3306
          DB_USER: root
          DB_PASSWORD: test_password
          DB_NAME: invoice_extraction_test
        run: pytest -m integration
```

**时间分配**:
- Day 1: 准备测试数据、创建测试框架
- Day 2: 编写端到端和API测试
- Day 3: 数据库集成测试、CI/CD配置

---

### Step 7: 异常处理规范化 (P1, 1天)

**当前问题**: 31处 `except Exception` 捕获过于宽泛

**目标**: 创建异常层次结构，精确捕获

**实施步骤**:

1. 创建 `src/exceptions.py`:
```python
"""
Application-specific exceptions.
"""


class InvoiceExtractionError(Exception):
    """Base exception for invoice extraction errors."""
    pass


class PDFProcessingError(InvoiceExtractionError):
    """Error during PDF processing."""
    pass


class OCRError(InvoiceExtractionError):
    """Error during OCR."""
    pass


class ModelInferenceError(InvoiceExtractionError):
    """Error during model inference."""
    pass


class FieldValidationError(InvoiceExtractionError):
    """Error during field validation."""
    pass


class DatabaseError(InvoiceExtractionError):
    """Error during database operations."""
    pass


class ConfigurationError(InvoiceExtractionError):
    """Error in configuration."""
    pass
```

2. 替换宽泛的异常捕获:
```python
# Before ❌
try:
    result = process_pdf(path)
except Exception as e:
    logger.error(f"Error: {e}")
    return None

# After ✅
try:
    result = process_pdf(path)
except PDFProcessingError as e:
    logger.error(f"PDF processing failed: {e}")
    return None
except OCRError as e:
    logger.warning(f"OCR failed, trying fallback: {e}")
    result = fallback_ocr(path)
except ModelInferenceError as e:
    logger.error(f"Model inference failed: {e}")
    raise  # Re-raise for upper layer
```

3. 在各模块中抛出具体异常:
```python
# src/inference/pdf_processor.py
from src.exceptions import PDFProcessingError

def convert_pdf_to_image(pdf_path: Path, dpi: int) -> list[np.ndarray]:
    try:
        images = pdf2image.convert_from_path(pdf_path, dpi=dpi)
    except Exception as e:
        raise PDFProcessingError(f"Failed to convert PDF: {e}") from e

    if not images:
        raise PDFProcessingError("PDF conversion returned no images")

    return images
```

4. 创建异常处理装饰器:
```python
# src/utils/error_handling.py
import functools
from typing import Callable, Type
from src.exceptions import InvoiceExtractionError


def handle_errors(
    *exception_types: Type[Exception],
    default_return=None,
    log_error: bool = True
):
    """Decorator for standardized error handling."""
    def decorator(func: Callable):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except exception_types as e:
                if log_error:
                    logger = logging.getLogger(func.__module__)
                    logger.error(
                        f"Error in {func.__name__}: {e}",
                        exc_info=True
                    )
                return default_return
        return wrapper
    return decorator


# Usage
@handle_errors(PDFProcessingError, OCRError, default_return=None)
def safe_process_document(doc_path: Path):
    return process_document(doc_path)
```

---

### Step 8-12: 其他重构任务

详细步骤参见 `CODE_REVIEW_REPORT.md` Section 6 (Action Plan)。

---

## 🧪 测试策略

### 测试金字塔

```
       /\
      /  \    E2E Tests (10%)
     /----\   - Full pipeline tests
    /      \  - API integration tests
   /--------\
  /          \ Integration Tests (30%)
 /------------\ - Module integration
/              \ - Database tests
----------------
                Unit Tests (60%)
                - Function-level tests
                - High coverage
```

### 测试覆盖率目标

| 模块 | 当前覆盖率 | 目标覆盖率 |
|------|-----------|-----------|
| `field_extractor.py` | 40% | 80% |
| `pipeline.py` | 50% | 75% |
| `payment_line_parser.py` | 0% (新) | 90% |
| `customer_number_parser.py` | 0% (新) | 90% |
| `web/routes.py` | 30% | 70% |
| `db/operations.py` | 20% | 60% |
| **Overall** | **45%** | **70%+** |

### 回归测试

每次重构后必须运行:

```bash
# 1. 单元测试
pytest tests/unit/ -v

# 2. 集成测试
pytest tests/integration/ -v

# 3. 端到端测试（使用实际PDF）
pytest tests/e2e/ -v

# 4. 性能测试（确保没有退化）
pytest tests/performance/ -v --benchmark

# 5. 测试覆盖率检查
pytest --cov=src --cov-report=html --cov-fail-under=70
```

---

## ⚠️ 风险管理

### 识别的风险

| 风险 | 影响 | 概率 | 缓解措施 |
|------|------|------|---------|
| 重构破坏现有功能 | 高 | 中 | 1. 重构前补充测试<br>2. 小步迭代<br>3. 回归测试 |
| 性能退化 | 中 | 低 | 1. 性能基准测试<br>2. 持续监控<br>3. Profile优化 |
| API接口变更影响客户端 | 高 | 低 | 1. 语义化版本控制<br>2. 废弃通知期<br>3. 向后兼容 |
| 数据库迁移失败 | 高 | 低 | 1. 备份数据<br>2. 分阶段迁移<br>3. 回滚计划 |
| 时间超期 | 中 | 中 | 1. 优先级排序<br>2. 每周进度审查<br>3. 必要时调整范围 |

### 回滚计划

每个重构步骤都应有明确的回滚策略:

1. **代码回滚**: 使用Git分支隔离变更
   ```bash
   # 每个重构任务创建特性分支
   git checkout -b refactor/payment-line-parser

   # 如需回滚
   git checkout main
   git branch -D refactor/payment-line-parser
   ```

2. **数据库回滚**: 使用数据库迁移工具
   ```bash
   # 应用迁移
   alembic upgrade head

   # 回滚迁移
   alembic downgrade -1
   ```

3. **配置回滚**: 保留旧配置兼容性
   ```python
   # 支持新旧两种配置格式
   password = config.get("db_password") or config.get("password")
   ```

---

## 📊 成功指标

### 量化指标

| 指标 | 当前值 | 目标值 | 测量方法 |
|------|--------|--------|---------|
| 测试覆盖率 | 45% | 70%+ | `pytest --cov` |
| 平均函数长度 | 80行 | <50行 | `radon cc` |
| 循环复杂度 | 最高15+ | <10 | `radon cc` |
| 代码重复率 | ~15% | <5% | `pylint --duplicate` |
| 安全问题 | 2个 (明文密码, SQL注入) | 0个 | 手动审查 + `bandit` |
| 文档覆盖率 | 30% | 80%+ | 手动审查 |
| 平均处理时间 | ~2秒/文档 | <2秒/文档 | 性能测试 |

### 质量门禁

所有变更必须满足:
- ✅ 测试覆盖率 ≥ 70%
- ✅ 所有测试通过 (单元 + 集成 + E2E)
- ✅ 无高危安全问题
- ✅ 代码审查通过
- ✅ 性能无退化 (±5%以内)
- ✅ 文档已更新

---

## 📅 时间表

### Phase 1: 紧急修复 (Week 1)

| 日期 | 任务 | 负责人 | 状态 |
|------|------|--------|------|
| Day 1 | 修复明文密码 + 环境变量配置 | | ⏳ |
| Day 2-3 | 修复SQL注入 + 添加参数化查询 | | ⏳ |
| Day 4-5 | 异常处理规范化 | | ⏳ |

### Phase 2: 核心重构 (Week 2-4)

| 周 | 任务 | 状态 |
|----|------|------|
| Week 2 | 统一payment_line解析 + 拆分customer_number | ⏳ |
| Week 3 | 重构pipeline + Extract Method | ⏳ |
| Week 4 | 添加集成测试 + 提升单元测试覆盖率 | ⏳ |

### Phase 3: 优化改进 (Week 5-6)

| 周 | 任务 | 状态 |
|----|------|------|
| Week 5 | 批处理优化 + 配置提取 | ⏳ |
| Week 6 | 文档完善 + 日志优化 + 性能调优 | ⏳ |

---

## 🔄 持续改进

### Code Review Checklist

每次提交前检查:
- [ ] 所有测试通过
- [ ] 测试覆盖率达标
- [ ] 无新增安全问题
- [ ] 代码符合风格指南
- [ ] 函数长度 < 50行
- [ ] 循环复杂度 < 10
- [ ] 文档已更新
- [ ] 变更日志已记录

### 自动化工具

配置pre-commit hooks:
```yaml
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
      - id: black
        language_version: python3.11

  - repo: https://github.com/PyCQA/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: [--max-line-length=88, --extend-ignore=E203]

  - repo: https://github.com/PyCQA/bandit
    rev: 1.7.5
    hooks:
      - id: bandit
        args: [-c, pyproject.toml]

  - repo: local
    hooks:
      - id: pytest-check
        name: pytest-check
        entry: pytest
        language: system
        pass_filenames: false
        always_run: true
        args: [-m, "not integration", --tb=short]
```

---

## 📚 参考资料

### 重构书籍
- *Refactoring: Improving the Design of Existing Code* - Martin Fowler
- *Clean Code* - Robert C. Martin
- *Working Effectively with Legacy Code* - Michael Feathers

### 设计模式
- Strategy Pattern (customer_number patterns)
- Factory Pattern (parser creation)
- Template Method (field normalization)

### Python最佳实践
- PEP 8: Style Guide
- PEP 257: Docstring Conventions
- Google Python Style Guide

---

## ✅ 验收标准

重构完成的定义:
1. ✅ 所有P0和P1任务完成
2. ✅ 测试覆盖率 ≥ 70%
3. ✅ 安全问题全部修复
4. ✅ 代码重复率 < 5%
5. ✅ 所有长函数 (>100行) 已拆分
6. ✅ API文档完整
7. ✅ 性能无退化
8. ✅ 生产环境部署成功

---

**文档结束**

下一步: 开始执行 Phase 1, Day 1 - 修复明文密码问题