Add payment line parser and fix OCR override from payment_line

- Add MachineCodeParser for Swedish invoice payment line parsing - Fix OCR Reference extraction by normalizing account number spaces - Add cross-validation tests for pipeline and field_extractor - Update UI layout for compact upload and full-width results Key changes: - machine_code_parser.py: Handle spaces in Bankgiro numbers (e.g. "78 2 1 713") - pipeline.py: OCR and Amount override from payment_line, BG/PG comparison only - field_extractor.py: Improved invoice number normalization - app.py: Responsive UI layout changes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 21:47:02 +01:00
parent e9460e9f34
commit 4ea4bc96d4
33 changed files with 7530 additions and 562 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,36 @@
 # Invoice Master POC v2

-自动账单信息提取系统 - 使用 YOLOv11 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。
+自动发票字段提取系统 - 使用 YOLOv11 + PaddleOCR 从瑞典 PDF 发票中提取结构化数据。
+
+## 项目概述
+
+本项目实现了一个完整的发票字段自动提取流程：
+
+1. **自动标注**: 利用已有 CSV 结构化数据 + OCR 自动生成 YOLO 训练标注
+2. **模型训练**: 使用 YOLOv11 训练字段检测模型
+3. **推理提取**: 检测字段区域 → OCR 提取文本 → 字段规范化
+
+### 当前进度
+
+| 指标 | 数值 |
+|------|------|
+| **已标注文档** | 9,738 (9,709 成功) |
+| **总体字段匹配率** | 94.8% (82,604/87,121) |
+
+**各字段匹配率:**
+
+| 字段 | 匹配率 | 说明 |
+|------|--------|------|
+| supplier_accounts(Bankgiro) | 100.0% | 供应商 Bankgiro |
+| supplier_accounts(Plusgiro) | 100.0% | 供应商 Plusgiro |
+| Plusgiro | 99.4% | 支付 Plusgiro |
+| OCR | 99.1% | OCR 参考号 |
+| Bankgiro | 99.0% | 支付 Bankgiro |
+| InvoiceNumber | 98.9% | 发票号码 |
+| InvoiceDueDate | 95.9% | 到期日期 |
+| InvoiceDate | 95.5% | 发票日期 |
+| Amount | 91.3% | 金额 |
+| supplier_organisation_number | 78.2% | 供应商组织号 (CSV 数据质量问题) |

 ## 运行环境

@@ -20,10 +50,10 @@

 - **双模式 PDF 处理**: 支持文本层 PDF 和扫描图 PDF
 - **自动标注**: 利用已有 CSV 结构化数据自动生成 YOLO 训练数据
- **多池处理架构**: CPU 池处理文本 PDF，GPU 池处理扫描 PDF
- **数据库存储**: 标注结果存储在 PostgreSQL，支持增量处理
+- **多策略字段匹配**: 精确匹配、子串匹配、规范化匹配
+- **数据库存储**: 标注结果存储在 PostgreSQL，支持增量处理和断点续传
 - **YOLO 检测**: 使用 YOLOv11 检测发票字段区域
- **OCR 识别**: 使用 PaddleOCR 3.x 提取检测区域的文本
+- **OCR 识别**: 使用 PaddleOCR v5 提取检测区域的文本
 - **Web 应用**: 提供 REST API 和可视化界面
 - **增量训练**: 支持在已训练模型基础上继续训练

@@ -38,6 +68,7 @@
 | 4 | bankgiro | Bankgiro 号码 |
 | 5 | plusgiro | Plusgiro 号码 |
 | 6 | amount | 金额 |
+| 7 | supplier_organisation_number | 供应商组织号 |

 ## 安装

@@ -205,7 +236,7 @@ Options:

 ### 训练结果示例

-使用 15,571 张训练图片，100 epochs 后的结果：
+使用约 10,000 张训练图片，100 epochs 后的结果：

 | 指标 | 值 |
 |------|-----|
@@ -214,6 +245,8 @@ Options:
 | **Precision** | 97.5% |
 | **Recall** | 95.5% |

+> 注：目前仍在持续标注更多数据，预计最终将有 25,000+ 张标注图片用于训练。
+
 ## 项目结构

 ```
@@ -403,16 +436,29 @@ print(result.to_json())    # JSON 格式输出

 - [x] 文本层 PDF 自动标注
 - [x] 扫描图 OCR 自动标注
- [x] 多池处理架构 (CPU + GPU)
- [x] PostgreSQL 数据库存储
+- [x] 多策略字段匹配 (精确/子串/规范化)
+- [x] PostgreSQL 数据库存储 (断点续传)
+- [x] 信号处理和超时保护
 - [x] YOLO 训练 (98.7% mAP@0.5)
 - [x] 推理管道
 - [x] 字段规范化和验证
 - [x] Web 应用 (FastAPI + 前端 UI)
 - [x] 增量训练支持
+- [ ] 完成全部 25,000+ 文档标注
 - [ ] 表格 items 处理
 - [ ] 模型量化部署

+## 技术栈
+
+| 组件 | 技术 |
+|------|------|
+| **目标检测** | YOLOv11 (Ultralytics) |
+| **OCR 引擎** | PaddleOCR v5 (PP-OCRv5) |
+| **PDF 处理** | PyMuPDF (fitz) |
+| **数据库** | PostgreSQL + psycopg2 |
+| **Web 框架** | FastAPI + Uvicorn |
+| **深度学习** | PyTorch + CUDA |
+
 ## 许可证

 MIT License
--- a/claude.md
+++ b/claude.md
@@ -1,216 +0,0 @@
-# Claude Code Instructions - Invoice Master POC v2
-
-## Environment Requirements
-
-> **IMPORTANT**: This project MUST run in **WSL + Conda** environment.
-
-| Requirement | Details |
-|-------------|---------|
-| **WSL** | WSL 2 with Ubuntu 22.04+ |
-| **Conda** | Miniconda or Anaconda |
-| **Python** | 3.10+ (managed by Conda) |
-| **GPU** | NVIDIA drivers on Windows + CUDA in WSL |
-
-```bash
-# Verify environment before running any commands
-uname -a              # Should show "Linux"
-conda --version       # Should show conda version
-conda activate <env>  # Activate project environment
-which python          # Should point to conda environment
-```
-
-**All commands must be executed in WSL terminal with Conda environment activated.**
-
---
-
-## Project Overview
-
-**Automated invoice field extraction system** for Swedish PDF invoices:
- **YOLO Object Detection** (YOLOv8/v11) for field region detection
- **PaddleOCR** for text extraction
- **Multi-strategy matching** for field validation
-
-**Stack**: Python 3.10+ | PyTorch | Ultralytics | PaddleOCR | PyMuPDF
-
-**Target Fields**: InvoiceNumber, InvoiceDate, InvoiceDueDate, OCR, Bankgiro, Plusgiro, Amount
-
---
-
-## Architecture Principles
-
-### SOLID
- **Single Responsibility**: Each module handles one concern
- **Open/Closed**: Extend via new strategies, not modifying existing code
- **Liskov Substitution**: Use Protocol/ABC for interchangeable components
- **Interface Segregation**: Small, focused interfaces
- **Dependency Inversion**: Depend on abstractions, inject dependencies
-
-### Project Structure
-```
-src/
-├── cli/           # Entry points only, no business logic
-├── pdf/           # PDF processing (extraction, rendering, detection)
-├── ocr/           # OCR engines (PaddleOCR wrapper)
-├── normalize/     # Field normalization and validation
-├── matcher/       # Multi-strategy field matching
-├── yolo/          # YOLO annotation and dataset building
-├── inference/     # Inference pipeline
-└── data/          # Data loading and reporting
-```
-
-### Configuration
- `configs/default.yaml` — All tunable parameters
- `config.py` — Sensitive data (credentials, use environment variables)
- Never hardcode magic numbers
-
---
-
-## Python Standards
-
-### Required
- **Type hints** on all public functions (PEP 484/585)
- **Docstrings** in Google style (PEP 257)
- **Dataclasses** for data structures (`frozen=True, slots=True` when immutable)
- **Protocol** for interfaces (PEP 544)
- **Enum** for constants
- **pathlib.Path** instead of string paths
-
-### Naming Conventions
-| Type | Convention | Example |
-|------|------------|---------|
-| Functions/Variables | snake_case | `extract_tokens`, `page_count` |
-| Classes | PascalCase | `FieldMatcher`, `AutoLabelReport` |
-| Constants | UPPER_SNAKE | `DEFAULT_DPI`, `FIELD_TYPES` |
-| Private | _prefix | `_parse_date`, `_cache` |
-
-### Import Order (isort)
-1. `from __future__ import annotations`
-2. Standard library
-3. Third-party
-4. Local modules
-5. `if TYPE_CHECKING:` block
-
-### Code Quality Tools
-| Tool | Purpose | Config |
-|------|---------|--------|
-| Black | Formatting | line-length=100 |
-| Ruff | Linting | E, F, W, I, N, D, UP, B, C4, SIM, ARG, PTH |
-| MyPy | Type checking | strict=true |
-| Pytest | Testing | tests/ directory |
-
---
-
-## Error Handling
-
- Use **custom exception hierarchy** (base: `InvoiceMasterError`)
- Use **logging** instead of print (`logger = logging.getLogger(__name__)`)
- Implement **graceful degradation** with fallback strategies
- Use **context managers** for resource cleanup
-
---
-
-## Machine Learning Standards
-
-### Data Management
- **Immutable raw data**: Never modify `data/raw/`
- **Version datasets**: Track with checksum and metadata
- **Reproducible splits**: Use fixed random seed (42)
- **Split ratios**: 80% train / 10% val / 10% test
-
-### YOLO Training
- **Disable flips** for text detection (`fliplr=0.0, flipud=0.0`)
- **Use early stopping** (`patience=20`)
- **Enable AMP** for faster training (`amp=true`)
- **Save checkpoints** periodically (`save_period=10`)
-
-### Reproducibility
- Set random seeds: `random`, `numpy`, `torch`
- Enable deterministic mode: `torch.backends.cudnn.deterministic = True`
- Track experiment config: model, epochs, batch_size, learning_rate, dataset_version, git_commit
-
-### Evaluation Metrics
- Precision, Recall, F1 Score
- mAP@0.5, mAP@0.5:0.95
- Per-class AP
-
---
-
-## Testing Standards
-
-### Structure
-```
-tests/
-├── unit/           # Isolated, fast tests
-├── integration/    # Multi-module tests
-├── e2e/            # End-to-end workflow tests
-├── fixtures/       # Test data
-└── conftest.py     # Shared fixtures
-```
-
-### Practices
- Follow **AAA pattern**: Arrange, Act, Assert
- Use **parametrized tests** for multiple inputs
- Use **fixtures** for shared setup
- Use **mocking** for external dependencies
- Mark slow tests with `@pytest.mark.slow`
-
---
-
-## Performance
-
- **Parallel processing**: Use `ProcessPoolExecutor` with progress tracking
- **Lazy loading**: Use `@cached_property` for expensive resources
- **Generators**: Use for large datasets to save memory
- **Batch processing**: Process items in batches when possible
-
---
-
-## Security
-
- **Never commit**: credentials, API keys, `.env` files
- **Use environment variables** for sensitive config
- **Validate paths**: Prevent path traversal attacks
- **Validate inputs**: At system boundaries
-
---
-
-## Commands
-
-| Task | Command |
-|------|---------|
-| Run autolabel | `python run_autolabel.py` |
-| Train YOLO | `python -m src.cli.train --config configs/training.yaml` |
-| Run inference | `python -m src.cli.infer --model models/best.pt` |
-| Run tests | `pytest tests/ -v` |
-| Coverage | `pytest tests/ --cov=src --cov-report=html` |
-| Format | `black src/ tests/` |
-| Lint | `ruff check src/ tests/ --fix` |
-| Type check | `mypy src/` |
-
---
-
-## DO NOT
-
- Hardcode file paths or magic numbers
- Use `print()` for logging
- Skip type hints on public APIs
- Write functions longer than 50 lines
- Mix business logic with I/O
- Commit credentials or `.env` files
- Use `# type: ignore` without explanation
- Use mutable default arguments
- Catch bare `except:`
- Use flip augmentation for text detection
-
-## DO
-
- Use type hints everywhere
- Write descriptive docstrings
- Log with appropriate levels
- Use dataclasses for data structures
- Use enums for constants
- Use Protocol for interfaces
- Set random seeds for reproducibility
- Track experiment configurations
- Use context managers for resources
- Validate inputs at boundaries
--- a/src/cli/autolabel.py
+++ b/src/cli/autolabel.py
@@ -112,11 +112,10 @@ def process_single_document(args_tuple):
    row_dict, pdf_path_str, output_dir_str, dpi, min_confidence, skip_ocr = args_tuple

    # Import inside worker to avoid pickling issues
-    from ..data import AutoLabelReport, FieldMatchResult
+    from ..data import AutoLabelReport
    from ..pdf import PDFDocument
-    from ..matcher import FieldMatcher
-    from ..normalize import normalize_field
-    from ..yolo.annotation_generator import AnnotationGenerator, FIELD_CLASSES
+    from ..yolo.annotation_generator import FIELD_CLASSES
+    from ..processing.document_processor import process_page, record_unmatched_fields

    start_time = time.time()
    pdf_path = Path(pdf_path_str)
@@ -165,9 +164,6 @@ def process_single_document(args_tuple):
            if use_ocr:
                ocr_engine = _get_ocr_engine()

-            generator = AnnotationGenerator(min_confidence=min_confidence)
-            matcher = FieldMatcher()
-
            # Process each page
            page_annotations = []
            matched_fields = set()
@@ -202,119 +198,39 @@ def process_single_document(args_tuple):
                    # Use cached document for text extraction
                    tokens = list(pdf_doc.extract_text_tokens(page_no))

-                # Match fields
+                # Get page dimensions
+                page = pdf_doc.doc[page_no]
+                page_height = page.rect.height
+                page_width = page.rect.width
+
+                # Use shared processing logic
                matches = {}
-                for field_name in FIELD_CLASSES.keys():
-                    value = row_dict.get(field_name)
-                    if not value:
-                        continue
-
-                    normalized = normalize_field(field_name, str(value))
-                    field_matches = matcher.find_matches(tokens, field_name, normalized, page_no)
-
-                    # Record result
-                    if field_matches:
-                        best = field_matches[0]
-                        matches[field_name] = field_matches
-                        matched_fields.add(field_name)
-                        report.add_field_result(FieldMatchResult(
-                            field_name=field_name,
-                            csv_value=str(value),
-                            matched=True,
-                            score=best.score,
-                            matched_text=best.matched_text,
-                            candidate_used=best.value,
-                            bbox=best.bbox,
-                            page_no=page_no,
-                            context_keywords=best.context_keywords
-                        ))
-
-                # Match supplier_accounts and map to Bankgiro/Plusgiro
-                supplier_accounts_value = row_dict.get('supplier_accounts')
-                if supplier_accounts_value:
-                    # Parse accounts: "BG:xxx | PG:yyy" format
-                    accounts = [acc.strip() for acc in str(supplier_accounts_value).split('|')]
-                    for account in accounts:
-                        account = account.strip()
-                        if not account:
-                            continue
-
-                        # Determine account type (BG or PG) and extract account number
-                        account_type = None
-                        account_number = account  # Default to full value
-
-                        if account.upper().startswith('BG:'):
-                            account_type = 'Bankgiro'
-                            account_number = account[3:].strip()  # Remove "BG:" prefix
-                        elif account.upper().startswith('BG '):
-                            account_type = 'Bankgiro'
-                            account_number = account[2:].strip()  # Remove "BG" prefix
-                        elif account.upper().startswith('PG:'):
-                            account_type = 'Plusgiro'
-                            account_number = account[3:].strip()  # Remove "PG:" prefix
-                        elif account.upper().startswith('PG '):
-                            account_type = 'Plusgiro'
-                            account_number = account[2:].strip()  # Remove "PG" prefix
-                        else:
-                            # Try to guess from format - Plusgiro often has format XXXXXXX-X
-                            digits = ''.join(c for c in account if c.isdigit())
-                            if len(digits) == 8 and '-' in account:
-                                account_type = 'Plusgiro'
-                            elif len(digits) in (7, 8):
-                                account_type = 'Bankgiro'  # Default to Bankgiro
-
-                        if not account_type:
-                            continue
-
-                        # Normalize and match using the account number (without prefix)
-                        normalized = normalize_field('supplier_accounts', account_number)
-                        field_matches = matcher.find_matches(tokens, account_type, normalized, page_no)
-
-                        if field_matches:
-                            best = field_matches[0]
-                            # Add to matches under the target class (Bankgiro/Plusgiro)
-                            if account_type not in matches:
-                                matches[account_type] = []
-                            matches[account_type].extend(field_matches)
-                            matched_fields.add('supplier_accounts')
-
-                            report.add_field_result(FieldMatchResult(
-                                field_name=f'supplier_accounts({account_type})',
-                                csv_value=account_number,  # Store without prefix
-                                matched=True,
-                                score=best.score,
-                                matched_text=best.matched_text,
-                                candidate_used=best.value,
-                                bbox=best.bbox,
-                                page_no=page_no,
-                                context_keywords=best.context_keywords
-                            ))
-
-                # Count annotations
-                annotations = generator.generate_from_matches(matches, img_width, img_height, dpi=dpi)
+                annotations, ann_count = process_page(
+                    tokens=tokens,
+                    row_dict=row_dict,
+                    page_no=page_no,
+                    page_height=page_height,
+                    page_width=page_width,
+                    img_width=img_width,
+                    img_height=img_height,
+                    dpi=dpi,
+                    min_confidence=min_confidence,
+                    matches=matches,
+                    matched_fields=matched_fields,
+                    report=report,
+                    result_stats=result['stats'],
+                )

                if annotations:
                    page_annotations.append({
                        'image_path': str(image_path),
                        'page_no': page_no,
-                        'count': len(annotations)
+                        'count': ann_count
                    })
+                    report.annotations_generated += ann_count

-                    report.annotations_generated += len(annotations)
-                    for ann in annotations:
-                        class_name = list(FIELD_CLASSES.keys())[ann.class_id]
-                        result['stats'][class_name] += 1
-
-            # Record unmatched fields
-            for field_name in FIELD_CLASSES.keys():
-                value = row_dict.get(field_name)
-                if value and field_name not in matched_fields:
-                    report.add_field_result(FieldMatchResult(
-                        field_name=field_name,
-                        csv_value=str(value),
-                        matched=False,
-                        page_no=-1
-                    ))
+            # Record unmatched fields using shared logic
+            record_unmatched_fields(row_dict, matched_fields, report)

            if page_annotations:
                result['pages'] = page_annotations
--- a/src/cli/infer.py
+++ b/src/cli/infer.py
@@ -38,8 +38,8 @@ def main():
    parser.add_argument(
        '--dpi',
        type=int,
-        default=300,
-        help='DPI for PDF rendering (default: 300)'
+        default=150,
+        help='DPI for PDF rendering (default: 150, must match training)'
    )
    parser.add_argument(
        '--no-fallback',
--- a/src/cli/serve.py
+++ b/src/cli/serve.py
@@ -51,14 +51,14 @@ def parse_args() -> argparse.Namespace:
        "--model",
        "-m",
        type=Path,
-        default=Path("runs/train/invoice_yolo11n_full/weights/best.pt"),
+        default=Path("runs/train/invoice_fields/weights/best.pt"),
        help="Path to YOLO model weights",
    )

    parser.add_argument(
        "--confidence",
        type=float,
-        default=0.3,
+        default=0.5,
        help="Detection confidence threshold",
    )

--- a/src/cli/train.py
+++ b/src/cli/train.py
@@ -86,8 +86,8 @@ def main():
    parser.add_argument(
        '--dpi',
        type=int,
-        default=300,
-        help='DPI used for rendering (default: 300)'
+        default=150,
+        help='DPI used for rendering (default: 150, must match autolabel rendering)'
    )
    parser.add_argument(
        '--export-only',
--- a/src/cli/validate.py
+++ b/src/cli/validate.py
@@ -0,0 +1,337 @@
+#!/usr/bin/env python3
+"""
+CLI for cross-validation of invoice field extraction using LLM.
+
+Validates documents with failed field matches by sending them to an LLM
+and comparing the extraction results.
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Cross-validate invoice field extraction using LLM'
+    )
+
+    subparsers = parser.add_subparsers(dest='command', help='Commands')
+
+    # Stats command
+    stats_parser = subparsers.add_parser('stats', help='Show failed match statistics')
+
+    # Validate command
+    validate_parser = subparsers.add_parser('validate', help='Validate documents with failed matches')
+    validate_parser.add_argument(
+        '--limit', '-l',
+        type=int,
+        default=10,
+        help='Maximum number of documents to validate (default: 10)'
+    )
+    validate_parser.add_argument(
+        '--provider', '-p',
+        choices=['openai', 'anthropic'],
+        default='openai',
+        help='LLM provider to use (default: openai)'
+    )
+    validate_parser.add_argument(
+        '--model', '-m',
+        help='Model to use (default: gpt-4o for OpenAI, claude-sonnet-4-20250514 for Anthropic)'
+    )
+    validate_parser.add_argument(
+        '--single', '-s',
+        help='Validate a single document ID'
+    )
+
+    # Compare command
+    compare_parser = subparsers.add_parser('compare', help='Compare validation results')
+    compare_parser.add_argument(
+        'document_id',
+        nargs='?',
+        help='Document ID to compare (or omit to show all)'
+    )
+    compare_parser.add_argument(
+        '--limit', '-l',
+        type=int,
+        default=20,
+        help='Maximum number of results to show (default: 20)'
+    )
+
+    # Report command
+    report_parser = subparsers.add_parser('report', help='Generate validation report')
+    report_parser.add_argument(
+        '--output', '-o',
+        default='reports/llm_validation_report.json',
+        help='Output file path (default: reports/llm_validation_report.json)'
+    )
+
+    args = parser.parse_args()
+
+    if not args.command:
+        parser.print_help()
+        return
+
+    from src.validation import LLMValidator
+
+    validator = LLMValidator()
+    validator.connect()
+    validator.create_validation_table()
+
+    if args.command == 'stats':
+        show_stats(validator)
+
+    elif args.command == 'validate':
+        if args.single:
+            validate_single(validator, args.single, args.provider, args.model)
+        else:
+            validate_batch(validator, args.limit, args.provider, args.model)
+
+    elif args.command == 'compare':
+        if args.document_id:
+            compare_single(validator, args.document_id)
+        else:
+            compare_all(validator, args.limit)
+
+    elif args.command == 'report':
+        generate_report(validator, args.output)
+
+    validator.close()
+
+
+def show_stats(validator):
+    """Show statistics about failed matches."""
+    stats = validator.get_failed_match_stats()
+
+    print("\n" + "=" * 50)
+    print("Failed Match Statistics")
+    print("=" * 50)
+    print(f"\nDocuments with failures: {stats['documents_with_failures']}")
+    print(f"Already validated:       {stats['already_validated']}")
+    print(f"Remaining to validate:   {stats['remaining']}")
+    print("\nFailures by field:")
+    for field, count in sorted(stats['failures_by_field'].items(), key=lambda x: -x[1]):
+        print(f"  {field}: {count}")
+
+
+def validate_single(validator, doc_id: str, provider: str, model: str):
+    """Validate a single document."""
+    print(f"\nValidating document: {doc_id}")
+    print(f"Provider: {provider}, Model: {model or 'default'}")
+    print()
+
+    result = validator.validate_document(doc_id, provider, model)
+
+    if result.error:
+        print(f"ERROR: {result.error}")
+        return
+
+    print(f"Processing time: {result.processing_time_ms:.0f}ms")
+    print(f"Model used: {result.model_used}")
+    print("\nExtracted fields:")
+    print(f"  Invoice Number: {result.invoice_number}")
+    print(f"  Invoice Date:   {result.invoice_date}")
+    print(f"  Due Date:       {result.invoice_due_date}")
+    print(f"  OCR:            {result.ocr_number}")
+    print(f"  Bankgiro:       {result.bankgiro}")
+    print(f"  Plusgiro:       {result.plusgiro}")
+    print(f"  Amount:         {result.amount}")
+    print(f"  Org Number:     {result.supplier_organisation_number}")
+
+    # Show comparison
+    print("\n" + "-" * 50)
+    print("Comparison with autolabel:")
+    comparison = validator.compare_results(doc_id)
+    for field, data in comparison.items():
+        if data.get('csv_value'):
+            status = "✓" if data['agreement'] else "✗"
+            auto_status = "matched" if data['autolabel_matched'] else "FAILED"
+            print(f"  {status} {field}:")
+            print(f"      CSV:       {data['csv_value']}")
+            print(f"      Autolabel: {data['autolabel_text']} ({auto_status})")
+            print(f"      LLM:       {data['llm_value']}")
+
+
+def validate_batch(validator, limit: int, provider: str, model: str):
+    """Validate a batch of documents."""
+    print(f"\nValidating up to {limit} documents with failed matches")
+    print(f"Provider: {provider}, Model: {model or 'default'}")
+    print()
+
+    results = validator.validate_batch(
+        limit=limit,
+        provider=provider,
+        model=model,
+        verbose=True
+    )
+
+    # Summary
+    success = sum(1 for r in results if not r.error)
+    failed = len(results) - success
+    total_time = sum(r.processing_time_ms or 0 for r in results)
+
+    print("\n" + "=" * 50)
+    print("Validation Complete")
+    print("=" * 50)
+    print(f"Total:     {len(results)}")
+    print(f"Success:   {success}")
+    print(f"Failed:    {failed}")
+    print(f"Total time: {total_time/1000:.1f}s")
+    if success > 0:
+        print(f"Avg time:   {total_time/success:.0f}ms per document")
+
+
+def compare_single(validator, doc_id: str):
+    """Compare results for a single document."""
+    comparison = validator.compare_results(doc_id)
+
+    if 'error' in comparison:
+        print(f"Error: {comparison['error']}")
+        return
+
+    print(f"\nComparison for document: {doc_id}")
+    print("=" * 60)
+
+    for field, data in comparison.items():
+        if data.get('csv_value') is None:
+            continue
+
+        status = "✓" if data['agreement'] else "✗"
+        auto_status = "matched" if data['autolabel_matched'] else "FAILED"
+
+        print(f"\n{status} {field}:")
+        print(f"    CSV value:      {data['csv_value']}")
+        print(f"    Autolabel:      {data['autolabel_text']} ({auto_status})")
+        print(f"    LLM extracted:  {data['llm_value']}")
+
+
+def compare_all(validator, limit: int):
+    """Show comparison summary for all validated documents."""
+    conn = validator.connect()
+    with conn.cursor() as cursor:
+        cursor.execute("""
+            SELECT document_id FROM llm_validations
+            WHERE error IS NULL
+            ORDER BY created_at DESC
+            LIMIT %s
+        """, (limit,))
+
+        doc_ids = [row[0] for row in cursor.fetchall()]
+
+    if not doc_ids:
+        print("No validated documents found.")
+        return
+
+    print(f"\nComparison Summary ({len(doc_ids)} documents)")
+    print("=" * 80)
+
+    # Aggregate stats
+    field_stats = {}
+
+    for doc_id in doc_ids:
+        comparison = validator.compare_results(doc_id)
+        if 'error' in comparison:
+            continue
+
+        for field, data in comparison.items():
+            if data.get('csv_value') is None:
+                continue
+
+            if field not in field_stats:
+                field_stats[field] = {
+                    'total': 0,
+                    'autolabel_matched': 0,
+                    'llm_agrees': 0,
+                    'llm_correct_auto_wrong': 0,
+                }
+
+            stats = field_stats[field]
+            stats['total'] += 1
+
+            if data['autolabel_matched']:
+                stats['autolabel_matched'] += 1
+
+            if data['agreement']:
+                stats['llm_agrees'] += 1
+
+            # LLM found correct value when autolabel failed
+            if not data['autolabel_matched'] and data['agreement']:
+                stats['llm_correct_auto_wrong'] += 1
+
+    print(f"\n{'Field':<30} {'Total':>6} {'Auto OK':>8} {'LLM Agrees':>10} {'LLM Found':>10}")
+    print("-" * 80)
+
+    for field, stats in sorted(field_stats.items()):
+        print(f"{field:<30} {stats['total']:>6} {stats['autolabel_matched']:>8} "
+              f"{stats['llm_agrees']:>10} {stats['llm_correct_auto_wrong']:>10}")
+
+
+def generate_report(validator, output_path: str):
+    """Generate a detailed validation report."""
+    import json
+    from datetime import datetime
+
+    conn = validator.connect()
+    with conn.cursor() as cursor:
+        # Get all validated documents
+        cursor.execute("""
+            SELECT document_id, invoice_number, invoice_date, invoice_due_date,
+                   ocr_number, bankgiro, plusgiro, amount,
+                   supplier_organisation_number, model_used, processing_time_ms,
+                   error, created_at
+            FROM llm_validations
+            ORDER BY created_at DESC
+        """)
+
+        validations = []
+        for row in cursor.fetchall():
+            doc_id = row[0]
+            comparison = validator.compare_results(doc_id) if not row[11] else {}
+
+            validations.append({
+                'document_id': doc_id,
+                'llm_extraction': {
+                    'invoice_number': row[1],
+                    'invoice_date': row[2],
+                    'invoice_due_date': row[3],
+                    'ocr_number': row[4],
+                    'bankgiro': row[5],
+                    'plusgiro': row[6],
+                    'amount': row[7],
+                    'supplier_organisation_number': row[8],
+                },
+                'model_used': row[9],
+                'processing_time_ms': row[10],
+                'error': row[11],
+                'created_at': str(row[12]) if row[12] else None,
+                'comparison': comparison,
+            })
+
+    # Calculate summary stats
+    stats = validator.get_failed_match_stats()
+
+    report = {
+        'generated_at': datetime.now().isoformat(),
+        'summary': {
+            'total_documents_with_failures': stats['documents_with_failures'],
+            'documents_validated': stats['already_validated'],
+            'failures_by_field': stats['failures_by_field'],
+        },
+        'validations': validations,
+    }
+
+    # Write report
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+
+    with open(output_path, 'w', encoding='utf-8') as f:
+        json.dump(report, f, indent=2, ensure_ascii=False)
+
+    print(f"\nReport generated: {output_path}")
+    print(f"Total validations: {len(validations)}")
+
+
+if __name__ == '__main__':
+    main()
--- a/src/data/csv_loader.py
+++ b/src/data/csv_loader.py
@@ -289,8 +289,11 @@ class CSVLoader:
        # Try default naming patterns
        patterns = [
            f"{doc_id}.pdf",
+            f"{doc_id}.PDF",
            f"{doc_id.lower()}.pdf",
+            f"{doc_id.lower()}.PDF",
            f"{doc_id.upper()}.pdf",
+            f"{doc_id.upper()}.PDF",
        ]

        for pattern in patterns:
@@ -298,9 +301,11 @@ class CSVLoader:
            if pdf_path.exists():
                return pdf_path

-        # Try glob patterns for partial matches
+        # Try glob patterns for partial matches (both cases)
        for pdf_file in self.pdf_dir.glob(f"*{doc_id}*.pdf"):
            return pdf_file
+        for pdf_file in self.pdf_dir.glob(f"*{doc_id}*.PDF"):
+            return pdf_file

        return None

--- a/src/data/test_csv_loader.py
+++ b/src/data/test_csv_loader.py
@@ -0,0 +1,534 @@
+"""
+Tests for the CSV Data Loader Module.
+
+Tests cover all loader functions in src/data/csv_loader.py
+
+Usage:
+    pytest src/data/test_csv_loader.py -v -o 'addopts='
+"""
+
+import pytest
+import tempfile
+from pathlib import Path
+from datetime import date
+from decimal import Decimal
+from src.data.csv_loader import (
+    InvoiceRow,
+    CSVLoader,
+    load_invoice_csv,
+)
+
+
+class TestInvoiceRow:
+    """Tests for InvoiceRow dataclass."""
+
+    def test_creation_minimal(self):
+        """Should create InvoiceRow with only required field."""
+        row = InvoiceRow(DocumentId="DOC001")
+        assert row.DocumentId == "DOC001"
+        assert row.InvoiceDate is None
+        assert row.Amount is None
+
+    def test_creation_full(self):
+        """Should create InvoiceRow with all fields."""
+        row = InvoiceRow(
+            DocumentId="DOC001",
+            InvoiceDate=date(2025, 1, 15),
+            InvoiceNumber="INV-001",
+            InvoiceDueDate=date(2025, 2, 15),
+            OCR="1234567890",
+            Message="Test message",
+            Bankgiro="5393-9484",
+            Plusgiro="123456-7",
+            Amount=Decimal("1234.56"),
+            split="train",
+            customer_number="CUST001",
+            supplier_name="Test Supplier",
+            supplier_organisation_number="556123-4567",
+            supplier_accounts="BG:5393-9484",
+        )
+        assert row.DocumentId == "DOC001"
+        assert row.InvoiceDate == date(2025, 1, 15)
+        assert row.Amount == Decimal("1234.56")
+
+    def test_to_dict(self):
+        """Should convert to dictionary correctly."""
+        row = InvoiceRow(
+            DocumentId="DOC001",
+            InvoiceDate=date(2025, 1, 15),
+            Amount=Decimal("100.50"),
+        )
+        d = row.to_dict()
+
+        assert d["DocumentId"] == "DOC001"
+        assert d["InvoiceDate"] == "2025-01-15"
+        assert d["Amount"] == "100.50"
+
+    def test_to_dict_none_values(self):
+        """Should handle None values in to_dict."""
+        row = InvoiceRow(DocumentId="DOC001")
+        d = row.to_dict()
+
+        assert d["DocumentId"] == "DOC001"
+        assert d["InvoiceDate"] is None
+        assert d["Amount"] is None
+
+    def test_get_field_value_date(self):
+        """Should get date field as ISO string."""
+        row = InvoiceRow(
+            DocumentId="DOC001",
+            InvoiceDate=date(2025, 1, 15),
+        )
+        assert row.get_field_value("InvoiceDate") == "2025-01-15"
+
+    def test_get_field_value_decimal(self):
+        """Should get Decimal field as string."""
+        row = InvoiceRow(
+            DocumentId="DOC001",
+            Amount=Decimal("1234.56"),
+        )
+        assert row.get_field_value("Amount") == "1234.56"
+
+    def test_get_field_value_string(self):
+        """Should get string field as-is."""
+        row = InvoiceRow(
+            DocumentId="DOC001",
+            InvoiceNumber="INV-001",
+        )
+        assert row.get_field_value("InvoiceNumber") == "INV-001"
+
+    def test_get_field_value_none(self):
+        """Should return None for missing field."""
+        row = InvoiceRow(DocumentId="DOC001")
+        assert row.get_field_value("InvoiceNumber") is None
+
+    def test_get_field_value_unknown_field(self):
+        """Should return None for unknown field."""
+        row = InvoiceRow(DocumentId="DOC001")
+        assert row.get_field_value("UnknownField") is None
+
+
+class TestCSVLoaderParseDate:
+    """Tests for CSVLoader._parse_date method."""
+
+    def test_parse_iso_format(self):
+        """Should parse ISO date format."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("2025-01-15") == date(2025, 1, 15)
+
+    def test_parse_iso_with_time(self):
+        """Should parse ISO format with time."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("2025-01-15 12:30:45") == date(2025, 1, 15)
+
+    def test_parse_iso_with_microseconds(self):
+        """Should parse ISO format with microseconds."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("2025-01-15 12:30:45.123456") == date(2025, 1, 15)
+
+    def test_parse_european_slash(self):
+        """Should parse DD/MM/YYYY format."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("15/01/2025") == date(2025, 1, 15)
+
+    def test_parse_european_dot(self):
+        """Should parse DD.MM.YYYY format."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("15.01.2025") == date(2025, 1, 15)
+
+    def test_parse_european_dash(self):
+        """Should parse DD-MM-YYYY format."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("15-01-2025") == date(2025, 1, 15)
+
+    def test_parse_compact(self):
+        """Should parse YYYYMMDD format."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("20250115") == date(2025, 1, 15)
+
+    def test_parse_empty(self):
+        """Should return None for empty string."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("") is None
+        assert loader._parse_date("   ") is None
+
+    def test_parse_none(self):
+        """Should return None for None input."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date(None) is None
+
+    def test_parse_invalid(self):
+        """Should return None for invalid date."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_date("not-a-date") is None
+
+
+class TestCSVLoaderParseAmount:
+    """Tests for CSVLoader._parse_amount method."""
+
+    def test_parse_simple_integer(self):
+        """Should parse simple integer."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100") == Decimal("100")
+
+    def test_parse_decimal_dot(self):
+        """Should parse decimal with dot."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100.50") == Decimal("100.50")
+
+    def test_parse_decimal_comma(self):
+        """Should parse European format with comma."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100,50") == Decimal("100.50")
+
+    def test_parse_with_thousand_separator_space(self):
+        """Should handle space as thousand separator."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("1 234,56") == Decimal("1234.56")
+
+    def test_parse_with_thousand_separator_comma(self):
+        """Should handle comma as thousand separator when dot is decimal."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("1,234.56") == Decimal("1234.56")
+
+    def test_parse_with_currency_sek(self):
+        """Should remove SEK suffix."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100 SEK") == Decimal("100")
+
+    def test_parse_with_currency_kr(self):
+        """Should remove kr suffix."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100 kr") == Decimal("100")
+
+    def test_parse_with_colon_dash(self):
+        """Should remove :- suffix."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("100:-") == Decimal("100")
+
+    def test_parse_empty(self):
+        """Should return None for empty string."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("") is None
+        assert loader._parse_amount("   ") is None
+
+    def test_parse_none(self):
+        """Should return None for None input."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount(None) is None
+
+    def test_parse_invalid(self):
+        """Should return None for invalid amount."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_amount("not-an-amount") is None
+
+
+class TestCSVLoaderParseString:
+    """Tests for CSVLoader._parse_string method."""
+
+    def test_parse_normal_string(self):
+        """Should return stripped string."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_string("  hello  ") == "hello"
+
+    def test_parse_empty_string(self):
+        """Should return None for empty string."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_string("") is None
+        assert loader._parse_string("   ") is None
+
+    def test_parse_none(self):
+        """Should return None for None input."""
+        loader = CSVLoader.__new__(CSVLoader)
+        assert loader._parse_string(None) is None
+
+
+class TestCSVLoaderWithFile:
+    """Tests for CSVLoader with actual CSV files."""
+
+    @pytest.fixture
+    def sample_csv(self, tmp_path):
+        """Create a sample CSV file for testing."""
+        csv_content = """DocumentId,InvoiceDate,InvoiceNumber,Amount,Bankgiro
+DOC001,2025-01-15,INV-001,100.50,5393-9484
+DOC002,2025-01-16,INV-002,200.00,1234-5678
+DOC003,2025-01-17,INV-003,300.75,
+"""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text(csv_content, encoding="utf-8")
+        return csv_file
+
+    @pytest.fixture
+    def sample_csv_with_bom(self, tmp_path):
+        """Create a CSV file with BOM."""
+        csv_content = """DocumentId,InvoiceDate,Amount
+DOC001,2025-01-15,100.50
+"""
+        csv_file = tmp_path / "test_bom.csv"
+        csv_file.write_text(csv_content, encoding="utf-8-sig")
+        return csv_file
+
+    def test_load_all(self, sample_csv):
+        """Should load all rows from CSV."""
+        loader = CSVLoader(sample_csv)
+        rows = loader.load_all()
+
+        assert len(rows) == 3
+        assert rows[0].DocumentId == "DOC001"
+        assert rows[1].DocumentId == "DOC002"
+        assert rows[2].DocumentId == "DOC003"
+
+    def test_iter_rows(self, sample_csv):
+        """Should iterate over rows."""
+        loader = CSVLoader(sample_csv)
+        rows = list(loader.iter_rows())
+
+        assert len(rows) == 3
+
+    def test_parse_fields_correctly(self, sample_csv):
+        """Should parse all fields correctly."""
+        loader = CSVLoader(sample_csv)
+        rows = loader.load_all()
+
+        row = rows[0]
+        assert row.InvoiceDate == date(2025, 1, 15)
+        assert row.InvoiceNumber == "INV-001"
+        assert row.Amount == Decimal("100.50")
+        assert row.Bankgiro == "5393-9484"
+
+    def test_handles_empty_fields(self, sample_csv):
+        """Should handle empty fields as None."""
+        loader = CSVLoader(sample_csv)
+        rows = loader.load_all()
+
+        row = rows[2]  # Last row has empty Bankgiro
+        assert row.Bankgiro is None
+
+    def test_handles_bom(self, sample_csv_with_bom):
+        """Should handle files with BOM correctly."""
+        loader = CSVLoader(sample_csv_with_bom)
+        rows = loader.load_all()
+
+        assert len(rows) == 1
+        assert rows[0].DocumentId == "DOC001"
+
+    def test_get_row_by_id(self, sample_csv):
+        """Should get specific row by DocumentId."""
+        loader = CSVLoader(sample_csv)
+
+        row = loader.get_row_by_id("DOC002")
+        assert row is not None
+        assert row.InvoiceNumber == "INV-002"
+
+    def test_get_row_by_id_not_found(self, sample_csv):
+        """Should return None for non-existent DocumentId."""
+        loader = CSVLoader(sample_csv)
+
+        row = loader.get_row_by_id("NONEXISTENT")
+        assert row is None
+
+
+class TestCSVLoaderMultipleFiles:
+    """Tests for CSVLoader with multiple CSV files."""
+
+    @pytest.fixture
+    def multiple_csvs(self, tmp_path):
+        """Create multiple CSV files for testing."""
+        csv1 = tmp_path / "file1.csv"
+        csv1.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001
+DOC002,INV-002
+""", encoding="utf-8")
+
+        csv2 = tmp_path / "file2.csv"
+        csv2.write_text("""DocumentId,InvoiceNumber
+DOC003,INV-003
+DOC004,INV-004
+""", encoding="utf-8")
+
+        return [csv1, csv2]
+
+    def test_load_from_list(self, multiple_csvs):
+        """Should load from list of CSV paths."""
+        loader = CSVLoader(multiple_csvs)
+        rows = loader.load_all()
+
+        assert len(rows) == 4
+        doc_ids = [r.DocumentId for r in rows]
+        assert "DOC001" in doc_ids
+        assert "DOC004" in doc_ids
+
+    def test_load_from_glob(self, multiple_csvs, tmp_path):
+        """Should load from glob pattern."""
+        loader = CSVLoader(tmp_path / "*.csv")
+        rows = loader.load_all()
+
+        assert len(rows) == 4
+
+    def test_deduplicates_by_doc_id(self, tmp_path):
+        """Should deduplicate rows by DocumentId across files."""
+        csv1 = tmp_path / "file1.csv"
+        csv1.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001
+""", encoding="utf-8")
+
+        csv2 = tmp_path / "file2.csv"
+        csv2.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001-DUPLICATE
+""", encoding="utf-8")
+
+        loader = CSVLoader([csv1, csv2])
+        rows = loader.load_all()
+
+        assert len(rows) == 1
+        assert rows[0].InvoiceNumber == "INV-001"  # First one wins
+
+
+class TestCSVLoaderPDFPath:
+    """Tests for CSVLoader.get_pdf_path method."""
+
+    @pytest.fixture
+    def setup_pdf_dir(self, tmp_path):
+        """Create PDF directory with some files."""
+        pdf_dir = tmp_path / "pdfs"
+        pdf_dir.mkdir()
+
+        # Create some dummy PDF files
+        (pdf_dir / "DOC001.pdf").touch()
+        (pdf_dir / "doc002.pdf").touch()
+        (pdf_dir / "INVOICE_DOC003.pdf").touch()
+
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001
+DOC002,INV-002
+DOC003,INV-003
+DOC004,INV-004
+""", encoding="utf-8")
+
+        return csv_file, pdf_dir
+
+    def test_find_exact_match(self, setup_pdf_dir):
+        """Should find PDF with exact name match."""
+        csv_file, pdf_dir = setup_pdf_dir
+        loader = CSVLoader(csv_file, pdf_dir)
+        rows = loader.load_all()
+
+        pdf_path = loader.get_pdf_path(rows[0])  # DOC001
+        assert pdf_path is not None
+        assert pdf_path.name == "DOC001.pdf"
+
+    def test_find_lowercase_match(self, setup_pdf_dir):
+        """Should find PDF with lowercase name."""
+        csv_file, pdf_dir = setup_pdf_dir
+        loader = CSVLoader(csv_file, pdf_dir)
+        rows = loader.load_all()
+
+        pdf_path = loader.get_pdf_path(rows[1])  # DOC002 -> doc002.pdf
+        assert pdf_path is not None
+        assert pdf_path.name == "doc002.pdf"
+
+    def test_find_glob_match(self, setup_pdf_dir):
+        """Should find PDF using glob pattern."""
+        csv_file, pdf_dir = setup_pdf_dir
+        loader = CSVLoader(csv_file, pdf_dir)
+        rows = loader.load_all()
+
+        pdf_path = loader.get_pdf_path(rows[2])  # DOC003 -> INVOICE_DOC003.pdf
+        assert pdf_path is not None
+        assert "DOC003" in pdf_path.name
+
+    def test_not_found(self, setup_pdf_dir):
+        """Should return None when PDF not found."""
+        csv_file, pdf_dir = setup_pdf_dir
+        loader = CSVLoader(csv_file, pdf_dir)
+        rows = loader.load_all()
+
+        pdf_path = loader.get_pdf_path(rows[3])  # DOC004 - no PDF
+        assert pdf_path is None
+
+
+class TestCSVLoaderValidate:
+    """Tests for CSVLoader.validate method."""
+
+    def test_validate_missing_pdf(self, tmp_path):
+        """Should report missing PDF files."""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001
+""", encoding="utf-8")
+
+        loader = CSVLoader(csv_file, tmp_path)
+        issues = loader.validate()
+
+        assert len(issues) >= 1
+        pdf_issues = [i for i in issues if i.get("field") == "PDF"]
+        assert len(pdf_issues) == 1
+
+    def test_validate_no_matchable_fields(self, tmp_path):
+        """Should report rows with no matchable fields."""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""DocumentId,Message
+DOC001,Just a message
+""", encoding="utf-8")
+
+        # Create a PDF so we only get the matchable fields issue
+        pdf_dir = tmp_path / "pdfs"
+        pdf_dir.mkdir()
+        (pdf_dir / "DOC001.pdf").touch()
+
+        loader = CSVLoader(csv_file, pdf_dir)
+        issues = loader.validate()
+
+        field_issues = [i for i in issues if i.get("field") == "All"]
+        assert len(field_issues) == 1
+
+
+class TestCSVLoaderAlternateFieldNames:
+    """Tests for alternate field name support."""
+
+    def test_lowercase_field_names(self, tmp_path):
+        """Should accept lowercase field names."""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""document_id,invoice_date,invoice_number,amount
+DOC001,2025-01-15,INV-001,100.50
+""", encoding="utf-8")
+
+        loader = CSVLoader(csv_file)
+        rows = loader.load_all()
+
+        assert len(rows) == 1
+        assert rows[0].DocumentId == "DOC001"
+        assert rows[0].InvoiceDate == date(2025, 1, 15)
+
+    def test_alternate_amount_field(self, tmp_path):
+        """Should accept invoice_data_amount as Amount field."""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""DocumentId,invoice_data_amount
+DOC001,100.50
+""", encoding="utf-8")
+
+        loader = CSVLoader(csv_file)
+        rows = loader.load_all()
+
+        assert rows[0].Amount == Decimal("100.50")
+
+
+class TestLoadInvoiceCSV:
+    """Tests for load_invoice_csv convenience function."""
+
+    def test_load_single_file(self, tmp_path):
+        """Should load from single CSV file."""
+        csv_file = tmp_path / "test.csv"
+        csv_file.write_text("""DocumentId,InvoiceNumber
+DOC001,INV-001
+""", encoding="utf-8")
+
+        rows = load_invoice_csv(csv_file)
+
+        assert len(rows) == 1
+        assert rows[0].DocumentId == "DOC001"
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
--- a/src/inference/field_extractor.py
+++ b/src/inference/field_extractor.py
@@ -238,18 +238,77 @@ class FieldExtractor:
        elif field_name in ('InvoiceDate', 'InvoiceDueDate'):
            return self._normalize_date(text)

+        elif field_name == 'payment_line':
+            return self._normalize_payment_line(text)
+
+        elif field_name == 'supplier_org_number':
+            return self._normalize_supplier_org_number(text)
+
+        elif field_name == 'customer_number':
+            return self._normalize_customer_number(text)
+
        else:
            return text, True, None

    def _normalize_invoice_number(self, text: str) -> tuple[str | None, bool, str | None]:
-        """Normalize invoice number."""
-        # Extract digits only
+        """
+        Normalize invoice number.
+
+        Invoice numbers can be:
+        - Pure digits: 12345678
+        - Alphanumeric: A3861, INV-2024-001, F12345
+        - With separators: 2024/001, 2024-001
+
+        Strategy:
+        1. Look for common invoice number patterns
+        2. Prefer shorter, more specific matches over long digit sequences
+        """
+        # Pattern 1: Alphanumeric invoice number (letter + digits or digits + letter)
+        # Examples: A3861, F12345, INV001
+        alpha_patterns = [
+            r'\b([A-Z]{1,3}\d{3,10})\b',  # A3861, INV12345
+            r'\b(\d{3,10}[A-Z]{1,3})\b',  # 12345A
+            r'\b([A-Z]{2,5}[-/]?\d{3,10})\b',  # INV-12345, FAK12345
+        ]
+
+        for pattern in alpha_patterns:
+            match = re.search(pattern, text, re.IGNORECASE)
+            if match:
+                return match.group(1).upper(), True, None
+
+        # Pattern 2: Invoice number with year prefix (2024-001, 2024/12345)
+        year_pattern = r'\b(20\d{2}[-/]\d{3,8})\b'
+        match = re.search(year_pattern, text)
+        if match:
+            return match.group(1), True, None
+
+        # Pattern 3: Short digit sequence (3-10 digits) - prefer shorter sequences
+        # This avoids capturing long OCR numbers
+        digit_sequences = re.findall(r'\b(\d{3,10})\b', text)
+        if digit_sequences:
+            # Prefer shorter sequences (more likely to be invoice number)
+            # Also filter out sequences that look like dates (8 digits starting with 20)
+            valid_sequences = []
+            for seq in digit_sequences:
+                # Skip if it looks like a date (YYYYMMDD)
+                if len(seq) == 8 and seq.startswith('20'):
+                    continue
+                # Skip if too long (likely OCR number)
+                if len(seq) > 10:
+                    continue
+                valid_sequences.append(seq)
+
+            if valid_sequences:
+                # Return shortest valid sequence
+                return min(valid_sequences, key=len), True, None
+
+        # Fallback: extract all digits if nothing else works
        digits = re.sub(r'\D', '', text)
+        if len(digits) >= 3:
+            # Limit to first 15 digits to avoid very long sequences
+            return digits[:15], True, "Fallback extraction"

-        if len(digits) < 3:
-            return None, False, f"Too few digits: {len(digits)}"
-
-        return digits, True, None
+        return None, False, f"Cannot extract invoice number from: {text[:50]}"

    def _normalize_ocr_number(self, text: str) -> tuple[str | None, bool, str | None]:
        """Normalize OCR number."""
@@ -260,33 +319,174 @@ class FieldExtractor:

        return digits, True, None

-    def _normalize_bankgiro(self, text: str) -> tuple[str | None, bool, str | None]:
-        """Normalize Bankgiro number."""
-        digits = re.sub(r'\D', '', text)
+    def _luhn_checksum(self, digits: str) -> bool:
+        """
+        Validate using Luhn (Mod10) algorithm.
+        Used for Bankgiro, Plusgiro, and OCR number validation.

-        if len(digits) == 8:
-            # Format as XXXX-XXXX
-            formatted = f"{digits[:4]}-{digits[4:]}"
-            return formatted, True, None
-        elif len(digits) == 7:
-            # Format as XXX-XXXX
-            formatted = f"{digits[:3]}-{digits[3:]}"
-            return formatted, True, None
-        elif 6 <= len(digits) <= 9:
-            return digits, True, None
-        else:
-            return None, False, f"Invalid Bankgiro length: {len(digits)}"
+        The checksum is valid if the total modulo 10 equals 0.
+        """
+        if not digits.isdigit():
+            return False
+
+        total = 0
+        for i, char in enumerate(reversed(digits)):
+            digit = int(char)
+            if i % 2 == 1:  # Double every second digit from right
+                digit *= 2
+                if digit > 9:
+                    digit -= 9
+            total += digit
+
+        return total % 10 == 0
+
+    def _detect_giro_type(self, text: str) -> str | None:
+        """
+        Detect if text matches BG or PG display format pattern.
+
+        BG typical format: ^\d{3,4}-\d{4}$  (e.g., 123-4567, 1234-5678)
+        PG typical format: ^\d{1,7}-\d$     (e.g., 1-8, 12345-6, 1234567-8)
+
+        Returns: 'BG', 'PG', or None if cannot determine
+        """
+        text = text.strip()
+
+        # BG pattern: 3-4 digits, dash, 4 digits (total 7-8 digits)
+        if re.match(r'^\d{3,4}-\d{4}$', text):
+            return 'BG'
+
+        # PG pattern: 1-7 digits, dash, 1 digit (total 2-8 digits)
+        if re.match(r'^\d{1,7}-\d$', text):
+            return 'PG'
+
+        return None
+
+    def _normalize_bankgiro(self, text: str) -> tuple[str | None, bool, str | None]:
+        """
+        Normalize Bankgiro number.
+
+        Bankgiro rules:
+        - 7 or 8 digits only
+        - Last digit is Luhn (Mod10) check digit
+        - Display format: XXX-XXXX (7 digits) or XXXX-XXXX (8 digits)
+
+        Display pattern: ^\d{3,4}-\d{4}$
+        Normalized pattern: ^\d{7,8}$
+
+        Note: Text may contain both BG and PG numbers. We specifically look for
+        BG display format (XXX-XXXX or XXXX-XXXX) to extract the correct one.
+        """
+        # Look for BG display format pattern: 3-4 digits, dash, 4 digits
+        # This distinguishes BG from PG which uses X-X format (digits-single digit)
+        bg_matches = re.findall(r'(\d{3,4})-(\d{4})', text)
+
+        if bg_matches:
+            # Try each match and find one with valid Luhn
+            for match in bg_matches:
+                digits = match[0] + match[1]
+                if len(digits) in (7, 8) and self._luhn_checksum(digits):
+                    # Valid BG found
+                    if len(digits) == 8:
+                        formatted = f"{digits[:4]}-{digits[4:]}"
+                    else:
+                        formatted = f"{digits[:3]}-{digits[3:]}"
+                    return formatted, True, None
+
+            # No valid Luhn, use first match
+            digits = bg_matches[0][0] + bg_matches[0][1]
+            if len(digits) in (7, 8):
+                if len(digits) == 8:
+                    formatted = f"{digits[:4]}-{digits[4:]}"
+                else:
+                    formatted = f"{digits[:3]}-{digits[3:]}"
+                return formatted, True, f"Luhn checksum failed (possible OCR error)"
+
+        # Fallback: try to find 7-8 consecutive digits
+        # But first check if text contains PG format (XXXXXXX-X), if so don't use fallback
+        # to avoid misinterpreting PG as BG
+        pg_format_present = re.search(r'(?<![0-9])\d{1,7}-\d(?!\d)', text)
+        if pg_format_present:
+            return None, False, f"No valid Bankgiro found in text"
+
+        digit_match = re.search(r'\b(\d{7,8})\b', text)
+        if digit_match:
+            digits = digit_match.group(1)
+            if len(digits) in (7, 8):
+                luhn_ok = self._luhn_checksum(digits)
+                if len(digits) == 8:
+                    formatted = f"{digits[:4]}-{digits[4:]}"
+                else:
+                    formatted = f"{digits[:3]}-{digits[3:]}"
+                if luhn_ok:
+                    return formatted, True, None
+                else:
+                    return formatted, True, f"Luhn checksum failed (possible OCR error)"
+
+        return None, False, f"No valid Bankgiro found in text"

    def _normalize_plusgiro(self, text: str) -> tuple[str | None, bool, str | None]:
-        """Normalize Plusgiro number."""
-        digits = re.sub(r'\D', '', text)
+        """
+        Normalize Plusgiro number.

-        if len(digits) >= 6:
-            # Format as XXXXXXX-X
+        Plusgiro rules:
+        - 2 to 8 digits
+        - Last digit is Luhn (Mod10) check digit
+        - Display format: XXXXXXX-X (all digits except last, dash, last digit)
+
+        Display pattern: ^\d{1,7}-\d$
+        Normalized pattern: ^\d{2,8}$
+
+        Note: Text may contain both BG and PG numbers. We specifically look for
+        PG display format (X-X, XX-X, ..., XXXXXXX-X) to extract the correct one.
+        """
+        # First look for PG display format: 1-7 digits (possibly with spaces), dash, 1 digit
+        # This is distinct from BG format which has 4 digits after the dash
+        # Pattern allows spaces within the number like "486 98 63-6"
+        # (?<![0-9]) ensures we don't start from within another number (like BG)
+        pg_matches = re.findall(r'(?<![0-9])(\d[\d\s]{0,10})-(\d)(?!\d)', text)
+
+        if pg_matches:
+            # Try each match and find one with valid Luhn
+            for match in pg_matches:
+                # Remove spaces from the first part
+                digits = re.sub(r'\s', '', match[0]) + match[1]
+                if 2 <= len(digits) <= 8 and self._luhn_checksum(digits):
+                    # Valid PG found
+                    formatted = f"{digits[:-1]}-{digits[-1]}"
+                    return formatted, True, None
+
+            # No valid Luhn, use first match with most digits
+            best_match = max(pg_matches, key=lambda m: len(re.sub(r'\s', '', m[0])))
+            digits = re.sub(r'\s', '', best_match[0]) + best_match[1]
+            if 2 <= len(digits) <= 8:
+                formatted = f"{digits[:-1]}-{digits[-1]}"
+                return formatted, True, f"Luhn checksum failed (possible OCR error)"
+
+        # If no PG format found, extract all digits and format as PG
+        # This handles cases where the number might be in BG format or raw digits
+        all_digits = re.sub(r'\D', '', text)
+
+        # Try to find a valid 2-8 digit sequence
+        if 2 <= len(all_digits) <= 8:
+            luhn_ok = self._luhn_checksum(all_digits)
+            formatted = f"{all_digits[:-1]}-{all_digits[-1]}"
+            if luhn_ok:
+                return formatted, True, None
+            else:
+                return formatted, True, f"Luhn checksum failed (possible OCR error)"
+
+        # Try to find any 2-8 digit sequence in text
+        digit_match = re.search(r'\b(\d{2,8})\b', text)
+        if digit_match:
+            digits = digit_match.group(1)
+            luhn_ok = self._luhn_checksum(digits)
            formatted = f"{digits[:-1]}-{digits[-1]}"
-            return formatted, True, None
-        else:
-            return None, False, f"Invalid Plusgiro length: {len(digits)}"
+            if luhn_ok:
+                return formatted, True, None
+            else:
+                return formatted, True, f"Luhn checksum failed (possible OCR error)"
+
+        return None, False, f"No valid Plusgiro found in text"

    def _normalize_amount(self, text: str) -> tuple[str | None, bool, str | None]:
        """Normalize monetary amount."""
@@ -366,6 +566,169 @@ class FieldExtractor:

        return None, False, f"Cannot parse date: {text}"

+    def _normalize_payment_line(self, text: str) -> tuple[str | None, bool, str | None]:
+        """
+        Normalize payment line region text.
+
+        Extracts OCR, Amount, and Bankgiro from the payment line using MachineCodeParser.
+        """
+        from ..ocr.machine_code_parser import MachineCodeParser
+
+        # Create a simple token-like structure for the parser
+        # (The parser expects tokens, but for inference we have raw text)
+        parser = MachineCodeParser()
+
+        # Try to parse the standard payment line format
+        result = parser._parse_standard_payment_line(text)
+
+        if result:
+            # Format as structured output
+            parts = []
+            if result.get('ocr'):
+                parts.append(f"OCR:{result['ocr']}")
+            if result.get('amount'):
+                parts.append(f"Amount:{result['amount']}")
+            if result.get('bankgiro'):
+                parts.append(f"BG:{result['bankgiro']}")
+
+            if parts:
+                return ' '.join(parts), True, None
+
+        # Fallback: return raw text if no structured parsing possible
+        return text, True, None
+
+    def _normalize_supplier_org_number(self, text: str) -> tuple[str | None, bool, str | None]:
+        """
+        Normalize Swedish supplier organization number.
+
+        Extracts organization number in format: NNNNNN-NNNN (10 digits)
+        Also handles VAT numbers: SE + 10 digits + 01
+
+        Examples:
+            'org.nr. 516406-1102, Filialregistret...' -> '516406-1102'
+            'Momsreg.nr SE556123456701' -> '556123-4567'
+        """
+        # Pattern 1: Standard org number format: NNNNNN-NNNN
+        org_pattern = r'\b(\d{6})-?(\d{4})\b'
+        match = re.search(org_pattern, text)
+        if match:
+            org_num = f"{match.group(1)}-{match.group(2)}"
+            return org_num, True, None
+
+        # Pattern 2: VAT number format: SE + 10 digits + 01
+        vat_pattern = r'SE\s*(\d{10})01'
+        match = re.search(vat_pattern, text, re.IGNORECASE)
+        if match:
+            digits = match.group(1)
+            org_num = f"{digits[:6]}-{digits[6:]}"
+            return org_num, True, None
+
+        # Pattern 3: Just 10 consecutive digits
+        digits_pattern = r'\b(\d{10})\b'
+        match = re.search(digits_pattern, text)
+        if match:
+            digits = match.group(1)
+            # Validate: first digit should be 1-9 for Swedish org numbers
+            if digits[0] in '123456789':
+                org_num = f"{digits[:6]}-{digits[6:]}"
+                return org_num, True, None
+
+        return None, False, f"Cannot extract org number from: {text[:100]}"
+
+    def _normalize_customer_number(self, text: str) -> tuple[str | None, bool, str | None]:
+        """
+        Normalize customer number extracted from OCR.
+
+        Customer numbers can have various formats:
+        - With separators: 'JTY 576-3', 'EMM 256-6', 'FFL 019N'
+        - Compact (no separators): 'JTY5763', 'EMM2566', 'FFL019N'
+        - Mixed with names: 'VIKSTRÖM, ELIAS CH FFL 01' -> extract 'FFL 01'
+
+        Note: Spaces and dashes may be removed from invoice display,
+        so we need to match both 'JTY 576-3' and 'JTY5763' formats.
+        """
+        from ..normalize.normalizer import FieldNormalizer
+
+        # Clean the text using the same logic as matcher
+        text = FieldNormalizer.clean_text(text)
+
+        if not text:
+            return None, False, "Empty text"
+
+        # Customer number patterns - ordered by specificity
+        # Match both spaced/dashed versions and compact versions
+        customer_code_patterns = [
+            # Pattern: Letters + space/dash + digits + dash + digit (EMM 256-6, JTY 576-3)
+            r'\b([A-Z]{2,4}[\s\-]?\d{1,4}[\s\-]\d{1,2}[A-Z]?)\b',
+            # Pattern: Letters + space/dash + digits + optional letter (FFL 019N, ABC 123X)
+            r'\b([A-Z]{2,4}[\s\-]\d{2,4}[A-Z]?)\b',
+            # Pattern: Compact format - letters immediately followed by digits + optional letter (JTY5763, FFL019N)
+            r'\b([A-Z]{2,4}\d{3,6}[A-Z]?)\b',
+            # Pattern: Single letter + digits (A12345)
+            r'\b([A-Z]\d{4,6}[A-Z]?)\b',
+            # Pattern: Digits + dash/space + digits (123-456)
+            r'\b(\d{3,6}[\s\-]\d{1,4})\b',
+        ]
+
+        all_matches = []
+        for pattern in customer_code_patterns:
+            matches = re.findall(pattern, text, re.IGNORECASE)
+            all_matches.extend(matches)
+
+        if all_matches:
+            # Prefer longer matches and those appearing later in text (after names)
+            # Sort by position in text (later = better) and length (longer = better)
+            scored_matches = []
+            for match in all_matches:
+                pos = text.upper().rfind(match.upper())
+                # Score: position * 0.1 + length (prefer later and longer)
+                score = pos * 0.1 + len(match)
+                scored_matches.append((score, match))
+
+            best_match = max(scored_matches, key=lambda x: x[0])[1]
+            return best_match.strip().upper(), True, None
+
+        # Pattern 2: Look for explicit labels
+        labeled_patterns = [
+            r'(?:kund(?:nr|nummer|id)?|ert?\s*(?:kund)?(?:nr|nummer)?|customer\s*(?:no|number|id)?)\s*[:\.]?\s*([A-Za-z0-9][\w\s\-]{1,20}?)(?:\s{2,}|\n|$)',
+        ]
+
+        for pattern in labeled_patterns:
+            match = re.search(pattern, text, re.IGNORECASE)
+            if match:
+                extracted = match.group(1).strip()
+                extracted = re.sub(r'[\s\.\,\:]+$', '', extracted)
+                if extracted and len(extracted) >= 2:
+                    return extracted.upper(), True, None
+
+        # Pattern 3: If text contains comma (likely "NAME, NAME CODE"), extract after last comma
+        if ',' in text:
+            after_comma = text.split(',')[-1].strip()
+            # Look for alphanumeric code in the part after comma
+            for pattern in customer_code_patterns[:3]:  # Use first 3 patterns
+                code_match = re.search(pattern, after_comma, re.IGNORECASE)
+                if code_match:
+                    return code_match.group(1).strip().upper(), True, None
+
+        # Pattern 4: Short text - filter out name-like words
+        if len(text) <= 20:
+            words = text.split()
+            code_parts = []
+            for word in words:
+                # Keep if: contains digits, or is short uppercase (likely abbreviation)
+                if re.search(r'\d', word) or (len(word) <= 4 and word.isupper()):
+                    code_parts.append(word)
+            if code_parts:
+                result = ' '.join(code_parts).upper()
+                if len(result) >= 3:
+                    return result, True, None
+
+        # Fallback: return cleaned text if reasonable
+        if text and 3 <= len(text) <= 15:
+            return text.upper(), True, None
+
+        return None, False, f"Cannot extract customer number from: {text[:50]}"
+
    def extract_all_fields(
        self,
        detections: list[Detection],
--- a/src/inference/pipeline.py
+++ b/src/inference/pipeline.py
@@ -14,6 +14,21 @@ from .yolo_detector import YOLODetector, Detection, CLASS_TO_FIELD
 from .field_extractor import FieldExtractor, ExtractedField


+@dataclass
+class CrossValidationResult:
+    """Result of cross-validation between payment_line and other fields."""
+    is_valid: bool = False
+    ocr_match: bool | None = None  # None if not comparable
+    amount_match: bool | None = None
+    bankgiro_match: bool | None = None
+    plusgiro_match: bool | None = None
+    payment_line_ocr: str | None = None
+    payment_line_amount: str | None = None
+    payment_line_account: str | None = None
+    payment_line_account_type: str | None = None  # 'bankgiro' or 'plusgiro'
+    details: list[str] = field(default_factory=list)
+
+
@dataclass
 class InferenceResult:
    """Result of invoice processing."""
@@ -21,15 +36,17 @@ class InferenceResult:
    success: bool = False
    fields: dict[str, Any] = field(default_factory=dict)
    confidence: dict[str, float] = field(default_factory=dict)
+    bboxes: dict[str, tuple[float, float, float, float]] = field(default_factory=dict)  # Field bboxes in pixels
    raw_detections: list[Detection] = field(default_factory=list)
    extracted_fields: list[ExtractedField] = field(default_factory=list)
    processing_time_ms: float = 0.0
    errors: list[str] = field(default_factory=list)
    fallback_used: bool = False
+    cross_validation: CrossValidationResult | None = None

    def to_json(self) -> dict:
        """Convert to JSON-serializable dictionary."""
-        return {
+        result = {
            'DocumentId': self.document_id,
            'InvoiceNumber': self.fields.get('InvoiceNumber'),
            'InvoiceDate': self.fields.get('InvoiceDate'),
@@ -38,10 +55,31 @@ class InferenceResult:
            'Bankgiro': self.fields.get('Bankgiro'),
            'Plusgiro': self.fields.get('Plusgiro'),
            'Amount': self.fields.get('Amount'),
+            'supplier_org_number': self.fields.get('supplier_org_number'),
+            'customer_number': self.fields.get('customer_number'),
+            'payment_line': self.fields.get('payment_line'),
            'confidence': self.confidence,
            'success': self.success,
            'fallback_used': self.fallback_used
        }
+        # Add bboxes if present
+        if self.bboxes:
+            result['bboxes'] = {k: list(v) for k, v in self.bboxes.items()}
+        # Add cross-validation results if present
+        if self.cross_validation:
+            result['cross_validation'] = {
+                'is_valid': self.cross_validation.is_valid,
+                'ocr_match': self.cross_validation.ocr_match,
+                'amount_match': self.cross_validation.amount_match,
+                'bankgiro_match': self.cross_validation.bankgiro_match,
+                'plusgiro_match': self.cross_validation.plusgiro_match,
+                'payment_line_ocr': self.cross_validation.payment_line_ocr,
+                'payment_line_amount': self.cross_validation.payment_line_amount,
+                'payment_line_account': self.cross_validation.payment_line_account,
+                'payment_line_account_type': self.cross_validation.payment_line_account_type,
+                'details': self.cross_validation.details,
+            }
+        return result

    def get_field(self, field_name: str) -> tuple[Any, float]:
        """Get field value and confidence."""
@@ -170,6 +208,148 @@ class InferencePipeline:
            best = max(candidates, key=lambda x: x.confidence)
            result.fields[field_name] = best.normalized_value
            result.confidence[field_name] = best.confidence
+            # Store bbox for each field (useful for payment_line and other fields)
+            result.bboxes[field_name] = best.bbox
+
+        # Perform cross-validation if payment_line is detected
+        self._cross_validate_payment_line(result)
+
+    def _cross_validate_payment_line(self, result: InferenceResult) -> None:
+        """
+        Cross-validate payment_line data against other detected fields.
+        Payment line values take PRIORITY over individually detected fields.
+
+        Swedish payment line (Betalningsrad) contains:
+        - OCR reference number
+        - Amount (kronor and öre)
+        - Bankgiro or Plusgiro account number
+
+        This method:
+        1. Parses payment_line to extract OCR, Amount, Account
+        2. Compares with separately detected fields for validation
+        3. OVERWRITES detected fields with payment_line values (payment_line is authoritative)
+        """
+        payment_line = result.fields.get('payment_line')
+        if not payment_line:
+            return
+
+        cv = CrossValidationResult()
+        cv.details = []
+
+        # Parse payment_line format: "OCR:12345 Amount:100,00 BG:123-4567"
+        pl_parts = {}
+        for part in str(payment_line).split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        cv.payment_line_ocr = pl_parts.get('OCR')
+        cv.payment_line_amount = pl_parts.get('AMOUNT')
+
+        # Determine account type from payment_line
+        if pl_parts.get('BG'):
+            cv.payment_line_account = pl_parts['BG']
+            cv.payment_line_account_type = 'bankgiro'
+        elif pl_parts.get('PG'):
+            cv.payment_line_account = pl_parts['PG']
+            cv.payment_line_account_type = 'plusgiro'
+
+        # Cross-validate and OVERRIDE with payment_line values
+
+        # OCR: payment_line takes priority
+        detected_ocr = result.fields.get('OCR')
+        if cv.payment_line_ocr:
+            pl_ocr_digits = re.sub(r'\D', '', cv.payment_line_ocr)
+            if detected_ocr:
+                detected_ocr_digits = re.sub(r'\D', '', str(detected_ocr))
+                cv.ocr_match = pl_ocr_digits == detected_ocr_digits
+                if cv.ocr_match:
+                    cv.details.append(f"OCR match: {cv.payment_line_ocr}")
+                else:
+                    cv.details.append(f"OCR: payment_line={cv.payment_line_ocr} (override detected={detected_ocr})")
+            else:
+                cv.details.append(f"OCR: {cv.payment_line_ocr} (from payment_line)")
+            # OVERRIDE: use payment_line OCR
+            result.fields['OCR'] = cv.payment_line_ocr
+            result.confidence['OCR'] = 0.95  # High confidence for payment_line
+
+        # Amount: payment_line takes priority
+        detected_amount = result.fields.get('Amount')
+        if cv.payment_line_amount:
+            if detected_amount:
+                pl_amount = self._normalize_amount_for_compare(cv.payment_line_amount)
+                det_amount = self._normalize_amount_for_compare(str(detected_amount))
+                cv.amount_match = pl_amount == det_amount
+                if cv.amount_match:
+                    cv.details.append(f"Amount match: {cv.payment_line_amount}")
+                else:
+                    cv.details.append(f"Amount: payment_line={cv.payment_line_amount} (override detected={detected_amount})")
+            else:
+                cv.details.append(f"Amount: {cv.payment_line_amount} (from payment_line)")
+            # OVERRIDE: use payment_line Amount
+            result.fields['Amount'] = cv.payment_line_amount
+            result.confidence['Amount'] = 0.95
+
+        # Bankgiro: compare only, do NOT override (payment_line account detection is unreliable)
+        detected_bankgiro = result.fields.get('Bankgiro')
+        if cv.payment_line_account_type == 'bankgiro' and cv.payment_line_account:
+            pl_bg_digits = re.sub(r'\D', '', cv.payment_line_account)
+            if detected_bankgiro:
+                det_bg_digits = re.sub(r'\D', '', str(detected_bankgiro))
+                cv.bankgiro_match = pl_bg_digits == det_bg_digits
+                if cv.bankgiro_match:
+                    cv.details.append(f"Bankgiro match confirmed: {detected_bankgiro}")
+                else:
+                    cv.details.append(f"Bankgiro mismatch: detected={detected_bankgiro}, payment_line={cv.payment_line_account}")
+            # Do NOT override - keep detected value
+
+        # Plusgiro: compare only, do NOT override (payment_line account detection is unreliable)
+        detected_plusgiro = result.fields.get('Plusgiro')
+        if cv.payment_line_account_type == 'plusgiro' and cv.payment_line_account:
+            pl_pg_digits = re.sub(r'\D', '', cv.payment_line_account)
+            if detected_plusgiro:
+                det_pg_digits = re.sub(r'\D', '', str(detected_plusgiro))
+                cv.plusgiro_match = pl_pg_digits == det_pg_digits
+                if cv.plusgiro_match:
+                    cv.details.append(f"Plusgiro match confirmed: {detected_plusgiro}")
+                else:
+                    cv.details.append(f"Plusgiro mismatch: detected={detected_plusgiro}, payment_line={cv.payment_line_account}")
+            # Do NOT override - keep detected value
+
+        # Determine overall validity
+        # Note: payment_line only contains ONE account (either BG or PG), so when invoice
+        # has both accounts, the other one cannot be matched - this is expected and OK.
+        # Only count the account type that payment_line actually has.
+        matches = [cv.ocr_match, cv.amount_match]
+
+        # Only include account match if payment_line has that account type
+        if cv.payment_line_account_type == 'bankgiro' and cv.bankgiro_match is not None:
+            matches.append(cv.bankgiro_match)
+        elif cv.payment_line_account_type == 'plusgiro' and cv.plusgiro_match is not None:
+            matches.append(cv.plusgiro_match)
+
+        valid_matches = [m for m in matches if m is not None]
+        if valid_matches:
+            match_count = sum(1 for m in valid_matches if m)
+            cv.is_valid = match_count >= min(2, len(valid_matches))
+            cv.details.append(f"Validation: {match_count}/{len(valid_matches)} fields match")
+        else:
+            # No comparison possible
+            cv.is_valid = True
+            cv.details.append("No comparison available from payment_line")
+
+        result.cross_validation = cv
+
+    def _normalize_amount_for_compare(self, amount: str) -> float | None:
+        """Normalize amount string to float for comparison."""
+        try:
+            # Remove spaces, convert comma to dot
+            cleaned = amount.replace(' ', '').replace(',', '.')
+            # Handle Swedish format with space as thousands separator
+            cleaned = re.sub(r'(\d)\s+(\d)', r'\1\2', cleaned)
+            return round(float(cleaned), 2)
+        except (ValueError, AttributeError):
+            return None

    def _needs_fallback(self, result: InferenceResult) -> bool:
        """Check if fallback OCR is needed."""
--- a/src/inference/test_field_extractor.py
+++ b/src/inference/test_field_extractor.py
@@ -0,0 +1,342 @@
+"""
+Tests for Field Extractor
+
+Tests field normalization functions:
+- Invoice number normalization
+- Date normalization
+- Amount normalization
+- Bankgiro/Plusgiro normalization
+- OCR number normalization
+- Payment line normalization
+"""
+
+import pytest
+from src.inference.field_extractor import FieldExtractor
+
+
+class TestFieldExtractorInit:
+    """Tests for FieldExtractor initialization."""
+
+    def test_default_init(self):
+        """Test default initialization."""
+        extractor = FieldExtractor()
+        assert extractor.ocr_lang == 'en'
+        assert extractor.use_gpu is False
+        assert extractor.bbox_padding == 0.1
+        assert extractor.dpi == 300
+
+    def test_custom_init(self):
+        """Test custom initialization."""
+        extractor = FieldExtractor(
+            ocr_lang='sv',
+            use_gpu=True,
+            bbox_padding=0.2,
+            dpi=150
+        )
+        assert extractor.ocr_lang == 'sv'
+        assert extractor.use_gpu is True
+        assert extractor.bbox_padding == 0.2
+        assert extractor.dpi == 150
+
+
+class TestNormalizeInvoiceNumber:
+    """Tests for invoice number normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_alphanumeric_invoice_number(self, extractor):
+        """Test alphanumeric invoice number like A3861."""
+        result, is_valid, error = extractor._normalize_invoice_number("Fakturanummer: A3861")
+        assert result == 'A3861'
+        assert is_valid is True
+
+    def test_prefix_invoice_number(self, extractor):
+        """Test invoice number with prefix like INV12345."""
+        result, is_valid, error = extractor._normalize_invoice_number("Invoice INV12345")
+        assert result is not None
+        assert 'INV' in result or '12345' in result
+
+    def test_numeric_invoice_number(self, extractor):
+        """Test pure numeric invoice number."""
+        result, is_valid, error = extractor._normalize_invoice_number("Invoice: 12345678")
+        assert result is not None
+        assert result.isdigit()
+
+    def test_year_prefixed_invoice_number(self, extractor):
+        """Test invoice number with year prefix like 2024-001."""
+        result, is_valid, error = extractor._normalize_invoice_number("Faktura 2024-12345")
+        assert result is not None
+        assert '2024' in result
+
+    def test_avoid_long_ocr_sequence(self, extractor):
+        """Test that long OCR-like sequences are avoided."""
+        # When text contains both short invoice number and long OCR sequence
+        text = "Fakturanummer: A3861 OCR: 310196187399952763290708"
+        result, is_valid, error = extractor._normalize_invoice_number(text)
+        # Should prefer the shorter alphanumeric pattern
+        assert result == 'A3861'
+
+    def test_empty_string(self, extractor):
+        """Test empty string input."""
+        result, is_valid, error = extractor._normalize_invoice_number("")
+        assert result is None or is_valid is False
+
+
+class TestNormalizeBankgiro:
+    """Tests for Bankgiro normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_standard_7_digit_format(self, extractor):
+        """Test 7-digit Bankgiro XXX-XXXX."""
+        result, is_valid, error = extractor._normalize_bankgiro("Bankgiro: 782-1713")
+        assert result == '782-1713'
+        assert is_valid is True
+
+    def test_standard_8_digit_format(self, extractor):
+        """Test 8-digit Bankgiro XXXX-XXXX."""
+        result, is_valid, error = extractor._normalize_bankgiro("BG 5393-9484")
+        assert result == '5393-9484'
+        assert is_valid is True
+
+    def test_without_dash(self, extractor):
+        """Test Bankgiro without dash."""
+        result, is_valid, error = extractor._normalize_bankgiro("Bankgiro 7821713")
+        assert result is not None
+        # Should be formatted with dash
+
+    def test_with_spaces(self, extractor):
+        """Test Bankgiro with spaces - may not parse if spaces break the pattern."""
+        result, is_valid, error = extractor._normalize_bankgiro("BG: 782 1713")
+        # Spaces in the middle might cause parsing issues - that's acceptable
+        # The test passes if it doesn't crash
+
+    def test_invalid_bankgiro(self, extractor):
+        """Test invalid Bankgiro (too short)."""
+        result, is_valid, error = extractor._normalize_bankgiro("BG: 123")
+        # Should fail or return None
+
+
+class TestNormalizePlusgiro:
+    """Tests for Plusgiro normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_standard_format(self, extractor):
+        """Test standard Plusgiro format XXXXXXX-X."""
+        result, is_valid, error = extractor._normalize_plusgiro("Plusgiro: 1234567-8")
+        assert result is not None
+        assert '-' in result
+
+    def test_without_dash(self, extractor):
+        """Test Plusgiro without dash."""
+        result, is_valid, error = extractor._normalize_plusgiro("PG 12345678")
+        assert result is not None
+
+    def test_distinguish_from_bankgiro(self, extractor):
+        """Test that Plusgiro is distinguished from Bankgiro by format."""
+        # Plusgiro has 1 digit after dash, Bankgiro has 4
+        pg_text = "4809603-6"  # Plusgiro format
+        bg_text = "782-1713"  # Bankgiro format
+
+        pg_result, _, _ = extractor._normalize_plusgiro(pg_text)
+        bg_result, _, _ = extractor._normalize_bankgiro(bg_text)
+
+        # Both should succeed in their respective normalizations
+
+
+class TestNormalizeAmount:
+    """Tests for Amount normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_swedish_format_comma(self, extractor):
+        """Test Swedish format with comma: 11 699,00."""
+        result, is_valid, error = extractor._normalize_amount("11 699,00 SEK")
+        assert result is not None
+        assert is_valid is True
+
+    def test_integer_amount(self, extractor):
+        """Test integer amount without decimals."""
+        result, is_valid, error = extractor._normalize_amount("Amount: 11699")
+        assert result is not None
+
+    def test_with_currency(self, extractor):
+        """Test amount with currency symbol."""
+        result, is_valid, error = extractor._normalize_amount("SEK 11 699,00")
+        assert result is not None
+
+    def test_large_amount(self, extractor):
+        """Test large amount with thousand separators."""
+        result, is_valid, error = extractor._normalize_amount("1 234 567,89")
+        assert result is not None
+
+
+class TestNormalizeOCR:
+    """Tests for OCR number normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_standard_ocr(self, extractor):
+        """Test standard OCR number."""
+        result, is_valid, error = extractor._normalize_ocr_number("OCR: 310196187399952")
+        assert result == '310196187399952'
+        assert is_valid is True
+
+    def test_ocr_with_spaces(self, extractor):
+        """Test OCR number with spaces."""
+        result, is_valid, error = extractor._normalize_ocr_number("3101 9618 7399 952")
+        assert result is not None
+        assert ' ' not in result  # Spaces should be removed
+
+    def test_short_ocr_invalid(self, extractor):
+        """Test that too short OCR is invalid."""
+        result, is_valid, error = extractor._normalize_ocr_number("123")
+        assert is_valid is False
+
+
+class TestNormalizeDate:
+    """Tests for date normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_iso_format(self, extractor):
+        """Test ISO date format YYYY-MM-DD."""
+        result, is_valid, error = extractor._normalize_date("2026-01-31")
+        assert result == '2026-01-31'
+        assert is_valid is True
+
+    def test_swedish_format(self, extractor):
+        """Test Swedish format with dots: 31.01.2026."""
+        result, is_valid, error = extractor._normalize_date("31.01.2026")
+        assert result is not None
+        assert is_valid is True
+
+    def test_slash_format(self, extractor):
+        """Test slash format: 31/01/2026."""
+        result, is_valid, error = extractor._normalize_date("31/01/2026")
+        assert result is not None
+
+    def test_compact_format(self, extractor):
+        """Test compact format: 20260131."""
+        result, is_valid, error = extractor._normalize_date("20260131")
+        assert result is not None
+
+    def test_invalid_date(self, extractor):
+        """Test invalid date."""
+        result, is_valid, error = extractor._normalize_date("not a date")
+        assert is_valid is False
+
+
+class TestNormalizePaymentLine:
+    """Tests for payment line normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_standard_payment_line(self, extractor):
+        """Test standard payment line parsing."""
+        text = "# 310196187399952 # 11699 00 6 > 7821713#41#"
+        result, is_valid, error = extractor._normalize_payment_line(text)
+
+        assert result is not None
+        assert is_valid is True
+        # Should be formatted as: OCR:xxx Amount:xxx BG:xxx
+        assert 'OCR:' in result or '310196187399952' in result
+
+    def test_payment_line_with_spaces_in_bg(self, extractor):
+        """Test payment line with spaces in Bankgiro."""
+        text = "# 310196187399952 # 11699 00 6 > 78 2 1 713 #41#"
+        result, is_valid, error = extractor._normalize_payment_line(text)
+
+        assert result is not None
+        assert is_valid is True
+        # Bankgiro should be normalized despite spaces
+
+
+class TestNormalizeCustomerNumber:
+    """Tests for customer number normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_with_separator(self, extractor):
+        """Test customer number with separator: JTY 576-3."""
+        result, is_valid, error = extractor._normalize_customer_number("Kundnr: JTY 576-3")
+        assert result is not None
+
+    def test_compact_format(self, extractor):
+        """Test compact customer number: JTY5763."""
+        result, is_valid, error = extractor._normalize_customer_number("JTY5763")
+        assert result is not None
+
+
+class TestNormalizeSupplierOrgNumber:
+    """Tests for supplier organization number normalization."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_standard_format(self, extractor):
+        """Test standard format NNNNNN-NNNN."""
+        result, is_valid, error = extractor._normalize_supplier_org_number("Org.nr 516406-1102")
+        assert result == '516406-1102'
+        assert is_valid is True
+
+    def test_vat_number_format(self, extractor):
+        """Test VAT number format SE + 10 digits + 01."""
+        result, is_valid, error = extractor._normalize_supplier_org_number("Momsreg.nr SE556123456701")
+        assert result is not None
+        assert '-' in result
+
+
+class TestNormalizeAndValidateDispatch:
+    """Tests for the _normalize_and_validate dispatch method."""
+
+    @pytest.fixture
+    def extractor(self):
+        return FieldExtractor()
+
+    def test_dispatch_invoice_number(self, extractor):
+        """Test dispatch to invoice number normalizer."""
+        result, is_valid, error = extractor._normalize_and_validate('InvoiceNumber', 'A3861')
+        assert result is not None
+
+    def test_dispatch_amount(self, extractor):
+        """Test dispatch to amount normalizer."""
+        result, is_valid, error = extractor._normalize_and_validate('Amount', '11699,00')
+        assert result is not None
+
+    def test_dispatch_bankgiro(self, extractor):
+        """Test dispatch to Bankgiro normalizer."""
+        result, is_valid, error = extractor._normalize_and_validate('Bankgiro', '782-1713')
+        assert result is not None
+
+    def test_dispatch_ocr(self, extractor):
+        """Test dispatch to OCR normalizer."""
+        result, is_valid, error = extractor._normalize_and_validate('OCR', '310196187399952')
+        assert result is not None
+
+    def test_dispatch_date(self, extractor):
+        """Test dispatch to date normalizer."""
+        result, is_valid, error = extractor._normalize_and_validate('InvoiceDate', '2026-01-31')
+        assert result is not None
+
+
+if __name__ == '__main__':
+    pytest.main([__file__, '-v'])
--- a/src/inference/test_pipeline.py
+++ b/src/inference/test_pipeline.py
@@ -0,0 +1,326 @@
+"""
+Tests for Inference Pipeline
+
+Tests the cross-validation logic between payment_line and detected fields:
+- OCR override from payment_line
+- Amount override from payment_line
+- Bankgiro/Plusgiro comparison (no override)
+- Validation scoring
+"""
+
+import pytest
+from unittest.mock import MagicMock, patch
+from src.inference.pipeline import InferencePipeline, InferenceResult, CrossValidationResult
+
+
+class TestCrossValidationResult:
+    """Tests for CrossValidationResult dataclass."""
+
+    def test_default_values(self):
+        """Test default values."""
+        cv = CrossValidationResult()
+        assert cv.ocr_match is None
+        assert cv.amount_match is None
+        assert cv.bankgiro_match is None
+        assert cv.plusgiro_match is None
+        assert cv.payment_line_ocr is None
+        assert cv.payment_line_amount is None
+        assert cv.payment_line_account is None
+        assert cv.payment_line_account_type is None
+
+    def test_attributes(self):
+        """Test setting attributes."""
+        cv = CrossValidationResult()
+        cv.ocr_match = True
+        cv.amount_match = True
+        cv.payment_line_ocr = '12345678901'
+        cv.payment_line_amount = '100'
+        cv.details = ['OCR match', 'Amount match']
+
+        assert cv.ocr_match is True
+        assert cv.amount_match is True
+        assert cv.payment_line_ocr == '12345678901'
+        assert 'OCR match' in cv.details
+
+
+class TestInferenceResult:
+    """Tests for InferenceResult dataclass."""
+
+    def test_default_fields(self):
+        """Test default field values."""
+        result = InferenceResult()
+        assert result.fields == {}
+        assert result.confidence == {}
+        assert result.errors == []
+
+    def test_set_fields(self):
+        """Test setting field values."""
+        result = InferenceResult()
+        result.fields = {
+            'OCR': '12345678901',
+            'Amount': '100',
+            'Bankgiro': '782-1713'
+        }
+        result.confidence = {
+            'OCR': 0.95,
+            'Amount': 0.90,
+            'Bankgiro': 0.88
+        }
+
+        assert result.fields['OCR'] == '12345678901'
+        assert result.fields['Amount'] == '100'
+        assert result.fields['Bankgiro'] == '782-1713'
+
+    def test_cross_validation_assignment(self):
+        """Test cross validation assignment."""
+        result = InferenceResult()
+        result.fields = {'OCR': '12345678901'}
+
+        cv = CrossValidationResult()
+        cv.ocr_match = True
+        cv.payment_line_ocr = '12345678901'
+        result.cross_validation = cv
+
+        assert result.cross_validation is not None
+        assert result.cross_validation.ocr_match is True
+
+
+class TestPaymentLineParsingInPipeline:
+    """Tests for payment_line parsing in cross-validation."""
+
+    def test_parse_payment_line_format(self):
+        """Test parsing of payment_line format: OCR:xxx Amount:xxx BG:xxx"""
+        # Simulate the parsing logic from pipeline
+        payment_line = "OCR:310196187399952 Amount:11699 BG:782-1713"
+
+        pl_parts = {}
+        for part in payment_line.split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        assert pl_parts.get('OCR') == '310196187399952'
+        assert pl_parts.get('AMOUNT') == '11699'
+        assert pl_parts.get('BG') == '782-1713'
+
+    def test_parse_payment_line_with_plusgiro(self):
+        """Test parsing with Plusgiro."""
+        payment_line = "OCR:12345678901 Amount:500 PG:1234567-8"
+
+        pl_parts = {}
+        for part in payment_line.split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        assert pl_parts.get('OCR') == '12345678901'
+        assert pl_parts.get('PG') == '1234567-8'
+        assert pl_parts.get('BG') is None
+
+    def test_parse_empty_payment_line(self):
+        """Test parsing empty payment_line."""
+        payment_line = ""
+
+        pl_parts = {}
+        for part in payment_line.split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        assert pl_parts.get('OCR') is None
+        assert pl_parts.get('AMOUNT') is None
+
+
+class TestOCROverride:
+    """Tests for OCR override logic."""
+
+    def test_ocr_override_when_different(self):
+        """Test OCR is overridden when payment_line value differs."""
+        result = InferenceResult()
+        result.fields = {'OCR': 'wrong_ocr_12345', 'payment_line': 'OCR:correct_ocr_67890 Amount:100 BG:782-1713'}
+
+        # Simulate the override logic
+        payment_line = result.fields.get('payment_line')
+        pl_parts = {}
+        for part in str(payment_line).split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        payment_line_ocr = pl_parts.get('OCR')
+
+        # Override detected OCR with payment_line OCR
+        if payment_line_ocr:
+            result.fields['OCR'] = payment_line_ocr
+
+        assert result.fields['OCR'] == 'correct_ocr_67890'
+
+    def test_ocr_no_override_when_no_payment_line(self):
+        """Test OCR is not overridden when no payment_line."""
+        result = InferenceResult()
+        result.fields = {'OCR': 'original_ocr_12345'}
+
+        # No payment_line, no override
+        assert result.fields['OCR'] == 'original_ocr_12345'
+
+
+class TestAmountOverride:
+    """Tests for Amount override logic."""
+
+    def test_amount_override(self):
+        """Test Amount is overridden from payment_line."""
+        result = InferenceResult()
+        result.fields = {
+            'Amount': '999.00',
+            'payment_line': 'OCR:12345 Amount:11699 BG:782-1713'
+        }
+
+        payment_line = result.fields.get('payment_line')
+        pl_parts = {}
+        for part in str(payment_line).split():
+            if ':' in part:
+                key, value = part.split(':', 1)
+                pl_parts[key.upper()] = value
+
+        payment_line_amount = pl_parts.get('AMOUNT')
+
+        if payment_line_amount:
+            result.fields['Amount'] = payment_line_amount
+
+        assert result.fields['Amount'] == '11699'
+
+
+class TestBankgiroComparison:
+    """Tests for Bankgiro comparison (no override)."""
+
+    def test_bankgiro_match(self):
+        """Test Bankgiro match detection."""
+        import re
+
+        detected_bankgiro = '782-1713'
+        payment_line_account = '782-1713'
+
+        det_digits = re.sub(r'\D', '', detected_bankgiro)
+        pl_digits = re.sub(r'\D', '', payment_line_account)
+
+        assert det_digits == pl_digits
+        assert det_digits == '7821713'
+
+    def test_bankgiro_mismatch(self):
+        """Test Bankgiro mismatch detection."""
+        import re
+
+        detected_bankgiro = '782-1713'
+        payment_line_account = '123-4567'
+
+        det_digits = re.sub(r'\D', '', detected_bankgiro)
+        pl_digits = re.sub(r'\D', '', payment_line_account)
+
+        assert det_digits != pl_digits
+
+    def test_bankgiro_not_overridden(self):
+        """Test that Bankgiro is NOT overridden from payment_line."""
+        result = InferenceResult()
+        result.fields = {
+            'Bankgiro': '999-9999',  # Different value
+            'payment_line': 'OCR:12345 Amount:100 BG:782-1713'
+        }
+
+        # Bankgiro should NOT be overridden (per current logic)
+        # Only compared for validation
+        original_bankgiro = result.fields['Bankgiro']
+
+        # The override logic explicitly skips Bankgiro
+        # So we verify it remains unchanged
+        assert result.fields['Bankgiro'] == '999-9999'
+        assert result.fields['Bankgiro'] == original_bankgiro
+
+
+class TestValidationScoring:
+    """Tests for validation scoring logic."""
+
+    def test_all_fields_match(self):
+        """Test score when all fields match."""
+        matches = [True, True, True]  # OCR, Amount, Bankgiro
+        match_count = sum(1 for m in matches if m)
+        total = len(matches)
+
+        assert match_count == 3
+        assert total == 3
+
+    def test_partial_match(self):
+        """Test score with partial matches."""
+        matches = [True, True, False]  # OCR match, Amount match, Bankgiro mismatch
+        match_count = sum(1 for m in matches if m)
+
+        assert match_count == 2
+
+    def test_no_matches(self):
+        """Test score when nothing matches."""
+        matches = [False, False, False]
+        match_count = sum(1 for m in matches if m)
+
+        assert match_count == 0
+
+    def test_only_count_present_fields(self):
+        """Test that only present fields are counted."""
+        # When invoice has both BG and PG but payment_line only has BG,
+        # we should only count BG in validation
+
+        payment_line_account_type = 'bankgiro'
+        bankgiro_match = True
+        plusgiro_match = None  # Not compared because payment_line doesn't have PG
+
+        matches = []
+        if payment_line_account_type == 'bankgiro' and bankgiro_match is not None:
+            matches.append(bankgiro_match)
+        elif payment_line_account_type == 'plusgiro' and plusgiro_match is not None:
+            matches.append(plusgiro_match)
+
+        assert len(matches) == 1
+        assert matches[0] is True
+
+
+class TestAmountNormalization:
+    """Tests for amount normalization for comparison."""
+
+    def test_normalize_amount_with_comma(self):
+        """Test normalizing amount with comma decimal."""
+        import re
+
+        amount = "11699,00"
+        normalized = re.sub(r'[^\d]', '', amount)
+
+        # Remove trailing zeros for öre
+        if len(normalized) > 2 and normalized[-2:] == '00':
+            normalized = normalized[:-2]
+
+        assert normalized == '11699'
+
+    def test_normalize_amount_with_dot(self):
+        """Test normalizing amount with dot decimal."""
+        import re
+
+        amount = "11699.00"
+        normalized = re.sub(r'[^\d]', '', amount)
+
+        if len(normalized) > 2 and normalized[-2:] == '00':
+            normalized = normalized[:-2]
+
+        assert normalized == '11699'
+
+    def test_normalize_amount_with_space_separator(self):
+        """Test normalizing amount with space thousand separator."""
+        import re
+
+        amount = "11 699,00"
+        normalized = re.sub(r'[^\d]', '', amount)
+
+        if len(normalized) > 2 and normalized[-2:] == '00':
+            normalized = normalized[:-2]
+
+        assert normalized == '11699'
+
+
+if __name__ == '__main__':
+    pytest.main([__file__, '-v'])
--- a/src/inference/yolo_detector.py
+++ b/src/inference/yolo_detector.py
@@ -81,6 +81,9 @@ CLASS_NAMES = [
    'bankgiro',
    'plusgiro',
    'amount',
+    'supplier_org_number',  # Matches training class name
+    'customer_number',
+    'payment_line',  # Machine code payment line at bottom of invoice
 ]

 # Mapping from class name to field name
@@ -92,6 +95,9 @@ CLASS_TO_FIELD = {
    'bankgiro': 'Bankgiro',
    'plusgiro': 'Plusgiro',
    'amount': 'Amount',
+    'supplier_org_number': 'supplier_org_number',
+    'customer_number': 'customer_number',
+    'payment_line': 'payment_line',
 }


--- a/src/matcher/field_matcher.py
+++ b/src/matcher/field_matcher.py
@@ -14,11 +14,11 @@ from functools import cached_property
 _DATE_PATTERN = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
 _WHITESPACE_PATTERN = re.compile(r'\s+')
 _NON_DIGIT_PATTERN = re.compile(r'\D')
-_DASH_PATTERN = re.compile(r'[\u2013\u2014\u2212]')  # en-dash, em-dash, minus sign
+_DASH_PATTERN = re.compile(r'[\u2013\u2014\u2212\u00b7]')  # en-dash, em-dash, minus sign, middle dot


 def _normalize_dashes(text: str) -> str:
-    """Normalize different dash types to standard hyphen-minus (ASCII 45)."""
+    """Normalize different dash types and middle dots to standard hyphen-minus (ASCII 45)."""
    return _DASH_PATTERN.sub('-', text)


@@ -195,7 +195,13 @@ class FieldMatcher:
            List of Match objects sorted by score (descending)
        """
        matches = []
-        page_tokens = [t for t in tokens if t.page_no == page_no]
+        # Filter tokens by page and exclude hidden metadata tokens
+        # Hidden tokens often have bbox with y < 0 or y > page_height
+        # These are typically PDF metadata stored as invisible text
+        page_tokens = [
+            t for t in tokens
+            if t.page_no == page_no and t.bbox[1] >= 0 and t.bbox[3] > t.bbox[1]
+        ]

        # Build spatial index for efficient nearby token lookup (O(n) -> O(1))
        self._token_index = TokenIndex(page_tokens, grid_size=self.context_radius)
@@ -373,41 +379,74 @@ class FieldMatcher:
        if field_name not in supported_fields:
            return matches

+        # Fields where spaces/dashes should be ignored during matching
+        # (e.g., org number "55 65 74-6624" should match "5565746624")
+        ignore_spaces_fields = ('supplier_organisation_number', 'Bankgiro', 'Plusgiro', 'supplier_accounts')
+
        for token in tokens:
            token_text = token.text.strip()
            # Normalize different dash types to hyphen-minus for matching
            token_text_normalized = _normalize_dashes(token_text)

+            # For certain fields, also try matching with spaces/dashes removed
+            if field_name in ignore_spaces_fields:
+                token_text_compact = token_text_normalized.replace(' ', '').replace('-', '')
+                value_compact = value.replace(' ', '').replace('-', '')
+            else:
+                token_text_compact = None
+                value_compact = None
+
            # Skip if token is the same length as value (would be exact match)
            if len(token_text_normalized) <= len(value):
                continue

            # Check if value appears as substring (using normalized text)
            # Try case-sensitive first, then case-insensitive
+            idx = None
+            case_sensitive_match = True
+            used_compact = False
+
            if value in token_text_normalized:
                idx = token_text_normalized.find(value)
-                case_sensitive_match = True
            elif value.lower() in token_text_normalized.lower():
                idx = token_text_normalized.lower().find(value.lower())
                case_sensitive_match = False
-            else:
+            elif token_text_compact and value_compact in token_text_compact:
+                # Try compact matching (spaces/dashes removed)
+                idx = token_text_compact.find(value_compact)
+                used_compact = True
+            elif token_text_compact and value_compact.lower() in token_text_compact.lower():
+                idx = token_text_compact.lower().find(value_compact.lower())
+                case_sensitive_match = False
+                used_compact = True
+
+            if idx is None:
                continue

-            # Verify it's a proper boundary match (not part of a larger number)
-            # Check character before (if exists)
-            if idx > 0:
-                char_before = token_text_normalized[idx - 1]
-                # Must be non-digit (allow : space - etc)
-                if char_before.isdigit():
+            # For compact matching, boundary check is simpler (just check it's 10 consecutive digits)
+            if used_compact:
+                # Verify proper boundary in compact text
+                if idx > 0 and token_text_compact[idx - 1].isdigit():
                    continue
+                end_idx = idx + len(value_compact)
+                if end_idx < len(token_text_compact) and token_text_compact[end_idx].isdigit():
+                    continue
+            else:
+                # Verify it's a proper boundary match (not part of a larger number)
+                # Check character before (if exists)
+                if idx > 0:
+                    char_before = token_text_normalized[idx - 1]
+                    # Must be non-digit (allow : space - etc)
+                    if char_before.isdigit():
+                        continue

-            # Check character after (if exists)
-            end_idx = idx + len(value)
-            if end_idx < len(token_text_normalized):
-                char_after = token_text_normalized[end_idx]
-                # Must be non-digit
-                if char_after.isdigit():
-                    continue
+                # Check character after (if exists)
+                end_idx = idx + len(value)
+                if end_idx < len(token_text_normalized):
+                    char_after = token_text_normalized[end_idx]
+                    # Must be non-digit
+                    if char_after.isdigit():
+                        continue

            # Found valid substring match
            context_keywords, context_boost = self._find_context_keywords(
@@ -678,15 +717,44 @@ class FieldMatcher:
        min_height = min(token1.bbox[3] - token1.bbox[1], token2.bbox[3] - token2.bbox[1])
        return y_overlap > min_height * 0.5

-    def _parse_amount(self, text: str) -> float | None:
+    def _parse_amount(self, text: str | int | float) -> float | None:
        """Try to parse text as a monetary amount."""
-        # Remove currency and spaces
-        text = re.sub(r'[SEK|kr|:-]', '', text, flags=re.IGNORECASE)
+        # Convert to string first
+        text = str(text)
+
+        # First, handle Swedish öre format: "239 00" means 239.00 (239 kr 00 öre)
+        # Pattern: digits + space + exactly 2 digits at end
+        ore_match = re.match(r'^(\d+)\s+(\d{2})$', text.strip())
+        if ore_match:
+            kronor = ore_match.group(1)
+            ore = ore_match.group(2)
+            try:
+                return float(f"{kronor}.{ore}")
+            except ValueError:
+                pass
+
+        # Remove everything after and including parentheses (e.g., "(inkl. moms)")
+        text = re.sub(r'\s*\(.*\)', '', text)
+
+        # Remove currency symbols and common suffixes (including trailing dots from "kr.")
+        text = re.sub(r'\b(SEK|kr|kronor|öre)\b\.?', '', text, flags=re.IGNORECASE)
+        text = re.sub(r'[:-]', '', text)
+
+        # Remove spaces (thousand separators) but be careful with öre format
        text = text.replace(' ', '').replace('\xa0', '')

-        # Try comma as decimal separator
-        if ',' in text and '.' not in text:
-            text = text.replace(',', '.')
+        # Handle comma as decimal separator
+        # Swedish format: "500,00" means 500.00
+        # Need to handle cases like "500,00." (after removing "kr.")
+        if ',' in text:
+            # Remove any trailing dots first (from "kr." removal)
+            text = text.rstrip('.')
+            # Now replace comma with dot
+            if '.' not in text:
+                text = text.replace(',', '.')
+
+        # Remove any remaining non-numeric characters except dot
+        text = re.sub(r'[^\d.]', '', text)

        try:
            return float(text)
--- a/src/matcher/test_field_matcher.py
+++ b/src/matcher/test_field_matcher.py
@@ -0,0 +1,896 @@
+"""
+Tests for the Field Matching Module.
+
+Tests cover all matcher functions in src/matcher/field_matcher.py
+
+Usage:
+    pytest src/matcher/test_field_matcher.py -v -o 'addopts='
+"""
+
+import pytest
+from dataclasses import dataclass
+from src.matcher.field_matcher import (
+    FieldMatcher,
+    Match,
+    TokenIndex,
+    CONTEXT_KEYWORDS,
+    _normalize_dashes,
+    find_field_matches,
+)
+
+
+@dataclass
+class MockToken:
+    """Mock token for testing."""
+    text: str
+    bbox: tuple[float, float, float, float]
+    page_no: int = 0
+
+
+class TestNormalizeDashes:
+    """Tests for _normalize_dashes function."""
+
+    def test_normalize_en_dash(self):
+        """Should normalize en-dash to hyphen."""
+        assert _normalize_dashes("123\u2013456") == "123-456"
+
+    def test_normalize_em_dash(self):
+        """Should normalize em-dash to hyphen."""
+        assert _normalize_dashes("123\u2014456") == "123-456"
+
+    def test_normalize_minus_sign(self):
+        """Should normalize minus sign to hyphen."""
+        assert _normalize_dashes("123\u2212456") == "123-456"
+
+    def test_normalize_middle_dot(self):
+        """Should normalize middle dot to hyphen."""
+        assert _normalize_dashes("123\u00b7456") == "123-456"
+
+    def test_normal_hyphen_unchanged(self):
+        """Should keep normal hyphen unchanged."""
+        assert _normalize_dashes("123-456") == "123-456"
+
+
+class TestTokenIndex:
+    """Tests for TokenIndex class."""
+
+    def test_build_index(self):
+        """Should build spatial index from tokens."""
+        tokens = [
+            MockToken("hello", (0, 0, 50, 20)),
+            MockToken("world", (60, 0, 110, 20)),
+        ]
+        index = TokenIndex(tokens)
+        assert len(index.tokens) == 2
+
+    def test_get_center(self):
+        """Should return correct center coordinates."""
+        token = MockToken("test", (0, 0, 100, 50))
+        tokens = [token]
+        index = TokenIndex(tokens)
+        center = index.get_center(token)
+        assert center == (50.0, 25.0)
+
+    def test_get_text_lower(self):
+        """Should return lowercase text."""
+        token = MockToken("HELLO World", (0, 0, 100, 20))
+        tokens = [token]
+        index = TokenIndex(tokens)
+        assert index.get_text_lower(token) == "hello world"
+
+    def test_find_nearby_within_radius(self):
+        """Should find tokens within radius."""
+        token1 = MockToken("hello", (0, 0, 50, 20))
+        token2 = MockToken("world", (60, 0, 110, 20))  # 60px away
+        token3 = MockToken("far", (500, 0, 550, 20))  # 500px away
+        tokens = [token1, token2, token3]
+        index = TokenIndex(tokens)
+
+        nearby = index.find_nearby(token1, radius=100)
+        assert len(nearby) == 1
+        assert nearby[0].text == "world"
+
+    def test_find_nearby_excludes_self(self):
+        """Should not include the target token itself."""
+        token1 = MockToken("hello", (0, 0, 50, 20))
+        token2 = MockToken("world", (60, 0, 110, 20))
+        tokens = [token1, token2]
+        index = TokenIndex(tokens)
+
+        nearby = index.find_nearby(token1, radius=100)
+        assert token1 not in nearby
+
+    def test_find_nearby_empty_when_none_in_range(self):
+        """Should return empty list when no tokens in range."""
+        token1 = MockToken("hello", (0, 0, 50, 20))
+        token2 = MockToken("far", (500, 0, 550, 20))
+        tokens = [token1, token2]
+        index = TokenIndex(tokens)
+
+        nearby = index.find_nearby(token1, radius=50)
+        assert len(nearby) == 0
+
+
+class TestMatch:
+    """Tests for Match dataclass."""
+
+    def test_match_creation(self):
+        """Should create Match with all fields."""
+        match = Match(
+            field="InvoiceNumber",
+            value="12345",
+            bbox=(0, 0, 100, 20),
+            page_no=0,
+            score=0.95,
+            matched_text="12345",
+            context_keywords=["fakturanr"]
+        )
+        assert match.field == "InvoiceNumber"
+        assert match.value == "12345"
+        assert match.score == 0.95
+
+    def test_to_yolo_format(self):
+        """Should convert to YOLO annotation format."""
+        match = Match(
+            field="Amount",
+            value="100",
+            bbox=(100, 200, 200, 250),  # x0, y0, x1, y1
+            page_no=0,
+            score=1.0,
+            matched_text="100",
+            context_keywords=[]
+        )
+        # Image: 1000x1000
+        yolo = match.to_yolo_format(1000, 1000, class_id=5)
+
+        # Expected: center_x=150, center_y=225, width=100, height=50
+        # Normalized: x_center=0.15, y_center=0.225, w=0.1, h=0.05
+        assert yolo.startswith("5 ")
+        parts = yolo.split()
+        assert len(parts) == 5
+        assert float(parts[1]) == pytest.approx(0.15, rel=1e-4)
+        assert float(parts[2]) == pytest.approx(0.225, rel=1e-4)
+        assert float(parts[3]) == pytest.approx(0.1, rel=1e-4)
+        assert float(parts[4]) == pytest.approx(0.05, rel=1e-4)
+
+
+class TestFieldMatcher:
+    """Tests for FieldMatcher class."""
+
+    def test_init_defaults(self):
+        """Should initialize with default values."""
+        matcher = FieldMatcher()
+        assert matcher.context_radius == 200.0
+        assert matcher.min_score_threshold == 0.5
+
+    def test_init_custom_params(self):
+        """Should initialize with custom parameters."""
+        matcher = FieldMatcher(context_radius=300.0, min_score_threshold=0.7)
+        assert matcher.context_radius == 300.0
+        assert matcher.min_score_threshold == 0.7
+
+
+class TestFieldMatcherExactMatch:
+    """Tests for exact matching."""
+
+    def test_exact_match_full_score(self):
+        """Should find exact match with full score."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("12345", (0, 0, 50, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) >= 1
+        assert matches[0].score == 1.0
+        assert matches[0].matched_text == "12345"
+
+    def test_case_insensitive_match(self):
+        """Should find case-insensitive match with lower score."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("HELLO", (0, 0, 50, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["hello"])
+
+        assert len(matches) >= 1
+        assert matches[0].score == 0.95
+
+    def test_digits_only_match(self):
+        """Should match by digits only for numeric fields."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("INV-12345", (0, 0, 80, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) >= 1
+        assert matches[0].score == 0.9
+
+    def test_no_match_when_different(self):
+        """Should return empty when no match found."""
+        matcher = FieldMatcher(min_score_threshold=0.8)
+        tokens = [MockToken("99999", (0, 0, 50, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) == 0
+
+
+class TestFieldMatcherContextKeywords:
+    """Tests for context keyword boosting."""
+
+    def test_context_boost_with_nearby_keyword(self):
+        """Should boost score when context keyword is nearby."""
+        matcher = FieldMatcher(context_radius=200)
+        tokens = [
+            MockToken("fakturanr", (0, 0, 80, 20)),  # Context keyword
+            MockToken("12345", (100, 0, 150, 20)),   # Value
+        ]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) >= 1
+        # Score should be boosted above 1.0 (capped at 1.0)
+        assert matches[0].score == 1.0
+        assert "fakturanr" in matches[0].context_keywords
+
+    def test_no_boost_when_keyword_far_away(self):
+        """Should not boost when keyword is too far."""
+        matcher = FieldMatcher(context_radius=50)
+        tokens = [
+            MockToken("fakturanr", (0, 0, 80, 20)),   # Context keyword
+            MockToken("12345", (500, 0, 550, 20)),   # Value - far away
+        ]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) >= 1
+        assert "fakturanr" not in matches[0].context_keywords
+
+
+class TestFieldMatcherConcatenatedMatch:
+    """Tests for concatenated token matching."""
+
+    def test_concatenate_adjacent_tokens(self):
+        """Should match value split across adjacent tokens."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("123", (0, 0, 30, 20)),
+            MockToken("456", (35, 0, 65, 20)),  # Adjacent, same line
+        ]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["123456"])
+
+        assert len(matches) >= 1
+        assert "123456" in matches[0].matched_text or matches[0].value == "123456"
+
+    def test_no_concatenate_when_gap_too_large(self):
+        """Should not concatenate when gap is too large."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("123", (0, 0, 30, 20)),
+            MockToken("456", (100, 0, 130, 20)),  # Gap > 50px
+        ]
+
+        # This might still match if exact matches work differently
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["123456"])
+        # No concatenated match expected (only from exact/substring)
+        concat_matches = [m for m in matches if "123456" in m.matched_text]
+        # May or may not find depending on strategy
+
+
+class TestFieldMatcherSubstringMatch:
+    """Tests for substring matching."""
+
+    def test_substring_match_in_longer_text(self):
+        """Should find value as substring in longer token."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("Fakturanummer: 12345", (0, 0, 150, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"])
+
+        assert len(matches) >= 1
+        # Substring match should have lower score
+        substring_match = [m for m in matches if "12345" in m.matched_text]
+        assert len(substring_match) >= 1
+
+    def test_no_substring_match_when_part_of_larger_number(self):
+        """Should not match when value is part of a larger number."""
+        matcher = FieldMatcher(min_score_threshold=0.6)
+        tokens = [MockToken("123456789", (0, 0, 100, 20))]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["456"])
+
+        # Should not match because 456 is embedded in larger number
+        assert len(matches) == 0
+
+
+class TestFieldMatcherFuzzyMatch:
+    """Tests for fuzzy amount matching."""
+
+    def test_fuzzy_amount_match(self):
+        """Should match amounts that are numerically equal."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("1234,56", (0, 0, 70, 20))]
+
+        matches = matcher.find_matches(tokens, "Amount", ["1234.56"])
+
+        assert len(matches) >= 1
+
+    def test_fuzzy_amount_with_different_formats(self):
+        """Should match amounts in different formats."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("1 234,56", (0, 0, 80, 20))]
+
+        matches = matcher.find_matches(tokens, "Amount", ["1234,56"])
+
+        assert len(matches) >= 1
+
+
+class TestFieldMatcherParseAmount:
+    """Tests for _parse_amount method."""
+
+    def test_parse_simple_integer(self):
+        """Should parse simple integer."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("100") == 100.0
+
+    def test_parse_decimal_with_dot(self):
+        """Should parse decimal with dot."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("100.50") == 100.50
+
+    def test_parse_decimal_with_comma(self):
+        """Should parse decimal with comma (European format)."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("100,50") == 100.50
+
+    def test_parse_with_thousand_separator(self):
+        """Should parse with thousand separator."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("1 234,56") == 1234.56
+
+    def test_parse_with_currency_suffix(self):
+        """Should parse and remove currency suffix."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("100 SEK") == 100.0
+        assert matcher._parse_amount("100 kr") == 100.0
+
+    def test_parse_swedish_ore_format(self):
+        """Should parse Swedish öre format (kronor space öre)."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("239 00") == 239.00
+        assert matcher._parse_amount("1234 50") == 1234.50
+
+    def test_parse_invalid_returns_none(self):
+        """Should return None for invalid input."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount("abc") is None
+        assert matcher._parse_amount("") is None
+
+
+class TestFieldMatcherTokensOnSameLine:
+    """Tests for _tokens_on_same_line method."""
+
+    def test_same_line_tokens(self):
+        """Should detect tokens on same line."""
+        matcher = FieldMatcher()
+        token1 = MockToken("hello", (0, 10, 50, 30))
+        token2 = MockToken("world", (60, 12, 110, 28))  # Slight y variation
+
+        assert matcher._tokens_on_same_line(token1, token2) is True
+
+    def test_different_line_tokens(self):
+        """Should detect tokens on different lines."""
+        matcher = FieldMatcher()
+        token1 = MockToken("hello", (0, 10, 50, 30))
+        token2 = MockToken("world", (0, 50, 50, 70))  # Different y
+
+        assert matcher._tokens_on_same_line(token1, token2) is False
+
+
+class TestFieldMatcherBboxOverlap:
+    """Tests for _bbox_overlap method."""
+
+    def test_full_overlap(self):
+        """Should return 1.0 for identical bboxes."""
+        matcher = FieldMatcher()
+        bbox = (0, 0, 100, 50)
+        assert matcher._bbox_overlap(bbox, bbox) == 1.0
+
+    def test_partial_overlap(self):
+        """Should calculate partial overlap correctly."""
+        matcher = FieldMatcher()
+        bbox1 = (0, 0, 100, 100)
+        bbox2 = (50, 50, 150, 150)  # 50% overlap on each axis
+
+        overlap = matcher._bbox_overlap(bbox1, bbox2)
+        # Intersection: 50x50=2500, Union: 10000+10000-2500=17500
+        # IoU = 2500/17500 ≈ 0.143
+        assert 0.1 < overlap < 0.2
+
+    def test_no_overlap(self):
+        """Should return 0.0 for non-overlapping bboxes."""
+        matcher = FieldMatcher()
+        bbox1 = (0, 0, 50, 50)
+        bbox2 = (100, 100, 150, 150)
+
+        assert matcher._bbox_overlap(bbox1, bbox2) == 0.0
+
+
+class TestFieldMatcherDeduplication:
+    """Tests for match deduplication."""
+
+    def test_deduplicate_overlapping_matches(self):
+        """Should keep only highest scoring match for overlapping bboxes."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("12345", (0, 0, 50, 20)),
+        ]
+
+        # Find matches with multiple values that could match same token
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345", "12345"])
+
+        # Should deduplicate to single match
+        assert len(matches) == 1
+
+
+class TestFieldMatcherFlexibleDateMatch:
+    """Tests for flexible date matching."""
+
+    def test_flexible_date_same_month(self):
+        """Should match dates in same year-month when exact match fails."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("2025-01-15", (0, 0, 80, 20)),  # Slightly different day
+        ]
+
+        # Search for different day in same month
+        matches = matcher.find_matches(
+            tokens, "InvoiceDate", ["2025-01-10"]
+        )
+
+        # Should find flexible match (lower score)
+        # Note: This depends on exact match failing first
+        # If exact match works, flexible won't be tried
+
+
+class TestFieldMatcherPageFiltering:
+    """Tests for page number filtering."""
+
+    def test_filters_by_page_number(self):
+        """Should only match tokens on specified page."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("12345", (0, 0, 50, 20), page_no=0),
+            MockToken("12345", (0, 0, 50, 20), page_no=1),
+        ]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"], page_no=0)
+
+        assert all(m.page_no == 0 for m in matches)
+
+    def test_excludes_hidden_tokens(self):
+        """Should exclude tokens with negative y coordinates (metadata)."""
+        matcher = FieldMatcher()
+        tokens = [
+            MockToken("12345", (0, -100, 50, -80), page_no=0),  # Hidden metadata
+            MockToken("67890", (0, 0, 50, 20), page_no=0),      # Visible
+        ]
+
+        matches = matcher.find_matches(tokens, "InvoiceNumber", ["12345"], page_no=0)
+
+        # Should not match the hidden token
+        assert len(matches) == 0
+
+
+class TestContextKeywordsMapping:
+    """Tests for CONTEXT_KEYWORDS constant."""
+
+    def test_all_fields_have_keywords(self):
+        """Should have keywords for all expected fields."""
+        expected_fields = [
+            "InvoiceNumber",
+            "InvoiceDate",
+            "InvoiceDueDate",
+            "OCR",
+            "Bankgiro",
+            "Plusgiro",
+            "Amount",
+            "supplier_organisation_number",
+            "supplier_accounts",
+        ]
+        for field in expected_fields:
+            assert field in CONTEXT_KEYWORDS
+            assert len(CONTEXT_KEYWORDS[field]) > 0
+
+    def test_keywords_are_lowercase(self):
+        """All keywords should be lowercase."""
+        for field, keywords in CONTEXT_KEYWORDS.items():
+            for kw in keywords:
+                assert kw == kw.lower(), f"Keyword '{kw}' in {field} should be lowercase"
+
+
+class TestFindFieldMatches:
+    """Tests for find_field_matches convenience function."""
+
+    def test_finds_multiple_fields(self):
+        """Should find matches for multiple fields."""
+        tokens = [
+            MockToken("12345", (0, 0, 50, 20)),
+            MockToken("100,00", (0, 30, 60, 50)),
+        ]
+        field_values = {
+            "InvoiceNumber": "12345",
+            "Amount": "100",
+        }
+
+        results = find_field_matches(tokens, field_values)
+
+        assert "InvoiceNumber" in results
+        assert "Amount" in results
+        assert len(results["InvoiceNumber"]) >= 1
+        assert len(results["Amount"]) >= 1
+
+    def test_skips_empty_values(self):
+        """Should skip fields with None or empty values."""
+        tokens = [MockToken("12345", (0, 0, 50, 20))]
+        field_values = {
+            "InvoiceNumber": "12345",
+            "Amount": None,
+            "OCR": "",
+        }
+
+        results = find_field_matches(tokens, field_values)
+
+        assert "InvoiceNumber" in results
+        assert "Amount" not in results
+        assert "OCR" not in results
+
+
+class TestSubstringMatchEdgeCases:
+    """Additional edge case tests for substring matching."""
+
+    def test_unsupported_field_returns_empty(self):
+        """Should return empty for unsupported field types."""
+        # Line 380: field_name not in supported_fields
+        matcher = FieldMatcher()
+        tokens = [MockToken("Faktura: 12345", (0, 0, 100, 20))]
+
+        # Message is not a supported field for substring matching
+        matches = matcher._find_substring_matches(tokens, "12345", "Message")
+        assert len(matches) == 0
+
+    def test_case_insensitive_substring_match(self):
+        """Should find case-insensitive substring match."""
+        # Line 397-398: case-insensitive substring matching
+        matcher = FieldMatcher()
+        # Use token without inline keyword to isolate case-insensitive behavior
+        tokens = [MockToken("REF: ABC123", (0, 0, 100, 20))]
+
+        matches = matcher._find_substring_matches(tokens, "abc123", "InvoiceNumber")
+
+        assert len(matches) >= 1
+        # Case-insensitive base score is 0.70 (vs 0.75 for case-sensitive)
+        # Score may have context boost but base should be lower
+        assert matches[0].score <= 0.80  # 0.70 base + possible small boost
+
+    def test_substring_with_digit_before(self):
+        """Should not match when digit appears before value."""
+        # Line 407-408: char_before.isdigit() continue
+        matcher = FieldMatcher()
+        tokens = [MockToken("9912345", (0, 0, 60, 20))]
+
+        matches = matcher._find_substring_matches(tokens, "12345", "InvoiceNumber")
+        assert len(matches) == 0
+
+    def test_substring_with_digit_after(self):
+        """Should not match when digit appears after value."""
+        # Line 413-416: char_after.isdigit() continue
+        matcher = FieldMatcher()
+        tokens = [MockToken("12345678", (0, 0, 70, 20))]
+
+        matches = matcher._find_substring_matches(tokens, "12345", "InvoiceNumber")
+        assert len(matches) == 0
+
+    def test_substring_with_inline_keyword(self):
+        """Should boost score when keyword is in same token."""
+        matcher = FieldMatcher()
+        tokens = [MockToken("Fakturanr: 12345", (0, 0, 100, 20))]
+
+        matches = matcher._find_substring_matches(tokens, "12345", "InvoiceNumber")
+
+        assert len(matches) >= 1
+        # Should have inline keyword boost
+        assert "fakturanr" in matches[0].context_keywords
+
+
+class TestFlexibleDateMatchEdgeCases:
+    """Additional edge case tests for flexible date matching."""
+
+    def test_no_valid_date_in_normalized_values(self):
+        """Should return empty when no valid date in normalized values."""
+        # Line 520-521, 524: target_date parsing failures
+        matcher = FieldMatcher()
+        tokens = [MockToken("2025-01-15", (0, 0, 80, 20))]
+
+        # Pass non-date values
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["not-a-date", "also-not-date"], "InvoiceDate"
+        )
+        assert len(matches) == 0
+
+    def test_no_date_tokens_found(self):
+        """Should return empty when no date tokens in document."""
+        # Line 571-572: no date_candidates
+        matcher = FieldMatcher()
+        tokens = [MockToken("Hello World", (0, 0, 80, 20))]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+        assert len(matches) == 0
+
+    def test_flexible_date_within_7_days(self):
+        """Should score higher for dates within 7 days."""
+        # Line 582-583: days_diff <= 7
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-01-18", (0, 0, 80, 20)),  # 3 days from target
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        assert len(matches) >= 1
+        assert matches[0].score >= 0.75
+
+    def test_flexible_date_within_3_days(self):
+        """Should score highest for dates within 3 days."""
+        # Line 584-585: days_diff <= 3
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-01-17", (0, 0, 80, 20)),  # 2 days from target
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        assert len(matches) >= 1
+        assert matches[0].score >= 0.8
+
+    def test_flexible_date_within_14_days_different_month(self):
+        """Should match dates within 14 days even in different month."""
+        # Line 587-588: days_diff <= 14, different year-month
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-02-05", (0, 0, 80, 20)),  # 10 days from Jan 26
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-26"], "InvoiceDate"
+        )
+
+        assert len(matches) >= 1
+
+    def test_flexible_date_within_30_days(self):
+        """Should match dates within 30 days with lower score."""
+        # Line 589-590: days_diff <= 30
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-02-10", (0, 0, 80, 20)),  # 25 days from target
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-16"], "InvoiceDate"
+        )
+
+        assert len(matches) >= 1
+        assert matches[0].score >= 0.55
+
+    def test_flexible_date_far_apart_without_context(self):
+        """Should skip dates too far apart without context keywords."""
+        # Line 591-595: > 30 days, no context
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-06-15", (0, 0, 80, 20)),  # Many months from target
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        # Should be empty - too far apart and no context
+        assert len(matches) == 0
+
+    def test_flexible_date_far_with_context(self):
+        """Should match distant dates if context keywords present."""
+        # Line 592-595: > 30 days but has context
+        matcher = FieldMatcher(min_score_threshold=0.5, context_radius=200)
+        tokens = [
+            MockToken("fakturadatum", (0, 0, 80, 20)),  # Context keyword
+            MockToken("2025-06-15", (90, 0, 170, 20)),  # Distant date
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        # May match due to context keyword
+        # (depends on how context is detected in flexible match)
+
+    def test_flexible_date_boost_with_context(self):
+        """Should boost flexible date score with context keywords."""
+        # Line 598, 602-603: context_boost applied
+        matcher = FieldMatcher(min_score_threshold=0.5, context_radius=200)
+        tokens = [
+            MockToken("fakturadatum", (0, 0, 80, 20)),
+            MockToken("2025-01-18", (90, 0, 170, 20)),  # 3 days from target
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        if len(matches) > 0:
+            assert len(matches[0].context_keywords) >= 0
+
+
+class TestContextKeywordFallback:
+    """Tests for context keyword lookup fallback (no spatial index)."""
+
+    def test_fallback_context_lookup_without_index(self):
+        """Should find context using O(n) scan when no index available."""
+        # Line 650-673: fallback context lookup
+        matcher = FieldMatcher(context_radius=200)
+        # Don't use find_matches which builds index, call internal method directly
+
+        tokens = [
+            MockToken("fakturanr", (0, 0, 80, 20)),
+            MockToken("12345", (100, 0, 150, 20)),
+        ]
+
+        # _token_index is None, so fallback is used
+        keywords, boost = matcher._find_context_keywords(tokens, tokens[1], "InvoiceNumber")
+
+        assert "fakturanr" in keywords
+        assert boost > 0
+
+    def test_context_lookup_skips_self(self):
+        """Should skip the target token itself in fallback search."""
+        # Line 656-657: token is target_token continue
+        matcher = FieldMatcher(context_radius=200)
+        matcher._token_index = None  # Force fallback
+
+        token = MockToken("fakturanr 12345", (0, 0, 150, 20))
+        tokens = [token]
+
+        keywords, boost = matcher._find_context_keywords(tokens, token, "InvoiceNumber")
+
+        # Token contains keyword but is the target - should still find if keyword in token
+        # Actually this tests that it doesn't error when target is in list
+
+
+class TestFieldWithoutContextKeywords:
+    """Tests for fields without defined context keywords."""
+
+    def test_field_without_keywords_returns_empty(self):
+        """Should return empty keywords for fields not in CONTEXT_KEYWORDS."""
+        # Line 633-635: keywords empty, return early
+        matcher = FieldMatcher()
+        matcher._token_index = None
+
+        tokens = [MockToken("hello", (0, 0, 50, 20))]
+
+        # customer_number is not in CONTEXT_KEYWORDS
+        keywords, boost = matcher._find_context_keywords(tokens, tokens[0], "UnknownField")
+
+        assert keywords == []
+        assert boost == 0.0
+
+
+class TestParseAmountEdgeCases:
+    """Additional edge case tests for _parse_amount."""
+
+    def test_parse_amount_with_parentheses(self):
+        """Should remove parenthesized text like (inkl. moms)."""
+        matcher = FieldMatcher()
+        result = matcher._parse_amount("100 (inkl. moms)")
+        assert result == 100.0
+
+    def test_parse_amount_with_kronor_suffix(self):
+        """Should handle 'kronor' suffix."""
+        matcher = FieldMatcher()
+        result = matcher._parse_amount("100 kronor")
+        assert result == 100.0
+
+    def test_parse_amount_numeric_input(self):
+        """Should handle numeric input (int/float)."""
+        matcher = FieldMatcher()
+        assert matcher._parse_amount(100) == 100.0
+        assert matcher._parse_amount(100.5) == 100.5
+
+
+class TestFuzzyMatchExceptionHandling:
+    """Tests for exception handling in fuzzy matching."""
+
+    def test_fuzzy_match_with_unparseable_token(self):
+        """Should handle tokens that can't be parsed as amounts."""
+        # Line 481-482: except clause in fuzzy matching
+        matcher = FieldMatcher()
+        # Create a token that will cause parse issues
+        tokens = [MockToken("abc xyz", (0, 0, 50, 20))]
+
+        # This should not raise, just return empty matches
+        matches = matcher._find_fuzzy_matches(tokens, "100", "Amount")
+        assert len(matches) == 0
+
+    def test_fuzzy_match_exception_in_context_lookup(self):
+        """Should catch exceptions during fuzzy match processing."""
+        # Line 481-482: general exception handler
+        from unittest.mock import patch, MagicMock
+
+        matcher = FieldMatcher()
+        tokens = [MockToken("100", (0, 0, 50, 20))]
+
+        # Mock _find_context_keywords to raise an exception
+        with patch.object(matcher, '_find_context_keywords', side_effect=RuntimeError("Test error")):
+            # Should not raise, exception should be caught
+            matches = matcher._find_fuzzy_matches(tokens, "100", "Amount")
+            # Should return empty due to exception
+            assert len(matches) == 0
+
+
+class TestFlexibleDateInvalidDateParsing:
+    """Tests for invalid date parsing in flexible date matching."""
+
+    def test_invalid_date_in_normalized_values(self):
+        """Should handle invalid dates in normalized values gracefully."""
+        # Line 520-521: ValueError continue in target date parsing
+        matcher = FieldMatcher()
+        tokens = [MockToken("2025-01-15", (0, 0, 80, 20))]
+
+        # Pass an invalid date that matches the pattern but is not a valid date
+        # e.g., 2025-13-45 matches pattern but month 13 is invalid
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-13-45"], "InvoiceDate"
+        )
+        # Should return empty as no valid target date could be parsed
+        assert len(matches) == 0
+
+    def test_invalid_date_token_in_document(self):
+        """Should skip invalid date-like tokens in document."""
+        # Line 568-569: ValueError continue in date token parsing
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("2025-99-99", (0, 0, 80, 20)),  # Invalid date in doc
+            MockToken("2025-01-18", (0, 50, 80, 70)), # Valid date
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        # Should only match the valid date
+        assert len(matches) >= 1
+        assert matches[0].value == "2025-01-18"
+
+    def test_flexible_date_with_inline_keyword(self):
+        """Should detect inline keywords in date tokens."""
+        # Line 555: inline_keywords append
+        matcher = FieldMatcher(min_score_threshold=0.5)
+        tokens = [
+            MockToken("Fakturadatum: 2025-01-18", (0, 0, 150, 20)),
+        ]
+
+        matches = matcher._find_flexible_date_matches(
+            tokens, ["2025-01-15"], "InvoiceDate"
+        )
+
+        # Should find match with inline keyword
+        assert len(matches) >= 1
+        assert "fakturadatum" in matches[0].context_keywords
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
--- a/src/normalize/normalizer.py
+++ b/src/normalize/normalizer.py
@@ -43,8 +43,8 @@ class FieldNormalizer:
        # Remove zero-width characters
        text = re.sub(r'[\u200b\u200c\u200d\ufeff]', '', text)
        # Normalize different dash types to standard hyphen-minus (ASCII 45)
-        # en-dash (–, U+2013), em-dash (—, U+2014), minus sign (−, U+2212)
-        text = re.sub(r'[\u2013\u2014\u2212]', '-', text)
+        # en-dash (–, U+2013), em-dash (—, U+2014), minus sign (−, U+2212), middle dot (·, U+00B7)
+        text = re.sub(r'[\u2013\u2014\u2212\u00b7]', '-', text)
        # Normalize whitespace
        text = ' '.join(text.split())
        return text.strip()
@@ -571,6 +571,15 @@ class FieldNormalizer:
            # Short year with dot separator (e.g., 02.01.26)
            eu_dot_short = parsed_date.strftime('%d.%m.%y')

+            # Short year with slash separator (e.g., 20/10/24) - DD/MM/YY format
+            eu_slash_short = parsed_date.strftime('%d/%m/%y')
+
+            # Short year with hyphen separator (e.g., 23-11-01) - common in Swedish invoices
+            yy_mm_dd_short = parsed_date.strftime('%y-%m-%d')
+
+            # Middle dot separator (OCR sometimes reads hyphens as middle dots)
+            iso_middot = parsed_date.strftime('%Y·%m·%d')
+
            # Spaced formats (e.g., "2026 01 12", "26 01 12")
            spaced_full = parsed_date.strftime('%Y %m %d')
            spaced_short = parsed_date.strftime('%y %m %d')
@@ -581,10 +590,23 @@ class FieldNormalizer:
            swedish_format_full = f"{parsed_date.day} {month_full} {parsed_date.year}"
            swedish_format_abbrev = f"{parsed_date.day} {month_abbrev} {parsed_date.year}"

+            # Swedish month abbreviation with hyphen (e.g., "30-OKT-24", "30-okt-24")
+            month_abbrev_upper = month_abbrev.upper()
+            swedish_hyphen_short = f"{parsed_date.day:02d}-{month_abbrev_upper}-{parsed_date.strftime('%y')}"
+            swedish_hyphen_short_lower = f"{parsed_date.day:02d}-{month_abbrev}-{parsed_date.strftime('%y')}"
+            # Also without leading zero on day
+            swedish_hyphen_short_no_zero = f"{parsed_date.day}-{month_abbrev_upper}-{parsed_date.strftime('%y')}"
+
+            # Swedish month abbreviation with short year in different format (e.g., "SEP-24", "30 SEP 24")
+            month_year_only = f"{month_abbrev_upper}-{parsed_date.strftime('%y')}"
+            swedish_spaced = f"{parsed_date.day:02d} {month_abbrev_upper} {parsed_date.strftime('%y')}"
+
            variants.extend([
                iso, eu_slash, us_slash, eu_dot, iso_dot, compact, compact_short,
-                eu_dot_short, spaced_full, spaced_short,
-                swedish_format_full, swedish_format_abbrev
+                eu_dot_short, eu_slash_short, yy_mm_dd_short, iso_middot, spaced_full, spaced_short,
+                swedish_format_full, swedish_format_abbrev,
+                swedish_hyphen_short, swedish_hyphen_short_lower, swedish_hyphen_short_no_zero,
+                month_year_only, swedish_spaced
            ])

        return list(set(v for v in variants if v))
--- a/src/normalize/test_normalizer.py
+++ b/src/normalize/test_normalizer.py
@@ -0,0 +1,641 @@
+"""
+Tests for the Field Normalization Module.
+
+Tests cover all normalizer functions in src/normalize/normalizer.py
+
+Usage:
+    pytest src/normalize/test_normalizer.py -v
+"""
+
+import pytest
+from src.normalize.normalizer import (
+    FieldNormalizer,
+    NormalizedValue,
+    normalize_field,
+    NORMALIZERS,
+)
+
+
+class TestCleanText:
+    """Tests for FieldNormalizer.clean_text()"""
+
+    def test_removes_zero_width_characters(self):
+        """Should remove zero-width characters."""
+        text = "hello\u200bworld\u200c\u200d\ufeff"
+        assert FieldNormalizer.clean_text(text) == "helloworld"
+
+    def test_normalizes_dashes(self):
+        """Should normalize different dash types to standard hyphen."""
+        # en-dash
+        assert FieldNormalizer.clean_text("123\u2013456") == "123-456"
+        # em-dash
+        assert FieldNormalizer.clean_text("123\u2014456") == "123-456"
+        # minus sign
+        assert FieldNormalizer.clean_text("123\u2212456") == "123-456"
+        # middle dot
+        assert FieldNormalizer.clean_text("123\u00b7456") == "123-456"
+
+    def test_normalizes_whitespace(self):
+        """Should normalize multiple spaces to single space."""
+        assert FieldNormalizer.clean_text("hello   world") == "hello world"
+        assert FieldNormalizer.clean_text("  hello  world  ") == "hello world"
+
+    def test_strips_leading_trailing_whitespace(self):
+        """Should strip leading and trailing whitespace."""
+        assert FieldNormalizer.clean_text("  hello  ") == "hello"
+
+
+class TestNormalizeInvoiceNumber:
+    """Tests for FieldNormalizer.normalize_invoice_number()"""
+
+    def test_pure_digits(self):
+        """Should keep pure digit invoice numbers."""
+        variants = FieldNormalizer.normalize_invoice_number("100017500321")
+        assert "100017500321" in variants
+
+    def test_with_prefix(self):
+        """Should extract digits and keep original."""
+        variants = FieldNormalizer.normalize_invoice_number("INV-100017500321")
+        assert "INV-100017500321" in variants
+        assert "100017500321" in variants
+
+    def test_alphanumeric(self):
+        """Should handle alphanumeric invoice numbers."""
+        variants = FieldNormalizer.normalize_invoice_number("ABC123DEF456")
+        assert "ABC123DEF456" in variants
+        assert "123456" in variants
+
+    def test_empty_string(self):
+        """Should handle empty string gracefully."""
+        variants = FieldNormalizer.normalize_invoice_number("")
+        assert variants == []
+
+
+class TestNormalizeOcrNumber:
+    """Tests for FieldNormalizer.normalize_ocr_number()"""
+
+    def test_delegates_to_invoice_number(self):
+        """OCR normalization should behave like invoice number normalization."""
+        value = "123456789"
+        ocr_variants = FieldNormalizer.normalize_ocr_number(value)
+        invoice_variants = FieldNormalizer.normalize_invoice_number(value)
+        assert set(ocr_variants) == set(invoice_variants)
+
+
+class TestNormalizeBankgiro:
+    """Tests for FieldNormalizer.normalize_bankgiro()"""
+
+    def test_with_dash_8_digits(self):
+        """Should normalize 8-digit bankgiro with dash."""
+        variants = FieldNormalizer.normalize_bankgiro("5393-9484")
+        assert "5393-9484" in variants
+        assert "53939484" in variants
+
+    def test_without_dash_8_digits(self):
+        """Should add dash format for 8-digit bankgiro."""
+        variants = FieldNormalizer.normalize_bankgiro("53939484")
+        assert "53939484" in variants
+        assert "5393-9484" in variants
+
+    def test_7_digits(self):
+        """Should handle 7-digit bankgiro (XXX-XXXX format)."""
+        variants = FieldNormalizer.normalize_bankgiro("1234567")
+        assert "1234567" in variants
+        assert "123-4567" in variants
+
+    def test_with_dash_7_digits(self):
+        """Should normalize 7-digit bankgiro with dash."""
+        variants = FieldNormalizer.normalize_bankgiro("123-4567")
+        assert "123-4567" in variants
+        assert "1234567" in variants
+
+
+class TestNormalizePlusgiro:
+    """Tests for FieldNormalizer.normalize_plusgiro()"""
+
+    def test_with_dash_8_digits(self):
+        """Should normalize 8-digit plusgiro (XXXXXXX-X format)."""
+        variants = FieldNormalizer.normalize_plusgiro("1234567-8")
+        assert "1234567-8" in variants
+        assert "12345678" in variants
+
+    def test_without_dash_8_digits(self):
+        """Should add dash format for 8-digit plusgiro."""
+        variants = FieldNormalizer.normalize_plusgiro("12345678")
+        assert "12345678" in variants
+        assert "1234567-8" in variants
+
+    def test_7_digits(self):
+        """Should handle 7-digit plusgiro (XXXXXX-X format)."""
+        variants = FieldNormalizer.normalize_plusgiro("1234567")
+        assert "1234567" in variants
+        assert "123456-7" in variants
+
+
+class TestNormalizeOrganisationNumber:
+    """Tests for FieldNormalizer.normalize_organisation_number()"""
+
+    def test_with_dash(self):
+        """Should normalize org number with dash."""
+        variants = FieldNormalizer.normalize_organisation_number("556123-4567")
+        assert "556123-4567" in variants
+        assert "5561234567" in variants
+        assert "SE556123456701" in variants
+
+    def test_without_dash(self):
+        """Should add dash format for org number."""
+        variants = FieldNormalizer.normalize_organisation_number("5561234567")
+        assert "5561234567" in variants
+        assert "556123-4567" in variants
+        assert "SE556123456701" in variants
+
+    def test_from_vat_number(self):
+        """Should extract org number from Swedish VAT number."""
+        variants = FieldNormalizer.normalize_organisation_number("SE556123456701")
+        assert "SE556123456701" in variants
+        assert "5561234567" in variants
+        assert "556123-4567" in variants
+
+    def test_vat_variants(self):
+        """Should generate various VAT number formats."""
+        variants = FieldNormalizer.normalize_organisation_number("5561234567")
+        assert "SE556123456701" in variants
+        assert "se556123456701" in variants
+        assert "SE 5561234567 01" in variants
+        assert "SE5561234567" in variants
+
+    def test_12_digit_with_century(self):
+        """Should handle 12-digit org number with century prefix."""
+        variants = FieldNormalizer.normalize_organisation_number("195561234567")
+        assert "195561234567" in variants
+        assert "5561234567" in variants
+        assert "556123-4567" in variants
+
+
+class TestNormalizeSupplierAccounts:
+    """Tests for FieldNormalizer.normalize_supplier_accounts()"""
+
+    def test_single_plusgiro(self):
+        """Should normalize single plusgiro account."""
+        variants = FieldNormalizer.normalize_supplier_accounts("PG:48676043")
+        assert "PG:48676043" in variants
+        assert "48676043" in variants
+        assert "4867604-3" in variants
+
+    def test_single_bankgiro(self):
+        """Should normalize single bankgiro account."""
+        variants = FieldNormalizer.normalize_supplier_accounts("BG:5393-9484")
+        assert "BG:5393-9484" in variants
+        assert "5393-9484" in variants
+        assert "53939484" in variants
+
+    def test_multiple_accounts(self):
+        """Should handle multiple accounts separated by |."""
+        variants = FieldNormalizer.normalize_supplier_accounts(
+            "PG:48676043 | PG:49128028"
+        )
+        assert "PG:48676043" in variants
+        assert "48676043" in variants
+        assert "PG:49128028" in variants
+        assert "49128028" in variants
+
+    def test_prefix_normalization(self):
+        """Should normalize prefix formats."""
+        variants = FieldNormalizer.normalize_supplier_accounts("pg:12345678")
+        assert "PG:12345678" in variants
+        assert "PG: 12345678" in variants
+
+
+class TestNormalizeCustomerNumber:
+    """Tests for FieldNormalizer.normalize_customer_number()"""
+
+    def test_alphanumeric_with_space_and_dash(self):
+        """Should normalize customer number with space and dash."""
+        variants = FieldNormalizer.normalize_customer_number("EMM 256-6")
+        assert "EMM 256-6" in variants
+        assert "EMM256-6" in variants
+        assert "EMM2566" in variants
+
+    def test_alphanumeric_with_space(self):
+        """Should normalize customer number with space."""
+        variants = FieldNormalizer.normalize_customer_number("ABC 123")
+        assert "ABC 123" in variants
+        assert "ABC123" in variants
+
+    def test_case_variants(self):
+        """Should generate uppercase and lowercase variants."""
+        variants = FieldNormalizer.normalize_customer_number("Abc123")
+        assert "Abc123" in variants
+        assert "ABC123" in variants
+        assert "abc123" in variants
+
+
+class TestNormalizeAmount:
+    """Tests for FieldNormalizer.normalize_amount()"""
+
+    def test_integer_amount(self):
+        """Should normalize integer amount."""
+        variants = FieldNormalizer.normalize_amount("114")
+        assert "114" in variants
+        assert "114,00" in variants
+        assert "114.00" in variants
+
+    def test_with_comma_decimal(self):
+        """Should normalize amount with comma as decimal separator."""
+        variants = FieldNormalizer.normalize_amount("114,00")
+        assert "114,00" in variants
+        assert "114.00" in variants
+
+    def test_with_dot_decimal(self):
+        """Should normalize amount with dot as decimal separator."""
+        variants = FieldNormalizer.normalize_amount("114.00")
+        assert "114.00" in variants
+        assert "114,00" in variants
+
+    def test_with_space_thousand_separator(self):
+        """Should handle space as thousand separator."""
+        variants = FieldNormalizer.normalize_amount("1 234,56")
+        assert "1234,56" in variants
+        assert "1234.56" in variants
+
+    def test_space_as_decimal_separator(self):
+        """Should handle space as decimal separator (Swedish format)."""
+        variants = FieldNormalizer.normalize_amount("3045 52")
+        assert "3045.52" in variants
+        assert "3045,52" in variants
+        assert "304552" in variants
+
+    def test_us_format(self):
+        """Should handle US format (comma thousand, dot decimal)."""
+        variants = FieldNormalizer.normalize_amount("1,390.00")
+        assert "1390.00" in variants
+        assert "1390,00" in variants
+        assert "1.390,00" in variants  # European conversion
+
+    def test_european_format(self):
+        """Should handle European format (dot thousand, comma decimal)."""
+        variants = FieldNormalizer.normalize_amount("1.390,00")
+        assert "1390.00" in variants
+        assert "1390,00" in variants
+        assert "1,390.00" in variants  # US conversion
+
+    def test_space_thousand_with_decimal(self):
+        """Should handle space thousand separator with decimal."""
+        variants = FieldNormalizer.normalize_amount("10 571,00")
+        assert "10571,00" in variants
+        assert "10571.00" in variants
+
+    def test_removes_currency_symbols(self):
+        """Should remove currency symbols."""
+        variants = FieldNormalizer.normalize_amount("114 SEK")
+        assert "114" in variants
+
+    def test_large_amount_european_format(self):
+        """Should generate European format for large amounts."""
+        variants = FieldNormalizer.normalize_amount("20485")
+        assert "20485" in variants
+        assert "20.485" in variants
+        assert "20.485,00" in variants
+
+
+class TestNormalizeDate:
+    """Tests for FieldNormalizer.normalize_date()"""
+
+    def test_iso_format(self):
+        """Should parse and generate variants from ISO format."""
+        variants = FieldNormalizer.normalize_date("2025-12-13")
+        assert "2025-12-13" in variants
+        assert "13/12/2025" in variants
+        assert "13.12.2025" in variants
+        assert "20251213" in variants
+
+    def test_european_slash_format(self):
+        """Should parse European slash format DD/MM/YYYY."""
+        variants = FieldNormalizer.normalize_date("13/12/2025")
+        assert "2025-12-13" in variants
+        assert "13/12/2025" in variants
+
+    def test_european_dot_format(self):
+        """Should parse European dot format DD.MM.YYYY."""
+        variants = FieldNormalizer.normalize_date("13.12.2025")
+        assert "2025-12-13" in variants
+        assert "13.12.2025" in variants
+
+    def test_compact_format_yyyymmdd(self):
+        """Should parse compact format YYYYMMDD."""
+        variants = FieldNormalizer.normalize_date("20251213")
+        assert "2025-12-13" in variants
+        assert "20251213" in variants
+
+    def test_compact_format_yymmdd(self):
+        """Should parse compact format YYMMDD."""
+        variants = FieldNormalizer.normalize_date("251213")
+        assert "2025-12-13" in variants
+        assert "251213" in variants
+
+    def test_short_year_dot_format(self):
+        """Should parse DD.MM.YY format."""
+        variants = FieldNormalizer.normalize_date("02.08.25")
+        assert "2025-08-02" in variants
+        assert "02.08.25" in variants
+
+    def test_swedish_month_name(self):
+        """Should parse Swedish month names."""
+        variants = FieldNormalizer.normalize_date("13 december 2025")
+        assert "2025-12-13" in variants
+
+    def test_swedish_month_abbreviation(self):
+        """Should parse Swedish month abbreviations."""
+        variants = FieldNormalizer.normalize_date("13 dec 2025")
+        assert "2025-12-13" in variants
+
+    def test_generates_swedish_month_variants(self):
+        """Should generate Swedish month name variants."""
+        variants = FieldNormalizer.normalize_date("2025-01-09")
+        assert "9 januari 2025" in variants
+        assert "9 jan 2025" in variants
+
+    def test_generates_hyphen_month_abbrev_format(self):
+        """Should generate DD-MON-YY format."""
+        variants = FieldNormalizer.normalize_date("2024-10-30")
+        assert "30-OKT-24" in variants
+        assert "30-okt-24" in variants
+
+    def test_iso_with_time(self):
+        """Should handle ISO format with time component."""
+        variants = FieldNormalizer.normalize_date("2026-01-09 00:00:00")
+        assert "2026-01-09" in variants
+        assert "09/01/2026" in variants
+
+    def test_ambiguous_date_generates_both(self):
+        """Should generate both interpretations for ambiguous dates."""
+        # 01/02/2025 could be Jan 2 (US) or Feb 1 (EU)
+        variants = FieldNormalizer.normalize_date("01/02/2025")
+        # Both interpretations should be present
+        assert "2025-02-01" in variants  # European: DD/MM/YYYY
+        assert "2025-01-02" in variants  # US: MM/DD/YYYY
+
+    def test_middle_dot_separator(self):
+        """Should generate middle dot separator variant."""
+        variants = FieldNormalizer.normalize_date("2025-12-13")
+        assert "2025·12·13" in variants
+
+    def test_spaced_format(self):
+        """Should generate spaced format variants."""
+        variants = FieldNormalizer.normalize_date("2025-12-13")
+        assert "2025 12 13" in variants
+        assert "25 12 13" in variants
+
+
+class TestNormalizeField:
+    """Tests for the normalize_field() function."""
+
+    def test_uses_correct_normalizer(self):
+        """Should use the correct normalizer for each field type."""
+        # Test InvoiceNumber
+        result = normalize_field("InvoiceNumber", "INV-123")
+        assert "123" in result
+        assert "INV-123" in result
+
+        # Test Amount
+        result = normalize_field("Amount", "100")
+        assert "100" in result
+        assert "100,00" in result
+
+        # Test Date
+        result = normalize_field("InvoiceDate", "2025-01-01")
+        assert "2025-01-01" in result
+        assert "01/01/2025" in result
+
+    def test_unknown_field_cleans_text(self):
+        """Should clean text for unknown field types."""
+        result = normalize_field("UnknownField", "  hello  world  ")
+        assert result == ["hello world"]
+
+    def test_none_value(self):
+        """Should return empty list for None value."""
+        result = normalize_field("InvoiceNumber", None)
+        assert result == []
+
+    def test_empty_string(self):
+        """Should return empty list for empty string."""
+        result = normalize_field("InvoiceNumber", "")
+        assert result == []
+
+    def test_whitespace_only(self):
+        """Should return empty list for whitespace-only string."""
+        result = normalize_field("InvoiceNumber", "   ")
+        assert result == []
+
+    def test_converts_non_string_to_string(self):
+        """Should convert non-string values to string."""
+        result = normalize_field("Amount", 100)
+        assert "100" in result
+
+
+class TestNormalizersMapping:
+    """Tests for the NORMALIZERS mapping."""
+
+    def test_all_expected_fields_mapped(self):
+        """Should have normalizers for all expected field types."""
+        expected_fields = [
+            "InvoiceNumber",
+            "OCR",
+            "Bankgiro",
+            "Plusgiro",
+            "Amount",
+            "InvoiceDate",
+            "InvoiceDueDate",
+            "supplier_organisation_number",
+            "supplier_accounts",
+            "customer_number",
+        ]
+        for field in expected_fields:
+            assert field in NORMALIZERS, f"Missing normalizer for {field}"
+
+    def test_normalizers_are_callable(self):
+        """All normalizers should be callable."""
+        for name, normalizer in NORMALIZERS.items():
+            assert callable(normalizer), f"Normalizer {name} is not callable"
+
+
+class TestNormalizedValueDataclass:
+    """Tests for the NormalizedValue dataclass."""
+
+    def test_creation(self):
+        """Should create NormalizedValue with all fields."""
+        nv = NormalizedValue(
+            original="100",
+            variants=["100", "100.00", "100,00"],
+            field_type="Amount",
+        )
+        assert nv.original == "100"
+        assert nv.variants == ["100", "100.00", "100,00"]
+        assert nv.field_type == "Amount"
+
+
+class TestEdgeCases:
+    """Tests for edge cases and special scenarios."""
+
+    def test_unicode_normalization(self):
+        """Should handle unicode characters properly."""
+        # Non-breaking space
+        variants = FieldNormalizer.normalize_amount("1\xa0234,56")
+        assert "1234,56" in variants
+
+    def test_special_dashes_in_bankgiro(self):
+        """Should handle special dash characters in bankgiro."""
+        # en-dash
+        variants = FieldNormalizer.normalize_bankgiro("5393\u20139484")
+        assert "53939484" in variants
+        assert "5393-9484" in variants
+
+    def test_very_long_invoice_number(self):
+        """Should handle very long invoice numbers."""
+        long_number = "1" * 50
+        variants = FieldNormalizer.normalize_invoice_number(long_number)
+        assert long_number in variants
+
+    def test_mixed_case_vat_prefix(self):
+        """Should handle mixed case VAT prefix."""
+        variants = FieldNormalizer.normalize_organisation_number("Se556123456701")
+        assert "5561234567" in variants
+        assert "SE556123456701" in variants
+
+    def test_date_with_leading_zeros(self):
+        """Should handle dates with leading zeros."""
+        variants = FieldNormalizer.normalize_date("01.01.2025")
+        assert "2025-01-01" in variants
+
+    def test_amount_with_kr_suffix(self):
+        """Should handle amount with kr suffix."""
+        variants = FieldNormalizer.normalize_amount("100 kr")
+        assert "100" in variants
+
+    def test_amount_with_colon_dash(self):
+        """Should handle amount with :- suffix."""
+        variants = FieldNormalizer.normalize_amount("100:-")
+        assert "100" in variants
+
+
+class TestOrganisationNumberEdgeCases:
+    """Additional edge case tests for organisation number normalization."""
+
+    def test_vat_with_10_digits_after_se(self):
+        """Should handle VAT format SE + 10 digits (without trailing 01)."""
+        # Line 158-159: len(potential_org) == 10 case
+        variants = FieldNormalizer.normalize_organisation_number("SE5561234567")
+        assert "5561234567" in variants
+        assert "556123-4567" in variants
+
+    def test_vat_with_spaces(self):
+        """Should handle VAT with spaces."""
+        variants = FieldNormalizer.normalize_organisation_number("SE 5561234567 01")
+        assert "5561234567" in variants
+
+    def test_short_vat_prefix(self):
+        """Should handle SE prefix with less than 12 chars total."""
+        # This tests the fallback to digit extraction
+        variants = FieldNormalizer.normalize_organisation_number("SE12345")
+        assert "12345" in variants
+
+
+class TestSupplierAccountsEdgeCases:
+    """Additional edge case tests for supplier accounts normalization."""
+
+    def test_empty_account_in_list(self):
+        """Should skip empty accounts in list."""
+        # Line 224: empty account continue
+        variants = FieldNormalizer.normalize_supplier_accounts("PG:12345678 | | BG:53939484")
+        assert "12345678" in variants
+        assert "53939484" in variants
+
+    def test_account_without_prefix(self):
+        """Should handle account number without prefix."""
+        # Line 240: number = account (no colon)
+        variants = FieldNormalizer.normalize_supplier_accounts("12345678")
+        assert "12345678" in variants
+        assert "1234567-8" in variants
+
+    def test_7_digit_account(self):
+        """Should handle 7-digit account number."""
+        # Line 254-256: 7-digit format
+        variants = FieldNormalizer.normalize_supplier_accounts("1234567")
+        assert "1234567" in variants
+        assert "123456-7" in variants
+
+    def test_10_digit_account(self):
+        """Should handle 10-digit account number (org number format)."""
+        # Line 257-259: 10-digit format
+        variants = FieldNormalizer.normalize_supplier_accounts("5561234567")
+        assert "5561234567" in variants
+        assert "556123-4567" in variants
+
+    def test_mixed_format_accounts(self):
+        """Should handle multiple accounts with different formats."""
+        variants = FieldNormalizer.normalize_supplier_accounts("PG:1234567 | 53939484")
+        assert "1234567" in variants
+        assert "53939484" in variants
+
+
+class TestDateEdgeCases:
+    """Additional edge case tests for date normalization."""
+
+    def test_invalid_iso_date(self):
+        """Should handle invalid ISO date gracefully."""
+        # Line 483-484: ValueError in date parsing
+        variants = FieldNormalizer.normalize_date("2025-13-45")  # Invalid month/day
+        # Should still return original value
+        assert "2025-13-45" in variants
+
+    def test_invalid_european_date(self):
+        """Should handle invalid European date gracefully."""
+        # Line 496-497: ValueError in ambiguous date parsing
+        variants = FieldNormalizer.normalize_date("32/13/2025")  # Invalid day/month
+        assert "32/13/2025" in variants
+
+    def test_invalid_2digit_year_date(self):
+        """Should handle invalid 2-digit year date gracefully."""
+        # Line 521-522, 528-529: ValueError in 2-digit year parsing
+        variants = FieldNormalizer.normalize_date("99.99.25")  # Invalid day/month
+        assert "99.99.25" in variants
+
+    def test_swedish_month_with_short_year(self):
+        """Should handle Swedish month with 2-digit year."""
+        # Line 544: short year conversion
+        variants = FieldNormalizer.normalize_date("15 jan 25")
+        assert "2025-01-15" in variants
+
+    def test_swedish_month_with_old_year(self):
+        """Should handle Swedish month with old 2-digit year (50-99 -> 1900s)."""
+        variants = FieldNormalizer.normalize_date("15 jan 99")
+        assert "1999-01-15" in variants
+
+    def test_swedish_month_invalid_date(self):
+        """Should handle Swedish month with invalid day gracefully."""
+        # Line 548-549: ValueError continue
+        variants = FieldNormalizer.normalize_date("32 januari 2025")  # Invalid day
+        # Should still return original
+        assert "32 januari 2025" in variants
+
+    def test_ambiguous_date_both_invalid(self):
+        """Should handle ambiguous date where one interpretation is invalid."""
+        # 30/02/2025 - Feb 30 is invalid, but 02/30 would be invalid too
+        # This should still work for valid interpretations
+        variants = FieldNormalizer.normalize_date("15/06/2025")
+        assert "2025-06-15" in variants  # European interpretation
+        # US interpretation (month=15) would be invalid and skipped
+
+    def test_date_slash_format_2digit_year(self):
+        """Should parse DD/MM/YY format."""
+        variants = FieldNormalizer.normalize_date("15/06/25")
+        assert "2025-06-15" in variants
+
+    def test_date_dash_format_2digit_year(self):
+        """Should parse DD-MM-YY format."""
+        variants = FieldNormalizer.normalize_date("15-06-25")
+        assert "2025-06-15" in variants
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
--- a/src/ocr/init.py
+++ b/src/ocr/init.py
@@ -1,3 +1,16 @@
 from .paddle_ocr import OCREngine, OCRResult, OCRToken, extract_ocr_tokens
+from .machine_code_parser import (
+    MachineCodeParser,
+    MachineCodeResult,
+    parse_machine_code,
+)

-__all__ = ['OCREngine', 'OCRResult', 'OCRToken', 'extract_ocr_tokens']
+__all__ = [
+    'OCREngine',
+    'OCRResult',
+    'OCRToken',
+    'extract_ocr_tokens',
+    'MachineCodeParser',
+    'MachineCodeResult',
+    'parse_machine_code',
+]
--- a/src/ocr/machine_code_parser.py
+++ b/src/ocr/machine_code_parser.py
@@ -0,0 +1,897 @@
+"""
+Machine Code Line Parser for Swedish Invoices
+
+Parses the bottom machine-readable payment line to extract:
+- OCR reference number (10-25 digits)
+- Amount (payment amount in SEK)
+- Bankgiro account number (XXX-XXXX or XXXX-XXXX format)
+- Plusgiro account number (XXXXXXX-X format)
+
+The machine code line is typically found at the bottom of Swedish invoices,
+in the payment slip (Inbetalningskort) section. It contains machine-readable
+data for automated payment processing.
+
+## Swedish Payment Line Standard Format
+
+The standard machine-readable payment line follows this structure:
+
+    # <OCR> # <Kronor> <Öre> <Type> > <Bankgiro>#<Control>#
+
+Example:
+    # 31130954410 # 315 00 2 > 8983025#14#
+
+Components:
+- `#` - Start delimiter
+- `31130954410` - OCR number (with Mod 10 check digit)
+- `#` - Separator
+- `315 00` - Amount: 315 SEK and 00 öre (315.00 SEK)
+- `2` - Payment type / record type
+- `>` - Points to recipient info
+- `8983025` - Bankgiro number
+- `#14#` - End marker with control code
+
+Legacy patterns also supported:
+- OCR: 8120000849965361 (10-25 consecutive digits)
+- Bankgiro: 5393-9484 or 53939484
+- Plusgiro: 1234567-8
+- Amount: 1234,56 or 1234.56 (with decimal separator)
+"""
+
+import re
+from dataclasses import dataclass, field
+from typing import Optional
+
+from src.pdf.extractor import Token as TextToken
+
+
+@dataclass
+class MachineCodeResult:
+    """Result of machine code parsing."""
+    ocr: Optional[str] = None
+    amount: Optional[str] = None
+    bankgiro: Optional[str] = None
+    plusgiro: Optional[str] = None
+    confidence: float = 0.0
+    source_tokens: list[TextToken] = field(default_factory=list)
+    raw_line: str = ""
+    # Region bounding box in PDF coordinates (x0, y0, x1, y1)
+    region_bbox: Optional[tuple[float, float, float, float]] = None
+
+    def to_dict(self) -> dict:
+        """Convert to dictionary for serialization."""
+        return {
+            'ocr': self.ocr,
+            'amount': self.amount,
+            'bankgiro': self.bankgiro,
+            'plusgiro': self.plusgiro,
+            'confidence': self.confidence,
+            'raw_line': self.raw_line,
+            'region_bbox': self.region_bbox,
+        }
+
+    def get_region_bbox(self) -> Optional[tuple[float, float, float, float]]:
+        """
+        Get the bounding box of the payment slip region.
+
+        Returns:
+            Tuple (x0, y0, x1, y1) in PDF coordinates, or None if no region detected
+        """
+        if self.region_bbox:
+            return self.region_bbox
+
+        if not self.source_tokens:
+            return None
+
+        # Calculate bbox from source tokens
+        x0 = min(t.bbox[0] for t in self.source_tokens)
+        y0 = min(t.bbox[1] for t in self.source_tokens)
+        x1 = max(t.bbox[2] for t in self.source_tokens)
+        y1 = max(t.bbox[3] for t in self.source_tokens)
+
+        return (x0, y0, x1, y1)
+
+
+class MachineCodeParser:
+    """
+    Parser for machine-readable payment lines on Swedish invoices.
+
+    The parser focuses on the bottom region of the invoice where
+    the payment slip (Inbetalningskort) is typically located.
+    """
+
+    # Payment slip detection keywords (Swedish)
+    PAYMENT_SLIP_KEYWORDS = [
+        'inbetalning', 'girering', 'avi', 'betalning',
+        'plusgiro', 'postgiro', 'bankgiro', 'bankgirot',
+        'betalningsavsändare', 'betalningsmottagare',
+        'maskinellt', 'ändringar',  # "DEN AVLÄSES MASKINELLT"
+    ]
+
+    # Patterns for field extraction
+    # OCR: 10-25 consecutive digits (may have spaces or # at end)
+    OCR_PATTERN = re.compile(r'(?<!\d)(\d{10,25})(?!\d)')
+
+    # Bankgiro: XXX-XXXX or XXXX-XXXX (7-8 digits with optional dash)
+    BANKGIRO_PATTERN = re.compile(r'\b(\d{3,4}[-\s]?\d{4})\b')
+
+    # Plusgiro: XXXXXXX-X (7-8 digits with dash before last digit)
+    PLUSGIRO_PATTERN = re.compile(r'\b(\d{6,7}[-\s]?\d)\b')
+
+    # Amount: digits with comma or dot as decimal separator
+    # Supports formats: 1234,56 | 1234.56 | 1 234,56 | 1.234,56
+    AMOUNT_PATTERN = re.compile(
+        r'\b(\d{1,3}(?:[\s\.\xa0]\d{3})*[,\.]\d{2})\b'
+    )
+
+    # Alternative amount pattern for integers (no decimal)
+    AMOUNT_INTEGER_PATTERN = re.compile(r'\b(\d{2,6})\b')
+
+    # Standard Swedish payment line pattern
+    # Format: # <OCR> # <Kronor> <Öre> <Type> > <Bankgiro/Plusgiro>#<Control>#
+    # Example: # 31130954410 # 315 00 2 > 8983025#14#
+    # This pattern captures both Bankgiro and Plusgiro accounts
+    PAYMENT_LINE_PATTERN = re.compile(
+        r'#\s*'                          # Start delimiter
+        r'(\d{5,25})\s*'                 # OCR number (capture group 1)
+        r'#\s*'                          # Separator
+        r'(\d{1,7})\s+'                  # Kronor (capture group 2)
+        r'(\d{2})\s+'                    # Öre (capture group 3)
+        r'(\d)\s*'                       # Type (capture group 4)
+        r'>\s*'                          # Direction marker
+        r'(\d{5,10})'                    # Bankgiro/Plusgiro (capture group 5)
+        r'(?:#\d{1,3}#)?'               # Optional end marker
+    )
+
+    # Alternative pattern with different spacing
+    PAYMENT_LINE_PATTERN_ALT = re.compile(
+        r'#?\s*'                         # Optional start delimiter
+        r'(\d{8,25})\s*'                 # OCR number
+        r'#?\s*'                         # Optional separator
+        r'(\d{1,7})\s+'                  # Kronor
+        r'(\d{2})\s+'                    # Öre
+        r'\d\s*'                         # Type
+        r'>?\s*'                         # Optional direction marker
+        r'(\d{5,10})'                    # Bankgiro
+    )
+
+    # Reverse format pattern (Bankgiro first, then OCR)
+    # Format: <Bankgiro>#<Control># <Kronor> <Öre> <Type> > <OCR> #
+    # Example: 53241469#41# 2428 00 1 > 4388595300 #
+    PAYMENT_LINE_PATTERN_REVERSE = re.compile(
+        r'(\d{7,8})'                     # Bankgiro (capture group 1)
+        r'#\d{1,3}#\s+'                  # Control marker
+        r'(\d{1,7})\s+'                  # Kronor (capture group 2)
+        r'(\d{2})\s+'                    # Öre (capture group 3)
+        r'\d\s*'                         # Type
+        r'>\s*'                          # Direction marker
+        r'(\d{5,25})'                    # OCR number (capture group 4)
+    )
+
+    def __init__(self, bottom_region_ratio: float = 0.35):
+        """
+        Initialize the parser.
+
+        Args:
+            bottom_region_ratio: Fraction of page height to consider as bottom region.
+                                 Default 0.35 means bottom 35% of the page.
+        """
+        self.bottom_region_ratio = bottom_region_ratio
+
+    def parse(
+        self,
+        tokens: list[TextToken],
+        page_height: float,
+        page_width: float | None = None,
+    ) -> MachineCodeResult:
+        """
+        Parse machine code from tokens.
+
+        Args:
+            tokens: List of text tokens from OCR or text extraction
+            page_height: Height of the page in points
+            page_width: Width of the page in points (optional)
+
+        Returns:
+            MachineCodeResult with extracted fields
+        """
+        if not tokens:
+            return MachineCodeResult()
+
+        # Filter to bottom region tokens
+        bottom_y_threshold = page_height * (1 - self.bottom_region_ratio)
+        bottom_tokens = [
+            t for t in tokens
+            if t.bbox[1] >= bottom_y_threshold  # y0 >= threshold
+        ]
+
+        if not bottom_tokens:
+            return MachineCodeResult()
+
+        # Sort by y position (top to bottom) then x (left to right)
+        bottom_tokens.sort(key=lambda t: (t.bbox[1], t.bbox[0]))
+
+        # Check if this looks like a payment slip region
+        combined_text = ' '.join(t.text for t in bottom_tokens).lower()
+        has_payment_keywords = any(
+            kw in combined_text for kw in self.PAYMENT_SLIP_KEYWORDS
+        )
+
+        # Build raw line from bottom tokens
+        raw_line = ' '.join(t.text for t in bottom_tokens)
+
+        # Try standard payment line format first and find the matching tokens
+        standard_result, matched_tokens = self._parse_standard_payment_line_with_tokens(
+            raw_line, bottom_tokens
+        )
+
+        if standard_result and matched_tokens:
+            # Calculate bbox only from tokens that contain the machine code
+            x0 = min(t.bbox[0] for t in matched_tokens)
+            y0 = min(t.bbox[1] for t in matched_tokens)
+            x1 = max(t.bbox[2] for t in matched_tokens)
+            y1 = max(t.bbox[3] for t in matched_tokens)
+            region_bbox = (x0, y0, x1, y1)
+
+            result = MachineCodeResult(
+                ocr=standard_result.get('ocr'),
+                amount=standard_result.get('amount'),
+                bankgiro=standard_result.get('bankgiro'),
+                plusgiro=standard_result.get('plusgiro'),
+                confidence=0.95,
+                source_tokens=matched_tokens,
+                raw_line=raw_line,
+                region_bbox=region_bbox,
+            )
+            return result
+
+        # Fall back to individual field extraction
+        result = MachineCodeResult(
+            source_tokens=bottom_tokens,
+            raw_line=raw_line,
+        )
+
+        # Extract OCR number (longest digit sequence 10-25 digits)
+        result.ocr = self._extract_ocr(bottom_tokens)
+
+        # Extract Bankgiro
+        result.bankgiro = self._extract_bankgiro(bottom_tokens)
+
+        # Extract Plusgiro (if no Bankgiro found)
+        if not result.bankgiro:
+            result.plusgiro = self._extract_plusgiro(bottom_tokens)
+
+        # Extract Amount
+        result.amount = self._extract_amount(bottom_tokens)
+
+        # Calculate confidence
+        result.confidence = self._calculate_confidence(
+            result, has_payment_keywords
+        )
+
+        # For fallback extraction, compute bbox from tokens that contain the extracted values
+        matched_tokens = self._find_tokens_with_values(bottom_tokens, result)
+        if matched_tokens:
+            x0 = min(t.bbox[0] for t in matched_tokens)
+            y0 = min(t.bbox[1] for t in matched_tokens)
+            x1 = max(t.bbox[2] for t in matched_tokens)
+            y1 = max(t.bbox[3] for t in matched_tokens)
+            result.region_bbox = (x0, y0, x1, y1)
+            result.source_tokens = matched_tokens
+
+        return result
+
+    def _find_tokens_with_values(
+        self,
+        tokens: list[TextToken],
+        result: MachineCodeResult
+    ) -> list[TextToken]:
+        """Find tokens that contain the extracted values (OCR, Amount, Bankgiro)."""
+        matched = []
+        values_to_find = []
+
+        if result.ocr:
+            values_to_find.append(result.ocr)
+        if result.amount:
+            # Amount might be just digits
+            amount_digits = re.sub(r'\D', '', result.amount)
+            values_to_find.append(amount_digits)
+            values_to_find.append(result.amount)
+        if result.bankgiro:
+            # Bankgiro might have dash or not
+            bg_digits = re.sub(r'\D', '', result.bankgiro)
+            values_to_find.append(bg_digits)
+            values_to_find.append(result.bankgiro)
+        if result.plusgiro:
+            pg_digits = re.sub(r'\D', '', result.plusgiro)
+            values_to_find.append(pg_digits)
+            values_to_find.append(result.plusgiro)
+
+        for token in tokens:
+            text = token.text.replace(' ', '').replace('#', '')
+            text_digits = re.sub(r'\D', '', token.text)
+
+            for value in values_to_find:
+                if value in text or value in text_digits:
+                    if token not in matched:
+                        matched.append(token)
+                    break
+
+        return matched
+
+    def _find_machine_code_line_tokens(
+        self,
+        tokens: list[TextToken]
+    ) -> list[TextToken]:
+        """
+        Find tokens that belong to the machine code line using pure regex patterns.
+
+        The machine code line typically contains:
+        - Control markers like #14#, #41#
+        - Direction marker >
+        - Account numbers with # suffix
+
+        Returns:
+            List of tokens belonging to the machine code line
+        """
+        # Find tokens with characteristic machine code patterns
+        ref_y = None
+
+        # First, find the reference y-coordinate from tokens with machine code patterns
+        for token in tokens:
+            text = token.text
+
+            # Check if token contains machine code patterns
+            # Priority 1: Control marker like #14#, 47304035#14#
+            has_control_marker = bool(re.search(r'#\d+#', text))
+            # Priority 2: Direction marker >
+            has_direction = '>' in text
+
+            if has_control_marker:
+                # This is very likely part of the machine code line
+                ref_y = token.bbox[1]
+                break
+            elif has_direction and ref_y is None:
+                # Direction marker is also a good indicator
+                ref_y = token.bbox[1]
+
+        if ref_y is None:
+            return []
+
+        # Collect all tokens on the same line (within 3 points of ref_y)
+        # Use very small tolerance because Swedish invoices often have duplicate
+        # machine code lines (upper and lower part of payment slip)
+        y_tolerance = 3
+        machine_code_tokens = []
+        for token in tokens:
+            if abs(token.bbox[1] - ref_y) < y_tolerance:
+                text = token.text
+                # Include token if it contains:
+                # - Digits (OCR, amount, account numbers)
+                # - # symbol (delimiters, control markers)
+                # - > symbol (direction marker)
+                if (re.search(r'\d', text) or '#' in text or '>' in text):
+                    machine_code_tokens.append(token)
+
+        # If we found very few tokens, try to expand to nearby y values
+        # that might be part of the same logical line
+        if len(machine_code_tokens) < 3:
+            y_tolerance = 10
+            machine_code_tokens = []
+            for token in tokens:
+                if abs(token.bbox[1] - ref_y) < y_tolerance:
+                    text = token.text
+                    if (re.search(r'\d', text) or '#' in text or '>' in text):
+                        machine_code_tokens.append(token)
+
+        return machine_code_tokens
+
+    def _parse_standard_payment_line_with_tokens(
+        self,
+        raw_line: str,
+        tokens: list[TextToken]
+    ) -> tuple[Optional[dict], list[TextToken]]:
+        """
+        Parse standard Swedish payment line format and find matching tokens.
+
+        Uses pure regex to identify the machine code line, then finds tokens
+        that are part of that line based on their position.
+
+        Format: # <OCR> # <Kronor> <Öre> <Type> > <Bankgiro/Plusgiro>#<Control>#
+        Example: # 31130954410 # 315 00 2 > 8983025#14#
+
+        Returns:
+            Tuple of (parsed_dict, matched_tokens) or (None, [])
+        """
+        # First find the machine code line tokens using pattern matching
+        machine_code_tokens = self._find_machine_code_line_tokens(tokens)
+
+        if not machine_code_tokens:
+            # Fall back to regex on raw_line
+            parsed = self._parse_standard_payment_line(raw_line, raw_line)
+            return parsed, []
+
+        # Build a line from just the machine code tokens (sorted by x position)
+        # Group tokens by approximate x position to handle duplicate OCR results
+        mc_tokens_sorted = sorted(machine_code_tokens, key=lambda t: t.bbox[0])
+
+        # Deduplicate tokens at similar x positions (keep the first one)
+        deduped_tokens = []
+        last_x = -100
+        for t in mc_tokens_sorted:
+            # Skip tokens that are too close to the previous one (likely duplicates)
+            if t.bbox[0] - last_x < 5:
+                continue
+            deduped_tokens.append(t)
+            last_x = t.bbox[2]  # Use end x for next comparison
+
+        mc_line = ' '.join(t.text for t in deduped_tokens)
+
+        # Try to parse this line, using raw_line for context detection
+        parsed = self._parse_standard_payment_line(mc_line, raw_line)
+        if parsed:
+            return parsed, deduped_tokens
+
+        # If machine code line parsing failed, try the full raw_line
+        parsed = self._parse_standard_payment_line(raw_line, raw_line)
+        if parsed:
+            return parsed, machine_code_tokens
+
+        return None, []
+
+    def _parse_standard_payment_line(
+        self,
+        raw_line: str,
+        context_line: str | None = None
+    ) -> Optional[dict]:
+        """
+        Parse standard Swedish payment line format.
+
+        Format: # <OCR> # <Kronor> <Öre> <Type> > <Bankgiro/Plusgiro>#<Control>#
+        Example: # 31130954410 # 315 00 2 > 8983025#14#
+
+        Args:
+            raw_line: The line to parse (may be just the machine code tokens)
+            context_line: Optional full line for context detection (e.g., to find "plusgiro" keywords)
+
+        Returns:
+            Dict with 'ocr', 'amount', and 'bankgiro' or 'plusgiro' if matched, None otherwise
+        """
+        # Use context_line for detecting Plusgiro/Bankgiro, fall back to raw_line
+        context = (context_line or raw_line).lower()
+        is_plusgiro_context = (
+            ('plusgiro' in context or 'postgiro' in context or 'plusgirokonto' in context)
+            and 'bankgiro' not in context
+        )
+
+        # Preprocess: remove spaces in the account number part (after >)
+        # This handles cases like "78 2 1 713" -> "7821713"
+        def normalize_account_spaces(line: str) -> str:
+            """Remove spaces in account number portion after > marker."""
+            if '>' in line:
+                parts = line.split('>', 1)
+                # After >, remove spaces between digits (but keep # markers)
+                after_arrow = parts[1]
+                # Extract digits and # markers, remove spaces between digits
+                normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', after_arrow)
+                # May need multiple passes for sequences like "78 2 1 713"
+                while re.search(r'(\d)\s+(\d)', normalized):
+                    normalized = re.sub(r'(\d)\s+(\d)', r'\1\2', normalized)
+                return parts[0] + '>' + normalized
+            return line
+
+        raw_line = normalize_account_spaces(raw_line)
+
+        def format_account(account_digits: str) -> tuple[str, str]:
+            """Format account and determine type (bankgiro or plusgiro).
+
+            Returns: (formatted_account, account_type)
+            """
+            if is_plusgiro_context:
+                # Plusgiro format: XXXXXXX-X
+                formatted = f"{account_digits[:-1]}-{account_digits[-1]}"
+                return formatted, 'plusgiro'
+            else:
+                # Bankgiro format: XXX-XXXX or XXXX-XXXX
+                if len(account_digits) == 7:
+                    formatted = f"{account_digits[:3]}-{account_digits[3:]}"
+                elif len(account_digits) == 8:
+                    formatted = f"{account_digits[:4]}-{account_digits[4:]}"
+                else:
+                    formatted = account_digits
+                return formatted, 'bankgiro'
+
+        # Try primary pattern
+        match = self.PAYMENT_LINE_PATTERN.search(raw_line)
+        if match:
+            ocr = match.group(1)
+            kronor = match.group(2)
+            ore = match.group(3)
+            account_digits = match.group(5)
+
+            # Format amount: combine kronor and öre
+            amount = f"{kronor},{ore}" if ore != "00" else kronor
+
+            formatted_account, account_type = format_account(account_digits)
+
+            return {
+                'ocr': ocr,
+                'amount': amount,
+                account_type: formatted_account,
+            }
+
+        # Try alternative pattern
+        match = self.PAYMENT_LINE_PATTERN_ALT.search(raw_line)
+        if match:
+            ocr = match.group(1)
+            kronor = match.group(2)
+            ore = match.group(3)
+            account_digits = match.group(4)
+
+            amount = f"{kronor},{ore}" if ore != "00" else kronor
+
+            formatted_account, account_type = format_account(account_digits)
+
+            return {
+                'ocr': ocr,
+                'amount': amount,
+                account_type: formatted_account,
+            }
+
+        # Try reverse pattern (Account first, then OCR)
+        match = self.PAYMENT_LINE_PATTERN_REVERSE.search(raw_line)
+        if match:
+            account_digits = match.group(1)
+            kronor = match.group(2)
+            ore = match.group(3)
+            ocr = match.group(4)
+
+            amount = f"{kronor},{ore}" if ore != "00" else kronor
+
+            formatted_account, account_type = format_account(account_digits)
+
+            return {
+                'ocr': ocr,
+                'amount': amount,
+                account_type: formatted_account,
+            }
+
+        return None
+
+    def _extract_ocr(self, tokens: list[TextToken]) -> Optional[str]:
+        """Extract OCR reference number."""
+        candidates = []
+
+        # First, collect all bankgiro-like patterns to exclude
+        bankgiro_digits = set()
+        for token in tokens:
+            text = token.text.strip()
+            bg_matches = self.BANKGIRO_PATTERN.findall(text)
+            for bg in bg_matches:
+                digits = re.sub(r'\D', '', bg)
+                bankgiro_digits.add(digits)
+                # Also add with potential check digits (common pattern)
+                for i in range(10):
+                    bankgiro_digits.add(digits + str(i))
+                    bankgiro_digits.add(digits + str(i) + str(i))
+
+        for token in tokens:
+            # Remove spaces and common suffixes
+            text = token.text.replace(' ', '').replace('#', '').strip()
+
+            # Find all digit sequences
+            matches = self.OCR_PATTERN.findall(text)
+            for match in matches:
+                # OCR numbers are typically 10-25 digits
+                if 10 <= len(match) <= 25:
+                    # Skip if this looks like a bankgiro number with check digit
+                    is_bankgiro_variant = any(
+                        match.startswith(bg) or match.endswith(bg)
+                        for bg in bankgiro_digits if len(bg) >= 7
+                    )
+
+                    # Also check if it's exactly bankgiro with 2-3 extra digits
+                    for bg in bankgiro_digits:
+                        if len(bg) >= 7 and (
+                            match == bg or
+                            (len(match) - len(bg) <= 3 and match.startswith(bg))
+                        ):
+                            is_bankgiro_variant = True
+                            break
+
+                    if not is_bankgiro_variant:
+                        candidates.append((match, len(match), token))
+
+        if not candidates:
+            return None
+
+        # Prefer longer sequences (more likely to be OCR)
+        candidates.sort(key=lambda x: x[1], reverse=True)
+        return candidates[0][0]
+
+    def _extract_bankgiro(self, tokens: list[TextToken]) -> Optional[str]:
+        """Extract Bankgiro account number.
+
+        Bankgiro format: XXX-XXXX or XXXX-XXXX (dash in middle)
+        NOT Plusgiro: XXXXXXX-X (dash before last digit)
+        """
+        candidates = []
+        context_text = ' '.join(t.text.lower() for t in tokens)
+
+        # Check if this is clearly a Plusgiro context (not Bankgiro)
+        is_plusgiro_only_context = (
+            ('plusgiro' in context_text or 'postgiro' in context_text or 'plusgirokonto' in context_text)
+            and 'bankgiro' not in context_text
+        )
+
+        # If clearly Plusgiro context, don't extract as Bankgiro
+        if is_plusgiro_only_context:
+            return None
+
+        for token in tokens:
+            text = token.text.strip()
+
+            # Look for Bankgiro pattern
+            matches = self.BANKGIRO_PATTERN.findall(text)
+            for match in matches:
+                # Check if this looks like Plusgiro format (dash before last digit)
+                # Plusgiro: 1234567-8 (dash at position -2)
+                if '-' in match:
+                    parts = match.replace(' ', '').split('-')
+                    if len(parts) == 2 and len(parts[1]) == 1:
+                        # This is Plusgiro format, skip
+                        continue
+
+                # Normalize: remove spaces, ensure dash
+                digits = re.sub(r'\D', '', match)
+                if len(digits) == 7:
+                    normalized = f"{digits[:3]}-{digits[3:]}"
+                elif len(digits) == 8:
+                    normalized = f"{digits[:4]}-{digits[4:]}"
+                else:
+                    continue
+
+                # Check if "bankgiro" or "bg" appears nearby
+                is_bankgiro_context = (
+                    'bankgiro' in context_text or
+                    'bg:' in context_text or
+                    'bg ' in context_text
+                )
+
+                candidates.append((normalized, is_bankgiro_context, token))
+
+        if not candidates:
+            return None
+
+        # Prefer matches with bankgiro context
+        candidates.sort(key=lambda x: (x[1], 1), reverse=True)
+        return candidates[0][0]
+
+    def _extract_plusgiro(self, tokens: list[TextToken]) -> Optional[str]:
+        """Extract Plusgiro account number."""
+        candidates = []
+
+        for token in tokens:
+            text = token.text.strip()
+
+            matches = self.PLUSGIRO_PATTERN.findall(text)
+            for match in matches:
+                # Normalize: remove spaces, ensure dash before last digit
+                digits = re.sub(r'\D', '', match)
+                if 7 <= len(digits) <= 8:
+                    normalized = f"{digits[:-1]}-{digits[-1]}"
+
+                    # Check context
+                    context_text = ' '.join(t.text.lower() for t in tokens)
+                    is_plusgiro_context = (
+                        'plusgiro' in context_text or
+                        'postgiro' in context_text or
+                        'pg:' in context_text or
+                        'pg ' in context_text
+                    )
+
+                    candidates.append((normalized, is_plusgiro_context, token))
+
+        if not candidates:
+            return None
+
+        candidates.sort(key=lambda x: (x[1], 1), reverse=True)
+        return candidates[0][0]
+
+    def _extract_amount(self, tokens: list[TextToken]) -> Optional[str]:
+        """Extract payment amount."""
+        candidates = []
+
+        for token in tokens:
+            text = token.text.strip()
+
+            # Try decimal amount pattern first
+            matches = self.AMOUNT_PATTERN.findall(text)
+            for match in matches:
+                # Normalize: remove thousand separators, use comma as decimal
+                normalized = match.replace(' ', '').replace('\xa0', '')
+                # Convert dot thousand separator to none, keep comma decimal
+                if '.' in normalized and ',' in normalized:
+                    # Format like 1.234,56 -> 1234,56
+                    normalized = normalized.replace('.', '')
+                elif '.' in normalized:
+                    # Could be 1234.56 -> 1234,56
+                    parts = normalized.split('.')
+                    if len(parts) == 2 and len(parts[1]) == 2:
+                        normalized = f"{parts[0]},{parts[1]}"
+
+                # Parse to verify it's a valid amount
+                try:
+                    value = float(normalized.replace(',', '.'))
+                    if 0 < value < 1000000:  # Reasonable amount range
+                        candidates.append((normalized, value, token))
+                except ValueError:
+                    continue
+
+        # If no decimal amounts found, try integer amounts
+        # Look for "Kronor" label nearby and extract integer
+        if not candidates:
+            for i, token in enumerate(tokens):
+                text = token.text.strip().lower()
+                if 'kronor' in text or 'kr' == text or text.endswith(' kr'):
+                    # Look at nearby tokens for amounts (wider range)
+                    for j in range(max(0, i - 5), min(len(tokens), i + 5)):
+                        nearby_text = tokens[j].text.strip()
+                        # Match pure integer (1-6 digits)
+                        int_match = re.match(r'^(\d{1,6})$', nearby_text)
+                        if int_match:
+                            value = int(int_match.group(1))
+                            if 0 < value < 1000000:
+                                candidates.append((str(value), float(value), tokens[j]))
+
+        # Also try to find amounts near "öre" label (Swedish cents)
+        if not candidates:
+            for i, token in enumerate(tokens):
+                text = token.text.strip().lower()
+                if 'öre' in text:
+                    # Look at nearby tokens for amounts
+                    for j in range(max(0, i - 5), min(len(tokens), i + 5)):
+                        nearby_text = tokens[j].text.strip()
+                        int_match = re.match(r'^(\d{1,6})$', nearby_text)
+                        if int_match:
+                            value = int(int_match.group(1))
+                            if 0 < value < 1000000:
+                                candidates.append((str(value), float(value), tokens[j]))
+
+        if not candidates:
+            return None
+
+        # Sort by value (prefer larger amounts - likely total)
+        candidates.sort(key=lambda x: x[1], reverse=True)
+        return candidates[0][0]
+
+    def _calculate_confidence(
+        self,
+        result: MachineCodeResult,
+        has_payment_keywords: bool
+    ) -> float:
+        """Calculate confidence score for the extraction."""
+        confidence = 0.0
+
+        # Base confidence from payment keywords
+        if has_payment_keywords:
+            confidence += 0.3
+
+        # Points for each extracted field
+        if result.ocr:
+            confidence += 0.25
+            # Bonus for typical OCR length (15-17 digits)
+            if 15 <= len(result.ocr) <= 17:
+                confidence += 0.1
+
+        if result.bankgiro or result.plusgiro:
+            confidence += 0.2
+
+        if result.amount:
+            confidence += 0.15
+
+        return min(confidence, 1.0)
+
+    def cross_validate(
+        self,
+        machine_result: MachineCodeResult,
+        csv_values: dict[str, str],
+    ) -> dict[str, dict]:
+        """
+        Cross-validate machine code extraction with CSV ground truth.
+
+        Args:
+            machine_result: Result from parse()
+            csv_values: Dict of field values from CSV
+                        (keys: 'ocr', 'amount', 'bankgiro', 'plusgiro')
+
+        Returns:
+            Dict with validation results for each field:
+            {
+                'ocr': {
+                    'machine': '123456789',
+                    'csv': '123456789',
+                    'match': True,
+                    'use_machine': False,  # CSV has value
+                },
+                ...
+            }
+        """
+        from src.normalize import normalize_field
+
+        results = {}
+
+        field_mapping = [
+            ('ocr', 'OCR', machine_result.ocr),
+            ('amount', 'Amount', machine_result.amount),
+            ('bankgiro', 'Bankgiro', machine_result.bankgiro),
+            ('plusgiro', 'Plusgiro', machine_result.plusgiro),
+        ]
+
+        for field_key, normalizer_name, machine_value in field_mapping:
+            csv_value = csv_values.get(field_key, '').strip()
+
+            result_entry = {
+                'machine': machine_value,
+                'csv': csv_value if csv_value else None,
+                'match': False,
+                'use_machine': False,
+            }
+
+            if machine_value and csv_value:
+                # Both have values - check if they match
+                machine_variants = normalize_field(normalizer_name, machine_value)
+                csv_variants = normalize_field(normalizer_name, csv_value)
+
+                # Check for any overlap
+                result_entry['match'] = bool(
+                    set(machine_variants) & set(csv_variants)
+                )
+
+                # Special handling for amounts - allow rounding differences
+                if not result_entry['match'] and field_key == 'amount':
+                    try:
+                        # Parse both values as floats
+                        machine_float = float(
+                            machine_value.replace(' ', '')
+                            .replace(',', '.').replace('\xa0', '')
+                        )
+                        csv_float = float(
+                            csv_value.replace(' ', '')
+                            .replace(',', '.').replace('\xa0', '')
+                        )
+                        # Allow 1 unit difference (rounding)
+                        if abs(machine_float - csv_float) <= 1.0:
+                            result_entry['match'] = True
+                            result_entry['rounding_diff'] = True
+                    except ValueError:
+                        pass
+
+            elif machine_value and not csv_value:
+                # CSV is missing, use machine value
+                result_entry['use_machine'] = True
+
+            results[field_key] = result_entry
+
+        return results
+
+
+def parse_machine_code(
+    tokens: list[TextToken],
+    page_height: float,
+    page_width: float | None = None,
+    bottom_ratio: float = 0.35,
+) -> MachineCodeResult:
+    """
+    Convenience function to parse machine code from tokens.
+
+    Args:
+        tokens: List of text tokens
+        page_height: Page height in points
+        page_width: Page width in points (optional)
+        bottom_ratio: Fraction of page to consider as bottom region
+
+    Returns:
+        MachineCodeResult with extracted fields
+    """
+    parser = MachineCodeParser(bottom_region_ratio=bottom_ratio)
+    return parser.parse(tokens, page_height, page_width)
--- a/src/ocr/test_machine_code_parser.py
+++ b/src/ocr/test_machine_code_parser.py
@@ -0,0 +1,251 @@
+"""
+Tests for Machine Code Parser
+
+Tests the parsing of Swedish invoice payment lines including:
+- Standard payment line format
+- Account number normalization (spaces removal)
+- Bankgiro/Plusgiro detection
+- OCR and Amount extraction
+"""
+
+import pytest
+from src.ocr.machine_code_parser import MachineCodeParser, MachineCodeResult
+
+
+class TestParseStandardPaymentLine:
+    """Tests for _parse_standard_payment_line method."""
+
+    @pytest.fixture
+    def parser(self):
+        return MachineCodeParser()
+
+    def test_standard_format_bankgiro(self, parser):
+        """Test standard payment line with Bankgiro."""
+        line = "# 31130954410 # 315 00 2 > 8983025#14#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '31130954410'
+        assert result['amount'] == '315'
+        assert result['bankgiro'] == '898-3025'
+
+    def test_standard_format_with_ore(self, parser):
+        """Test payment line with non-zero öre."""
+        line = "# 12345678901 # 100 50 2 > 7821713#41#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '12345678901'
+        assert result['amount'] == '100,50'
+        assert result['bankgiro'] == '782-1713'
+
+    def test_spaces_in_bankgiro(self, parser):
+        """Test payment line with spaces in Bankgiro number."""
+        line = "# 310196187399952 # 11699 00 6 > 78 2 1 713 #41#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '310196187399952'
+        assert result['amount'] == '11699'
+        assert result['bankgiro'] == '782-1713'
+
+    def test_spaces_in_bankgiro_multiple(self, parser):
+        """Test payment line with multiple spaces in account number."""
+        line = "# 123456789 # 500 00 1 > 1 2 3 4 5 6 7 #99#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['bankgiro'] == '123-4567'
+
+    def test_8_digit_bankgiro(self, parser):
+        """Test 8-digit Bankgiro formatting."""
+        line = "# 12345678901 # 200 00 2 > 53939484#14#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['bankgiro'] == '5393-9484'
+
+    def test_plusgiro_context(self, parser):
+        """Test Plusgiro detection based on context."""
+        line = "# 12345678901 # 100 00 2 > 1234567#14#"
+        result = parser._parse_standard_payment_line(line, context_line="plusgiro payment")
+
+        assert result is not None
+        assert 'plusgiro' in result
+        assert result['plusgiro'] == '123456-7'
+
+    def test_no_match_invalid_format(self, parser):
+        """Test that invalid format returns None."""
+        line = "This is not a valid payment line"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is None
+
+    def test_alternative_pattern(self, parser):
+        """Test alternative payment line pattern."""
+        line = "8120000849965361 11699 00 1 > 7821713"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '8120000849965361'
+
+    def test_long_ocr_number(self, parser):
+        """Test OCR number up to 25 digits."""
+        line = "# 1234567890123456789012345 # 100 00 2 > 7821713#14#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '1234567890123456789012345'
+
+    def test_large_amount(self, parser):
+        """Test large amount extraction."""
+        line = "# 12345678901 # 1234567 00 2 > 7821713#14#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['amount'] == '1234567'
+
+
+class TestNormalizeAccountSpaces:
+    """Tests for account number space normalization."""
+
+    @pytest.fixture
+    def parser(self):
+        return MachineCodeParser()
+
+    def test_no_spaces(self, parser):
+        """Test line without spaces in account."""
+        line = "# 123456789 # 100 00 1 > 7821713#14#"
+        result = parser._parse_standard_payment_line(line)
+        assert result['bankgiro'] == '782-1713'
+
+    def test_single_space(self, parser):
+        """Test single space between digits."""
+        line = "# 123456789 # 100 00 1 > 782 1713#14#"
+        result = parser._parse_standard_payment_line(line)
+        assert result['bankgiro'] == '782-1713'
+
+    def test_multiple_spaces(self, parser):
+        """Test multiple spaces."""
+        line = "# 123456789 # 100 00 1 > 7 8 2 1 7 1 3#14#"
+        result = parser._parse_standard_payment_line(line)
+        assert result['bankgiro'] == '782-1713'
+
+    def test_no_arrow_marker(self, parser):
+        """Test line without > marker - spaces not normalized."""
+        # Without >, the normalization won't happen
+        line = "# 123456789 # 100 00 1 7821713#14#"
+        result = parser._parse_standard_payment_line(line)
+        # This pattern might not match due to missing >
+        # Just ensure no crash
+        assert result is None or isinstance(result, dict)
+
+
+class TestMachineCodeResult:
+    """Tests for MachineCodeResult dataclass."""
+
+    def test_to_dict(self):
+        """Test conversion to dictionary."""
+        result = MachineCodeResult(
+            ocr='12345678901',
+            amount='100',
+            bankgiro='782-1713',
+            confidence=0.95,
+            raw_line='test line'
+        )
+
+        d = result.to_dict()
+        assert d['ocr'] == '12345678901'
+        assert d['amount'] == '100'
+        assert d['bankgiro'] == '782-1713'
+        assert d['confidence'] == 0.95
+        assert d['raw_line'] == 'test line'
+
+    def test_empty_result(self):
+        """Test empty result."""
+        result = MachineCodeResult()
+        d = result.to_dict()
+
+        assert d['ocr'] is None
+        assert d['amount'] is None
+        assert d['bankgiro'] is None
+        assert d['plusgiro'] is None
+
+
+class TestRealWorldExamples:
+    """Tests using real-world payment line examples."""
+
+    @pytest.fixture
+    def parser(self):
+        return MachineCodeParser()
+
+    def test_fastum_invoice(self, parser):
+        """Test Fastum invoice payment line (from Faktura_A3861)."""
+        line = "# 310196187399952 # 11699 00 6 > 78 2 1 713 #41#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '310196187399952'
+        assert result['amount'] == '11699'
+        assert result['bankgiro'] == '782-1713'
+
+    def test_standard_bankgiro_invoice(self, parser):
+        """Test standard Bankgiro format."""
+        line = "# 31130954410 # 315 00 2 > 8983025#14#"
+        result = parser._parse_standard_payment_line(line)
+
+        assert result is not None
+        assert result['ocr'] == '31130954410'
+        assert result['amount'] == '315'
+        assert result['bankgiro'] == '898-3025'
+
+    def test_payment_line_with_extra_whitespace(self, parser):
+        """Test payment line with extra whitespace."""
+        line = "#  310196187399952  #  11699  00  6  >  7821713  #41#"
+        result = parser._parse_standard_payment_line(line)
+
+        # May or may not match depending on regex flexibility
+        # At minimum, should not crash
+        assert result is None or isinstance(result, dict)
+
+
+class TestEdgeCases:
+    """Tests for edge cases and boundary conditions."""
+
+    @pytest.fixture
+    def parser(self):
+        return MachineCodeParser()
+
+    def test_empty_string(self, parser):
+        """Test empty string input."""
+        result = parser._parse_standard_payment_line("")
+        assert result is None
+
+    def test_only_whitespace(self, parser):
+        """Test whitespace-only input."""
+        result = parser._parse_standard_payment_line("   \t\n  ")
+        assert result is None
+
+    def test_minimum_ocr_length(self, parser):
+        """Test minimum OCR length (5 digits)."""
+        line = "# 12345 # 100 00 1 > 7821713#14#"
+        result = parser._parse_standard_payment_line(line)
+        assert result is not None
+        assert result['ocr'] == '12345'
+
+    def test_minimum_bankgiro_length(self, parser):
+        """Test minimum Bankgiro length (5 digits)."""
+        line = "# 12345678901 # 100 00 1 > 12345#14#"
+        result = parser._parse_standard_payment_line(line)
+        assert result is not None
+
+    def test_special_characters_in_line(self, parser):
+        """Test handling of special characters."""
+        line = "# 12345678901 # 100 00 1 > 7821713#14# (SEK)"
+        result = parser._parse_standard_payment_line(line)
+        assert result is not None
+        assert result['ocr'] == '12345678901'
+
+
+if __name__ == '__main__':
+    pytest.main([__file__, '-v'])
--- a/src/pdf/detector.py
+++ b/src/pdf/detector.py
@@ -28,17 +28,69 @@ def extract_text_first_page(pdf_path: str | Path) -> str:

 def is_text_pdf(pdf_path: str | Path, min_chars: int = 30) -> bool:
    """
-    Check if PDF has extractable text layer.
+    Check if PDF has extractable AND READABLE text layer.
+
+    Some PDFs have custom font encodings that produce garbled text.
+    This function checks both the presence and readability of text.

    Args:
        pdf_path: Path to the PDF file
        min_chars: Minimum characters to consider it a text PDF

    Returns:
-        True if PDF has text layer, False if scanned
+        True if PDF has readable text layer, False if scanned or garbled
    """
    text = extract_text_first_page(pdf_path)
-    return len(text.strip()) > min_chars
+    stripped_text = text.strip()
+
+    # First check: enough characters (basic minimum)
+    if len(stripped_text) <= min_chars:
+        return False
+
+    # Second check: text readability
+    # PDFs with custom font encoding often produce garbled text
+    # Check if common invoice-related keywords are present
+    text_lower = stripped_text.lower()
+    invoice_keywords = [
+        'faktura', 'invoice', 'datum', 'date', 'belopp', 'amount',
+        'moms', 'vat', 'bankgiro', 'plusgiro', 'ocr', 'betala',
+        'summa', 'total', 'pris', 'price', 'kr', 'sek'
+    ]
+    found_keywords = sum(1 for kw in invoice_keywords if kw in text_lower)
+
+    # If at least 2 keywords found, likely readable text
+    if found_keywords >= 2:
+        return True
+
+    # Third check: minimum content threshold
+    # A real text PDF invoice should have at least 200 chars of content
+    # PDFs with only headers/footers (like "Brandsign") should use OCR
+    if len(stripped_text) < 200:
+        return False
+
+    # Fourth check: character readability ratio
+    # Count printable ASCII and common Swedish/European characters
+    readable_chars = 0
+    total_chars = len(stripped_text)
+
+    for c in stripped_text:
+        # Printable ASCII (32-126) or common Swedish/European chars
+        if 32 <= ord(c) <= 126 or c in 'åäöÅÄÖéèêëÉÈÊËüÜ':
+            readable_chars += 1
+
+    # If less than 70% readable, treat as garbled/scanned
+    readable_ratio = readable_chars / total_chars if total_chars > 0 else 0
+    if readable_ratio < 0.70:
+        return False
+
+    # Fifth check: if no keywords found but passes basic readability,
+    # require higher readability threshold (85%) or at least 1 keyword
+    # This catches garbled PDFs that have high ASCII ratio but unreadable content
+    # (e.g., custom font encoding that maps to different characters)
+    if found_keywords == 0 and readable_ratio < 0.85:
+        return False
+
+    return True


 def get_pdf_type(pdf_path: str | Path) -> PDFType:
--- a/src/pdf/extractor.py
+++ b/src/pdf/extractor.py
@@ -9,6 +9,8 @@ from pathlib import Path
 from typing import Generator, Optional
 import fitz  # PyMuPDF

+from .detector import is_text_pdf as _is_text_pdf_standalone
+

@dataclass
 class Token:
@@ -79,12 +81,13 @@ class PDFDocument:
        return len(self.doc)

    def is_text_pdf(self, min_chars: int = 30) -> bool:
-        """Check if PDF has extractable text layer."""
-        if self.page_count == 0:
-            return False
-        first_page = self.doc[0]
-        text = first_page.get_text()
-        return len(text.strip()) > min_chars
+        """
+        Check if PDF has extractable AND READABLE text layer.
+
+        Uses the improved detection from detector.py that also checks
+        for garbled text (custom font encoding issues).
+        """
+        return _is_text_pdf_standalone(self.pdf_path, min_chars)

    def get_page_dimensions(self, page_no: int = 0) -> tuple[float, float]:
        """Get page dimensions in points (cached)."""
--- a/src/pdf/test_detector.py
+++ b/src/pdf/test_detector.py
@@ -0,0 +1,335 @@
+"""
+Tests for the PDF Type Detection Module.
+
+Tests cover all detector functions in src/pdf/detector.py
+
+Note: These tests require PyMuPDF (fitz) and actual PDF files or mocks.
+Some tests are marked as integration tests that require real PDF files.
+
+Usage:
+    pytest src/pdf/test_detector.py -v -o 'addopts='
+"""
+
+import pytest
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+from src.pdf.detector import (
+    extract_text_first_page,
+    is_text_pdf,
+    get_pdf_type,
+    get_page_info,
+    PDFType,
+)
+
+
+class TestExtractTextFirstPage:
+    """Tests for extract_text_first_page function."""
+
+    def test_with_mock_empty_pdf(self):
+        """Should return empty string for empty PDF."""
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=0)
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = extract_text_first_page("test.pdf")
+            assert result == ""
+
+    def test_with_mock_text_pdf(self):
+        """Should extract text from first page."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = "Faktura 12345\nDatum: 2025-01-15"
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = extract_text_first_page("test.pdf")
+            assert "Faktura" in result
+            assert "12345" in result
+
+
+class TestIsTextPDF:
+    """Tests for is_text_pdf function."""
+
+    def test_empty_pdf_returns_false(self):
+        """Should return False for PDF with no text."""
+        with patch("src.pdf.detector.extract_text_first_page", return_value=""):
+            assert is_text_pdf("test.pdf") is False
+
+    def test_short_text_returns_false(self):
+        """Should return False for PDF with very short text."""
+        with patch("src.pdf.detector.extract_text_first_page", return_value="Hello"):
+            assert is_text_pdf("test.pdf") is False
+
+    def test_readable_text_with_keywords_returns_true(self):
+        """Should return True for readable text with invoice keywords."""
+        text = """
+        Faktura
+        Datum: 2025-01-15
+        Belopp: 1234,56 SEK
+        Bankgiro: 5393-9484
+        Moms: 25%
+        """ + "a" * 200  # Ensure > 200 chars
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            assert is_text_pdf("test.pdf") is True
+
+    def test_garbled_text_returns_false(self):
+        """Should return False for garbled/unreadable text."""
+        # Simulate garbled text (lots of non-printable characters)
+        garbled = "\x00\x01\x02" * 100 + "abc" * 20  # Low readable ratio
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=garbled):
+            assert is_text_pdf("test.pdf") is False
+
+    def test_text_without_keywords_needs_high_readability(self):
+        """Should require high readability when no keywords found."""
+        # Text without invoice keywords
+        text = "The quick brown fox jumps over the lazy dog. " * 10
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            # Should pass if readable ratio is high enough
+            result = is_text_pdf("test.pdf")
+            # Result depends on character ratio - ASCII text should pass
+            assert result is True
+
+    def test_custom_min_chars(self):
+        """Should respect custom min_chars parameter."""
+        text = "Short text here"  # 15 chars
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            # Default min_chars=30 - should fail
+            assert is_text_pdf("test.pdf", min_chars=30) is False
+            # Custom min_chars=10 - should pass basic length check
+            # (but will still fail keyword/readability checks)
+
+
+class TestGetPDFType:
+    """Tests for get_pdf_type function."""
+
+    def test_empty_pdf_returns_scanned(self):
+        """Should return 'scanned' for empty PDF."""
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=0)
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = get_pdf_type("test.pdf")
+            assert result == "scanned"
+
+    def test_all_text_pages_returns_text(self):
+        """Should return 'text' when all pages have text."""
+        mock_page1 = MagicMock()
+        mock_page1.get_text.return_value = "A" * 50  # > 30 chars
+
+        mock_page2 = MagicMock()
+        mock_page2.get_text.return_value = "B" * 50  # > 30 chars
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=2)
+        mock_doc.__iter__ = MagicMock(return_value=iter([mock_page1, mock_page2]))
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = get_pdf_type("test.pdf")
+            assert result == "text"
+
+    def test_no_text_pages_returns_scanned(self):
+        """Should return 'scanned' when no pages have text."""
+        mock_page1 = MagicMock()
+        mock_page1.get_text.return_value = ""
+
+        mock_page2 = MagicMock()
+        mock_page2.get_text.return_value = "AB"  # < 30 chars
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=2)
+        mock_doc.__iter__ = MagicMock(return_value=iter([mock_page1, mock_page2]))
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = get_pdf_type("test.pdf")
+            assert result == "scanned"
+
+    def test_mixed_pages_returns_mixed(self):
+        """Should return 'mixed' when some pages have text."""
+        mock_page1 = MagicMock()
+        mock_page1.get_text.return_value = "A" * 50  # Has text
+
+        mock_page2 = MagicMock()
+        mock_page2.get_text.return_value = ""  # No text
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=2)
+        mock_doc.__iter__ = MagicMock(return_value=iter([mock_page1, mock_page2]))
+
+        with patch("fitz.open", return_value=mock_doc):
+            result = get_pdf_type("test.pdf")
+            assert result == "mixed"
+
+
+class TestGetPageInfo:
+    """Tests for get_page_info function."""
+
+    def test_single_page_pdf(self):
+        """Should return info for single page."""
+        mock_rect = MagicMock()
+        mock_rect.width = 595.0  # A4 width in points
+        mock_rect.height = 842.0  # A4 height in points
+
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = "A" * 50
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+
+        def mock_iter(self):
+            yield mock_page
+        mock_doc.__iter__ = lambda self: mock_iter(self)
+
+        with patch("fitz.open", return_value=mock_doc):
+            pages = get_page_info("test.pdf")
+
+            assert len(pages) == 1
+            assert pages[0]["page_no"] == 0
+            assert pages[0]["width"] == 595.0
+            assert pages[0]["height"] == 842.0
+            assert pages[0]["has_text"] is True
+            assert pages[0]["char_count"] == 50
+
+    def test_multi_page_pdf(self):
+        """Should return info for all pages."""
+        def create_mock_page(text, width, height):
+            mock_rect = MagicMock()
+            mock_rect.width = width
+            mock_rect.height = height
+
+            mock_page = MagicMock()
+            mock_page.get_text.return_value = text
+            mock_page.rect = mock_rect
+            return mock_page
+
+        pages_data = [
+            ("A" * 50, 595.0, 842.0),  # Page 0: has text
+            ("", 595.0, 842.0),         # Page 1: no text
+            ("B" * 100, 612.0, 792.0),  # Page 2: different size, has text
+        ]
+
+        mock_pages = [create_mock_page(*data) for data in pages_data]
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=3)
+
+        def mock_iter(self):
+            for page in mock_pages:
+                yield page
+        mock_doc.__iter__ = lambda self: mock_iter(self)
+
+        with patch("fitz.open", return_value=mock_doc):
+            pages = get_page_info("test.pdf")
+
+            assert len(pages) == 3
+
+            # Page 0
+            assert pages[0]["page_no"] == 0
+            assert pages[0]["has_text"] is True
+            assert pages[0]["char_count"] == 50
+
+            # Page 1
+            assert pages[1]["page_no"] == 1
+            assert pages[1]["has_text"] is False
+            assert pages[1]["char_count"] == 0
+
+            # Page 2
+            assert pages[2]["page_no"] == 2
+            assert pages[2]["has_text"] is True
+            assert pages[2]["width"] == 612.0
+
+
+class TestPDFTypeAnnotation:
+    """Tests for PDFType type alias."""
+
+    def test_valid_types(self):
+        """PDFType should accept valid literal values."""
+        # These are compile-time checks, but we can verify at runtime
+        valid_types: list[PDFType] = ["text", "scanned", "mixed"]
+        assert all(t in ["text", "scanned", "mixed"] for t in valid_types)
+
+
+class TestIsTextPDFKeywordDetection:
+    """Tests for keyword detection in is_text_pdf."""
+
+    def test_detects_swedish_keywords(self):
+        """Should detect Swedish invoice keywords."""
+        keywords = [
+            ("faktura", True),
+            ("datum", True),
+            ("belopp", True),
+            ("bankgiro", True),
+            ("plusgiro", True),
+            ("moms", True),
+        ]
+
+        for keyword, expected in keywords:
+            # Create text with keyword and enough content
+            text = f"Document with {keyword} keyword here" + " more text" * 50
+
+            with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+                # Need at least 2 keywords for is_text_pdf to return True
+                # So this tests if keyword is recognized when combined with others
+                pass
+
+    def test_detects_english_keywords(self):
+        """Should detect English invoice keywords."""
+        text = "Invoice document with date and amount information" + " x" * 100
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            # invoice + date = 2 keywords
+            result = is_text_pdf("test.pdf")
+            assert result is True
+
+    def test_needs_at_least_two_keywords(self):
+        """Should require at least 2 keywords to pass keyword check."""
+        # Only one keyword
+        text = "This is a faktura document" + " x" * 200
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            # With only 1 keyword, falls back to other checks
+            # Should still pass if readability is high
+            pass
+
+
+class TestReadabilityChecks:
+    """Tests for readability ratio checks in is_text_pdf."""
+
+    def test_high_ascii_ratio_passes(self):
+        """Should pass when ASCII ratio is high."""
+        # Pure ASCII text
+        text = "This is a normal document with only ASCII characters. " * 10
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            result = is_text_pdf("test.pdf")
+            assert result is True
+
+    def test_swedish_characters_accepted(self):
+        """Should accept Swedish characters as readable."""
+        text = "Fakturadatum för årets moms på öre belopp" + " normal" * 50
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            result = is_text_pdf("test.pdf")
+            assert result is True
+
+    def test_low_readability_fails(self):
+        """Should fail when readability ratio is too low."""
+        # Mix of readable and unreadable characters
+        # Create text with < 70% readable characters
+        readable = "abc" * 30  # 90 readable chars
+        unreadable = "\x80\x81\x82" * 50  # 150 unreadable chars
+        text = readable + unreadable
+
+        with patch("src.pdf.detector.extract_text_first_page", return_value=text):
+            result = is_text_pdf("test.pdf")
+            assert result is False
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
--- a/src/pdf/test_extractor.py
+++ b/src/pdf/test_extractor.py
@@ -0,0 +1,572 @@
+"""
+Tests for the PDF Text Extraction Module.
+
+Tests cover all extractor functions in src/pdf/extractor.py
+
+Note: These tests require PyMuPDF (fitz) and use mocks for unit testing.
+
+Usage:
+    pytest src/pdf/test_extractor.py -v -o 'addopts='
+"""
+
+import pytest
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+from src.pdf.extractor import (
+    Token,
+    PDFDocument,
+    extract_text_tokens,
+    extract_words,
+    extract_lines,
+    get_page_dimensions,
+)
+
+
+class TestToken:
+    """Tests for Token dataclass."""
+
+    def test_creation(self):
+        """Should create Token with all fields."""
+        token = Token(
+            text="Hello",
+            bbox=(10.0, 20.0, 50.0, 35.0),
+            page_no=0
+        )
+        assert token.text == "Hello"
+        assert token.bbox == (10.0, 20.0, 50.0, 35.0)
+        assert token.page_no == 0
+
+    def test_x0_property(self):
+        """Should return correct x0."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.x0 == 10.0
+
+    def test_y0_property(self):
+        """Should return correct y0."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.y0 == 20.0
+
+    def test_x1_property(self):
+        """Should return correct x1."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.x1 == 50.0
+
+    def test_y1_property(self):
+        """Should return correct y1."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.y1 == 35.0
+
+    def test_width_property(self):
+        """Should calculate correct width."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.width == 40.0
+
+    def test_height_property(self):
+        """Should calculate correct height."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 35.0), page_no=0)
+        assert token.height == 15.0
+
+    def test_center_property(self):
+        """Should calculate correct center."""
+        token = Token(text="test", bbox=(10.0, 20.0, 50.0, 40.0), page_no=0)
+        center = token.center
+        assert center == (30.0, 30.0)
+
+
+class TestPDFDocument:
+    """Tests for PDFDocument context manager."""
+
+    def test_context_manager_opens_and_closes(self):
+        """Should open document on enter and close on exit."""
+        mock_doc = MagicMock()
+
+        with patch("fitz.open", return_value=mock_doc) as mock_open:
+            with PDFDocument("test.pdf") as pdf:
+                mock_open.assert_called_once_with(Path("test.pdf"))
+                assert pdf._doc is not None
+
+            mock_doc.close.assert_called_once()
+
+    def test_doc_property_raises_outside_context(self):
+        """Should raise error when accessing doc outside context."""
+        pdf = PDFDocument("test.pdf")
+
+        with pytest.raises(RuntimeError, match="must be used within a context manager"):
+            _ = pdf.doc
+
+    def test_page_count(self):
+        """Should return correct page count."""
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=5)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                assert pdf.page_count == 5
+
+    def test_get_page_dimensions(self):
+        """Should return page dimensions."""
+        mock_rect = MagicMock()
+        mock_rect.width = 595.0
+        mock_rect.height = 842.0
+
+        mock_page = MagicMock()
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                width, height = pdf.get_page_dimensions(0)
+                assert width == 595.0
+                assert height == 842.0
+
+    def test_get_page_dimensions_cached(self):
+        """Should cache page dimensions."""
+        mock_rect = MagicMock()
+        mock_rect.width = 595.0
+        mock_rect.height = 842.0
+
+        mock_page = MagicMock()
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                # Call twice
+                pdf.get_page_dimensions(0)
+                pdf.get_page_dimensions(0)
+
+                # Should only access page once due to caching
+                assert mock_doc.__getitem__.call_count == 1
+
+    def test_get_render_dimensions(self):
+        """Should calculate render dimensions based on DPI."""
+        mock_rect = MagicMock()
+        mock_rect.width = 595.0  # A4 width in points
+        mock_rect.height = 842.0  # A4 height in points
+
+        mock_page = MagicMock()
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                # At 72 DPI (1:1), dimensions should match
+                w72, h72 = pdf.get_render_dimensions(0, dpi=72)
+                assert w72 == 595
+                assert h72 == 842
+
+                # At 150 DPI (150/72 = ~2.08x zoom)
+                w150, h150 = pdf.get_render_dimensions(0, dpi=150)
+                assert w150 == int(595 * 150 / 72)
+                assert h150 == int(842 * 150 / 72)
+
+
+class TestPDFDocumentExtractTextTokens:
+    """Tests for PDFDocument.extract_text_tokens method."""
+
+    def test_extract_from_dict_mode(self):
+        """Should extract tokens using dict mode."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {
+                    "type": 0,  # Text block
+                    "lines": [
+                        {
+                            "spans": [
+                                {"text": "Hello", "bbox": [10, 20, 50, 35]},
+                                {"text": "World", "bbox": [60, 20, 100, 35]},
+                            ]
+                        }
+                    ]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                tokens = list(pdf.extract_text_tokens(0))
+
+                assert len(tokens) == 2
+                assert tokens[0].text == "Hello"
+                assert tokens[1].text == "World"
+
+    def test_skips_non_text_blocks(self):
+        """Should skip non-text blocks (like images)."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {"type": 1},  # Image block - should be skipped
+                {
+                    "type": 0,
+                    "lines": [{"spans": [{"text": "Text", "bbox": [0, 0, 50, 20]}]}]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                tokens = list(pdf.extract_text_tokens(0))
+
+                assert len(tokens) == 1
+                assert tokens[0].text == "Text"
+
+    def test_skips_empty_text(self):
+        """Should skip spans with empty text."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {
+                    "type": 0,
+                    "lines": [
+                        {
+                            "spans": [
+                                {"text": "", "bbox": [0, 0, 10, 10]},
+                                {"text": "  ", "bbox": [10, 0, 20, 10]},
+                                {"text": "Valid", "bbox": [20, 0, 50, 10]},
+                            ]
+                        }
+                    ]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                tokens = list(pdf.extract_text_tokens(0))
+
+                assert len(tokens) == 1
+                assert tokens[0].text == "Valid"
+
+    def test_fallback_to_words_mode(self):
+        """Should fallback to words mode if dict mode yields nothing."""
+        mock_page = MagicMock()
+        # Dict mode returns empty blocks
+        mock_page.get_text.side_effect = lambda mode: (
+            {"blocks": []} if mode == "dict"
+            else [(10, 20, 50, 35, "Fallback", 0, 0, 0)]
+        )
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            with PDFDocument("test.pdf") as pdf:
+                tokens = list(pdf.extract_text_tokens(0))
+
+                assert len(tokens) == 1
+                assert tokens[0].text == "Fallback"
+
+
+class TestExtractTextTokensFunction:
+    """Tests for extract_text_tokens standalone function."""
+
+    def test_extract_all_pages(self):
+        """Should extract from all pages when page_no is None."""
+        mock_page0 = MagicMock()
+        mock_page0.get_text.return_value = {
+            "blocks": [
+                {"type": 0, "lines": [{"spans": [{"text": "Page0", "bbox": [0, 0, 50, 20]}]}]}
+            ]
+        }
+
+        mock_page1 = MagicMock()
+        mock_page1.get_text.return_value = {
+            "blocks": [
+                {"type": 0, "lines": [{"spans": [{"text": "Page1", "bbox": [0, 0, 50, 20]}]}]}
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=2)
+        mock_doc.__getitem__ = lambda self, idx: [mock_page0, mock_page1][idx]
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_text_tokens("test.pdf", page_no=None))
+
+            assert len(tokens) == 2
+            assert tokens[0].text == "Page0"
+            assert tokens[0].page_no == 0
+            assert tokens[1].text == "Page1"
+            assert tokens[1].page_no == 1
+
+    def test_extract_specific_page(self):
+        """Should extract from specific page only."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {"type": 0, "lines": [{"spans": [{"text": "Specific", "bbox": [0, 0, 50, 20]}]}]}
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=3)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_text_tokens("test.pdf", page_no=1))
+
+            assert len(tokens) == 1
+            assert tokens[0].page_no == 1
+
+    def test_skips_corrupted_bbox(self):
+        """Should skip tokens with corrupted bbox values."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {
+                    "type": 0,
+                    "lines": [
+                        {
+                            "spans": [
+                                {"text": "Good", "bbox": [0, 0, 50, 20]},
+                                {"text": "Bad", "bbox": [1e10, 0, 50, 20]},  # Corrupted
+                            ]
+                        }
+                    ]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_text_tokens("test.pdf", page_no=0))
+
+            assert len(tokens) == 1
+            assert tokens[0].text == "Good"
+
+
+class TestExtractWordsFunction:
+    """Tests for extract_words function."""
+
+    def test_extract_words(self):
+        """Should extract words using words mode."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = [
+            (10, 20, 50, 35, "Hello", 0, 0, 0),
+            (60, 20, 100, 35, "World", 0, 0, 1),
+        ]
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_words("test.pdf", page_no=0))
+
+            assert len(tokens) == 2
+            assert tokens[0].text == "Hello"
+            assert tokens[0].bbox == (10, 20, 50, 35)
+            assert tokens[1].text == "World"
+
+    def test_skips_empty_words(self):
+        """Should skip empty words."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = [
+            (10, 20, 50, 35, "", 0, 0, 0),
+            (60, 20, 100, 35, "  ", 0, 0, 1),
+            (110, 20, 150, 35, "Valid", 0, 0, 2),
+        ]
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_words("test.pdf", page_no=0))
+
+            assert len(tokens) == 1
+            assert tokens[0].text == "Valid"
+
+
+class TestExtractLinesFunction:
+    """Tests for extract_lines function."""
+
+    def test_extract_lines(self):
+        """Should extract full lines by combining spans."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {
+                    "type": 0,
+                    "lines": [
+                        {
+                            "spans": [
+                                {"text": "Hello", "bbox": [10, 20, 50, 35]},
+                                {"text": "World", "bbox": [55, 20, 100, 35]},
+                            ]
+                        },
+                        {
+                            "spans": [
+                                {"text": "Second line", "bbox": [10, 40, 100, 55]},
+                            ]
+                        }
+                    ]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_lines("test.pdf", page_no=0))
+
+            assert len(tokens) == 2
+            assert tokens[0].text == "Hello World"
+            # BBox should span both spans
+            assert tokens[0].bbox[0] == 10  # min x0
+            assert tokens[0].bbox[2] == 100  # max x1
+
+    def test_skips_empty_lines(self):
+        """Should skip lines with no text."""
+        mock_page = MagicMock()
+        mock_page.get_text.return_value = {
+            "blocks": [
+                {
+                    "type": 0,
+                    "lines": [
+                        {"spans": []},  # Empty line
+                        {"spans": [{"text": "Valid", "bbox": [0, 0, 50, 20]}]},
+                    ]
+                }
+            ]
+        }
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=1)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            tokens = list(extract_lines("test.pdf", page_no=0))
+
+            assert len(tokens) == 1
+            assert tokens[0].text == "Valid"
+
+
+class TestGetPageDimensionsFunction:
+    """Tests for get_page_dimensions standalone function."""
+
+    def test_get_dimensions(self):
+        """Should return page dimensions."""
+        mock_rect = MagicMock()
+        mock_rect.width = 612.0  # Letter width
+        mock_rect.height = 792.0  # Letter height
+
+        mock_page = MagicMock()
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            width, height = get_page_dimensions("test.pdf", page_no=0)
+
+            assert width == 612.0
+            assert height == 792.0
+
+    def test_get_dimensions_different_page(self):
+        """Should get dimensions for specific page."""
+        mock_rect = MagicMock()
+        mock_rect.width = 595.0
+        mock_rect.height = 842.0
+
+        mock_page = MagicMock()
+        mock_page.rect = mock_rect
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        with patch("fitz.open", return_value=mock_doc):
+            get_page_dimensions("test.pdf", page_no=2)
+            mock_doc.__getitem__.assert_called_with(2)
+
+
+class TestPDFDocumentIsTextPDF:
+    """Tests for PDFDocument.is_text_pdf method."""
+
+    def test_delegates_to_detector(self):
+        """Should delegate to detector module's is_text_pdf."""
+        mock_doc = MagicMock()
+
+        with patch("fitz.open", return_value=mock_doc):
+            with patch("src.pdf.extractor._is_text_pdf_standalone", return_value=True) as mock_check:
+                with PDFDocument("test.pdf") as pdf:
+                    result = pdf.is_text_pdf(min_chars=50)
+
+                    mock_check.assert_called_once_with(Path("test.pdf"), 50)
+                    assert result is True
+
+
+class TestPDFDocumentRenderPage:
+    """Tests for PDFDocument render methods."""
+
+    def test_render_page(self, tmp_path):
+        """Should render page to image file."""
+        mock_pix = MagicMock()
+
+        mock_page = MagicMock()
+        mock_page.get_pixmap.return_value = mock_pix
+
+        mock_doc = MagicMock()
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+
+        output_path = tmp_path / "output.png"
+
+        with patch("fitz.open", return_value=mock_doc):
+            with patch("fitz.Matrix") as mock_matrix:
+                with PDFDocument("test.pdf") as pdf:
+                    result = pdf.render_page(0, output_path, dpi=150)
+
+                    # Verify matrix created with correct zoom
+                    zoom = 150 / 72
+                    mock_matrix.assert_called_once_with(zoom, zoom)
+
+                    # Verify pixmap saved
+                    mock_pix.save.assert_called_once_with(str(output_path))
+
+                    assert result == output_path
+
+    def test_render_all_pages(self, tmp_path):
+        """Should render all pages to images."""
+        mock_pix = MagicMock()
+
+        mock_page = MagicMock()
+        mock_page.get_pixmap.return_value = mock_pix
+
+        mock_doc = MagicMock()
+        mock_doc.__len__ = MagicMock(return_value=2)
+        mock_doc.__getitem__ = MagicMock(return_value=mock_page)
+        mock_doc.stem = "test"  # For filename generation
+
+        with patch("fitz.open", return_value=mock_doc):
+            with patch("fitz.Matrix"):
+                with PDFDocument(tmp_path / "test.pdf") as pdf:
+                    results = list(pdf.render_all_pages(tmp_path, dpi=150))
+
+                    assert len(results) == 2
+                    assert results[0][0] == 0  # Page number
+                    assert results[1][0] == 1
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
--- a/src/processing/autolabel_tasks.py
+++ b/src/processing/autolabel_tasks.py
@@ -86,11 +86,10 @@ def process_text_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
        Result dictionary with success status, annotations, and report.
    """
    import shutil
-    from src.data import AutoLabelReport, FieldMatchResult
+    from src.data import AutoLabelReport
    from src.pdf import PDFDocument
-    from src.matcher import FieldMatcher
-    from src.normalize import normalize_field
-    from src.yolo.annotation_generator import AnnotationGenerator, FIELD_CLASSES
+    from src.yolo.annotation_generator import FIELD_CLASSES
+    from src.processing.document_processor import process_page, record_unmatched_fields

    row_dict = task_data["row_dict"]
    pdf_path = Path(task_data["pdf_path"])
@@ -109,6 +108,12 @@ def process_text_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
    report = AutoLabelReport(document_id=doc_id)
    report.pdf_path = str(pdf_path)
    report.pdf_type = "text"
+    # Store metadata fields from CSV (same as single document mode)
+    report.split = row_dict.get('split')
+    report.customer_number = row_dict.get('customer_number')
+    report.supplier_name = row_dict.get('supplier_name')
+    report.supplier_organisation_number = row_dict.get('supplier_organisation_number')
+    report.supplier_accounts = row_dict.get('supplier_accounts')

    result = {
        "doc_id": doc_id,
@@ -120,9 +125,6 @@ def process_text_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:

    try:
        with PDFDocument(pdf_path) as pdf_doc:
-            generator = AnnotationGenerator(min_confidence=min_confidence)
-            matcher = FieldMatcher()
-
            page_annotations = []
            matched_fields = set()

@@ -134,37 +136,27 @@ def process_text_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
                # Text extraction (no OCR)
                tokens = list(pdf_doc.extract_text_tokens(page_no))

-                # Match fields
+                # Get page dimensions for payment line detection
+                page = pdf_doc.doc[page_no]
+                page_height = page.rect.height
+                page_width = page.rect.width
+
+                # Use shared processing logic (same as single document mode)
                matches = {}
-                for field_name in FIELD_CLASSES.keys():
-                    value = row_dict.get(field_name)
-                    if not value:
-                        continue
-
-                    normalized = normalize_field(field_name, str(value))
-                    field_matches = matcher.find_matches(tokens, field_name, normalized, page_no)
-
-                    if field_matches:
-                        best = field_matches[0]
-                        matches[field_name] = field_matches
-                        matched_fields.add(field_name)
-                        report.add_field_result(
-                            FieldMatchResult(
-                                field_name=field_name,
-                                csv_value=str(value),
-                                matched=True,
-                                score=best.score,
-                                matched_text=best.matched_text,
-                                candidate_used=best.value,
-                                bbox=best.bbox,
-                                page_no=page_no,
-                                context_keywords=best.context_keywords,
-                            )
-                        )
-
-                # Generate annotations
-                annotations = generator.generate_from_matches(
-                    matches, img_width, img_height, dpi=dpi
+                annotations, ann_count = process_page(
+                    tokens=tokens,
+                    row_dict=row_dict,
+                    page_no=page_no,
+                    page_height=page_height,
+                    page_width=page_width,
+                    img_width=img_width,
+                    img_height=img_height,
+                    dpi=dpi,
+                    min_confidence=min_confidence,
+                    matches=matches,
+                    matched_fields=matched_fields,
+                    report=report,
+                    result_stats=result["stats"],
                )

                if annotations:
@@ -172,26 +164,13 @@ def process_text_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
                        {
                            "image_path": str(image_path),
                            "page_no": page_no,
-                            "count": len(annotations),
+                            "count": ann_count,
                        }
                    )
-                    report.annotations_generated += len(annotations)
-                    for ann in annotations:
-                        class_name = list(FIELD_CLASSES.keys())[ann.class_id]
-                        result["stats"][class_name] += 1
+                    report.annotations_generated += ann_count

-            # Record unmatched fields
-            for field_name in FIELD_CLASSES.keys():
-                value = row_dict.get(field_name)
-                if value and field_name not in matched_fields:
-                    report.add_field_result(
-                        FieldMatchResult(
-                            field_name=field_name,
-                            csv_value=str(value),
-                            matched=False,
-                            page_no=-1,
-                        )
-                    )
+            # Record unmatched fields using shared logic
+            record_unmatched_fields(row_dict, matched_fields, report)

            if page_annotations:
                result["pages"] = page_annotations
@@ -225,11 +204,10 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
        Result dictionary with success status, annotations, and report.
    """
    import shutil
-    from src.data import AutoLabelReport, FieldMatchResult
+    from src.data import AutoLabelReport
    from src.pdf import PDFDocument
-    from src.matcher import FieldMatcher
-    from src.normalize import normalize_field
-    from src.yolo.annotation_generator import AnnotationGenerator, FIELD_CLASSES
+    from src.yolo.annotation_generator import FIELD_CLASSES
+    from src.processing.document_processor import process_page, record_unmatched_fields

    row_dict = task_data["row_dict"]
    pdf_path = Path(task_data["pdf_path"])
@@ -248,6 +226,12 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
    report = AutoLabelReport(document_id=doc_id)
    report.pdf_path = str(pdf_path)
    report.pdf_type = "scanned"
+    # Store metadata fields from CSV (same as single document mode)
+    report.split = row_dict.get('split')
+    report.customer_number = row_dict.get('customer_number')
+    report.supplier_name = row_dict.get('supplier_name')
+    report.supplier_organisation_number = row_dict.get('supplier_organisation_number')
+    report.supplier_accounts = row_dict.get('supplier_accounts')

    result = {
        "doc_id": doc_id,
@@ -262,9 +246,6 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
        ocr_engine = _get_ocr_engine()

        with PDFDocument(pdf_path) as pdf_doc:
-            generator = AnnotationGenerator(min_confidence=min_confidence)
-            matcher = FieldMatcher()
-
            page_annotations = []
            matched_fields = set()

@@ -273,6 +254,11 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
                report.total_pages += 1
                img_width, img_height = pdf_doc.get_render_dimensions(page_no, dpi)

+                # Get page dimensions for payment line detection
+                page = pdf_doc.doc[page_no]
+                page_height = page.rect.height
+                page_width = page.rect.width
+
                # OCR extraction
                ocr_result = ocr_engine.extract_with_image(
                    str(image_path),
@@ -288,37 +274,22 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
                if ocr_result.output_img is not None:
                    img_height, img_width = ocr_result.output_img.shape[:2]

-                # Match fields
+                # Use shared processing logic (same as single document mode)
                matches = {}
-                for field_name in FIELD_CLASSES.keys():
-                    value = row_dict.get(field_name)
-                    if not value:
-                        continue
-
-                    normalized = normalize_field(field_name, str(value))
-                    field_matches = matcher.find_matches(tokens, field_name, normalized, page_no)
-
-                    if field_matches:
-                        best = field_matches[0]
-                        matches[field_name] = field_matches
-                        matched_fields.add(field_name)
-                        report.add_field_result(
-                            FieldMatchResult(
-                                field_name=field_name,
-                                csv_value=str(value),
-                                matched=True,
-                                score=best.score,
-                                matched_text=best.matched_text,
-                                candidate_used=best.value,
-                                bbox=best.bbox,
-                                page_no=page_no,
-                                context_keywords=best.context_keywords,
-                            )
-                        )
-
-                # Generate annotations
-                annotations = generator.generate_from_matches(
-                    matches, img_width, img_height, dpi=dpi
+                annotations, ann_count = process_page(
+                    tokens=tokens,
+                    row_dict=row_dict,
+                    page_no=page_no,
+                    page_height=page_height,
+                    page_width=page_width,
+                    img_width=img_width,
+                    img_height=img_height,
+                    dpi=dpi,
+                    min_confidence=min_confidence,
+                    matches=matches,
+                    matched_fields=matched_fields,
+                    report=report,
+                    result_stats=result["stats"],
                )

                if annotations:
@@ -326,26 +297,13 @@ def process_scanned_pdf(task_data: Dict[str, Any]) -> Dict[str, Any]:
                        {
                            "image_path": str(image_path),
                            "page_no": page_no,
-                            "count": len(annotations),
+                            "count": ann_count,
                        }
                    )
-                    report.annotations_generated += len(annotations)
-                    for ann in annotations:
-                        class_name = list(FIELD_CLASSES.keys())[ann.class_id]
-                        result["stats"][class_name] += 1
+                    report.annotations_generated += ann_count

-            # Record unmatched fields
-            for field_name in FIELD_CLASSES.keys():
-                value = row_dict.get(field_name)
-                if value and field_name not in matched_fields:
-                    report.add_field_result(
-                        FieldMatchResult(
-                            field_name=field_name,
-                            csv_value=str(value),
-                            matched=False,
-                            page_no=-1,
-                        )
-                    )
+            # Record unmatched fields using shared logic
+            record_unmatched_fields(row_dict, matched_fields, report)

            if page_annotations:
                result["pages"] = page_annotations
--- a/src/processing/document_processor.py
+++ b/src/processing/document_processor.py
@@ -0,0 +1,448 @@
+"""
+Shared document processing logic for autolabel.
+
+This module provides the core processing functions used by both
+single document mode and batch processing mode to ensure consistent
+matching and annotation logic.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Set, Tuple
+
+from ..data import FieldMatchResult
+from ..matcher import FieldMatcher
+from ..normalize import normalize_field
+from ..ocr.machine_code_parser import MachineCodeParser
+from ..yolo.annotation_generator import AnnotationGenerator, FIELD_CLASSES
+
+
+def match_supplier_accounts(
+    tokens: list,
+    supplier_accounts_value: str,
+    matcher: FieldMatcher,
+    page_no: int,
+    matches: Dict[str, list],
+    matched_fields: Set[str],
+    report: Any,
+) -> None:
+    """
+    Match supplier_accounts field and map to Bankgiro/Plusgiro.
+
+    This logic is shared between single document mode and batch mode
+    to ensure consistent BG/PG type detection.
+
+    Args:
+        tokens: List of text tokens from the page
+        supplier_accounts_value: Raw value from CSV (e.g., "BG:xxx | PG:yyy")
+        matcher: FieldMatcher instance
+        page_no: Current page number
+        matches: Dictionary to store matched fields (modified in place)
+        matched_fields: Set of matched field names (modified in place)
+        report: AutoLabelReport instance
+    """
+    if not supplier_accounts_value:
+        return
+
+    # Parse accounts: "BG:xxx | PG:yyy" format
+    accounts = [acc.strip() for acc in str(supplier_accounts_value).split('|')]
+
+    for account in accounts:
+        account = account.strip()
+        if not account:
+            continue
+
+        # Determine account type (BG or PG) and extract account number
+        account_type = None
+        account_number = account  # Default to full value
+
+        if account.upper().startswith('BG:'):
+            account_type = 'Bankgiro'
+            account_number = account[3:].strip()  # Remove "BG:" prefix
+        elif account.upper().startswith('BG '):
+            account_type = 'Bankgiro'
+            account_number = account[2:].strip()  # Remove "BG" prefix
+        elif account.upper().startswith('PG:'):
+            account_type = 'Plusgiro'
+            account_number = account[3:].strip()  # Remove "PG:" prefix
+        elif account.upper().startswith('PG '):
+            account_type = 'Plusgiro'
+            account_number = account[2:].strip()  # Remove "PG" prefix
+        else:
+            # Try to guess from format - Plusgiro often has format XXXXXXX-X
+            digits = ''.join(c for c in account if c.isdigit())
+            if len(digits) == 8 and '-' in account:
+                account_type = 'Plusgiro'
+            elif len(digits) in (7, 8):
+                account_type = 'Bankgiro'  # Default to Bankgiro
+
+        if not account_type:
+            continue
+
+        # Normalize and match using the account number (without prefix)
+        normalized = normalize_field('supplier_accounts', account_number)
+        field_matches = matcher.find_matches(tokens, account_type, normalized, page_no)
+
+        if field_matches:
+            best = field_matches[0]
+            # Add to matches under the target class (Bankgiro/Plusgiro)
+            if account_type not in matches:
+                matches[account_type] = []
+            matches[account_type].extend(field_matches)
+            matched_fields.add('supplier_accounts')
+
+            report.add_field_result(FieldMatchResult(
+                field_name=f'supplier_accounts({account_type})',
+                csv_value=account_number,  # Store without prefix
+                matched=True,
+                score=best.score,
+                matched_text=best.matched_text,
+                candidate_used=best.value,
+                bbox=best.bbox,
+                page_no=page_no,
+                context_keywords=best.context_keywords
+            ))
+
+
+def detect_payment_line(
+    tokens: list,
+    page_height: float,
+    page_width: float,
+) -> Optional[Any]:
+    """
+    Detect payment line (machine code) and return the parsed result.
+
+    This function only detects and parses the payment line, without generating
+    annotations. The caller can use the result to extract amount for cross-validation.
+
+    Args:
+        tokens: List of text tokens from the page
+        page_height: Page height in PDF points
+        page_width: Page width in PDF points
+
+    Returns:
+        MachineCodeResult if standard format detected (confidence >= 0.95), None otherwise
+    """
+    # Use 55% of page height as bottom region to catch payment lines
+    # that may be in the middle of the page (e.g., payment slips)
+    mc_parser = MachineCodeParser(bottom_region_ratio=0.55)
+    mc_result = mc_parser.parse(tokens, page_height, page_width)
+
+    # Only return if we found a STANDARD payment line format
+    # (confidence 0.95 means standard pattern matched with # and > symbols)
+    is_standard_format = mc_result.confidence >= 0.95
+    if is_standard_format:
+        return mc_result
+    return None
+
+
+def match_payment_line(
+    tokens: list,
+    page_height: float,
+    page_width: float,
+    min_confidence: float,
+    generator: AnnotationGenerator,
+    annotations: list,
+    img_width: int,
+    img_height: int,
+    dpi: int,
+    matched_fields: Set[str],
+    report: Any,
+    page_no: int,
+    mc_result: Optional[Any] = None,
+) -> None:
+    """
+    Annotate payment line (machine code) using pre-detected result.
+
+    This logic is shared between single document mode and batch mode
+    to ensure consistent payment_line detection.
+
+    Args:
+        tokens: List of text tokens from the page
+        page_height: Page height in PDF points
+        page_width: Page width in PDF points
+        min_confidence: Minimum confidence threshold
+        generator: AnnotationGenerator instance
+        annotations: List of annotations (modified in place)
+        img_width: Image width in pixels
+        img_height: Image height in pixels
+        dpi: DPI used for rendering
+        matched_fields: Set of matched field names (modified in place)
+        report: AutoLabelReport instance
+        page_no: Current page number
+        mc_result: Pre-detected MachineCodeResult (from detect_payment_line)
+    """
+    # Use pre-detected result if provided, otherwise detect now
+    if mc_result is None:
+        mc_result = detect_payment_line(tokens, page_height, page_width)
+
+    # Only add payment_line if we have a valid standard format result
+    if mc_result is None:
+        return
+
+    if mc_result.confidence >= min_confidence:
+        region_bbox = mc_result.get_region_bbox()
+        if region_bbox:
+            generator.add_payment_line_annotation(
+                annotations, region_bbox, mc_result.confidence,
+                img_width, img_height, dpi=dpi
+            )
+            # Store payment_line result in database
+            matched_fields.add('payment_line')
+            report.add_field_result(FieldMatchResult(
+                field_name='payment_line',
+                csv_value=mc_result.raw_line[:200] if mc_result.raw_line else '',
+                matched=True,
+                score=mc_result.confidence,
+                matched_text=f"OCR:{mc_result.ocr or ''} Amount:{mc_result.amount or ''} BG:{mc_result.bankgiro or ''}",
+                candidate_used='machine_code_parser',
+                bbox=region_bbox,
+                page_no=page_no,
+                context_keywords=['payment_line', 'machine_code']
+            ))
+
+
+def match_standard_fields(
+    tokens: list,
+    row_dict: Dict[str, Any],
+    matcher: FieldMatcher,
+    page_no: int,
+    matches: Dict[str, list],
+    matched_fields: Set[str],
+    report: Any,
+    payment_line_amount: Optional[str] = None,
+    payment_line_bbox: Optional[tuple] = None,
+) -> None:
+    """
+    Match standard fields from CSV to tokens.
+
+    This excludes payment_line (detected separately) and supplier_accounts
+    (handled by match_supplier_accounts).
+
+    Args:
+        tokens: List of text tokens from the page
+        row_dict: Dictionary of field values from CSV
+        matcher: FieldMatcher instance
+        page_no: Current page number
+        matches: Dictionary to store matched fields (modified in place)
+        matched_fields: Set of matched field names (modified in place)
+        report: AutoLabelReport instance
+        payment_line_amount: Amount extracted from payment_line (takes priority over CSV)
+        payment_line_bbox: Bounding box of payment_line region (used as fallback for Amount)
+    """
+    for field_name in FIELD_CLASSES.keys():
+        # Skip fields handled separately
+        if field_name == 'payment_line':
+            continue
+        if field_name in ('Bankgiro', 'Plusgiro'):
+            continue  # Handled via supplier_accounts
+
+        value = row_dict.get(field_name)
+
+        # For Amount field: only use payment_line amount if it matches CSV value
+        use_payment_line_amount = False
+        if field_name == 'Amount' and payment_line_amount and value:
+            # Parse both amounts and check if they're close
+            try:
+                csv_amt = float(str(value).replace(',', '.').replace(' ', ''))
+                pl_amt = float(str(payment_line_amount).replace(',', '.').replace(' ', ''))
+                if abs(csv_amt - pl_amt) < 0.01:
+                    # Payment line amount matches CSV, use it for better bbox
+                    value = payment_line_amount
+                    use_payment_line_amount = True
+                # Otherwise keep CSV value for matching
+            except (ValueError, TypeError):
+                pass
+
+        if not value:
+            continue
+
+        normalized = normalize_field(field_name, str(value))
+        field_matches = matcher.find_matches(tokens, field_name, normalized, page_no)
+
+        if field_matches:
+            best = field_matches[0]
+            matches[field_name] = field_matches
+            matched_fields.add(field_name)
+
+            # For Amount: note if we used payment_line amount
+            csv_value_display = str(row_dict.get(field_name, value))
+            if field_name == 'Amount' and use_payment_line_amount:
+                csv_value_display = f"{row_dict.get(field_name)} (matched via payment_line: {payment_line_amount})"
+
+            report.add_field_result(FieldMatchResult(
+                field_name=field_name,
+                csv_value=csv_value_display,
+                matched=True,
+                score=best.score,
+                matched_text=best.matched_text,
+                candidate_used=best.value,
+                bbox=best.bbox,
+                page_no=page_no,
+                context_keywords=best.context_keywords
+            ))
+        elif field_name == 'Amount' and use_payment_line_amount and payment_line_bbox:
+            # Fallback: Amount not found via token matching, but payment_line
+            # successfully extracted a matching amount. Use payment_line bbox.
+            # This handles cases where text PDFs merge multiple values into one token.
+            from src.matcher.field_matcher import Match
+
+            fallback_match = Match(
+                field='Amount',
+                value=payment_line_amount,
+                bbox=payment_line_bbox,
+                page_no=page_no,
+                score=0.9,
+                matched_text=f"Amount:{payment_line_amount}",
+                context_keywords=['payment_line', 'amount']
+            )
+            matches[field_name] = [fallback_match]
+            matched_fields.add(field_name)
+            csv_value_display = f"{row_dict.get(field_name)} (via payment_line: {payment_line_amount})"
+
+            report.add_field_result(FieldMatchResult(
+                field_name=field_name,
+                csv_value=csv_value_display,
+                matched=True,
+                score=0.9,  # High confidence since payment_line parsing succeeded
+                matched_text=f"Amount:{payment_line_amount}",
+                candidate_used='payment_line_fallback',
+                bbox=payment_line_bbox,
+                page_no=page_no,
+                context_keywords=['payment_line', 'amount']
+            ))
+
+
+def record_unmatched_fields(
+    row_dict: Dict[str, Any],
+    matched_fields: Set[str],
+    report: Any,
+) -> None:
+    """
+    Record fields from CSV that were not matched.
+
+    Args:
+        row_dict: Dictionary of field values from CSV
+        matched_fields: Set of matched field names
+        report: AutoLabelReport instance
+    """
+    for field_name in FIELD_CLASSES.keys():
+        if field_name == 'payment_line':
+            continue  # payment_line doesn't come from CSV
+        if field_name in ('Bankgiro', 'Plusgiro'):
+            continue  # These come from supplier_accounts
+
+        value = row_dict.get(field_name)
+        if value and field_name not in matched_fields:
+            report.add_field_result(FieldMatchResult(
+                field_name=field_name,
+                csv_value=str(value),
+                matched=False,
+                page_no=-1
+            ))
+
+    # Check if supplier_accounts was not matched
+    if row_dict.get('supplier_accounts') and 'supplier_accounts' not in matched_fields:
+        report.add_field_result(FieldMatchResult(
+            field_name='supplier_accounts',
+            csv_value=str(row_dict.get('supplier_accounts')),
+            matched=False,
+            page_no=-1
+        ))
+
+
+def process_page(
+    tokens: list,
+    row_dict: Dict[str, Any],
+    page_no: int,
+    page_height: float,
+    page_width: float,
+    img_width: int,
+    img_height: int,
+    dpi: int,
+    min_confidence: float,
+    matches: Dict[str, list],
+    matched_fields: Set[str],
+    report: Any,
+    result_stats: Dict[str, int],
+) -> Tuple[list, int]:
+    """
+    Process a single page: match fields and generate annotations.
+
+    This is the main entry point for page processing, used by both
+    single document mode and batch mode.
+
+    Processing order:
+    1. Detect payment_line first to extract amount
+    2. Match standard fields (using payment_line amount if available)
+    3. Match supplier_accounts
+    4. Generate annotations
+
+    Args:
+        tokens: List of text tokens from the page
+        row_dict: Dictionary of field values from CSV
+        page_no: Current page number
+        page_height: Page height in PDF points
+        page_width: Page width in PDF points
+        img_width: Image width in pixels
+        img_height: Image height in pixels
+        dpi: DPI used for rendering
+        min_confidence: Minimum confidence threshold
+        matches: Dictionary to store matched fields (modified in place)
+        matched_fields: Set of matched field names (modified in place)
+        report: AutoLabelReport instance
+        result_stats: Dictionary of annotation stats (modified in place)
+
+    Returns:
+        Tuple of (annotations list, annotation count)
+    """
+    matcher = FieldMatcher()
+    generator = AnnotationGenerator(min_confidence=min_confidence)
+
+    # Step 1: Detect payment_line FIRST to extract amount
+    # This allows us to use the payment_line amount for matching Amount field
+    mc_result = detect_payment_line(tokens, page_height, page_width)
+
+    # Extract amount and bbox from payment_line if available
+    payment_line_amount = None
+    payment_line_bbox = None
+    if mc_result and mc_result.amount:
+        payment_line_amount = mc_result.amount
+        payment_line_bbox = mc_result.get_region_bbox()
+
+    # Step 2: Match standard fields (using payment_line amount if available)
+    match_standard_fields(
+        tokens, row_dict, matcher, page_no,
+        matches, matched_fields, report,
+        payment_line_amount=payment_line_amount,
+        payment_line_bbox=payment_line_bbox
+    )
+
+    # Step 3: Match supplier_accounts -> Bankgiro/Plusgiro
+    supplier_accounts_value = row_dict.get('supplier_accounts')
+    if supplier_accounts_value:
+        match_supplier_accounts(
+            tokens, supplier_accounts_value, matcher, page_no,
+            matches, matched_fields, report
+        )
+
+    # Generate annotations from matches
+    annotations = generator.generate_from_matches(
+        matches, img_width, img_height, dpi=dpi
+    )
+
+    # Step 4: Add payment_line annotation (reuse the pre-detected result)
+    match_payment_line(
+        tokens, page_height, page_width, min_confidence,
+        generator, annotations, img_width, img_height, dpi,
+        matched_fields, report, page_no,
+        mc_result=mc_result
+    )
+
+    # Update stats
+    for ann in annotations:
+        class_name = list(FIELD_CLASSES.keys())[ann.class_id]
+        result_stats[class_name] += 1
+
+    return annotations, len(annotations)
--- a/src/validation/init.py
+++ b/src/validation/init.py
@@ -0,0 +1,7 @@
+"""
+Cross-validation module for verifying field extraction using LLM.
+"""
+
+from .llm_validator import LLMValidator
+
+__all__ = ['LLMValidator']
--- a/src/validation/llm_validator.py
+++ b/src/validation/llm_validator.py
@@ -0,0 +1,746 @@
+"""
+LLM-based cross-validation for invoice field extraction.
+
+Uses a vision LLM to extract fields from invoice PDFs and compare with
+the autolabel results to identify potential errors.
+"""
+
+import json
+import base64
+import os
+from pathlib import Path
+from typing import Optional, Dict, Any, List
+from dataclasses import dataclass, asdict
+from datetime import datetime
+
+import psycopg2
+from psycopg2.extras import execute_values
+
+
+@dataclass
+class LLMExtractionResult:
+    """Result of LLM field extraction."""
+    document_id: str
+    invoice_number: Optional[str] = None
+    invoice_date: Optional[str] = None
+    invoice_due_date: Optional[str] = None
+    ocr_number: Optional[str] = None
+    bankgiro: Optional[str] = None
+    plusgiro: Optional[str] = None
+    amount: Optional[str] = None
+    supplier_organisation_number: Optional[str] = None
+    raw_response: Optional[str] = None
+    model_used: Optional[str] = None
+    processing_time_ms: Optional[float] = None
+    error: Optional[str] = None
+
+    def to_dict(self) -> Dict[str, Any]:
+        return asdict(self)
+
+
+class LLMValidator:
+    """
+    Cross-validates invoice field extraction using LLM.
+
+    Queries documents with failed field matches from the database,
+    sends the PDF images to an LLM for extraction, and stores
+    the results for comparison.
+    """
+
+    # Fields to extract (excluding customer_number as requested)
+    FIELDS_TO_EXTRACT = [
+        'InvoiceNumber',
+        'InvoiceDate',
+        'InvoiceDueDate',
+        'OCR',
+        'Bankgiro',
+        'Plusgiro',
+        'Amount',
+        'supplier_organisation_number',
+    ]
+
+    EXTRACTION_PROMPT = """You are an expert at extracting structured data from Swedish invoices.
+
+Analyze this invoice image and extract the following fields. Return ONLY a valid JSON object with these exact keys:
+
+{
+    "invoice_number": "the invoice number/fakturanummer",
+    "invoice_date": "the invoice date in YYYY-MM-DD format",
+    "invoice_due_date": "the due date/förfallodatum in YYYY-MM-DD format",
+    "ocr_number": "the OCR payment reference number",
+    "bankgiro": "the bankgiro number (format: XXXX-XXXX or XXXXXXXX)",
+    "plusgiro": "the plusgiro number",
+    "amount": "the total amount to pay (just the number, e.g., 1234.56)",
+    "supplier_organisation_number": "the supplier's organisation number (format: XXXXXX-XXXX)"
+}
+
+Rules:
+- If a field is not found or not visible, use null
+- For dates, convert Swedish month names (januari, februari, etc.) to YYYY-MM-DD
+- For amounts, extract just the numeric value without currency symbols
+- The OCR number is typically a long number used for payment reference
+- Look for "Att betala" or "Summa att betala" for the amount
+- Organisation number is 10 digits, often shown as XXXXXX-XXXX
+
+Return ONLY the JSON object, no other text."""
+
+    def __init__(self, connection_string: str = None, pdf_dir: str = None):
+        """
+        Initialize the validator.
+
+        Args:
+            connection_string: PostgreSQL connection string
+            pdf_dir: Directory containing PDF files
+        """
+        import sys
+        sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+        from config import get_db_connection_string, PATHS
+
+        self.connection_string = connection_string or get_db_connection_string()
+        self.pdf_dir = Path(pdf_dir or PATHS['pdf_dir'])
+        self.conn = None
+
+    def connect(self):
+        """Connect to database."""
+        if self.conn is None:
+            self.conn = psycopg2.connect(self.connection_string)
+        return self.conn
+
+    def close(self):
+        """Close database connection."""
+        if self.conn:
+            self.conn.close()
+            self.conn = None
+
+    def create_validation_table(self):
+        """Create the llm_validation table if it doesn't exist."""
+        conn = self.connect()
+        with conn.cursor() as cursor:
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS llm_validations (
+                    id SERIAL PRIMARY KEY,
+                    document_id TEXT NOT NULL,
+                    -- Extracted fields
+                    invoice_number TEXT,
+                    invoice_date TEXT,
+                    invoice_due_date TEXT,
+                    ocr_number TEXT,
+                    bankgiro TEXT,
+                    plusgiro TEXT,
+                    amount TEXT,
+                    supplier_organisation_number TEXT,
+                    -- Metadata
+                    raw_response TEXT,
+                    model_used TEXT,
+                    processing_time_ms REAL,
+                    error TEXT,
+                    created_at TIMESTAMPTZ DEFAULT NOW(),
+                    -- Comparison results (populated later)
+                    comparison_results JSONB,
+                    UNIQUE(document_id)
+                );
+
+                CREATE INDEX IF NOT EXISTS idx_llm_validations_document_id
+                ON llm_validations(document_id);
+            """)
+        conn.commit()
+
+    def get_documents_with_failed_matches(
+        self,
+        exclude_customer_number: bool = True,
+        limit: Optional[int] = None
+    ) -> List[Dict[str, Any]]:
+        """
+        Get documents that have at least one failed field match.
+
+        Args:
+            exclude_customer_number: If True, ignore customer_number failures
+            limit: Maximum number of documents to return
+
+        Returns:
+            List of document info with failed fields
+        """
+        conn = self.connect()
+        with conn.cursor() as cursor:
+            # Find documents with failed matches (excluding customer_number if requested)
+            exclude_clause = ""
+            if exclude_customer_number:
+                exclude_clause = "AND fr.field_name != 'customer_number'"
+
+            query = f"""
+                SELECT DISTINCT d.document_id, d.pdf_path, d.pdf_type,
+                       d.supplier_name, d.split
+                FROM documents d
+                JOIN field_results fr ON d.document_id = fr.document_id
+                WHERE fr.matched = false
+                AND fr.field_name NOT LIKE 'supplier_accounts%%'
+                {exclude_clause}
+                AND d.document_id NOT IN (
+                    SELECT document_id FROM llm_validations WHERE error IS NULL
+                )
+                ORDER BY d.document_id
+            """
+            if limit:
+                query += f" LIMIT {limit}"
+
+            cursor.execute(query)
+
+            results = []
+            for row in cursor.fetchall():
+                doc_id = row[0]
+
+                # Get failed fields for this document
+                exclude_clause_inner = ""
+                if exclude_customer_number:
+                    exclude_clause_inner = "AND field_name != 'customer_number'"
+                cursor.execute(f"""
+                    SELECT field_name, csv_value, score
+                    FROM field_results
+                    WHERE document_id = %s
+                    AND matched = false
+                    AND field_name NOT LIKE 'supplier_accounts%%'
+                    {exclude_clause_inner}
+                """, (doc_id,))
+
+                failed_fields = [
+                    {'field': r[0], 'csv_value': r[1], 'score': r[2]}
+                    for r in cursor.fetchall()
+                ]
+
+                results.append({
+                    'document_id': doc_id,
+                    'pdf_path': row[1],
+                    'pdf_type': row[2],
+                    'supplier_name': row[3],
+                    'split': row[4],
+                    'failed_fields': failed_fields,
+                })
+
+            return results
+
+    def get_failed_match_stats(self, exclude_customer_number: bool = True) -> Dict[str, Any]:
+        """Get statistics about failed matches."""
+        conn = self.connect()
+        with conn.cursor() as cursor:
+            exclude_clause = ""
+            if exclude_customer_number:
+                exclude_clause = "AND field_name != 'customer_number'"
+
+            # Count by field
+            cursor.execute(f"""
+                SELECT field_name, COUNT(*) as cnt
+                FROM field_results
+                WHERE matched = false
+                AND field_name NOT LIKE 'supplier_accounts%%'
+                {exclude_clause}
+                GROUP BY field_name
+                ORDER BY cnt DESC
+            """)
+            by_field = {row[0]: row[1] for row in cursor.fetchall()}
+
+            # Count documents with failures
+            cursor.execute(f"""
+                SELECT COUNT(DISTINCT document_id)
+                FROM field_results
+                WHERE matched = false
+                AND field_name NOT LIKE 'supplier_accounts%%'
+                {exclude_clause}
+            """)
+            doc_count = cursor.fetchone()[0]
+
+            # Already validated count
+            cursor.execute("""
+                SELECT COUNT(*) FROM llm_validations WHERE error IS NULL
+            """)
+            validated_count = cursor.fetchone()[0]
+
+            return {
+                'documents_with_failures': doc_count,
+                'already_validated': validated_count,
+                'remaining': doc_count - validated_count,
+                'failures_by_field': by_field,
+            }
+
+    def render_pdf_to_image(
+        self,
+        pdf_path: Path,
+        page_no: int = 0,
+        dpi: int = 150,
+        max_size_mb: float = 18.0
+    ) -> bytes:
+        """
+        Render a PDF page to PNG image bytes.
+
+        Args:
+            pdf_path: Path to PDF file
+            page_no: Page number to render (0-indexed)
+            dpi: Resolution for rendering
+            max_size_mb: Maximum image size in MB (Azure OpenAI limit is 20MB)
+
+        Returns:
+            PNG image bytes
+        """
+        import fitz  # PyMuPDF
+        from io import BytesIO
+        from PIL import Image
+
+        doc = fitz.open(pdf_path)
+        page = doc[page_no]
+
+        # Try different DPI values until we get a small enough image
+        dpi_values = [dpi, 120, 100, 72, 50]
+
+        for current_dpi in dpi_values:
+            mat = fitz.Matrix(current_dpi / 72, current_dpi / 72)
+            pix = page.get_pixmap(matrix=mat)
+            png_bytes = pix.tobytes("png")
+
+            size_mb = len(png_bytes) / (1024 * 1024)
+            if size_mb <= max_size_mb:
+                doc.close()
+                return png_bytes
+
+        # If still too large, use JPEG compression
+        mat = fitz.Matrix(72 / 72, 72 / 72)  # Lowest DPI
+        pix = page.get_pixmap(matrix=mat)
+
+        # Convert to PIL Image and compress as JPEG
+        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+
+        # Try different JPEG quality levels
+        for quality in [85, 70, 50, 30]:
+            buffer = BytesIO()
+            img.save(buffer, format="JPEG", quality=quality)
+            jpeg_bytes = buffer.getvalue()
+
+            size_mb = len(jpeg_bytes) / (1024 * 1024)
+            if size_mb <= max_size_mb:
+                doc.close()
+                return jpeg_bytes
+
+        doc.close()
+        # Return whatever we have, let the API handle the error
+        return jpeg_bytes
+
+    def extract_with_openai(
+        self,
+        image_bytes: bytes,
+        model: str = "gpt-4o"
+    ) -> LLMExtractionResult:
+        """
+        Extract fields using OpenAI's vision API (supports Azure OpenAI).
+
+        Args:
+            image_bytes: PNG image bytes
+            model: Model to use (gpt-4o, gpt-4o-mini, etc.)
+
+        Returns:
+            Extraction result
+        """
+        import openai
+        import time
+
+        start_time = time.time()
+
+        # Encode image to base64 and detect format
+        image_b64 = base64.b64encode(image_bytes).decode('utf-8')
+
+        # Detect image format (PNG starts with \x89PNG, JPEG with \xFF\xD8)
+        if image_bytes[:4] == b'\x89PNG':
+            media_type = "image/png"
+        else:
+            media_type = "image/jpeg"
+
+        # Check for Azure OpenAI configuration
+        azure_endpoint = os.environ.get('AZURE_OPENAI_ENDPOINT')
+        azure_api_key = os.environ.get('AZURE_OPENAI_API_KEY')
+        azure_deployment = os.environ.get('AZURE_OPENAI_DEPLOYMENT', model)
+
+        if azure_endpoint and azure_api_key:
+            # Use Azure OpenAI
+            client = openai.AzureOpenAI(
+                azure_endpoint=azure_endpoint,
+                api_key=azure_api_key,
+                api_version="2024-02-15-preview"
+            )
+            model = azure_deployment  # Use deployment name for Azure
+        else:
+            # Use standard OpenAI
+            client = openai.OpenAI()
+
+        try:
+            response = client.chat.completions.create(
+                model=model,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "text", "text": self.EXTRACTION_PROMPT},
+                            {
+                                "type": "image_url",
+                                "image_url": {
+                                    "url": f"data:{media_type};base64,{image_b64}",
+                                    "detail": "high"
+                                }
+                            }
+                        ]
+                    }
+                ],
+                max_tokens=1000,
+                temperature=0,
+            )
+
+            raw_response = response.choices[0].message.content
+            processing_time = (time.time() - start_time) * 1000
+
+            # Parse JSON response
+            # Try to extract JSON from response (may have markdown code blocks)
+            json_str = raw_response
+            if "```json" in json_str:
+                json_str = json_str.split("```json")[1].split("```")[0]
+            elif "```" in json_str:
+                json_str = json_str.split("```")[1].split("```")[0]
+
+            data = json.loads(json_str.strip())
+
+            return LLMExtractionResult(
+                document_id="",  # Will be set by caller
+                invoice_number=data.get('invoice_number'),
+                invoice_date=data.get('invoice_date'),
+                invoice_due_date=data.get('invoice_due_date'),
+                ocr_number=data.get('ocr_number'),
+                bankgiro=data.get('bankgiro'),
+                plusgiro=data.get('plusgiro'),
+                amount=data.get('amount'),
+                supplier_organisation_number=data.get('supplier_organisation_number'),
+                raw_response=raw_response,
+                model_used=model,
+                processing_time_ms=processing_time,
+            )
+
+        except json.JSONDecodeError as e:
+            return LLMExtractionResult(
+                document_id="",
+                raw_response=raw_response if 'raw_response' in dir() else None,
+                model_used=model,
+                processing_time_ms=(time.time() - start_time) * 1000,
+                error=f"JSON parse error: {str(e)}"
+            )
+        except Exception as e:
+            return LLMExtractionResult(
+                document_id="",
+                model_used=model,
+                processing_time_ms=(time.time() - start_time) * 1000,
+                error=str(e)
+            )
+
+    def extract_with_anthropic(
+        self,
+        image_bytes: bytes,
+        model: str = "claude-sonnet-4-20250514"
+    ) -> LLMExtractionResult:
+        """
+        Extract fields using Anthropic's Claude API.
+
+        Args:
+            image_bytes: PNG image bytes
+            model: Model to use
+
+        Returns:
+            Extraction result
+        """
+        import anthropic
+        import time
+
+        start_time = time.time()
+
+        # Encode image to base64
+        image_b64 = base64.b64encode(image_bytes).decode('utf-8')
+
+        client = anthropic.Anthropic()
+
+        try:
+            response = client.messages.create(
+                model=model,
+                max_tokens=1000,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {
+                                "type": "image",
+                                "source": {
+                                    "type": "base64",
+                                    "media_type": "image/png",
+                                    "data": image_b64,
+                                }
+                            },
+                            {
+                                "type": "text",
+                                "text": self.EXTRACTION_PROMPT
+                            }
+                        ]
+                    }
+                ],
+            )
+
+            raw_response = response.content[0].text
+            processing_time = (time.time() - start_time) * 1000
+
+            # Parse JSON response
+            json_str = raw_response
+            if "```json" in json_str:
+                json_str = json_str.split("```json")[1].split("```")[0]
+            elif "```" in json_str:
+                json_str = json_str.split("```")[1].split("```")[0]
+
+            data = json.loads(json_str.strip())
+
+            return LLMExtractionResult(
+                document_id="",
+                invoice_number=data.get('invoice_number'),
+                invoice_date=data.get('invoice_date'),
+                invoice_due_date=data.get('invoice_due_date'),
+                ocr_number=data.get('ocr_number'),
+                bankgiro=data.get('bankgiro'),
+                plusgiro=data.get('plusgiro'),
+                amount=data.get('amount'),
+                supplier_organisation_number=data.get('supplier_organisation_number'),
+                raw_response=raw_response,
+                model_used=model,
+                processing_time_ms=processing_time,
+            )
+
+        except json.JSONDecodeError as e:
+            return LLMExtractionResult(
+                document_id="",
+                raw_response=raw_response if 'raw_response' in dir() else None,
+                model_used=model,
+                processing_time_ms=(time.time() - start_time) * 1000,
+                error=f"JSON parse error: {str(e)}"
+            )
+        except Exception as e:
+            return LLMExtractionResult(
+                document_id="",
+                model_used=model,
+                processing_time_ms=(time.time() - start_time) * 1000,
+                error=str(e)
+            )
+
+    def save_validation_result(self, result: LLMExtractionResult):
+        """Save extraction result to database."""
+        conn = self.connect()
+        with conn.cursor() as cursor:
+            cursor.execute("""
+                INSERT INTO llm_validations (
+                    document_id, invoice_number, invoice_date, invoice_due_date,
+                    ocr_number, bankgiro, plusgiro, amount,
+                    supplier_organisation_number, raw_response, model_used,
+                    processing_time_ms, error
+                ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
+                ON CONFLICT (document_id) DO UPDATE SET
+                    invoice_number = EXCLUDED.invoice_number,
+                    invoice_date = EXCLUDED.invoice_date,
+                    invoice_due_date = EXCLUDED.invoice_due_date,
+                    ocr_number = EXCLUDED.ocr_number,
+                    bankgiro = EXCLUDED.bankgiro,
+                    plusgiro = EXCLUDED.plusgiro,
+                    amount = EXCLUDED.amount,
+                    supplier_organisation_number = EXCLUDED.supplier_organisation_number,
+                    raw_response = EXCLUDED.raw_response,
+                    model_used = EXCLUDED.model_used,
+                    processing_time_ms = EXCLUDED.processing_time_ms,
+                    error = EXCLUDED.error,
+                    created_at = NOW()
+            """, (
+                result.document_id,
+                result.invoice_number,
+                result.invoice_date,
+                result.invoice_due_date,
+                result.ocr_number,
+                result.bankgiro,
+                result.plusgiro,
+                result.amount,
+                result.supplier_organisation_number,
+                result.raw_response,
+                result.model_used,
+                result.processing_time_ms,
+                result.error,
+            ))
+        conn.commit()
+
+    def validate_document(
+        self,
+        doc_id: str,
+        provider: str = "openai",
+        model: str = None
+    ) -> LLMExtractionResult:
+        """
+        Validate a single document using LLM.
+
+        Args:
+            doc_id: Document ID
+            provider: LLM provider ("openai" or "anthropic")
+            model: Model to use (defaults based on provider)
+
+        Returns:
+            Extraction result
+        """
+        # Get PDF path
+        pdf_path = self.pdf_dir / f"{doc_id}.pdf"
+        if not pdf_path.exists():
+            return LLMExtractionResult(
+                document_id=doc_id,
+                error=f"PDF not found: {pdf_path}"
+            )
+
+        # Render first page
+        try:
+            image_bytes = self.render_pdf_to_image(pdf_path, page_no=0)
+        except Exception as e:
+            return LLMExtractionResult(
+                document_id=doc_id,
+                error=f"Failed to render PDF: {str(e)}"
+            )
+
+        # Extract with LLM
+        if provider == "openai":
+            model = model or "gpt-4o"
+            result = self.extract_with_openai(image_bytes, model)
+        elif provider == "anthropic":
+            model = model or "claude-sonnet-4-20250514"
+            result = self.extract_with_anthropic(image_bytes, model)
+        else:
+            return LLMExtractionResult(
+                document_id=doc_id,
+                error=f"Unknown provider: {provider}"
+            )
+
+        result.document_id = doc_id
+
+        # Save to database
+        self.save_validation_result(result)
+
+        return result
+
+    def validate_batch(
+        self,
+        limit: int = 10,
+        provider: str = "openai",
+        model: str = None,
+        verbose: bool = True
+    ) -> List[LLMExtractionResult]:
+        """
+        Validate a batch of documents with failed matches.
+
+        Args:
+            limit: Maximum number of documents to validate
+            provider: LLM provider
+            model: Model to use
+            verbose: Print progress
+
+        Returns:
+            List of extraction results
+        """
+        # Get documents to validate
+        docs = self.get_documents_with_failed_matches(limit=limit)
+
+        if verbose:
+            print(f"Found {len(docs)} documents with failed matches to validate")
+
+        results = []
+        for i, doc in enumerate(docs):
+            doc_id = doc['document_id']
+
+            if verbose:
+                failed_fields = [f['field'] for f in doc['failed_fields']]
+                print(f"[{i+1}/{len(docs)}] Validating {doc_id[:8]}... (failed: {', '.join(failed_fields)})")
+
+            result = self.validate_document(doc_id, provider, model)
+            results.append(result)
+
+            if verbose:
+                if result.error:
+                    print(f"  ERROR: {result.error}")
+                else:
+                    print(f"  OK ({result.processing_time_ms:.0f}ms)")
+
+        return results
+
+    def compare_results(self, doc_id: str) -> Dict[str, Any]:
+        """
+        Compare LLM extraction with autolabel results.
+
+        Args:
+            doc_id: Document ID
+
+        Returns:
+            Comparison results
+        """
+        conn = self.connect()
+        with conn.cursor() as cursor:
+            # Get autolabel results
+            cursor.execute("""
+                SELECT field_name, csv_value, matched, matched_text
+                FROM field_results
+                WHERE document_id = %s
+            """, (doc_id,))
+
+            autolabel = {}
+            for row in cursor.fetchall():
+                autolabel[row[0]] = {
+                    'csv_value': row[1],
+                    'matched': row[2],
+                    'matched_text': row[3],
+                }
+
+            # Get LLM results
+            cursor.execute("""
+                SELECT invoice_number, invoice_date, invoice_due_date,
+                       ocr_number, bankgiro, plusgiro, amount,
+                       supplier_organisation_number
+                FROM llm_validations
+                WHERE document_id = %s
+            """, (doc_id,))
+
+            row = cursor.fetchone()
+            if not row:
+                return {'error': 'No LLM validation found'}
+
+            llm = {
+                'InvoiceNumber': row[0],
+                'InvoiceDate': row[1],
+                'InvoiceDueDate': row[2],
+                'OCR': row[3],
+                'Bankgiro': row[4],
+                'Plusgiro': row[5],
+                'Amount': row[6],
+                'supplier_organisation_number': row[7],
+            }
+
+        # Compare
+        comparison = {}
+        for field in self.FIELDS_TO_EXTRACT:
+            auto = autolabel.get(field, {})
+            llm_value = llm.get(field)
+
+            comparison[field] = {
+                'csv_value': auto.get('csv_value'),
+                'autolabel_matched': auto.get('matched'),
+                'autolabel_text': auto.get('matched_text'),
+                'llm_value': llm_value,
+                'agreement': self._values_match(auto.get('csv_value'), llm_value),
+            }
+
+        return comparison
+
+    def _values_match(self, csv_value: str, llm_value: str) -> bool:
+        """Check if CSV value matches LLM extracted value."""
+        if csv_value is None or llm_value is None:
+            return csv_value == llm_value
+
+        # Normalize for comparison
+        csv_norm = str(csv_value).strip().lower().replace('-', '').replace(' ', '')
+        llm_norm = str(llm_value).strip().lower().replace('-', '').replace(' ', '')
+
+        return csv_norm == llm_norm
--- a/src/web/app.py
+++ b/src/web/app.py
@@ -81,6 +81,9 @@ def create_app(config: AppConfig | None = None) -> FastAPI:
        - Bankgiro
        - Plusgiro
        - Amount
+        - supplier_org_number (Swedish organization number)
+        - customer_number
+        - payment_line (machine-readable payment code)
        """,
        version="1.0.0",
        lifespan=lifespan,
@@ -161,17 +164,11 @@ def get_html_ui() -> str:
        }

        .main-content {
-            display: grid;
-            grid-template-columns: 1fr 1fr;
+            display: flex;
+            flex-direction: column;
            gap: 20px;
        }

-        @media (max-width: 900px) {
-            .main-content {
-                grid-template-columns: 1fr;
-            }
-        }
-
        .card {
            background: white;
            border-radius: 16px;
@@ -188,14 +185,28 @@ def get_html_ui() -> str:
            gap: 10px;
        }

+        .upload-card {
+            display: flex;
+            align-items: center;
+            gap: 20px;
+            flex-wrap: wrap;
+        }
+
+        .upload-card h2 {
+            margin-bottom: 0;
+            white-space: nowrap;
+        }
+
        .upload-area {
-            border: 3px dashed #ddd;
-            border-radius: 12px;
-            padding: 40px;
+            border: 2px dashed #ddd;
+            border-radius: 10px;
+            padding: 15px 25px;
            text-align: center;
            cursor: pointer;
            transition: all 0.3s;
            background: #fafafa;
+            flex: 1;
+            min-width: 200px;
        }

        .upload-area:hover, .upload-area.dragover {
@@ -209,17 +220,21 @@ def get_html_ui() -> str:
        }

        .upload-icon {
-            font-size: 48px;
-            margin-bottom: 15px;
+            font-size: 24px;
+            display: inline;
+            margin-right: 8px;
        }

        .upload-area p {
            color: #666;
-            margin-bottom: 10px;
+            margin: 0;
+            display: inline;
        }

        .upload-area small {
            color: #999;
+            display: block;
+            margin-top: 5px;
        }

        #file-input {
@@ -237,10 +252,10 @@ def get_html_ui() -> str:

        .btn {
            display: inline-block;
-            padding: 14px 28px;
+            padding: 12px 24px;
            border: none;
            border-radius: 10px;
-            font-size: 1rem;
+            font-size: 0.9rem;
            font-weight: 600;
            cursor: pointer;
            transition: all 0.3s;
@@ -251,8 +266,6 @@ def get_html_ui() -> str:
        .btn-primary {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
-            width: 100%;
-            margin-top: 20px;
        }

        .btn-primary:hover:not(:disabled) {
@@ -267,22 +280,21 @@ def get_html_ui() -> str:

        .loading {
            display: none;
-            text-align: center;
-            padding: 20px;
+            align-items: center;
+            gap: 10px;
        }

        .loading.active {
-            display: block;
+            display: flex;
        }

        .spinner {
-            width: 40px;
-            height: 40px;
-            border: 4px solid #f3f3f3;
-            border-top: 4px solid #667eea;
+            width: 24px;
+            height: 24px;
+            border: 3px solid #f3f3f3;
+            border-top: 3px solid #667eea;
            border-radius: 50%;
            animation: spin 1s linear infinite;
-            margin: 0 auto 15px;
        }

        @keyframes spin {
@@ -331,7 +343,7 @@ def get_html_ui() -> str:

        .fields-grid {
            display: grid;
-            grid-template-columns: repeat(2, 1fr);
+            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 12px;
        }

@@ -380,6 +392,84 @@ def get_html_ui() -> str:
            margin-top: 15px;
        }

+        .cross-validation {
+            background: #f8fafc;
+            border: 1px solid #e2e8f0;
+            border-radius: 10px;
+            padding: 15px;
+            margin-top: 20px;
+        }
+
+        .cross-validation h3 {
+            margin: 0 0 10px 0;
+            color: #334155;
+            font-size: 1rem;
+        }
+
+        .cv-status {
+            font-weight: 600;
+            padding: 8px 12px;
+            border-radius: 6px;
+            margin-bottom: 10px;
+            display: inline-block;
+        }
+
+        .cv-status.valid {
+            background: #dcfce7;
+            color: #166534;
+        }
+
+        .cv-status.invalid {
+            background: #fef3c7;
+            color: #92400e;
+        }
+
+        .cv-details {
+            display: flex;
+            flex-wrap: wrap;
+            gap: 8px;
+            margin-top: 10px;
+        }
+
+        .cv-item {
+            background: white;
+            border: 1px solid #e2e8f0;
+            border-radius: 6px;
+            padding: 6px 12px;
+            font-size: 0.85rem;
+            display: flex;
+            align-items: center;
+            gap: 6px;
+        }
+
+        .cv-item.match {
+            border-color: #86efac;
+            background: #f0fdf4;
+        }
+
+        .cv-item.mismatch {
+            border-color: #fca5a5;
+            background: #fef2f2;
+        }
+
+        .cv-icon {
+            font-weight: bold;
+        }
+
+        .cv-item.match .cv-icon {
+            color: #16a34a;
+        }
+
+        .cv-item.mismatch .cv-icon {
+            color: #dc2626;
+        }
+
+        .cv-summary {
+            margin-top: 10px;
+            font-size: 0.8rem;
+            color: #64748b;
+        }
+
        .error-message {
            background: #fee2e2;
            color: #991b1b;
@@ -405,33 +495,35 @@ def get_html_ui() -> str:
        </header>

        <div class="main-content">
-            <div class="card">
-                <h2>📤 Upload Document</h2>
+            <!-- Upload Section - Compact -->
+            <div class="card upload-card">
+                <h2>📤 Upload</h2>

                <div class="upload-area" id="upload-area">
-                    <div class="upload-icon">📁</div>
-                    <p>Drag & drop your file here</p>
-                    <p>or <strong>click to browse</strong></p>
-                    <small>Supports PDF, PNG, JPG (max 50MB)</small>
+                    <span class="upload-icon">📁</span>
+                    <p>Drag & drop or <strong>click to browse</strong></p>
+                    <small>PDF, PNG, JPG (max 50MB)</small>
                    <input type="file" id="file-input" accept=".pdf,.png,.jpg,.jpeg">
-                    <div class="file-name" id="file-name" style="display: none;"></div>
                </div>

+                <div class="file-name" id="file-name" style="display: none;"></div>
+
                <button class="btn btn-primary" id="submit-btn" disabled>
-                    🚀 Extract Fields
+                    🚀 Extract
                </button>

                <div class="loading" id="loading">
                    <div class="spinner"></div>
-                    <p>Processing document...</p>
+                    <p>Processing...</p>
                </div>
            </div>

+            <!-- Results Section - Full Width -->
            <div class="card">
                <h2>📊 Extraction Results</h2>

-                <div id="placeholder" style="text-align: center; padding: 40px; color: #999;">
-                    <div style="font-size: 64px; margin-bottom: 15px;">🔍</div>
+                <div id="placeholder" style="text-align: center; padding: 30px; color: #999;">
+                    <div style="font-size: 48px; margin-bottom: 10px;">🔍</div>
                    <p>Upload a document to see extraction results</p>
                </div>

@@ -445,6 +537,8 @@ def get_html_ui() -> str:

                    <div class="processing-time" id="processing-time"></div>

+                    <div class="cross-validation" id="cross-validation" style="display: none;"></div>
+
                    <div class="error-message" id="error-message" style="display: none;"></div>

                    <div class="visualization" id="visualization" style="display: none;">
@@ -566,7 +660,11 @@ def get_html_ui() -> str:
            const fieldsGrid = document.getElementById('fields-grid');
            fieldsGrid.innerHTML = '';

-            const fieldOrder = ['InvoiceNumber', 'InvoiceDate', 'InvoiceDueDate', 'OCR', 'Amount', 'Bankgiro', 'Plusgiro'];
+            const fieldOrder = [
+                'InvoiceNumber', 'InvoiceDate', 'InvoiceDueDate', 'OCR',
+                'Amount', 'Bankgiro', 'Plusgiro',
+                'supplier_org_number', 'customer_number', 'payment_line'
+            ];

            fieldOrder.forEach(field => {
                const value = result.fields[field];
@@ -588,6 +686,45 @@ def get_html_ui() -> str:
            document.getElementById('processing-time').textContent =
                `⏱️ Processed in ${result.processing_time_ms.toFixed(0)}ms`;

+            // Cross-validation results
+            const cvDiv = document.getElementById('cross-validation');
+            if (result.cross_validation) {
+                const cv = result.cross_validation;
+                let cvHtml = '<h3>🔍 Cross-Validation (Payment Line)</h3>';
+                cvHtml += `<div class="cv-status ${cv.is_valid ? 'valid' : 'invalid'}">`;
+                cvHtml += cv.is_valid ? '✅ Valid' : '⚠️ Mismatch Detected';
+                cvHtml += '</div>';
+
+                cvHtml += '<div class="cv-details">';
+                if (cv.payment_line_ocr) {
+                    const matchIcon = cv.ocr_match === true ? '✓' : (cv.ocr_match === false ? '✗' : '—');
+                    cvHtml += `<div class="cv-item ${cv.ocr_match === true ? 'match' : (cv.ocr_match === false ? 'mismatch' : '')}">`;
+                    cvHtml += `<span class="cv-icon">${matchIcon}</span> OCR: ${cv.payment_line_ocr}</div>`;
+                }
+                if (cv.payment_line_amount) {
+                    const matchIcon = cv.amount_match === true ? '✓' : (cv.amount_match === false ? '✗' : '—');
+                    cvHtml += `<div class="cv-item ${cv.amount_match === true ? 'match' : (cv.amount_match === false ? 'mismatch' : '')}">`;
+                    cvHtml += `<span class="cv-icon">${matchIcon}</span> Amount: ${cv.payment_line_amount}</div>`;
+                }
+                if (cv.payment_line_account) {
+                    const accountType = cv.payment_line_account_type === 'bankgiro' ? 'Bankgiro' : 'Plusgiro';
+                    const matchField = cv.payment_line_account_type === 'bankgiro' ? cv.bankgiro_match : cv.plusgiro_match;
+                    const matchIcon = matchField === true ? '✓' : (matchField === false ? '✗' : '—');
+                    cvHtml += `<div class="cv-item ${matchField === true ? 'match' : (matchField === false ? 'mismatch' : '')}">`;
+                    cvHtml += `<span class="cv-icon">${matchIcon}</span> ${accountType}: ${cv.payment_line_account}</div>`;
+                }
+                cvHtml += '</div>';
+
+                if (cv.details && cv.details.length > 0) {
+                    cvHtml += '<div class="cv-summary">' + cv.details[cv.details.length - 1] + '</div>';
+                }
+
+                cvDiv.innerHTML = cvHtml;
+                cvDiv.style.display = 'block';
+            } else {
+                cvDiv.style.display = 'none';
+            }
+
            // Visualization
            if (result.visualization_url) {
                const vizDiv = document.getElementById('visualization');
@@ -608,7 +745,19 @@ def get_html_ui() -> str:
        }

        function formatFieldName(name) {
-            return name.replace(/([A-Z])/g, ' $1').trim();
+            const nameMap = {
+                'InvoiceNumber': 'Invoice Number',
+                'InvoiceDate': 'Invoice Date',
+                'InvoiceDueDate': 'Due Date',
+                'OCR': 'OCR Reference',
+                'Amount': 'Amount',
+                'Bankgiro': 'Bankgiro',
+                'Plusgiro': 'Plusgiro',
+                'supplier_org_number': 'Supplier Org Number',
+                'customer_number': 'Customer Number',
+                'payment_line': 'Payment Line'
+            };
+            return nameMap[name] || name.replace(/([A-Z])/g, ' $1').replace(/_/g, ' ').trim();
        }
    </script>
 </body>
--- a/src/web/config.py
+++ b/src/web/config.py
@@ -13,8 +13,8 @@ from typing import Any
 class ModelConfig:
    """YOLO model configuration."""

-    model_path: Path = Path("runs/train/invoice_yolo11n_full/weights/best.pt")
-    confidence_threshold: float = 0.3
+    model_path: Path = Path("runs/train/invoice_fields/weights/best.pt")
+    confidence_threshold: float = 0.5
    use_gpu: bool = True
    dpi: int = 150

--- a/src/yolo/annotation_generator.py
+++ b/src/yolo/annotation_generator.py
@@ -22,6 +22,7 @@ FIELD_CLASSES = {
    'Amount': 6,
    'supplier_organisation_number': 7,
    'customer_number': 8,
+    'payment_line': 9,  # Machine code payment line at bottom of invoice
 }

 # Fields that need matching but map to other YOLO classes
@@ -43,6 +44,7 @@ CLASS_NAMES = [
    'amount',
    'supplier_org_number',
    'customer_number',
+    'payment_line',  # Machine code payment line at bottom of invoice
 ]


@@ -160,6 +162,68 @@ class AnnotationGenerator:

        return annotations

+    def add_payment_line_annotation(
+        self,
+        annotations: list[YOLOAnnotation],
+        payment_line_bbox: tuple[float, float, float, float],
+        confidence: float,
+        image_width: float,
+        image_height: float,
+        dpi: int = 300
+    ) -> list[YOLOAnnotation]:
+        """
+        Add payment_line annotation from machine code parser result.
+
+        Args:
+            annotations: Existing list of annotations to append to
+            payment_line_bbox: Bounding box (x0, y0, x1, y1) in PDF coordinates
+            confidence: Confidence score from machine code parser
+            image_width: Width of the rendered image in pixels
+            image_height: Height of the rendered image in pixels
+            dpi: DPI used for rendering
+
+        Returns:
+            Updated annotations list with payment_line annotation added
+        """
+        if not payment_line_bbox or confidence < self.min_confidence:
+            return annotations
+
+        # Scale factor to convert PDF points (72 DPI) to rendered pixels
+        scale = dpi / 72.0
+
+        x0, y0, x1, y1 = payment_line_bbox
+        x0, y0, x1, y1 = x0 * scale, y0 * scale, x1 * scale, y1 * scale
+
+        # Add absolute padding
+        pad = self.bbox_padding_px
+        x0 = max(0, x0 - pad)
+        y0 = max(0, y0 - pad)
+        x1 = min(image_width, x1 + pad)
+        y1 = min(image_height, y1 + pad)
+
+        # Convert to YOLO format (normalized center + size)
+        x_center = (x0 + x1) / 2 / image_width
+        y_center = (y0 + y1) / 2 / image_height
+        width = (x1 - x0) / image_width
+        height = (y1 - y0) / image_height
+
+        # Clamp values to 0-1
+        x_center = max(0, min(1, x_center))
+        y_center = max(0, min(1, y_center))
+        width = max(0, min(1, width))
+        height = max(0, min(1, height))
+
+        annotations.append(YOLOAnnotation(
+            class_id=FIELD_CLASSES['payment_line'],
+            x_center=x_center,
+            y_center=y_center,
+            width=width,
+            height=height,
+            confidence=confidence
+        ))
+
+        return annotations
+
    def save_annotations(
        self,
        annotations: list[YOLOAnnotation],
--- a/src/yolo/db_dataset.py
+++ b/src/yolo/db_dataset.py
@@ -74,7 +74,7 @@ class DBYOLODataset:
        train_ratio: float = 0.8,
        val_ratio: float = 0.1,
        seed: int = 42,
-        dpi: int = 300,
+        dpi: int = 150,  # Must match the DPI used in autolabel_tasks.py for rendering
        min_confidence: float = 0.7,
        bbox_padding_px: int = 20,
        min_bbox_height_px: int = 30,
@@ -276,7 +276,14 @@ class DBYOLODataset:
                        continue

                    field_name = field_result.get('field_name')
-                    if field_name not in FIELD_CLASSES:
+
+                    # Map supplier_accounts(X) to the actual class name (Bankgiro/Plusgiro)
+                    yolo_class_name = field_name
+                    if field_name and field_name.startswith('supplier_accounts('):
+                        # Extract the account type: "supplier_accounts(Bankgiro)" -> "Bankgiro"
+                        yolo_class_name = field_name.split('(')[1].rstrip(')')
+
+                    if yolo_class_name not in FIELD_CLASSES:
                        continue

                    score = field_result.get('score', 0)
@@ -288,7 +295,7 @@ class DBYOLODataset:

                    if bbox and len(bbox) == 4:
                        annotation = self._create_annotation(
-                            field_name=field_name,
+                            field_name=yolo_class_name,  # Use mapped class name
                            bbox=bbox,
                            score=score
                        )