Re-structure the project.

2026-01-25 15:21:11 +01:00
parent 8fd61ea928
commit e599424a92
80 changed files with 10672 additions and 1584 deletions
--- a/docs/PERFORMANCE_OPTIMIZATION.md
+++ b/docs/PERFORMANCE_OPTIMIZATION.md
@@ -0,0 +1,519 @@
+# Performance Optimization Guide
+
+This document provides performance optimization recommendations for the Invoice Field Extraction system.
+
+## Table of Contents
+
+1. [Batch Processing Optimization](#batch-processing-optimization)
+2. [Database Query Optimization](#database-query-optimization)
+3. [Caching Strategies](#caching-strategies)
+4. [Memory Management](#memory-management)
+5. [Profiling and Monitoring](#profiling-and-monitoring)
+
+---
+
+## Batch Processing Optimization
+
+### Current State
+
+The system processes invoices one at a time. For large batches, this can be inefficient.
+
+### Recommendations
+
+#### 1. Database Batch Operations
+
+**Current**: Individual inserts for each document
+```python
+# Inefficient
+for doc in documents:
+    db.insert_document(doc)  # Individual DB call
+```
+
+**Optimized**: Use `execute_values` for batch inserts
+```python
+# Efficient - already implemented in db.py line 519
+from psycopg2.extras import execute_values
+
+execute_values(cursor, """
+    INSERT INTO documents (...)
+    VALUES %s
+""", document_values)
+```
+
+**Impact**: 10-50x faster for batches of 100+ documents
+
+#### 2. PDF Processing Batching
+
+**Recommendation**: Process PDFs in parallel using multiprocessing
+
+```python
+from multiprocessing import Pool
+
+def process_batch(pdf_paths, batch_size=10):
+    """Process PDFs in parallel batches."""
+    with Pool(processes=batch_size) as pool:
+        results = pool.map(pipeline.process_pdf, pdf_paths)
+    return results
+```
+
+**Considerations**:
+- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
+- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
+- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
+
+**Status**: ✅ Already implemented in `src/processing/` modules
+
+#### 3. Image Caching for Multi-Page PDFs
+
+**Current**: Each page rendered independently
+```python
+# Current pattern in field_extractor.py
+for page_num in range(total_pages):
+    image = render_pdf_page(pdf_path, page_num, dpi=300)
+```
+
+**Optimized**: Pre-render all pages if processing multiple fields per page
+```python
+# Batch render
+images = {
+    page_num: render_pdf_page(pdf_path, page_num, dpi=300)
+    for page_num in page_numbers_needed
+}
+
+# Reuse images
+for detection in detections:
+    image = images[detection.page_no]
+    extract_field(detection, image)
+```
+
+**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
+
+---
+
+## Database Query Optimization
+
+### Current Performance
+
+- **Parameterized queries**: ✅ Implemented (Phase 1)
+- **Connection pooling**: ❌ Not implemented
+- **Query batching**: ✅ Partially implemented
+- **Index optimization**: ⚠️ Needs verification
+
+### Recommendations
+
+#### 1. Connection Pooling
+
+**Current**: New connection for each operation
+```python
+def connect(self):
+    """Create new database connection."""
+    return psycopg2.connect(**self.config)
+```
+
+**Optimized**: Use connection pooling
+```python
+from psycopg2 import pool
+
+class DocumentDatabase:
+    def __init__(self, config):
+        self.pool = pool.SimpleConnectionPool(
+            minconn=1,
+            maxconn=10,
+            **config
+        )
+
+    def connect(self):
+        return self.pool.getconn()
+
+    def close(self, conn):
+        self.pool.putconn(conn)
+```
+
+**Impact**:
+- Reduces connection overhead by 80-95%
+- Especially important for high-frequency operations
+
+#### 2. Index Recommendations
+
+**Check current indexes**:
+```sql
+-- Verify indexes exist on frequently queried columns
+SELECT tablename, indexname, indexdef
+FROM pg_indexes
+WHERE schemaname = 'public';
+```
+
+**Recommended indexes**:
+```sql
+-- If not already present
+CREATE INDEX IF NOT EXISTS idx_documents_success
+    ON documents(success);
+
+CREATE INDEX IF NOT EXISTS idx_documents_timestamp
+    ON documents(timestamp DESC);
+
+CREATE INDEX IF NOT EXISTS idx_field_results_document_id
+    ON field_results(document_id);
+
+CREATE INDEX IF NOT EXISTS idx_field_results_matched
+    ON field_results(matched);
+
+CREATE INDEX IF NOT EXISTS idx_field_results_field_name
+    ON field_results(field_name);
+```
+
+**Impact**:
+- 10-100x faster queries for filtered/sorted results
+- Critical for `get_failed_matches()` and `get_all_documents_summary()`
+
+#### 3. Query Batching
+
+**Status**: ✅ Already implemented for field results (line 519)
+
+**Verify batching is used**:
+```python
+# Good pattern in db.py
+execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
+```
+
+**Additional opportunity**: Batch `SELECT` queries
+```python
+# Current
+docs = [get_document(doc_id) for doc_id in doc_ids]  # N queries
+
+# Optimized
+docs = get_documents_batch(doc_ids)  # 1 query with IN clause
+```
+
+**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
+
+---
+
+## Caching Strategies
+
+### 1. Model Loading Cache
+
+**Current**: Models loaded per-instance
+
+**Recommendation**: Singleton pattern for YOLO model
+```python
+class YOLODetectorSingleton:
+    _instance = None
+    _model = None
+
+    @classmethod
+    def get_instance(cls, model_path):
+        if cls._instance is None:
+            cls._instance = YOLODetector(model_path)
+        return cls._instance
+```
+
+**Impact**: Reduces memory usage by 90% when processing multiple documents
+
+### 2. Parser Instance Caching
+
+**Current**: ✅ Already optimal
+```python
+# Good pattern in field_extractor.py
+def __init__(self):
+    self.payment_line_parser = PaymentLineParser()  # Reused
+    self.customer_number_parser = CustomerNumberParser()  # Reused
+```
+
+**Status**: No changes needed
+
+### 3. OCR Result Caching
+
+**Recommendation**: Cache OCR results for identical regions
+```python
+from functools import lru_cache
+
+@lru_cache(maxsize=1000)
+def ocr_region_cached(image_hash, bbox):
+    """Cache OCR results by image hash + bbox."""
+    return paddle_ocr.ocr_region(image, bbox)
+```
+
+**Impact**: 50-80% speedup when re-processing similar documents
+
+**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
+
+---
+
+## Memory Management
+
+### Current Issues
+
+**Potential memory leaks**:
+1. Large images kept in memory after processing
+2. OCR results accumulated without cleanup
+3. Model outputs not explicitly cleared
+
+### Recommendations
+
+#### 1. Explicit Image Cleanup
+
+```python
+import gc
+
+def process_pdf(pdf_path):
+    try:
+        image = render_pdf(pdf_path)
+        result = extract_fields(image)
+        return result
+    finally:
+        del image  # Explicit cleanup
+        gc.collect()  # Force garbage collection
+```
+
+#### 2. Generator Pattern for Large Batches
+
+**Current**: Load all documents into memory
+```python
+docs = [process_pdf(path) for path in pdf_paths]  # All in memory
+```
+
+**Optimized**: Use generator for streaming processing
+```python
+def process_batch_streaming(pdf_paths):
+    """Process documents one at a time, yielding results."""
+    for path in pdf_paths:
+        result = process_pdf(path)
+        yield result
+        # Result can be saved to DB immediately
+        # Previous result is garbage collected
+```
+
+**Impact**: Constant memory usage regardless of batch size
+
+#### 3. Context Managers for Resources
+
+```python
+class InferencePipeline:
+    def __enter__(self):
+        self.detector.load_model()
+        return self
+
+    def __exit__(self, *args):
+        self.detector.unload_model()
+        self.extractor.cleanup()
+
+# Usage
+with InferencePipeline(...) as pipeline:
+    results = pipeline.process_pdf(path)
+# Automatic cleanup
+```
+
+---
+
+## Profiling and Monitoring
+
+### Recommended Profiling Tools
+
+#### 1. cProfile for CPU Profiling
+
+```python
+import cProfile
+import pstats
+
+profiler = cProfile.Profile()
+profiler.enable()
+
+# Your code here
+pipeline.process_pdf(pdf_path)
+
+profiler.disable()
+stats = pstats.Stats(profiler)
+stats.sort_stats('cumulative')
+stats.print_stats(20)  # Top 20 slowest functions
+```
+
+#### 2. memory_profiler for Memory Analysis
+
+```bash
+pip install memory_profiler
+python -m memory_profiler your_script.py
+```
+
+Or decorator-based:
+```python
+from memory_profiler import profile
+
+@profile
+def process_large_batch(pdf_paths):
+    # Memory usage tracked line-by-line
+    results = [process_pdf(path) for path in pdf_paths]
+    return results
+```
+
+#### 3. py-spy for Production Profiling
+
+```bash
+pip install py-spy
+
+# Profile running process
+py-spy top --pid 12345
+
+# Generate flamegraph
+py-spy record -o profile.svg -- python your_script.py
+```
+
+**Advantage**: No code changes needed, minimal overhead
+
+### Key Metrics to Monitor
+
+1. **Processing Time per Document**
+   - Target: <10 seconds for single-page invoice
+   - Current: ~2-5 seconds (estimated)
+
+2. **Memory Usage**
+   - Target: <2GB for batch of 100 documents
+   - Monitor: Peak memory usage
+
+3. **Database Query Time**
+   - Target: <100ms per query (with indexes)
+   - Monitor: Slow query log
+
+4. **OCR Accuracy vs Speed Trade-off**
+   - Current: PaddleOCR with GPU (~200ms per region)
+   - Alternative: Tesseract (~500ms, slightly more accurate)
+
+### Logging Performance Metrics
+
+**Add to pipeline.py**:
+```python
+import time
+import logging
+
+logger = logging.getLogger(__name__)
+
+def process_pdf(self, pdf_path):
+    start = time.time()
+
+    # Processing...
+    result = self._process_internal(pdf_path)
+
+    elapsed = time.time() - start
+    logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
+
+    # Log to database for analysis
+    self.db.log_performance({
+        'document_id': result.document_id,
+        'processing_time': elapsed,
+        'field_count': len(result.fields)
+    })
+
+    return result
+```
+
+---
+
+## Performance Optimization Priorities
+
+### High Priority (Implement First)
+
+1. ✅ **Database parameterized queries** - Already done (Phase 1)
+2. ⚠️ **Database connection pooling** - Not implemented
+3. ⚠️ **Index optimization** - Needs verification
+
+### Medium Priority
+
+4. ⚠️ **Batch PDF rendering** - Optimization possible
+5. ✅ **Parser instance reuse** - Already done (Phase 2)
+6. ⚠️ **Model caching** - Could improve
+
+### Low Priority (Nice to Have)
+
+7. ⚠️ **OCR result caching** - Complex implementation
+8. ⚠️ **Generator patterns** - Refactoring needed
+9. ⚠️ **Advanced profiling** - For production optimization
+
+---
+
+## Benchmarking Script
+
+```python
+"""
+Benchmark script for invoice processing performance.
+"""
+
+import time
+from pathlib import Path
+from src.inference.pipeline import InferencePipeline
+
+def benchmark_single_document(pdf_path, iterations=10):
+    """Benchmark single document processing."""
+    pipeline = InferencePipeline(
+        model_path="path/to/model.pt",
+        use_gpu=True
+    )
+
+    times = []
+    for i in range(iterations):
+        start = time.time()
+        result = pipeline.process_pdf(pdf_path)
+        elapsed = time.time() - start
+        times.append(elapsed)
+        print(f"Iteration {i+1}: {elapsed:.2f}s")
+
+    avg_time = sum(times) / len(times)
+    print(f"\nAverage: {avg_time:.2f}s")
+    print(f"Min: {min(times):.2f}s")
+    print(f"Max: {max(times):.2f}s")
+
+def benchmark_batch(pdf_paths, batch_size=10):
+    """Benchmark batch processing."""
+    from multiprocessing import Pool
+
+    pipeline = InferencePipeline(
+        model_path="path/to/model.pt",
+        use_gpu=True
+    )
+
+    start = time.time()
+
+    with Pool(processes=batch_size) as pool:
+        results = pool.map(pipeline.process_pdf, pdf_paths)
+
+    elapsed = time.time() - start
+    avg_per_doc = elapsed / len(pdf_paths)
+
+    print(f"Total time: {elapsed:.2f}s")
+    print(f"Documents: {len(pdf_paths)}")
+    print(f"Average per document: {avg_per_doc:.2f}s")
+    print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
+
+if __name__ == "__main__":
+    # Single document benchmark
+    benchmark_single_document("test.pdf")
+
+    # Batch benchmark
+    pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
+    benchmark_batch(pdf_paths[:100])
+```
+
+---
+
+## Summary
+
+**Implemented (Phase 1-2)**:
+- ✅ Parameterized queries (SQL injection fix)
+- ✅ Parser instance reuse (Phase 2 refactoring)
+- ✅ Batch insert operations (execute_values)
+- ✅ Dual pool processing (CPU/GPU separation)
+
+**Quick Wins (Low effort, high impact)**:
+- Database connection pooling (2-4 hours)
+- Index verification and optimization (1-2 hours)
+- Batch PDF rendering (4-6 hours)
+
+**Long-term Improvements**:
+- OCR result caching with hashing
+- Generator patterns for streaming
+- Advanced profiling and monitoring
+
+**Expected Impact**:
+- Connection pooling: 80-95% reduction in DB overhead
+- Indexes: 10-100x faster queries
+- Batch rendering: 50-90% less redundant work
+- **Overall**: 2-5x throughput improvement for batch processing