invoice-master-poc-v2/docs/PERFORMANCE_OPTIMIZATION.md

# Performance Optimization Guide

This document provides performance optimization recommendations for the Invoice Field Extraction system.

## Table of Contents

1. [Batch Processing Optimization](#batch-processing-optimization)
2. [Database Query Optimization](#database-query-optimization)
3. [Caching Strategies](#caching-strategies)
4. [Memory Management](#memory-management)
5. [Profiling and Monitoring](#profiling-and-monitoring)

---

## Batch Processing Optimization

### Current State

The system processes invoices one at a time. For large batches, this can be inefficient.

### Recommendations

#### 1. Database Batch Operations

**Current**: Individual inserts for each document
```python
# Inefficient
for doc in documents:
    db.insert_document(doc)  # Individual DB call
```

**Optimized**: Use `execute_values` for batch inserts
```python
# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values

execute_values(cursor, """
    INSERT INTO documents (...)
    VALUES %s
""", document_values)
```

**Impact**: 10-50x faster for batches of 100+ documents

#### 2. PDF Processing Batching

**Recommendation**: Process PDFs in parallel using multiprocessing

```python
from multiprocessing import Pool

def process_batch(pdf_paths, batch_size=10):
    """Process PDFs in parallel batches."""
    with Pool(processes=batch_size) as pool:
        results = pool.map(pipeline.process_pdf, pdf_paths)
    return results
```

**Considerations**:
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern

**Status**: ✅ Already implemented in `src/processing/` modules

#### 3. Image Caching for Multi-Page PDFs

**Current**: Each page rendered independently
```python
# Current pattern in field_extractor.py
for page_num in range(total_pages):
    image = render_pdf_page(pdf_path, page_num, dpi=300)
```

**Optimized**: Pre-render all pages if processing multiple fields per page
```python
# Batch render
images = {
    page_num: render_pdf_page(pdf_path, page_num, dpi=300)
    for page_num in page_numbers_needed
}

# Reuse images
for detection in detections:
    image = images[detection.page_no]
    extract_field(detection, image)
```

**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices

---

## Database Query Optimization

### Current Performance

- **Parameterized queries**: ✅ Implemented (Phase 1)
- **Connection pooling**: ❌ Not implemented
- **Query batching**: ✅ Partially implemented
- **Index optimization**: ⚠️ Needs verification

### Recommendations

#### 1. Connection Pooling

**Current**: New connection for each operation
```python
def connect(self):
    """Create new database connection."""
    return psycopg2.connect(**self.config)
```

**Optimized**: Use connection pooling
```python
from psycopg2 import pool

class DocumentDatabase:
    def __init__(self, config):
        self.pool = pool.SimpleConnectionPool(
            minconn=1,
            maxconn=10,
            **config
        )

    def connect(self):
        return self.pool.getconn()

    def close(self, conn):
        self.pool.putconn(conn)
```

**Impact**:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations

#### 2. Index Recommendations

**Check current indexes**:
```sql
-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';
```

**Recommended indexes**:
```sql
-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
    ON documents(success);

CREATE INDEX IF NOT EXISTS idx_documents_timestamp
    ON documents(timestamp DESC);

CREATE INDEX IF NOT EXISTS idx_field_results_document_id
    ON field_results(document_id);

CREATE INDEX IF NOT EXISTS idx_field_results_matched
    ON field_results(matched);

CREATE INDEX IF NOT EXISTS idx_field_results_field_name
    ON field_results(field_name);
```

**Impact**:
- 10-100x faster queries for filtered/sorted results
- Critical for `get_failed_matches()` and `get_all_documents_summary()`

#### 3. Query Batching

**Status**: ✅ Already implemented for field results (line 519)

**Verify batching is used**:
```python
# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
```

**Additional opportunity**: Batch `SELECT` queries
```python
# Current
docs = [get_document(doc_id) for doc_id in doc_ids]  # N queries

# Optimized
docs = get_documents_batch(doc_ids)  # 1 query with IN clause
```

**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)

---

## Caching Strategies

### 1. Model Loading Cache

**Current**: Models loaded per-instance

**Recommendation**: Singleton pattern for YOLO model
```python
class YOLODetectorSingleton:
    _instance = None
    _model = None

    @classmethod
    def get_instance(cls, model_path):
        if cls._instance is None:
            cls._instance = YOLODetector(model_path)
        return cls._instance
```

**Impact**: Reduces memory usage by 90% when processing multiple documents

### 2. Parser Instance Caching

**Current**: ✅ Already optimal
```python
# Good pattern in field_extractor.py
def __init__(self):
    self.payment_line_parser = PaymentLineParser()  # Reused
    self.customer_number_parser = CustomerNumberParser()  # Reused
```

**Status**: No changes needed

### 3. OCR Result Caching

**Recommendation**: Cache OCR results for identical regions
```python
from functools import lru_cache

@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
    """Cache OCR results by image hash + bbox."""
    return paddle_ocr.ocr_region(image, bbox)
```

**Impact**: 50-80% speedup when re-processing similar documents

**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)

---

## Memory Management

### Current Issues

**Potential memory leaks**:
1. Large images kept in memory after processing
2. OCR results accumulated without cleanup
3. Model outputs not explicitly cleared

### Recommendations

#### 1. Explicit Image Cleanup

```python
import gc

def process_pdf(pdf_path):
    try:
        image = render_pdf(pdf_path)
        result = extract_fields(image)
        return result
    finally:
        del image  # Explicit cleanup
        gc.collect()  # Force garbage collection
```

#### 2. Generator Pattern for Large Batches

**Current**: Load all documents into memory
```python
docs = [process_pdf(path) for path in pdf_paths]  # All in memory
```

**Optimized**: Use generator for streaming processing
```python
def process_batch_streaming(pdf_paths):
    """Process documents one at a time, yielding results."""
    for path in pdf_paths:
        result = process_pdf(path)
        yield result
        # Result can be saved to DB immediately
        # Previous result is garbage collected
```

**Impact**: Constant memory usage regardless of batch size

#### 3. Context Managers for Resources

```python
class InferencePipeline:
    def __enter__(self):
        self.detector.load_model()
        return self

    def __exit__(self, *args):
        self.detector.unload_model()
        self.extractor.cleanup()

# Usage
with InferencePipeline(...) as pipeline:
    results = pipeline.process_pdf(path)
# Automatic cleanup
```

---

## Profiling and Monitoring

### Recommended Profiling Tools

#### 1. cProfile for CPU Profiling

```python
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Your code here
pipeline.process_pdf(pdf_path)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 slowest functions
```

#### 2. memory_profiler for Memory Analysis

```bash
pip install memory_profiler
python -m memory_profiler your_script.py
```

Or decorator-based:
```python
from memory_profiler import profile

@profile
def process_large_batch(pdf_paths):
    # Memory usage tracked line-by-line
    results = [process_pdf(path) for path in pdf_paths]
    return results
```

#### 3. py-spy for Production Profiling

```bash
pip install py-spy

# Profile running process
py-spy top --pid 12345

# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py
```

**Advantage**: No code changes needed, minimal overhead

### Key Metrics to Monitor

1. **Processing Time per Document**
   - Target: <10 seconds for single-page invoice
   - Current: ~2-5 seconds (estimated)

2. **Memory Usage**
   - Target: <2GB for batch of 100 documents
   - Monitor: Peak memory usage

3. **Database Query Time**
   - Target: <100ms per query (with indexes)
   - Monitor: Slow query log

4. **OCR Accuracy vs Speed Trade-off**
   - Current: PaddleOCR with GPU (~200ms per region)
   - Alternative: Tesseract (~500ms, slightly more accurate)

### Logging Performance Metrics

**Add to pipeline.py**:
```python
import time
import logging

logger = logging.getLogger(__name__)

def process_pdf(self, pdf_path):
    start = time.time()

    # Processing...
    result = self._process_internal(pdf_path)

    elapsed = time.time() - start
    logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")

    # Log to database for analysis
    self.db.log_performance({
        'document_id': result.document_id,
        'processing_time': elapsed,
        'field_count': len(result.fields)
    })

    return result
```

---

## Performance Optimization Priorities

### High Priority (Implement First)

1. ✅ **Database parameterized queries** - Already done (Phase 1)
2. ⚠️ **Database connection pooling** - Not implemented
3. ⚠️ **Index optimization** - Needs verification

### Medium Priority

4. ⚠️ **Batch PDF rendering** - Optimization possible
5. ✅ **Parser instance reuse** - Already done (Phase 2)
6. ⚠️ **Model caching** - Could improve

### Low Priority (Nice to Have)

7. ⚠️ **OCR result caching** - Complex implementation
8. ⚠️ **Generator patterns** - Refactoring needed
9. ⚠️ **Advanced profiling** - For production optimization

---

## Benchmarking Script

```python
"""
Benchmark script for invoice processing performance.
"""

import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline

def benchmark_single_document(pdf_path, iterations=10):
    """Benchmark single document processing."""
    pipeline = InferencePipeline(
        model_path="path/to/model.pt",
        use_gpu=True
    )

    times = []
    for i in range(iterations):
        start = time.time()
        result = pipeline.process_pdf(pdf_path)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Iteration {i+1}: {elapsed:.2f}s")

    avg_time = sum(times) / len(times)
    print(f"\nAverage: {avg_time:.2f}s")
    print(f"Min: {min(times):.2f}s")
    print(f"Max: {max(times):.2f}s")

def benchmark_batch(pdf_paths, batch_size=10):
    """Benchmark batch processing."""
    from multiprocessing import Pool

    pipeline = InferencePipeline(
        model_path="path/to/model.pt",
        use_gpu=True
    )

    start = time.time()

    with Pool(processes=batch_size) as pool:
        results = pool.map(pipeline.process_pdf, pdf_paths)

    elapsed = time.time() - start
    avg_per_doc = elapsed / len(pdf_paths)

    print(f"Total time: {elapsed:.2f}s")
    print(f"Documents: {len(pdf_paths)}")
    print(f"Average per document: {avg_per_doc:.2f}s")
    print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")

if __name__ == "__main__":
    # Single document benchmark
    benchmark_single_document("test.pdf")

    # Batch benchmark
    pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
    benchmark_batch(pdf_paths[:100])
```

---

## Summary

**Implemented (Phase 1-2)**:
- ✅ Parameterized queries (SQL injection fix)
- ✅ Parser instance reuse (Phase 2 refactoring)
- ✅ Batch insert operations (execute_values)
- ✅ Dual pool processing (CPU/GPU separation)

**Quick Wins (Low effort, high impact)**:
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)

**Long-term Improvements**:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring

**Expected Impact**:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- **Overall**: 2-5x throughput improvement for batch processing