Files
invoice-master-poc-v2/docs/PERFORMANCE_OPTIMIZATION.md
2026-01-25 15:21:11 +01:00

12 KiB

Performance Optimization Guide

This document provides performance optimization recommendations for the Invoice Field Extraction system.

Table of Contents

  1. Batch Processing Optimization
  2. Database Query Optimization
  3. Caching Strategies
  4. Memory Management
  5. Profiling and Monitoring

Batch Processing Optimization

Current State

The system processes invoices one at a time. For large batches, this can be inefficient.

Recommendations

1. Database Batch Operations

Current: Individual inserts for each document

# Inefficient
for doc in documents:
    db.insert_document(doc)  # Individual DB call

Optimized: Use execute_values for batch inserts

# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values

execute_values(cursor, """
    INSERT INTO documents (...)
    VALUES %s
""", document_values)

Impact: 10-50x faster for batches of 100+ documents

2. PDF Processing Batching

Recommendation: Process PDFs in parallel using multiprocessing

from multiprocessing import Pool

def process_batch(pdf_paths, batch_size=10):
    """Process PDFs in parallel batches."""
    with Pool(processes=batch_size) as pool:
        results = pool.map(pipeline.process_pdf, pdf_paths)
    return results

Considerations:

  • GPU models should use a shared process pool (already exists: src/processing/gpu_pool.py)
  • CPU-intensive tasks can use separate process pool (src/processing/cpu_pool.py)
  • Current dual pool coordinator (dual_pool_coordinator.py) already supports this pattern

Status: Already implemented in src/processing/ modules

3. Image Caching for Multi-Page PDFs

Current: Each page rendered independently

# Current pattern in field_extractor.py
for page_num in range(total_pages):
    image = render_pdf_page(pdf_path, page_num, dpi=300)

Optimized: Pre-render all pages if processing multiple fields per page

# Batch render
images = {
    page_num: render_pdf_page(pdf_path, page_num, dpi=300)
    for page_num in page_numbers_needed
}

# Reuse images
for detection in detections:
    image = images[detection.page_no]
    extract_field(detection, image)

Impact: Reduces redundant PDF rendering by 50-90% for multi-field invoices


Database Query Optimization

Current Performance

  • Parameterized queries: Implemented (Phase 1)
  • Connection pooling: Not implemented
  • Query batching: Partially implemented
  • Index optimization: ⚠️ Needs verification

Recommendations

1. Connection Pooling

Current: New connection for each operation

def connect(self):
    """Create new database connection."""
    return psycopg2.connect(**self.config)

Optimized: Use connection pooling

from psycopg2 import pool

class DocumentDatabase:
    def __init__(self, config):
        self.pool = pool.SimpleConnectionPool(
            minconn=1,
            maxconn=10,
            **config
        )

    def connect(self):
        return self.pool.getconn()

    def close(self, conn):
        self.pool.putconn(conn)

Impact:

  • Reduces connection overhead by 80-95%
  • Especially important for high-frequency operations

2. Index Recommendations

Check current indexes:

-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';

Recommended indexes:

-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
    ON documents(success);

CREATE INDEX IF NOT EXISTS idx_documents_timestamp
    ON documents(timestamp DESC);

CREATE INDEX IF NOT EXISTS idx_field_results_document_id
    ON field_results(document_id);

CREATE INDEX IF NOT EXISTS idx_field_results_matched
    ON field_results(matched);

CREATE INDEX IF NOT EXISTS idx_field_results_field_name
    ON field_results(field_name);

Impact:

  • 10-100x faster queries for filtered/sorted results
  • Critical for get_failed_matches() and get_all_documents_summary()

3. Query Batching

Status: Already implemented for field results (line 519)

Verify batching is used:

# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)

Additional opportunity: Batch SELECT queries

# Current
docs = [get_document(doc_id) for doc_id in doc_ids]  # N queries

# Optimized
docs = get_documents_batch(doc_ids)  # 1 query with IN clause

Status: Already implemented (get_documents_batch exists in db.py)


Caching Strategies

1. Model Loading Cache

Current: Models loaded per-instance

Recommendation: Singleton pattern for YOLO model

class YOLODetectorSingleton:
    _instance = None
    _model = None

    @classmethod
    def get_instance(cls, model_path):
        if cls._instance is None:
            cls._instance = YOLODetector(model_path)
        return cls._instance

Impact: Reduces memory usage by 90% when processing multiple documents

2. Parser Instance Caching

Current: Already optimal

# Good pattern in field_extractor.py
def __init__(self):
    self.payment_line_parser = PaymentLineParser()  # Reused
    self.customer_number_parser = CustomerNumberParser()  # Reused

Status: No changes needed

3. OCR Result Caching

Recommendation: Cache OCR results for identical regions

from functools import lru_cache

@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
    """Cache OCR results by image hash + bbox."""
    return paddle_ocr.ocr_region(image, bbox)

Impact: 50-80% speedup when re-processing similar documents

Note: Requires implementing image hashing (e.g., hashlib.md5(image.tobytes()))


Memory Management

Current Issues

Potential memory leaks:

  1. Large images kept in memory after processing
  2. OCR results accumulated without cleanup
  3. Model outputs not explicitly cleared

Recommendations

1. Explicit Image Cleanup

import gc

def process_pdf(pdf_path):
    try:
        image = render_pdf(pdf_path)
        result = extract_fields(image)
        return result
    finally:
        del image  # Explicit cleanup
        gc.collect()  # Force garbage collection

2. Generator Pattern for Large Batches

Current: Load all documents into memory

docs = [process_pdf(path) for path in pdf_paths]  # All in memory

Optimized: Use generator for streaming processing

def process_batch_streaming(pdf_paths):
    """Process documents one at a time, yielding results."""
    for path in pdf_paths:
        result = process_pdf(path)
        yield result
        # Result can be saved to DB immediately
        # Previous result is garbage collected

Impact: Constant memory usage regardless of batch size

3. Context Managers for Resources

class InferencePipeline:
    def __enter__(self):
        self.detector.load_model()
        return self

    def __exit__(self, *args):
        self.detector.unload_model()
        self.extractor.cleanup()

# Usage
with InferencePipeline(...) as pipeline:
    results = pipeline.process_pdf(path)
# Automatic cleanup

Profiling and Monitoring

1. cProfile for CPU Profiling

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Your code here
pipeline.process_pdf(pdf_path)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 slowest functions

2. memory_profiler for Memory Analysis

pip install memory_profiler
python -m memory_profiler your_script.py

Or decorator-based:

from memory_profiler import profile

@profile
def process_large_batch(pdf_paths):
    # Memory usage tracked line-by-line
    results = [process_pdf(path) for path in pdf_paths]
    return results

3. py-spy for Production Profiling

pip install py-spy

# Profile running process
py-spy top --pid 12345

# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py

Advantage: No code changes needed, minimal overhead

Key Metrics to Monitor

  1. Processing Time per Document

    • Target: <10 seconds for single-page invoice
    • Current: ~2-5 seconds (estimated)
  2. Memory Usage

    • Target: <2GB for batch of 100 documents
    • Monitor: Peak memory usage
  3. Database Query Time

    • Target: <100ms per query (with indexes)
    • Monitor: Slow query log
  4. OCR Accuracy vs Speed Trade-off

    • Current: PaddleOCR with GPU (~200ms per region)
    • Alternative: Tesseract (~500ms, slightly more accurate)

Logging Performance Metrics

Add to pipeline.py:

import time
import logging

logger = logging.getLogger(__name__)

def process_pdf(self, pdf_path):
    start = time.time()

    # Processing...
    result = self._process_internal(pdf_path)

    elapsed = time.time() - start
    logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")

    # Log to database for analysis
    self.db.log_performance({
        'document_id': result.document_id,
        'processing_time': elapsed,
        'field_count': len(result.fields)
    })

    return result

Performance Optimization Priorities

High Priority (Implement First)

  1. Database parameterized queries - Already done (Phase 1)
  2. ⚠️ Database connection pooling - Not implemented
  3. ⚠️ Index optimization - Needs verification

Medium Priority

  1. ⚠️ Batch PDF rendering - Optimization possible
  2. Parser instance reuse - Already done (Phase 2)
  3. ⚠️ Model caching - Could improve

Low Priority (Nice to Have)

  1. ⚠️ OCR result caching - Complex implementation
  2. ⚠️ Generator patterns - Refactoring needed
  3. ⚠️ Advanced profiling - For production optimization

Benchmarking Script

"""
Benchmark script for invoice processing performance.
"""

import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline

def benchmark_single_document(pdf_path, iterations=10):
    """Benchmark single document processing."""
    pipeline = InferencePipeline(
        model_path="path/to/model.pt",
        use_gpu=True
    )

    times = []
    for i in range(iterations):
        start = time.time()
        result = pipeline.process_pdf(pdf_path)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Iteration {i+1}: {elapsed:.2f}s")

    avg_time = sum(times) / len(times)
    print(f"\nAverage: {avg_time:.2f}s")
    print(f"Min: {min(times):.2f}s")
    print(f"Max: {max(times):.2f}s")

def benchmark_batch(pdf_paths, batch_size=10):
    """Benchmark batch processing."""
    from multiprocessing import Pool

    pipeline = InferencePipeline(
        model_path="path/to/model.pt",
        use_gpu=True
    )

    start = time.time()

    with Pool(processes=batch_size) as pool:
        results = pool.map(pipeline.process_pdf, pdf_paths)

    elapsed = time.time() - start
    avg_per_doc = elapsed / len(pdf_paths)

    print(f"Total time: {elapsed:.2f}s")
    print(f"Documents: {len(pdf_paths)}")
    print(f"Average per document: {avg_per_doc:.2f}s")
    print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")

if __name__ == "__main__":
    # Single document benchmark
    benchmark_single_document("test.pdf")

    # Batch benchmark
    pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
    benchmark_batch(pdf_paths[:100])

Summary

Implemented (Phase 1-2):

  • Parameterized queries (SQL injection fix)
  • Parser instance reuse (Phase 2 refactoring)
  • Batch insert operations (execute_values)
  • Dual pool processing (CPU/GPU separation)

Quick Wins (Low effort, high impact):

  • Database connection pooling (2-4 hours)
  • Index verification and optimization (1-2 hours)
  • Batch PDF rendering (4-6 hours)

Long-term Improvements:

  • OCR result caching with hashing
  • Generator patterns for streaming
  • Advanced profiling and monitoring

Expected Impact:

  • Connection pooling: 80-95% reduction in DB overhead
  • Indexes: 10-100x faster queries
  • Batch rendering: 50-90% less redundant work
  • Overall: 2-5x throughput improvement for batch processing