# Performance Optimization Guide This document provides performance optimization recommendations for the Invoice Field Extraction system. ## Table of Contents 1. [Batch Processing Optimization](#batch-processing-optimization) 2. [Database Query Optimization](#database-query-optimization) 3. [Caching Strategies](#caching-strategies) 4. [Memory Management](#memory-management) 5. [Profiling and Monitoring](#profiling-and-monitoring) --- ## Batch Processing Optimization ### Current State The system processes invoices one at a time. For large batches, this can be inefficient. ### Recommendations #### 1. Database Batch Operations **Current**: Individual inserts for each document ```python # Inefficient for doc in documents: db.insert_document(doc) # Individual DB call ``` **Optimized**: Use `execute_values` for batch inserts ```python # Efficient - already implemented in db.py line 519 from psycopg2.extras import execute_values execute_values(cursor, """ INSERT INTO documents (...) VALUES %s """, document_values) ``` **Impact**: 10-50x faster for batches of 100+ documents #### 2. PDF Processing Batching **Recommendation**: Process PDFs in parallel using multiprocessing ```python from multiprocessing import Pool def process_batch(pdf_paths, batch_size=10): """Process PDFs in parallel batches.""" with Pool(processes=batch_size) as pool: results = pool.map(pipeline.process_pdf, pdf_paths) return results ``` **Considerations**: - GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`) - CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`) - Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern **Status**: ✅ Already implemented in `src/processing/` modules #### 3. Image Caching for Multi-Page PDFs **Current**: Each page rendered independently ```python # Current pattern in field_extractor.py for page_num in range(total_pages): image = render_pdf_page(pdf_path, page_num, dpi=300) ``` **Optimized**: Pre-render all pages if processing multiple fields per page ```python # Batch render images = { page_num: render_pdf_page(pdf_path, page_num, dpi=300) for page_num in page_numbers_needed } # Reuse images for detection in detections: image = images[detection.page_no] extract_field(detection, image) ``` **Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices --- ## Database Query Optimization ### Current Performance - **Parameterized queries**: ✅ Implemented (Phase 1) - **Connection pooling**: ❌ Not implemented - **Query batching**: ✅ Partially implemented - **Index optimization**: ⚠️ Needs verification ### Recommendations #### 1. Connection Pooling **Current**: New connection for each operation ```python def connect(self): """Create new database connection.""" return psycopg2.connect(**self.config) ``` **Optimized**: Use connection pooling ```python from psycopg2 import pool class DocumentDatabase: def __init__(self, config): self.pool = pool.SimpleConnectionPool( minconn=1, maxconn=10, **config ) def connect(self): return self.pool.getconn() def close(self, conn): self.pool.putconn(conn) ``` **Impact**: - Reduces connection overhead by 80-95% - Especially important for high-frequency operations #### 2. Index Recommendations **Check current indexes**: ```sql -- Verify indexes exist on frequently queried columns SELECT tablename, indexname, indexdef FROM pg_indexes WHERE schemaname = 'public'; ``` **Recommended indexes**: ```sql -- If not already present CREATE INDEX IF NOT EXISTS idx_documents_success ON documents(success); CREATE INDEX IF NOT EXISTS idx_documents_timestamp ON documents(timestamp DESC); CREATE INDEX IF NOT EXISTS idx_field_results_document_id ON field_results(document_id); CREATE INDEX IF NOT EXISTS idx_field_results_matched ON field_results(matched); CREATE INDEX IF NOT EXISTS idx_field_results_field_name ON field_results(field_name); ``` **Impact**: - 10-100x faster queries for filtered/sorted results - Critical for `get_failed_matches()` and `get_all_documents_summary()` #### 3. Query Batching **Status**: ✅ Already implemented for field results (line 519) **Verify batching is used**: ```python # Good pattern in db.py execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values) ``` **Additional opportunity**: Batch `SELECT` queries ```python # Current docs = [get_document(doc_id) for doc_id in doc_ids] # N queries # Optimized docs = get_documents_batch(doc_ids) # 1 query with IN clause ``` **Status**: ✅ Already implemented (`get_documents_batch` exists in db.py) --- ## Caching Strategies ### 1. Model Loading Cache **Current**: Models loaded per-instance **Recommendation**: Singleton pattern for YOLO model ```python class YOLODetectorSingleton: _instance = None _model = None @classmethod def get_instance(cls, model_path): if cls._instance is None: cls._instance = YOLODetector(model_path) return cls._instance ``` **Impact**: Reduces memory usage by 90% when processing multiple documents ### 2. Parser Instance Caching **Current**: ✅ Already optimal ```python # Good pattern in field_extractor.py def __init__(self): self.payment_line_parser = PaymentLineParser() # Reused self.customer_number_parser = CustomerNumberParser() # Reused ``` **Status**: No changes needed ### 3. OCR Result Caching **Recommendation**: Cache OCR results for identical regions ```python from functools import lru_cache @lru_cache(maxsize=1000) def ocr_region_cached(image_hash, bbox): """Cache OCR results by image hash + bbox.""" return paddle_ocr.ocr_region(image, bbox) ``` **Impact**: 50-80% speedup when re-processing similar documents **Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`) --- ## Memory Management ### Current Issues **Potential memory leaks**: 1. Large images kept in memory after processing 2. OCR results accumulated without cleanup 3. Model outputs not explicitly cleared ### Recommendations #### 1. Explicit Image Cleanup ```python import gc def process_pdf(pdf_path): try: image = render_pdf(pdf_path) result = extract_fields(image) return result finally: del image # Explicit cleanup gc.collect() # Force garbage collection ``` #### 2. Generator Pattern for Large Batches **Current**: Load all documents into memory ```python docs = [process_pdf(path) for path in pdf_paths] # All in memory ``` **Optimized**: Use generator for streaming processing ```python def process_batch_streaming(pdf_paths): """Process documents one at a time, yielding results.""" for path in pdf_paths: result = process_pdf(path) yield result # Result can be saved to DB immediately # Previous result is garbage collected ``` **Impact**: Constant memory usage regardless of batch size #### 3. Context Managers for Resources ```python class InferencePipeline: def __enter__(self): self.detector.load_model() return self def __exit__(self, *args): self.detector.unload_model() self.extractor.cleanup() # Usage with InferencePipeline(...) as pipeline: results = pipeline.process_pdf(path) # Automatic cleanup ``` --- ## Profiling and Monitoring ### Recommended Profiling Tools #### 1. cProfile for CPU Profiling ```python import cProfile import pstats profiler = cProfile.Profile() profiler.enable() # Your code here pipeline.process_pdf(pdf_path) profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 slowest functions ``` #### 2. memory_profiler for Memory Analysis ```bash pip install memory_profiler python -m memory_profiler your_script.py ``` Or decorator-based: ```python from memory_profiler import profile @profile def process_large_batch(pdf_paths): # Memory usage tracked line-by-line results = [process_pdf(path) for path in pdf_paths] return results ``` #### 3. py-spy for Production Profiling ```bash pip install py-spy # Profile running process py-spy top --pid 12345 # Generate flamegraph py-spy record -o profile.svg -- python your_script.py ``` **Advantage**: No code changes needed, minimal overhead ### Key Metrics to Monitor 1. **Processing Time per Document** - Target: <10 seconds for single-page invoice - Current: ~2-5 seconds (estimated) 2. **Memory Usage** - Target: <2GB for batch of 100 documents - Monitor: Peak memory usage 3. **Database Query Time** - Target: <100ms per query (with indexes) - Monitor: Slow query log 4. **OCR Accuracy vs Speed Trade-off** - Current: PaddleOCR with GPU (~200ms per region) - Alternative: Tesseract (~500ms, slightly more accurate) ### Logging Performance Metrics **Add to pipeline.py**: ```python import time import logging logger = logging.getLogger(__name__) def process_pdf(self, pdf_path): start = time.time() # Processing... result = self._process_internal(pdf_path) elapsed = time.time() - start logger.info(f"Processed {pdf_path} in {elapsed:.2f}s") # Log to database for analysis self.db.log_performance({ 'document_id': result.document_id, 'processing_time': elapsed, 'field_count': len(result.fields) }) return result ``` --- ## Performance Optimization Priorities ### High Priority (Implement First) 1. ✅ **Database parameterized queries** - Already done (Phase 1) 2. ⚠️ **Database connection pooling** - Not implemented 3. ⚠️ **Index optimization** - Needs verification ### Medium Priority 4. ⚠️ **Batch PDF rendering** - Optimization possible 5. ✅ **Parser instance reuse** - Already done (Phase 2) 6. ⚠️ **Model caching** - Could improve ### Low Priority (Nice to Have) 7. ⚠️ **OCR result caching** - Complex implementation 8. ⚠️ **Generator patterns** - Refactoring needed 9. ⚠️ **Advanced profiling** - For production optimization --- ## Benchmarking Script ```python """ Benchmark script for invoice processing performance. """ import time from pathlib import Path from src.inference.pipeline import InferencePipeline def benchmark_single_document(pdf_path, iterations=10): """Benchmark single document processing.""" pipeline = InferencePipeline( model_path="path/to/model.pt", use_gpu=True ) times = [] for i in range(iterations): start = time.time() result = pipeline.process_pdf(pdf_path) elapsed = time.time() - start times.append(elapsed) print(f"Iteration {i+1}: {elapsed:.2f}s") avg_time = sum(times) / len(times) print(f"\nAverage: {avg_time:.2f}s") print(f"Min: {min(times):.2f}s") print(f"Max: {max(times):.2f}s") def benchmark_batch(pdf_paths, batch_size=10): """Benchmark batch processing.""" from multiprocessing import Pool pipeline = InferencePipeline( model_path="path/to/model.pt", use_gpu=True ) start = time.time() with Pool(processes=batch_size) as pool: results = pool.map(pipeline.process_pdf, pdf_paths) elapsed = time.time() - start avg_per_doc = elapsed / len(pdf_paths) print(f"Total time: {elapsed:.2f}s") print(f"Documents: {len(pdf_paths)}") print(f"Average per document: {avg_per_doc:.2f}s") print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec") if __name__ == "__main__": # Single document benchmark benchmark_single_document("test.pdf") # Batch benchmark pdf_paths = list(Path("data/test_pdfs").glob("*.pdf")) benchmark_batch(pdf_paths[:100]) ``` --- ## Summary **Implemented (Phase 1-2)**: - ✅ Parameterized queries (SQL injection fix) - ✅ Parser instance reuse (Phase 2 refactoring) - ✅ Batch insert operations (execute_values) - ✅ Dual pool processing (CPU/GPU separation) **Quick Wins (Low effort, high impact)**: - Database connection pooling (2-4 hours) - Index verification and optimization (1-2 hours) - Batch PDF rendering (4-6 hours) **Long-term Improvements**: - OCR result caching with hashing - Generator patterns for streaming - Advanced profiling and monitoring **Expected Impact**: - Connection pooling: 80-95% reduction in DB overhead - Indexes: 10-100x faster queries - Batch rendering: 50-90% less redundant work - **Overall**: 2-5x throughput improvement for batch processing