12 KiB
Performance Optimization Guide
This document provides performance optimization recommendations for the Invoice Field Extraction system.
Table of Contents
- Batch Processing Optimization
- Database Query Optimization
- Caching Strategies
- Memory Management
- Profiling and Monitoring
Batch Processing Optimization
Current State
The system processes invoices one at a time. For large batches, this can be inefficient.
Recommendations
1. Database Batch Operations
Current: Individual inserts for each document
# Inefficient
for doc in documents:
db.insert_document(doc) # Individual DB call
Optimized: Use execute_values for batch inserts
# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values
execute_values(cursor, """
INSERT INTO documents (...)
VALUES %s
""", document_values)
Impact: 10-50x faster for batches of 100+ documents
2. PDF Processing Batching
Recommendation: Process PDFs in parallel using multiprocessing
from multiprocessing import Pool
def process_batch(pdf_paths, batch_size=10):
"""Process PDFs in parallel batches."""
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
return results
Considerations:
- GPU models should use a shared process pool (already exists:
src/processing/gpu_pool.py) - CPU-intensive tasks can use separate process pool (
src/processing/cpu_pool.py) - Current dual pool coordinator (
dual_pool_coordinator.py) already supports this pattern
Status: ✅ Already implemented in src/processing/ modules
3. Image Caching for Multi-Page PDFs
Current: Each page rendered independently
# Current pattern in field_extractor.py
for page_num in range(total_pages):
image = render_pdf_page(pdf_path, page_num, dpi=300)
Optimized: Pre-render all pages if processing multiple fields per page
# Batch render
images = {
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
for page_num in page_numbers_needed
}
# Reuse images
for detection in detections:
image = images[detection.page_no]
extract_field(detection, image)
Impact: Reduces redundant PDF rendering by 50-90% for multi-field invoices
Database Query Optimization
Current Performance
- Parameterized queries: ✅ Implemented (Phase 1)
- Connection pooling: ❌ Not implemented
- Query batching: ✅ Partially implemented
- Index optimization: ⚠️ Needs verification
Recommendations
1. Connection Pooling
Current: New connection for each operation
def connect(self):
"""Create new database connection."""
return psycopg2.connect(**self.config)
Optimized: Use connection pooling
from psycopg2 import pool
class DocumentDatabase:
def __init__(self, config):
self.pool = pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
**config
)
def connect(self):
return self.pool.getconn()
def close(self, conn):
self.pool.putconn(conn)
Impact:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations
2. Index Recommendations
Check current indexes:
-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';
Recommended indexes:
-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
ON documents(success);
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
ON documents(timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
ON field_results(document_id);
CREATE INDEX IF NOT EXISTS idx_field_results_matched
ON field_results(matched);
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
ON field_results(field_name);
Impact:
- 10-100x faster queries for filtered/sorted results
- Critical for
get_failed_matches()andget_all_documents_summary()
3. Query Batching
Status: ✅ Already implemented for field results (line 519)
Verify batching is used:
# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
Additional opportunity: Batch SELECT queries
# Current
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
# Optimized
docs = get_documents_batch(doc_ids) # 1 query with IN clause
Status: ✅ Already implemented (get_documents_batch exists in db.py)
Caching Strategies
1. Model Loading Cache
Current: Models loaded per-instance
Recommendation: Singleton pattern for YOLO model
class YOLODetectorSingleton:
_instance = None
_model = None
@classmethod
def get_instance(cls, model_path):
if cls._instance is None:
cls._instance = YOLODetector(model_path)
return cls._instance
Impact: Reduces memory usage by 90% when processing multiple documents
2. Parser Instance Caching
Current: ✅ Already optimal
# Good pattern in field_extractor.py
def __init__(self):
self.payment_line_parser = PaymentLineParser() # Reused
self.customer_number_parser = CustomerNumberParser() # Reused
Status: No changes needed
3. OCR Result Caching
Recommendation: Cache OCR results for identical regions
from functools import lru_cache
@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
"""Cache OCR results by image hash + bbox."""
return paddle_ocr.ocr_region(image, bbox)
Impact: 50-80% speedup when re-processing similar documents
Note: Requires implementing image hashing (e.g., hashlib.md5(image.tobytes()))
Memory Management
Current Issues
Potential memory leaks:
- Large images kept in memory after processing
- OCR results accumulated without cleanup
- Model outputs not explicitly cleared
Recommendations
1. Explicit Image Cleanup
import gc
def process_pdf(pdf_path):
try:
image = render_pdf(pdf_path)
result = extract_fields(image)
return result
finally:
del image # Explicit cleanup
gc.collect() # Force garbage collection
2. Generator Pattern for Large Batches
Current: Load all documents into memory
docs = [process_pdf(path) for path in pdf_paths] # All in memory
Optimized: Use generator for streaming processing
def process_batch_streaming(pdf_paths):
"""Process documents one at a time, yielding results."""
for path in pdf_paths:
result = process_pdf(path)
yield result
# Result can be saved to DB immediately
# Previous result is garbage collected
Impact: Constant memory usage regardless of batch size
3. Context Managers for Resources
class InferencePipeline:
def __enter__(self):
self.detector.load_model()
return self
def __exit__(self, *args):
self.detector.unload_model()
self.extractor.cleanup()
# Usage
with InferencePipeline(...) as pipeline:
results = pipeline.process_pdf(path)
# Automatic cleanup
Profiling and Monitoring
Recommended Profiling Tools
1. cProfile for CPU Profiling
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your code here
pipeline.process_pdf(pdf_path)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 slowest functions
2. memory_profiler for Memory Analysis
pip install memory_profiler
python -m memory_profiler your_script.py
Or decorator-based:
from memory_profiler import profile
@profile
def process_large_batch(pdf_paths):
# Memory usage tracked line-by-line
results = [process_pdf(path) for path in pdf_paths]
return results
3. py-spy for Production Profiling
pip install py-spy
# Profile running process
py-spy top --pid 12345
# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py
Advantage: No code changes needed, minimal overhead
Key Metrics to Monitor
-
Processing Time per Document
- Target: <10 seconds for single-page invoice
- Current: ~2-5 seconds (estimated)
-
Memory Usage
- Target: <2GB for batch of 100 documents
- Monitor: Peak memory usage
-
Database Query Time
- Target: <100ms per query (with indexes)
- Monitor: Slow query log
-
OCR Accuracy vs Speed Trade-off
- Current: PaddleOCR with GPU (~200ms per region)
- Alternative: Tesseract (~500ms, slightly more accurate)
Logging Performance Metrics
Add to pipeline.py:
import time
import logging
logger = logging.getLogger(__name__)
def process_pdf(self, pdf_path):
start = time.time()
# Processing...
result = self._process_internal(pdf_path)
elapsed = time.time() - start
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
# Log to database for analysis
self.db.log_performance({
'document_id': result.document_id,
'processing_time': elapsed,
'field_count': len(result.fields)
})
return result
Performance Optimization Priorities
High Priority (Implement First)
- ✅ Database parameterized queries - Already done (Phase 1)
- ⚠️ Database connection pooling - Not implemented
- ⚠️ Index optimization - Needs verification
Medium Priority
- ⚠️ Batch PDF rendering - Optimization possible
- ✅ Parser instance reuse - Already done (Phase 2)
- ⚠️ Model caching - Could improve
Low Priority (Nice to Have)
- ⚠️ OCR result caching - Complex implementation
- ⚠️ Generator patterns - Refactoring needed
- ⚠️ Advanced profiling - For production optimization
Benchmarking Script
"""
Benchmark script for invoice processing performance.
"""
import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline
def benchmark_single_document(pdf_path, iterations=10):
"""Benchmark single document processing."""
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
times = []
for i in range(iterations):
start = time.time()
result = pipeline.process_pdf(pdf_path)
elapsed = time.time() - start
times.append(elapsed)
print(f"Iteration {i+1}: {elapsed:.2f}s")
avg_time = sum(times) / len(times)
print(f"\nAverage: {avg_time:.2f}s")
print(f"Min: {min(times):.2f}s")
print(f"Max: {max(times):.2f}s")
def benchmark_batch(pdf_paths, batch_size=10):
"""Benchmark batch processing."""
from multiprocessing import Pool
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
start = time.time()
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
elapsed = time.time() - start
avg_per_doc = elapsed / len(pdf_paths)
print(f"Total time: {elapsed:.2f}s")
print(f"Documents: {len(pdf_paths)}")
print(f"Average per document: {avg_per_doc:.2f}s")
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
if __name__ == "__main__":
# Single document benchmark
benchmark_single_document("test.pdf")
# Batch benchmark
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
benchmark_batch(pdf_paths[:100])
Summary
Implemented (Phase 1-2):
- ✅ Parameterized queries (SQL injection fix)
- ✅ Parser instance reuse (Phase 2 refactoring)
- ✅ Batch insert operations (execute_values)
- ✅ Dual pool processing (CPU/GPU separation)
Quick Wins (Low effort, high impact):
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)
Long-term Improvements:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring
Expected Impact:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- Overall: 2-5x throughput improvement for batch processing