Files
invoice-master-poc-v2/docs/PERFORMANCE_OPTIMIZATION.md
2026-01-25 15:21:11 +01:00

520 lines
12 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Performance Optimization Guide
This document provides performance optimization recommendations for the Invoice Field Extraction system.
## Table of Contents
1. [Batch Processing Optimization](#batch-processing-optimization)
2. [Database Query Optimization](#database-query-optimization)
3. [Caching Strategies](#caching-strategies)
4. [Memory Management](#memory-management)
5. [Profiling and Monitoring](#profiling-and-monitoring)
---
## Batch Processing Optimization
### Current State
The system processes invoices one at a time. For large batches, this can be inefficient.
### Recommendations
#### 1. Database Batch Operations
**Current**: Individual inserts for each document
```python
# Inefficient
for doc in documents:
db.insert_document(doc) # Individual DB call
```
**Optimized**: Use `execute_values` for batch inserts
```python
# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values
execute_values(cursor, """
INSERT INTO documents (...)
VALUES %s
""", document_values)
```
**Impact**: 10-50x faster for batches of 100+ documents
#### 2. PDF Processing Batching
**Recommendation**: Process PDFs in parallel using multiprocessing
```python
from multiprocessing import Pool
def process_batch(pdf_paths, batch_size=10):
"""Process PDFs in parallel batches."""
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
return results
```
**Considerations**:
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
**Status**: ✅ Already implemented in `src/processing/` modules
#### 3. Image Caching for Multi-Page PDFs
**Current**: Each page rendered independently
```python
# Current pattern in field_extractor.py
for page_num in range(total_pages):
image = render_pdf_page(pdf_path, page_num, dpi=300)
```
**Optimized**: Pre-render all pages if processing multiple fields per page
```python
# Batch render
images = {
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
for page_num in page_numbers_needed
}
# Reuse images
for detection in detections:
image = images[detection.page_no]
extract_field(detection, image)
```
**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
---
## Database Query Optimization
### Current Performance
- **Parameterized queries**: ✅ Implemented (Phase 1)
- **Connection pooling**: ❌ Not implemented
- **Query batching**: ✅ Partially implemented
- **Index optimization**: ⚠️ Needs verification
### Recommendations
#### 1. Connection Pooling
**Current**: New connection for each operation
```python
def connect(self):
"""Create new database connection."""
return psycopg2.connect(**self.config)
```
**Optimized**: Use connection pooling
```python
from psycopg2 import pool
class DocumentDatabase:
def __init__(self, config):
self.pool = pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
**config
)
def connect(self):
return self.pool.getconn()
def close(self, conn):
self.pool.putconn(conn)
```
**Impact**:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations
#### 2. Index Recommendations
**Check current indexes**:
```sql
-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';
```
**Recommended indexes**:
```sql
-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
ON documents(success);
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
ON documents(timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
ON field_results(document_id);
CREATE INDEX IF NOT EXISTS idx_field_results_matched
ON field_results(matched);
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
ON field_results(field_name);
```
**Impact**:
- 10-100x faster queries for filtered/sorted results
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
#### 3. Query Batching
**Status**: ✅ Already implemented for field results (line 519)
**Verify batching is used**:
```python
# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
```
**Additional opportunity**: Batch `SELECT` queries
```python
# Current
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
# Optimized
docs = get_documents_batch(doc_ids) # 1 query with IN clause
```
**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
---
## Caching Strategies
### 1. Model Loading Cache
**Current**: Models loaded per-instance
**Recommendation**: Singleton pattern for YOLO model
```python
class YOLODetectorSingleton:
_instance = None
_model = None
@classmethod
def get_instance(cls, model_path):
if cls._instance is None:
cls._instance = YOLODetector(model_path)
return cls._instance
```
**Impact**: Reduces memory usage by 90% when processing multiple documents
### 2. Parser Instance Caching
**Current**: ✅ Already optimal
```python
# Good pattern in field_extractor.py
def __init__(self):
self.payment_line_parser = PaymentLineParser() # Reused
self.customer_number_parser = CustomerNumberParser() # Reused
```
**Status**: No changes needed
### 3. OCR Result Caching
**Recommendation**: Cache OCR results for identical regions
```python
from functools import lru_cache
@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
"""Cache OCR results by image hash + bbox."""
return paddle_ocr.ocr_region(image, bbox)
```
**Impact**: 50-80% speedup when re-processing similar documents
**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
---
## Memory Management
### Current Issues
**Potential memory leaks**:
1. Large images kept in memory after processing
2. OCR results accumulated without cleanup
3. Model outputs not explicitly cleared
### Recommendations
#### 1. Explicit Image Cleanup
```python
import gc
def process_pdf(pdf_path):
try:
image = render_pdf(pdf_path)
result = extract_fields(image)
return result
finally:
del image # Explicit cleanup
gc.collect() # Force garbage collection
```
#### 2. Generator Pattern for Large Batches
**Current**: Load all documents into memory
```python
docs = [process_pdf(path) for path in pdf_paths] # All in memory
```
**Optimized**: Use generator for streaming processing
```python
def process_batch_streaming(pdf_paths):
"""Process documents one at a time, yielding results."""
for path in pdf_paths:
result = process_pdf(path)
yield result
# Result can be saved to DB immediately
# Previous result is garbage collected
```
**Impact**: Constant memory usage regardless of batch size
#### 3. Context Managers for Resources
```python
class InferencePipeline:
def __enter__(self):
self.detector.load_model()
return self
def __exit__(self, *args):
self.detector.unload_model()
self.extractor.cleanup()
# Usage
with InferencePipeline(...) as pipeline:
results = pipeline.process_pdf(path)
# Automatic cleanup
```
---
## Profiling and Monitoring
### Recommended Profiling Tools
#### 1. cProfile for CPU Profiling
```python
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your code here
pipeline.process_pdf(pdf_path)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 slowest functions
```
#### 2. memory_profiler for Memory Analysis
```bash
pip install memory_profiler
python -m memory_profiler your_script.py
```
Or decorator-based:
```python
from memory_profiler import profile
@profile
def process_large_batch(pdf_paths):
# Memory usage tracked line-by-line
results = [process_pdf(path) for path in pdf_paths]
return results
```
#### 3. py-spy for Production Profiling
```bash
pip install py-spy
# Profile running process
py-spy top --pid 12345
# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py
```
**Advantage**: No code changes needed, minimal overhead
### Key Metrics to Monitor
1. **Processing Time per Document**
- Target: <10 seconds for single-page invoice
- Current: ~2-5 seconds (estimated)
2. **Memory Usage**
- Target: <2GB for batch of 100 documents
- Monitor: Peak memory usage
3. **Database Query Time**
- Target: <100ms per query (with indexes)
- Monitor: Slow query log
4. **OCR Accuracy vs Speed Trade-off**
- Current: PaddleOCR with GPU (~200ms per region)
- Alternative: Tesseract (~500ms, slightly more accurate)
### Logging Performance Metrics
**Add to pipeline.py**:
```python
import time
import logging
logger = logging.getLogger(__name__)
def process_pdf(self, pdf_path):
start = time.time()
# Processing...
result = self._process_internal(pdf_path)
elapsed = time.time() - start
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
# Log to database for analysis
self.db.log_performance({
'document_id': result.document_id,
'processing_time': elapsed,
'field_count': len(result.fields)
})
return result
```
---
## Performance Optimization Priorities
### High Priority (Implement First)
1. **Database parameterized queries** - Already done (Phase 1)
2. **Database connection pooling** - Not implemented
3. **Index optimization** - Needs verification
### Medium Priority
4. **Batch PDF rendering** - Optimization possible
5. **Parser instance reuse** - Already done (Phase 2)
6. **Model caching** - Could improve
### Low Priority (Nice to Have)
7. **OCR result caching** - Complex implementation
8. **Generator patterns** - Refactoring needed
9. **Advanced profiling** - For production optimization
---
## Benchmarking Script
```python
"""
Benchmark script for invoice processing performance.
"""
import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline
def benchmark_single_document(pdf_path, iterations=10):
"""Benchmark single document processing."""
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
times = []
for i in range(iterations):
start = time.time()
result = pipeline.process_pdf(pdf_path)
elapsed = time.time() - start
times.append(elapsed)
print(f"Iteration {i+1}: {elapsed:.2f}s")
avg_time = sum(times) / len(times)
print(f"\nAverage: {avg_time:.2f}s")
print(f"Min: {min(times):.2f}s")
print(f"Max: {max(times):.2f}s")
def benchmark_batch(pdf_paths, batch_size=10):
"""Benchmark batch processing."""
from multiprocessing import Pool
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
start = time.time()
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
elapsed = time.time() - start
avg_per_doc = elapsed / len(pdf_paths)
print(f"Total time: {elapsed:.2f}s")
print(f"Documents: {len(pdf_paths)}")
print(f"Average per document: {avg_per_doc:.2f}s")
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
if __name__ == "__main__":
# Single document benchmark
benchmark_single_document("test.pdf")
# Batch benchmark
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
benchmark_batch(pdf_paths[:100])
```
---
## Summary
**Implemented (Phase 1-2)**:
- Parameterized queries (SQL injection fix)
- Parser instance reuse (Phase 2 refactoring)
- Batch insert operations (execute_values)
- Dual pool processing (CPU/GPU separation)
**Quick Wins (Low effort, high impact)**:
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)
**Long-term Improvements**:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring
**Expected Impact**:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- **Overall**: 2-5x throughput improvement for batch processing