520 lines
12 KiB
Markdown
520 lines
12 KiB
Markdown
# Performance Optimization Guide
|
||
|
||
This document provides performance optimization recommendations for the Invoice Field Extraction system.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Batch Processing Optimization](#batch-processing-optimization)
|
||
2. [Database Query Optimization](#database-query-optimization)
|
||
3. [Caching Strategies](#caching-strategies)
|
||
4. [Memory Management](#memory-management)
|
||
5. [Profiling and Monitoring](#profiling-and-monitoring)
|
||
|
||
---
|
||
|
||
## Batch Processing Optimization
|
||
|
||
### Current State
|
||
|
||
The system processes invoices one at a time. For large batches, this can be inefficient.
|
||
|
||
### Recommendations
|
||
|
||
#### 1. Database Batch Operations
|
||
|
||
**Current**: Individual inserts for each document
|
||
```python
|
||
# Inefficient
|
||
for doc in documents:
|
||
db.insert_document(doc) # Individual DB call
|
||
```
|
||
|
||
**Optimized**: Use `execute_values` for batch inserts
|
||
```python
|
||
# Efficient - already implemented in db.py line 519
|
||
from psycopg2.extras import execute_values
|
||
|
||
execute_values(cursor, """
|
||
INSERT INTO documents (...)
|
||
VALUES %s
|
||
""", document_values)
|
||
```
|
||
|
||
**Impact**: 10-50x faster for batches of 100+ documents
|
||
|
||
#### 2. PDF Processing Batching
|
||
|
||
**Recommendation**: Process PDFs in parallel using multiprocessing
|
||
|
||
```python
|
||
from multiprocessing import Pool
|
||
|
||
def process_batch(pdf_paths, batch_size=10):
|
||
"""Process PDFs in parallel batches."""
|
||
with Pool(processes=batch_size) as pool:
|
||
results = pool.map(pipeline.process_pdf, pdf_paths)
|
||
return results
|
||
```
|
||
|
||
**Considerations**:
|
||
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
|
||
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
|
||
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
|
||
|
||
**Status**: ✅ Already implemented in `src/processing/` modules
|
||
|
||
#### 3. Image Caching for Multi-Page PDFs
|
||
|
||
**Current**: Each page rendered independently
|
||
```python
|
||
# Current pattern in field_extractor.py
|
||
for page_num in range(total_pages):
|
||
image = render_pdf_page(pdf_path, page_num, dpi=300)
|
||
```
|
||
|
||
**Optimized**: Pre-render all pages if processing multiple fields per page
|
||
```python
|
||
# Batch render
|
||
images = {
|
||
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
|
||
for page_num in page_numbers_needed
|
||
}
|
||
|
||
# Reuse images
|
||
for detection in detections:
|
||
image = images[detection.page_no]
|
||
extract_field(detection, image)
|
||
```
|
||
|
||
**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
|
||
|
||
---
|
||
|
||
## Database Query Optimization
|
||
|
||
### Current Performance
|
||
|
||
- **Parameterized queries**: ✅ Implemented (Phase 1)
|
||
- **Connection pooling**: ❌ Not implemented
|
||
- **Query batching**: ✅ Partially implemented
|
||
- **Index optimization**: ⚠️ Needs verification
|
||
|
||
### Recommendations
|
||
|
||
#### 1. Connection Pooling
|
||
|
||
**Current**: New connection for each operation
|
||
```python
|
||
def connect(self):
|
||
"""Create new database connection."""
|
||
return psycopg2.connect(**self.config)
|
||
```
|
||
|
||
**Optimized**: Use connection pooling
|
||
```python
|
||
from psycopg2 import pool
|
||
|
||
class DocumentDatabase:
|
||
def __init__(self, config):
|
||
self.pool = pool.SimpleConnectionPool(
|
||
minconn=1,
|
||
maxconn=10,
|
||
**config
|
||
)
|
||
|
||
def connect(self):
|
||
return self.pool.getconn()
|
||
|
||
def close(self, conn):
|
||
self.pool.putconn(conn)
|
||
```
|
||
|
||
**Impact**:
|
||
- Reduces connection overhead by 80-95%
|
||
- Especially important for high-frequency operations
|
||
|
||
#### 2. Index Recommendations
|
||
|
||
**Check current indexes**:
|
||
```sql
|
||
-- Verify indexes exist on frequently queried columns
|
||
SELECT tablename, indexname, indexdef
|
||
FROM pg_indexes
|
||
WHERE schemaname = 'public';
|
||
```
|
||
|
||
**Recommended indexes**:
|
||
```sql
|
||
-- If not already present
|
||
CREATE INDEX IF NOT EXISTS idx_documents_success
|
||
ON documents(success);
|
||
|
||
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
|
||
ON documents(timestamp DESC);
|
||
|
||
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
|
||
ON field_results(document_id);
|
||
|
||
CREATE INDEX IF NOT EXISTS idx_field_results_matched
|
||
ON field_results(matched);
|
||
|
||
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
|
||
ON field_results(field_name);
|
||
```
|
||
|
||
**Impact**:
|
||
- 10-100x faster queries for filtered/sorted results
|
||
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
|
||
|
||
#### 3. Query Batching
|
||
|
||
**Status**: ✅ Already implemented for field results (line 519)
|
||
|
||
**Verify batching is used**:
|
||
```python
|
||
# Good pattern in db.py
|
||
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
|
||
```
|
||
|
||
**Additional opportunity**: Batch `SELECT` queries
|
||
```python
|
||
# Current
|
||
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
|
||
|
||
# Optimized
|
||
docs = get_documents_batch(doc_ids) # 1 query with IN clause
|
||
```
|
||
|
||
**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
|
||
|
||
---
|
||
|
||
## Caching Strategies
|
||
|
||
### 1. Model Loading Cache
|
||
|
||
**Current**: Models loaded per-instance
|
||
|
||
**Recommendation**: Singleton pattern for YOLO model
|
||
```python
|
||
class YOLODetectorSingleton:
|
||
_instance = None
|
||
_model = None
|
||
|
||
@classmethod
|
||
def get_instance(cls, model_path):
|
||
if cls._instance is None:
|
||
cls._instance = YOLODetector(model_path)
|
||
return cls._instance
|
||
```
|
||
|
||
**Impact**: Reduces memory usage by 90% when processing multiple documents
|
||
|
||
### 2. Parser Instance Caching
|
||
|
||
**Current**: ✅ Already optimal
|
||
```python
|
||
# Good pattern in field_extractor.py
|
||
def __init__(self):
|
||
self.payment_line_parser = PaymentLineParser() # Reused
|
||
self.customer_number_parser = CustomerNumberParser() # Reused
|
||
```
|
||
|
||
**Status**: No changes needed
|
||
|
||
### 3. OCR Result Caching
|
||
|
||
**Recommendation**: Cache OCR results for identical regions
|
||
```python
|
||
from functools import lru_cache
|
||
|
||
@lru_cache(maxsize=1000)
|
||
def ocr_region_cached(image_hash, bbox):
|
||
"""Cache OCR results by image hash + bbox."""
|
||
return paddle_ocr.ocr_region(image, bbox)
|
||
```
|
||
|
||
**Impact**: 50-80% speedup when re-processing similar documents
|
||
|
||
**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
|
||
|
||
---
|
||
|
||
## Memory Management
|
||
|
||
### Current Issues
|
||
|
||
**Potential memory leaks**:
|
||
1. Large images kept in memory after processing
|
||
2. OCR results accumulated without cleanup
|
||
3. Model outputs not explicitly cleared
|
||
|
||
### Recommendations
|
||
|
||
#### 1. Explicit Image Cleanup
|
||
|
||
```python
|
||
import gc
|
||
|
||
def process_pdf(pdf_path):
|
||
try:
|
||
image = render_pdf(pdf_path)
|
||
result = extract_fields(image)
|
||
return result
|
||
finally:
|
||
del image # Explicit cleanup
|
||
gc.collect() # Force garbage collection
|
||
```
|
||
|
||
#### 2. Generator Pattern for Large Batches
|
||
|
||
**Current**: Load all documents into memory
|
||
```python
|
||
docs = [process_pdf(path) for path in pdf_paths] # All in memory
|
||
```
|
||
|
||
**Optimized**: Use generator for streaming processing
|
||
```python
|
||
def process_batch_streaming(pdf_paths):
|
||
"""Process documents one at a time, yielding results."""
|
||
for path in pdf_paths:
|
||
result = process_pdf(path)
|
||
yield result
|
||
# Result can be saved to DB immediately
|
||
# Previous result is garbage collected
|
||
```
|
||
|
||
**Impact**: Constant memory usage regardless of batch size
|
||
|
||
#### 3. Context Managers for Resources
|
||
|
||
```python
|
||
class InferencePipeline:
|
||
def __enter__(self):
|
||
self.detector.load_model()
|
||
return self
|
||
|
||
def __exit__(self, *args):
|
||
self.detector.unload_model()
|
||
self.extractor.cleanup()
|
||
|
||
# Usage
|
||
with InferencePipeline(...) as pipeline:
|
||
results = pipeline.process_pdf(path)
|
||
# Automatic cleanup
|
||
```
|
||
|
||
---
|
||
|
||
## Profiling and Monitoring
|
||
|
||
### Recommended Profiling Tools
|
||
|
||
#### 1. cProfile for CPU Profiling
|
||
|
||
```python
|
||
import cProfile
|
||
import pstats
|
||
|
||
profiler = cProfile.Profile()
|
||
profiler.enable()
|
||
|
||
# Your code here
|
||
pipeline.process_pdf(pdf_path)
|
||
|
||
profiler.disable()
|
||
stats = pstats.Stats(profiler)
|
||
stats.sort_stats('cumulative')
|
||
stats.print_stats(20) # Top 20 slowest functions
|
||
```
|
||
|
||
#### 2. memory_profiler for Memory Analysis
|
||
|
||
```bash
|
||
pip install memory_profiler
|
||
python -m memory_profiler your_script.py
|
||
```
|
||
|
||
Or decorator-based:
|
||
```python
|
||
from memory_profiler import profile
|
||
|
||
@profile
|
||
def process_large_batch(pdf_paths):
|
||
# Memory usage tracked line-by-line
|
||
results = [process_pdf(path) for path in pdf_paths]
|
||
return results
|
||
```
|
||
|
||
#### 3. py-spy for Production Profiling
|
||
|
||
```bash
|
||
pip install py-spy
|
||
|
||
# Profile running process
|
||
py-spy top --pid 12345
|
||
|
||
# Generate flamegraph
|
||
py-spy record -o profile.svg -- python your_script.py
|
||
```
|
||
|
||
**Advantage**: No code changes needed, minimal overhead
|
||
|
||
### Key Metrics to Monitor
|
||
|
||
1. **Processing Time per Document**
|
||
- Target: <10 seconds for single-page invoice
|
||
- Current: ~2-5 seconds (estimated)
|
||
|
||
2. **Memory Usage**
|
||
- Target: <2GB for batch of 100 documents
|
||
- Monitor: Peak memory usage
|
||
|
||
3. **Database Query Time**
|
||
- Target: <100ms per query (with indexes)
|
||
- Monitor: Slow query log
|
||
|
||
4. **OCR Accuracy vs Speed Trade-off**
|
||
- Current: PaddleOCR with GPU (~200ms per region)
|
||
- Alternative: Tesseract (~500ms, slightly more accurate)
|
||
|
||
### Logging Performance Metrics
|
||
|
||
**Add to pipeline.py**:
|
||
```python
|
||
import time
|
||
import logging
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
def process_pdf(self, pdf_path):
|
||
start = time.time()
|
||
|
||
# Processing...
|
||
result = self._process_internal(pdf_path)
|
||
|
||
elapsed = time.time() - start
|
||
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
|
||
|
||
# Log to database for analysis
|
||
self.db.log_performance({
|
||
'document_id': result.document_id,
|
||
'processing_time': elapsed,
|
||
'field_count': len(result.fields)
|
||
})
|
||
|
||
return result
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Optimization Priorities
|
||
|
||
### High Priority (Implement First)
|
||
|
||
1. ✅ **Database parameterized queries** - Already done (Phase 1)
|
||
2. ⚠️ **Database connection pooling** - Not implemented
|
||
3. ⚠️ **Index optimization** - Needs verification
|
||
|
||
### Medium Priority
|
||
|
||
4. ⚠️ **Batch PDF rendering** - Optimization possible
|
||
5. ✅ **Parser instance reuse** - Already done (Phase 2)
|
||
6. ⚠️ **Model caching** - Could improve
|
||
|
||
### Low Priority (Nice to Have)
|
||
|
||
7. ⚠️ **OCR result caching** - Complex implementation
|
||
8. ⚠️ **Generator patterns** - Refactoring needed
|
||
9. ⚠️ **Advanced profiling** - For production optimization
|
||
|
||
---
|
||
|
||
## Benchmarking Script
|
||
|
||
```python
|
||
"""
|
||
Benchmark script for invoice processing performance.
|
||
"""
|
||
|
||
import time
|
||
from pathlib import Path
|
||
from src.inference.pipeline import InferencePipeline
|
||
|
||
def benchmark_single_document(pdf_path, iterations=10):
|
||
"""Benchmark single document processing."""
|
||
pipeline = InferencePipeline(
|
||
model_path="path/to/model.pt",
|
||
use_gpu=True
|
||
)
|
||
|
||
times = []
|
||
for i in range(iterations):
|
||
start = time.time()
|
||
result = pipeline.process_pdf(pdf_path)
|
||
elapsed = time.time() - start
|
||
times.append(elapsed)
|
||
print(f"Iteration {i+1}: {elapsed:.2f}s")
|
||
|
||
avg_time = sum(times) / len(times)
|
||
print(f"\nAverage: {avg_time:.2f}s")
|
||
print(f"Min: {min(times):.2f}s")
|
||
print(f"Max: {max(times):.2f}s")
|
||
|
||
def benchmark_batch(pdf_paths, batch_size=10):
|
||
"""Benchmark batch processing."""
|
||
from multiprocessing import Pool
|
||
|
||
pipeline = InferencePipeline(
|
||
model_path="path/to/model.pt",
|
||
use_gpu=True
|
||
)
|
||
|
||
start = time.time()
|
||
|
||
with Pool(processes=batch_size) as pool:
|
||
results = pool.map(pipeline.process_pdf, pdf_paths)
|
||
|
||
elapsed = time.time() - start
|
||
avg_per_doc = elapsed / len(pdf_paths)
|
||
|
||
print(f"Total time: {elapsed:.2f}s")
|
||
print(f"Documents: {len(pdf_paths)}")
|
||
print(f"Average per document: {avg_per_doc:.2f}s")
|
||
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
|
||
|
||
if __name__ == "__main__":
|
||
# Single document benchmark
|
||
benchmark_single_document("test.pdf")
|
||
|
||
# Batch benchmark
|
||
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
|
||
benchmark_batch(pdf_paths[:100])
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
**Implemented (Phase 1-2)**:
|
||
- ✅ Parameterized queries (SQL injection fix)
|
||
- ✅ Parser instance reuse (Phase 2 refactoring)
|
||
- ✅ Batch insert operations (execute_values)
|
||
- ✅ Dual pool processing (CPU/GPU separation)
|
||
|
||
**Quick Wins (Low effort, high impact)**:
|
||
- Database connection pooling (2-4 hours)
|
||
- Index verification and optimization (1-2 hours)
|
||
- Batch PDF rendering (4-6 hours)
|
||
|
||
**Long-term Improvements**:
|
||
- OCR result caching with hashing
|
||
- Generator patterns for streaming
|
||
- Advanced profiling and monitoring
|
||
|
||
**Expected Impact**:
|
||
- Connection pooling: 80-95% reduction in DB overhead
|
||
- Indexes: 10-100x faster queries
|
||
- Batch rendering: 50-90% less redundant work
|
||
- **Overall**: 2-5x throughput improvement for batch processing
|