Re-structure the project.

This commit is contained in:
Yaojia Wang
2026-01-25 15:21:11 +01:00
parent 8fd61ea928
commit e599424a92
80 changed files with 10672 additions and 1584 deletions

View File

@@ -0,0 +1,519 @@
# Performance Optimization Guide
This document provides performance optimization recommendations for the Invoice Field Extraction system.
## Table of Contents
1. [Batch Processing Optimization](#batch-processing-optimization)
2. [Database Query Optimization](#database-query-optimization)
3. [Caching Strategies](#caching-strategies)
4. [Memory Management](#memory-management)
5. [Profiling and Monitoring](#profiling-and-monitoring)
---
## Batch Processing Optimization
### Current State
The system processes invoices one at a time. For large batches, this can be inefficient.
### Recommendations
#### 1. Database Batch Operations
**Current**: Individual inserts for each document
```python
# Inefficient
for doc in documents:
db.insert_document(doc) # Individual DB call
```
**Optimized**: Use `execute_values` for batch inserts
```python
# Efficient - already implemented in db.py line 519
from psycopg2.extras import execute_values
execute_values(cursor, """
INSERT INTO documents (...)
VALUES %s
""", document_values)
```
**Impact**: 10-50x faster for batches of 100+ documents
#### 2. PDF Processing Batching
**Recommendation**: Process PDFs in parallel using multiprocessing
```python
from multiprocessing import Pool
def process_batch(pdf_paths, batch_size=10):
"""Process PDFs in parallel batches."""
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
return results
```
**Considerations**:
- GPU models should use a shared process pool (already exists: `src/processing/gpu_pool.py`)
- CPU-intensive tasks can use separate process pool (`src/processing/cpu_pool.py`)
- Current dual pool coordinator (`dual_pool_coordinator.py`) already supports this pattern
**Status**: ✅ Already implemented in `src/processing/` modules
#### 3. Image Caching for Multi-Page PDFs
**Current**: Each page rendered independently
```python
# Current pattern in field_extractor.py
for page_num in range(total_pages):
image = render_pdf_page(pdf_path, page_num, dpi=300)
```
**Optimized**: Pre-render all pages if processing multiple fields per page
```python
# Batch render
images = {
page_num: render_pdf_page(pdf_path, page_num, dpi=300)
for page_num in page_numbers_needed
}
# Reuse images
for detection in detections:
image = images[detection.page_no]
extract_field(detection, image)
```
**Impact**: Reduces redundant PDF rendering by 50-90% for multi-field invoices
---
## Database Query Optimization
### Current Performance
- **Parameterized queries**: ✅ Implemented (Phase 1)
- **Connection pooling**: ❌ Not implemented
- **Query batching**: ✅ Partially implemented
- **Index optimization**: ⚠️ Needs verification
### Recommendations
#### 1. Connection Pooling
**Current**: New connection for each operation
```python
def connect(self):
"""Create new database connection."""
return psycopg2.connect(**self.config)
```
**Optimized**: Use connection pooling
```python
from psycopg2 import pool
class DocumentDatabase:
def __init__(self, config):
self.pool = pool.SimpleConnectionPool(
minconn=1,
maxconn=10,
**config
)
def connect(self):
return self.pool.getconn()
def close(self, conn):
self.pool.putconn(conn)
```
**Impact**:
- Reduces connection overhead by 80-95%
- Especially important for high-frequency operations
#### 2. Index Recommendations
**Check current indexes**:
```sql
-- Verify indexes exist on frequently queried columns
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public';
```
**Recommended indexes**:
```sql
-- If not already present
CREATE INDEX IF NOT EXISTS idx_documents_success
ON documents(success);
CREATE INDEX IF NOT EXISTS idx_documents_timestamp
ON documents(timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_field_results_document_id
ON field_results(document_id);
CREATE INDEX IF NOT EXISTS idx_field_results_matched
ON field_results(matched);
CREATE INDEX IF NOT EXISTS idx_field_results_field_name
ON field_results(field_name);
```
**Impact**:
- 10-100x faster queries for filtered/sorted results
- Critical for `get_failed_matches()` and `get_all_documents_summary()`
#### 3. Query Batching
**Status**: ✅ Already implemented for field results (line 519)
**Verify batching is used**:
```python
# Good pattern in db.py
execute_values(cursor, "INSERT INTO field_results (...) VALUES %s", field_values)
```
**Additional opportunity**: Batch `SELECT` queries
```python
# Current
docs = [get_document(doc_id) for doc_id in doc_ids] # N queries
# Optimized
docs = get_documents_batch(doc_ids) # 1 query with IN clause
```
**Status**: ✅ Already implemented (`get_documents_batch` exists in db.py)
---
## Caching Strategies
### 1. Model Loading Cache
**Current**: Models loaded per-instance
**Recommendation**: Singleton pattern for YOLO model
```python
class YOLODetectorSingleton:
_instance = None
_model = None
@classmethod
def get_instance(cls, model_path):
if cls._instance is None:
cls._instance = YOLODetector(model_path)
return cls._instance
```
**Impact**: Reduces memory usage by 90% when processing multiple documents
### 2. Parser Instance Caching
**Current**: ✅ Already optimal
```python
# Good pattern in field_extractor.py
def __init__(self):
self.payment_line_parser = PaymentLineParser() # Reused
self.customer_number_parser = CustomerNumberParser() # Reused
```
**Status**: No changes needed
### 3. OCR Result Caching
**Recommendation**: Cache OCR results for identical regions
```python
from functools import lru_cache
@lru_cache(maxsize=1000)
def ocr_region_cached(image_hash, bbox):
"""Cache OCR results by image hash + bbox."""
return paddle_ocr.ocr_region(image, bbox)
```
**Impact**: 50-80% speedup when re-processing similar documents
**Note**: Requires implementing image hashing (e.g., `hashlib.md5(image.tobytes())`)
---
## Memory Management
### Current Issues
**Potential memory leaks**:
1. Large images kept in memory after processing
2. OCR results accumulated without cleanup
3. Model outputs not explicitly cleared
### Recommendations
#### 1. Explicit Image Cleanup
```python
import gc
def process_pdf(pdf_path):
try:
image = render_pdf(pdf_path)
result = extract_fields(image)
return result
finally:
del image # Explicit cleanup
gc.collect() # Force garbage collection
```
#### 2. Generator Pattern for Large Batches
**Current**: Load all documents into memory
```python
docs = [process_pdf(path) for path in pdf_paths] # All in memory
```
**Optimized**: Use generator for streaming processing
```python
def process_batch_streaming(pdf_paths):
"""Process documents one at a time, yielding results."""
for path in pdf_paths:
result = process_pdf(path)
yield result
# Result can be saved to DB immediately
# Previous result is garbage collected
```
**Impact**: Constant memory usage regardless of batch size
#### 3. Context Managers for Resources
```python
class InferencePipeline:
def __enter__(self):
self.detector.load_model()
return self
def __exit__(self, *args):
self.detector.unload_model()
self.extractor.cleanup()
# Usage
with InferencePipeline(...) as pipeline:
results = pipeline.process_pdf(path)
# Automatic cleanup
```
---
## Profiling and Monitoring
### Recommended Profiling Tools
#### 1. cProfile for CPU Profiling
```python
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your code here
pipeline.process_pdf(pdf_path)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 slowest functions
```
#### 2. memory_profiler for Memory Analysis
```bash
pip install memory_profiler
python -m memory_profiler your_script.py
```
Or decorator-based:
```python
from memory_profiler import profile
@profile
def process_large_batch(pdf_paths):
# Memory usage tracked line-by-line
results = [process_pdf(path) for path in pdf_paths]
return results
```
#### 3. py-spy for Production Profiling
```bash
pip install py-spy
# Profile running process
py-spy top --pid 12345
# Generate flamegraph
py-spy record -o profile.svg -- python your_script.py
```
**Advantage**: No code changes needed, minimal overhead
### Key Metrics to Monitor
1. **Processing Time per Document**
- Target: <10 seconds for single-page invoice
- Current: ~2-5 seconds (estimated)
2. **Memory Usage**
- Target: <2GB for batch of 100 documents
- Monitor: Peak memory usage
3. **Database Query Time**
- Target: <100ms per query (with indexes)
- Monitor: Slow query log
4. **OCR Accuracy vs Speed Trade-off**
- Current: PaddleOCR with GPU (~200ms per region)
- Alternative: Tesseract (~500ms, slightly more accurate)
### Logging Performance Metrics
**Add to pipeline.py**:
```python
import time
import logging
logger = logging.getLogger(__name__)
def process_pdf(self, pdf_path):
start = time.time()
# Processing...
result = self._process_internal(pdf_path)
elapsed = time.time() - start
logger.info(f"Processed {pdf_path} in {elapsed:.2f}s")
# Log to database for analysis
self.db.log_performance({
'document_id': result.document_id,
'processing_time': elapsed,
'field_count': len(result.fields)
})
return result
```
---
## Performance Optimization Priorities
### High Priority (Implement First)
1. **Database parameterized queries** - Already done (Phase 1)
2. **Database connection pooling** - Not implemented
3. **Index optimization** - Needs verification
### Medium Priority
4. **Batch PDF rendering** - Optimization possible
5. **Parser instance reuse** - Already done (Phase 2)
6. **Model caching** - Could improve
### Low Priority (Nice to Have)
7. **OCR result caching** - Complex implementation
8. **Generator patterns** - Refactoring needed
9. **Advanced profiling** - For production optimization
---
## Benchmarking Script
```python
"""
Benchmark script for invoice processing performance.
"""
import time
from pathlib import Path
from src.inference.pipeline import InferencePipeline
def benchmark_single_document(pdf_path, iterations=10):
"""Benchmark single document processing."""
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
times = []
for i in range(iterations):
start = time.time()
result = pipeline.process_pdf(pdf_path)
elapsed = time.time() - start
times.append(elapsed)
print(f"Iteration {i+1}: {elapsed:.2f}s")
avg_time = sum(times) / len(times)
print(f"\nAverage: {avg_time:.2f}s")
print(f"Min: {min(times):.2f}s")
print(f"Max: {max(times):.2f}s")
def benchmark_batch(pdf_paths, batch_size=10):
"""Benchmark batch processing."""
from multiprocessing import Pool
pipeline = InferencePipeline(
model_path="path/to/model.pt",
use_gpu=True
)
start = time.time()
with Pool(processes=batch_size) as pool:
results = pool.map(pipeline.process_pdf, pdf_paths)
elapsed = time.time() - start
avg_per_doc = elapsed / len(pdf_paths)
print(f"Total time: {elapsed:.2f}s")
print(f"Documents: {len(pdf_paths)}")
print(f"Average per document: {avg_per_doc:.2f}s")
print(f"Throughput: {len(pdf_paths)/elapsed:.2f} docs/sec")
if __name__ == "__main__":
# Single document benchmark
benchmark_single_document("test.pdf")
# Batch benchmark
pdf_paths = list(Path("data/test_pdfs").glob("*.pdf"))
benchmark_batch(pdf_paths[:100])
```
---
## Summary
**Implemented (Phase 1-2)**:
- Parameterized queries (SQL injection fix)
- Parser instance reuse (Phase 2 refactoring)
- Batch insert operations (execute_values)
- Dual pool processing (CPU/GPU separation)
**Quick Wins (Low effort, high impact)**:
- Database connection pooling (2-4 hours)
- Index verification and optimization (1-2 hours)
- Batch PDF rendering (4-6 hours)
**Long-term Improvements**:
- OCR result caching with hashing
- Generator patterns for streaming
- Advanced profiling and monitoring
**Expected Impact**:
- Connection pooling: 80-95% reduction in DB overhead
- Indexes: 10-100x faster queries
- Batch rendering: 50-90% less redundant work
- **Overall**: 2-5x throughput improvement for batch processing