# Eval Command Evaluate model performance and field extraction accuracy. ## Usage `/eval [model|accuracy|compare|report]` ## Model Evaluation `/eval model` Evaluate YOLO model performance on test dataset: ```bash # Run model evaluation python -m src.cli.train --model runs/train/invoice_fields/weights/best.pt --eval-only # Or use ultralytics directly yolo val model=runs/train/invoice_fields/weights/best.pt data=data.yaml ``` Output: ``` Model Evaluation: invoice_fields/best.pt ======================================== mAP@0.5: 93.5% mAP@0.5-0.95: 83.0% Per-class AP: - invoice_number: 95.2% - invoice_date: 94.8% - invoice_due_date: 93.1% - ocr_number: 91.5% - bankgiro: 92.3% - plusgiro: 90.8% - amount: 88.7% - supplier_org_num: 85.2% - payment_line: 82.4% - customer_number: 81.1% ``` ## Accuracy Evaluation `/eval accuracy` Evaluate field extraction accuracy against ground truth: ```bash # Run accuracy evaluation on labeled data python -m src.cli.infer --model runs/train/invoice_fields/weights/best.pt \ --input ~/invoice-data/test/*.pdf \ --ground-truth ~/invoice-data/test/labels.csv \ --output eval_results.json ``` Output: ``` Field Extraction Accuracy ========================= Documents tested: 500 Per-field accuracy: - InvoiceNumber: 98.9% (494/500) - InvoiceDate: 95.5% (478/500) - InvoiceDueDate: 95.9% (480/500) - OCR: 99.1% (496/500) - Bankgiro: 99.0% (495/500) - Plusgiro: 99.4% (497/500) - Amount: 91.3% (457/500) - supplier_org: 78.2% (391/500) Overall: 94.8% ``` ## Compare Models `/eval compare` Compare two model versions: ```bash # Compare old vs new model python -m src.cli.eval compare \ --model-a runs/train/invoice_v1/weights/best.pt \ --model-b runs/train/invoice_v2/weights/best.pt \ --test-data ~/invoice-data/test/ ``` Output: ``` Model Comparison ================ Model A Model B Delta mAP@0.5: 91.2% 93.5% +2.3% Accuracy: 92.1% 94.8% +2.7% Speed (ms): 1850 1520 -330 Per-field improvements: - amount: +4.2% - payment_line: +3.8% - customer_num: +2.1% Recommendation: Deploy Model B ``` ## Generate Report `/eval report` Generate comprehensive evaluation report: ```bash python -m src.cli.eval report --output eval_report.md ``` Output: ```markdown # Evaluation Report Generated: 2026-01-25 ## Model Performance - Model: runs/train/invoice_fields/weights/best.pt - mAP@0.5: 93.5% - Training samples: 9,738 ## Field Extraction Accuracy | Field | Accuracy | Errors | |-------|----------|--------| | InvoiceNumber | 98.9% | 6 | | Amount | 91.3% | 43 | ... ## Error Analysis ### Common Errors 1. Amount: OCR misreads comma as period 2. supplier_org: Missing from some invoices 3. payment_line: Partially obscured by stamps ## Recommendations 1. Add more training data for low-accuracy fields 2. Implement OCR error correction for amounts 3. Consider confidence threshold tuning ``` ## Quick Commands ```bash # Evaluate model metrics yolo val model=runs/train/invoice_fields/weights/best.pt # Test inference on sample python -m src.cli.infer --input sample.pdf --output result.json --gpu # Check test coverage pytest --cov=src --cov-report=html ``` ## Evaluation Metrics | Metric | Target | Current | |--------|--------|---------| | mAP@0.5 | >90% | 93.5% | | Overall Accuracy | >90% | 94.8% | | Test Coverage | >60% | 37% | | Tests Passing | 100% | 100% | ## When to Evaluate - After training a new model - Before deploying to production - After adding new training data - When accuracy complaints arise - Weekly performance monitoring