Eval Command

Evaluate model performance and field extraction accuracy.

Usage

/eval [model|accuracy|compare|report]

Model Evaluation

/eval model

Evaluate YOLO model performance on test dataset:

# Run model evaluation
python -m src.cli.train --model runs/train/invoice_fields/weights/best.pt --eval-only

# Or use ultralytics directly
yolo val model=runs/train/invoice_fields/weights/best.pt data=data.yaml

Output:

Model Evaluation: invoice_fields/best.pt
========================================
mAP@0.5:     93.5%
mAP@0.5-0.95: 83.0%

Per-class AP:
- invoice_number:    95.2%
- invoice_date:      94.8%
- invoice_due_date:  93.1%
- ocr_number:        91.5%
- bankgiro:          92.3%
- plusgiro:          90.8%
- amount:            88.7%
- supplier_org_num:  85.2%
- payment_line:      82.4%
- customer_number:   81.1%

Accuracy Evaluation

/eval accuracy

Evaluate field extraction accuracy against ground truth:

# Run accuracy evaluation on labeled data
python -m src.cli.infer --model runs/train/invoice_fields/weights/best.pt \
    --input ~/invoice-data/test/*.pdf \
    --ground-truth ~/invoice-data/test/labels.csv \
    --output eval_results.json

Output:

Field Extraction Accuracy
=========================
Documents tested: 500

Per-field accuracy:
- InvoiceNumber:   98.9% (494/500)
- InvoiceDate:     95.5% (478/500)
- InvoiceDueDate:  95.9% (480/500)
- OCR:             99.1% (496/500)
- Bankgiro:        99.0% (495/500)
- Plusgiro:        99.4% (497/500)
- Amount:          91.3% (457/500)
- supplier_org:    78.2% (391/500)

Overall: 94.8%

Compare Models

/eval compare

Compare two model versions:

# Compare old vs new model
python -m src.cli.eval compare \
    --model-a runs/train/invoice_v1/weights/best.pt \
    --model-b runs/train/invoice_v2/weights/best.pt \
    --test-data ~/invoice-data/test/

Output:

Model Comparison
================
                Model A     Model B     Delta
mAP@0.5:        91.2%       93.5%       +2.3%
Accuracy:       92.1%       94.8%       +2.7%
Speed (ms):     1850        1520        -330

Per-field improvements:
- amount:       +4.2%
- payment_line: +3.8%
- customer_num: +2.1%

Recommendation: Deploy Model B

Generate Report

/eval report

Generate comprehensive evaluation report:

python -m src.cli.eval report --output eval_report.md

Output:

# Evaluation Report
Generated: 2026-01-25

## Model Performance
- Model: runs/train/invoice_fields/weights/best.pt
- mAP@0.5: 93.5%
- Training samples: 9,738

## Field Extraction Accuracy
| Field | Accuracy | Errors |
|-------|----------|--------|
| InvoiceNumber | 98.9% | 6 |
| Amount | 91.3% | 43 |
...

## Error Analysis
### Common Errors
1. Amount: OCR misreads comma as period
2. supplier_org: Missing from some invoices
3. payment_line: Partially obscured by stamps

## Recommendations
1. Add more training data for low-accuracy fields
2. Implement OCR error correction for amounts
3. Consider confidence threshold tuning

Quick Commands

# Evaluate model metrics
yolo val model=runs/train/invoice_fields/weights/best.pt

# Test inference on sample
python -m src.cli.infer --input sample.pdf --output result.json --gpu

# Check test coverage
pytest --cov=src --cov-report=html

Evaluation Metrics

Metric	Target	Current
mAP@0.5	>90%	93.5%
Overall Accuracy	>90%	94.8%
Test Coverage	>60%	37%
Tests Passing	100%	100%

When to Evaluate

After training a new model
Before deploying to production
After adding new training data
When accuracy complaints arise
Weekly performance monitoring

3.6 KiB Raw Blame History

Eval Command

Usage

Model Evaluation

Accuracy Evaluation

Compare Models

Generate Report

Quick Commands

Evaluation Metrics

When to Evaluate

3.6 KiB

Raw Blame History