Files
invoice-master-poc-v2/.claude/commands/eval.md
2026-01-25 16:17:23 +01:00

3.6 KiB

Eval Command

Evaluate model performance and field extraction accuracy.

Usage

/eval [model|accuracy|compare|report]

Model Evaluation

/eval model

Evaluate YOLO model performance on test dataset:

# Run model evaluation
python -m src.cli.train --model runs/train/invoice_fields/weights/best.pt --eval-only

# Or use ultralytics directly
yolo val model=runs/train/invoice_fields/weights/best.pt data=data.yaml

Output:

Model Evaluation: invoice_fields/best.pt
========================================
mAP@0.5:     93.5%
mAP@0.5-0.95: 83.0%

Per-class AP:
- invoice_number:    95.2%
- invoice_date:      94.8%
- invoice_due_date:  93.1%
- ocr_number:        91.5%
- bankgiro:          92.3%
- plusgiro:          90.8%
- amount:            88.7%
- supplier_org_num:  85.2%
- payment_line:      82.4%
- customer_number:   81.1%

Accuracy Evaluation

/eval accuracy

Evaluate field extraction accuracy against ground truth:

# Run accuracy evaluation on labeled data
python -m src.cli.infer --model runs/train/invoice_fields/weights/best.pt \
    --input ~/invoice-data/test/*.pdf \
    --ground-truth ~/invoice-data/test/labels.csv \
    --output eval_results.json

Output:

Field Extraction Accuracy
=========================
Documents tested: 500

Per-field accuracy:
- InvoiceNumber:   98.9% (494/500)
- InvoiceDate:     95.5% (478/500)
- InvoiceDueDate:  95.9% (480/500)
- OCR:             99.1% (496/500)
- Bankgiro:        99.0% (495/500)
- Plusgiro:        99.4% (497/500)
- Amount:          91.3% (457/500)
- supplier_org:    78.2% (391/500)

Overall: 94.8%

Compare Models

/eval compare

Compare two model versions:

# Compare old vs new model
python -m src.cli.eval compare \
    --model-a runs/train/invoice_v1/weights/best.pt \
    --model-b runs/train/invoice_v2/weights/best.pt \
    --test-data ~/invoice-data/test/

Output:

Model Comparison
================
                Model A     Model B     Delta
mAP@0.5:        91.2%       93.5%       +2.3%
Accuracy:       92.1%       94.8%       +2.7%
Speed (ms):     1850        1520        -330

Per-field improvements:
- amount:       +4.2%
- payment_line: +3.8%
- customer_num: +2.1%

Recommendation: Deploy Model B

Generate Report

/eval report

Generate comprehensive evaluation report:

python -m src.cli.eval report --output eval_report.md

Output:

# Evaluation Report
Generated: 2026-01-25

## Model Performance
- Model: runs/train/invoice_fields/weights/best.pt
- mAP@0.5: 93.5%
- Training samples: 9,738

## Field Extraction Accuracy
| Field | Accuracy | Errors |
|-------|----------|--------|
| InvoiceNumber | 98.9% | 6 |
| Amount | 91.3% | 43 |
...

## Error Analysis
### Common Errors
1. Amount: OCR misreads comma as period
2. supplier_org: Missing from some invoices
3. payment_line: Partially obscured by stamps

## Recommendations
1. Add more training data for low-accuracy fields
2. Implement OCR error correction for amounts
3. Consider confidence threshold tuning

Quick Commands

# Evaluate model metrics
yolo val model=runs/train/invoice_fields/weights/best.pt

# Test inference on sample
python -m src.cli.infer --input sample.pdf --output result.json --gpu

# Check test coverage
pytest --cov=src --cov-report=html

Evaluation Metrics

Metric Target Current
mAP@0.5 >90% 93.5%
Overall Accuracy >90% 94.8%
Test Coverage >60% 37%
Tests Passing 100% 100%

When to Evaluate

  • After training a new model
  • Before deploying to production
  • After adding new training data
  • When accuracy complaints arise
  • Weekly performance monitoring