Add claude config
This commit is contained in:
174
.claude/commands/eval.md
Normal file
174
.claude/commands/eval.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Eval Command
|
||||
|
||||
Evaluate model performance and field extraction accuracy.
|
||||
|
||||
## Usage
|
||||
|
||||
`/eval [model|accuracy|compare|report]`
|
||||
|
||||
## Model Evaluation
|
||||
|
||||
`/eval model`
|
||||
|
||||
Evaluate YOLO model performance on test dataset:
|
||||
|
||||
```bash
|
||||
# Run model evaluation
|
||||
python -m src.cli.train --model runs/train/invoice_fields/weights/best.pt --eval-only
|
||||
|
||||
# Or use ultralytics directly
|
||||
yolo val model=runs/train/invoice_fields/weights/best.pt data=data.yaml
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Model Evaluation: invoice_fields/best.pt
|
||||
========================================
|
||||
mAP@0.5: 93.5%
|
||||
mAP@0.5-0.95: 83.0%
|
||||
|
||||
Per-class AP:
|
||||
- invoice_number: 95.2%
|
||||
- invoice_date: 94.8%
|
||||
- invoice_due_date: 93.1%
|
||||
- ocr_number: 91.5%
|
||||
- bankgiro: 92.3%
|
||||
- plusgiro: 90.8%
|
||||
- amount: 88.7%
|
||||
- supplier_org_num: 85.2%
|
||||
- payment_line: 82.4%
|
||||
- customer_number: 81.1%
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
`/eval accuracy`
|
||||
|
||||
Evaluate field extraction accuracy against ground truth:
|
||||
|
||||
```bash
|
||||
# Run accuracy evaluation on labeled data
|
||||
python -m src.cli.infer --model runs/train/invoice_fields/weights/best.pt \
|
||||
--input ~/invoice-data/test/*.pdf \
|
||||
--ground-truth ~/invoice-data/test/labels.csv \
|
||||
--output eval_results.json
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Field Extraction Accuracy
|
||||
=========================
|
||||
Documents tested: 500
|
||||
|
||||
Per-field accuracy:
|
||||
- InvoiceNumber: 98.9% (494/500)
|
||||
- InvoiceDate: 95.5% (478/500)
|
||||
- InvoiceDueDate: 95.9% (480/500)
|
||||
- OCR: 99.1% (496/500)
|
||||
- Bankgiro: 99.0% (495/500)
|
||||
- Plusgiro: 99.4% (497/500)
|
||||
- Amount: 91.3% (457/500)
|
||||
- supplier_org: 78.2% (391/500)
|
||||
|
||||
Overall: 94.8%
|
||||
```
|
||||
|
||||
## Compare Models
|
||||
|
||||
`/eval compare`
|
||||
|
||||
Compare two model versions:
|
||||
|
||||
```bash
|
||||
# Compare old vs new model
|
||||
python -m src.cli.eval compare \
|
||||
--model-a runs/train/invoice_v1/weights/best.pt \
|
||||
--model-b runs/train/invoice_v2/weights/best.pt \
|
||||
--test-data ~/invoice-data/test/
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Model Comparison
|
||||
================
|
||||
Model A Model B Delta
|
||||
mAP@0.5: 91.2% 93.5% +2.3%
|
||||
Accuracy: 92.1% 94.8% +2.7%
|
||||
Speed (ms): 1850 1520 -330
|
||||
|
||||
Per-field improvements:
|
||||
- amount: +4.2%
|
||||
- payment_line: +3.8%
|
||||
- customer_num: +2.1%
|
||||
|
||||
Recommendation: Deploy Model B
|
||||
```
|
||||
|
||||
## Generate Report
|
||||
|
||||
`/eval report`
|
||||
|
||||
Generate comprehensive evaluation report:
|
||||
|
||||
```bash
|
||||
python -m src.cli.eval report --output eval_report.md
|
||||
```
|
||||
|
||||
Output:
|
||||
```markdown
|
||||
# Evaluation Report
|
||||
Generated: 2026-01-25
|
||||
|
||||
## Model Performance
|
||||
- Model: runs/train/invoice_fields/weights/best.pt
|
||||
- mAP@0.5: 93.5%
|
||||
- Training samples: 9,738
|
||||
|
||||
## Field Extraction Accuracy
|
||||
| Field | Accuracy | Errors |
|
||||
|-------|----------|--------|
|
||||
| InvoiceNumber | 98.9% | 6 |
|
||||
| Amount | 91.3% | 43 |
|
||||
...
|
||||
|
||||
## Error Analysis
|
||||
### Common Errors
|
||||
1. Amount: OCR misreads comma as period
|
||||
2. supplier_org: Missing from some invoices
|
||||
3. payment_line: Partially obscured by stamps
|
||||
|
||||
## Recommendations
|
||||
1. Add more training data for low-accuracy fields
|
||||
2. Implement OCR error correction for amounts
|
||||
3. Consider confidence threshold tuning
|
||||
```
|
||||
|
||||
## Quick Commands
|
||||
|
||||
```bash
|
||||
# Evaluate model metrics
|
||||
yolo val model=runs/train/invoice_fields/weights/best.pt
|
||||
|
||||
# Test inference on sample
|
||||
python -m src.cli.infer --input sample.pdf --output result.json --gpu
|
||||
|
||||
# Check test coverage
|
||||
pytest --cov=src --cov-report=html
|
||||
```
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
| Metric | Target | Current |
|
||||
|--------|--------|---------|
|
||||
| mAP@0.5 | >90% | 93.5% |
|
||||
| Overall Accuracy | >90% | 94.8% |
|
||||
| Test Coverage | >60% | 37% |
|
||||
| Tests Passing | 100% | 100% |
|
||||
|
||||
## When to Evaluate
|
||||
|
||||
- After training a new model
|
||||
- Before deploying to production
|
||||
- After adding new training data
|
||||
- When accuracy complaints arise
|
||||
- Weekly performance monitoring
|
||||
Reference in New Issue
Block a user