186 lines
5.8 KiB
Markdown
186 lines
5.8 KiB
Markdown
# YOLO Model Fine-Tuning Best Practices
|
|
|
|
Production guide for continuous fine-tuning of YOLO object detection models with user feedback.
|
|
|
|
## Overview
|
|
|
|
When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.
|
|
|
|
Key risks:
|
|
- **Catastrophic forgetting**: model forgets original training after fine-tuning on small new data
|
|
- **Cumulative drift**: repeated fine-tuning sessions cause progressive degradation
|
|
- **Overfitting**: few samples + many epochs = memorizing noise
|
|
|
|
## 1. Data Management
|
|
|
|
```
|
|
Original training set (25K) --> permanently retained as "anchor dataset"
|
|
|
|
|
User-reported failures --> human review & labeling --> "fine-tune pool"
|
|
|
|
|
Fine-tune pool accumulates over time, never deleted
|
|
```
|
|
|
|
Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.
|
|
|
|
### Data Mixing Ratios
|
|
|
|
| Accumulated New Samples | Old Data Multiplier | Total Training Size |
|
|
|------------------------|--------------------|--------------------|
|
|
| 10 | 50x (500) | 510 |
|
|
| 50 | 20x (1,000) | 1,050 |
|
|
| 200 | 10x (2,000) | 2,200 |
|
|
| 500+ | 5x (2,500) | 3,000 |
|
|
|
|
Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.
|
|
|
|
Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.
|
|
|
|
## 2. Model Version Management
|
|
|
|
```
|
|
base_v1.pt (original 25K training)
|
|
+-- ft_v1.1.pt (base + fine-tune batch 1)
|
|
+-- ft_v1.2.pt (base + fine-tune batch 1+2)
|
|
+-- ...
|
|
|
|
When fine-tune pool reaches 2000+ samples:
|
|
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
|
|
+-- ft_v2.1.pt
|
|
+-- ...
|
|
```
|
|
|
|
CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.
|
|
|
|
## 3. Fine-Tuning Parameters
|
|
|
|
```yaml
|
|
base_model: best.pt # always start from base model
|
|
epochs: 10 # few epochs are sufficient
|
|
lr0: 0.001 # 1/10 of base training lr
|
|
freeze: 10 # freeze first 10 backbone layers
|
|
warmup_epochs: 1
|
|
cos_lr: true
|
|
|
|
# data mixing
|
|
new_samples: all # entire fine-tune pool
|
|
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000
|
|
```
|
|
|
|
### Why These Settings
|
|
|
|
| Parameter | Rationale |
|
|
|-----------|-----------|
|
|
| `epochs: 10` | More than enough for small datasets; prevents overfitting |
|
|
| `lr0: 0.001` | Low learning rate preserves base model knowledge |
|
|
| `freeze: 10` | Backbone features are general; only fine-tune detection head and later layers |
|
|
| `cos_lr: true` | Smooth decay prevents sharp weight updates |
|
|
|
|
## 4. Deployment Gating (Most Important)
|
|
|
|
Every fine-tuned model MUST pass three gates before deployment:
|
|
|
|
### Gate 1: Regression Validation
|
|
|
|
Run evaluation on the original test set (held out from the 25K training data).
|
|
|
|
| mAP50 Change | Action |
|
|
|-------------|--------|
|
|
| Drop < 1% | PASS - deploy |
|
|
| Drop 1-3% | REVIEW - human inspection required |
|
|
| Drop > 3% | REJECT - do not deploy |
|
|
|
|
### Gate 2: New Sample Validation
|
|
|
|
Run inference on the new failure documents.
|
|
|
|
| Detection Rate | Action |
|
|
|---------------|--------|
|
|
| > 80% correct | PASS |
|
|
| < 80% correct | REVIEW - check label quality or increase training |
|
|
|
|
### Gate 3: A/B Comparison (Optional)
|
|
|
|
Sample 100 production documents, run both old and new models:
|
|
- New model must not be worse on any field type
|
|
- Compare per-class mAP to detect targeted regressions
|
|
|
|
## 5. Fine-Tuning Frequency
|
|
|
|
| Strategy | Trigger | Recommendation |
|
|
|----------|---------|---------------|
|
|
| **By volume (recommended)** | Pool reaches 50+ new samples | Best signal-to-noise ratio |
|
|
| By schedule | Weekly or monthly | Predictable but may trigger with insufficient data |
|
|
| By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure |
|
|
|
|
Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.
|
|
|
|
## 6. Complete Workflow
|
|
|
|
```
|
|
User marks failed document
|
|
|
|
|
v
|
|
Human reviews and labels annotations
|
|
|
|
|
v
|
|
Add to fine-tune pool
|
|
|
|
|
v
|
|
Pool >= 50 samples? --NO--> Wait for more samples
|
|
|
|
|
YES
|
|
|
|
|
v
|
|
Prepare mixed dataset:
|
|
- All samples from fine-tune pool
|
|
- Random sample 5x from original 25K
|
|
|
|
|
v
|
|
Fine-tune from base.pt:
|
|
- 10 epochs
|
|
- lr0 = 0.001
|
|
- freeze first 10 layers
|
|
|
|
|
v
|
|
Gate 1: Original test set mAP drop < 1%?
|
|
|
|
|
PASS
|
|
|
|
|
v
|
|
Gate 2: New sample detection rate > 80%?
|
|
|
|
|
PASS
|
|
|
|
|
v
|
|
Deploy new model, retain old model for rollback
|
|
|
|
|
v
|
|
Pool accumulated 2000+ samples?
|
|
|
|
|
YES --> Merge all data, train new base from scratch
|
|
```
|
|
|
|
## 7. Monitoring in Production
|
|
|
|
Track these metrics continuously:
|
|
|
|
| Metric | Purpose | Alert Threshold |
|
|
|--------|---------|----------------|
|
|
| Detection rate per field | Catch field-specific regressions | < 90% for any field |
|
|
| Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline |
|
|
| User-reported failures / week | Measure improvement trend | Increasing over 3 weeks |
|
|
| Inference latency | Ensure model size hasn't bloated | > 2x baseline |
|
|
|
|
## 8. Summary of Rules
|
|
|
|
| Rule | Practice |
|
|
|------|----------|
|
|
| Never chain fine-tunes | Always start from base.pt |
|
|
| Never use only new data | Must mix with old data |
|
|
| Never fine-tune on < 50 samples | Accumulate before triggering |
|
|
| Never auto-deploy | Must pass gating validation |
|
|
| Never discard old models | Retain versions for rollback |
|
|
| Periodically retrain base | Merge all data at 2000+ new samples |
|
|
| Always human-review labels | Bad labels are worse than no labels |
|