5.8 KiB
YOLO Model Fine-Tuning Best Practices
Production guide for continuous fine-tuning of YOLO object detection models with user feedback.
Overview
When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.
Key risks:
- Catastrophic forgetting: model forgets original training after fine-tuning on small new data
- Cumulative drift: repeated fine-tuning sessions cause progressive degradation
- Overfitting: few samples + many epochs = memorizing noise
1. Data Management
Original training set (25K) --> permanently retained as "anchor dataset"
|
User-reported failures --> human review & labeling --> "fine-tune pool"
|
Fine-tune pool accumulates over time, never deleted
Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.
Data Mixing Ratios
| Accumulated New Samples | Old Data Multiplier | Total Training Size |
|---|---|---|
| 10 | 50x (500) | 510 |
| 50 | 20x (1,000) | 1,050 |
| 200 | 10x (2,000) | 2,200 |
| 500+ | 5x (2,500) | 3,000 |
Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.
Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.
2. Model Version Management
base_v1.pt (original 25K training)
+-- ft_v1.1.pt (base + fine-tune batch 1)
+-- ft_v1.2.pt (base + fine-tune batch 1+2)
+-- ...
When fine-tune pool reaches 2000+ samples:
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
+-- ft_v2.1.pt
+-- ...
CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.
3. Fine-Tuning Parameters
base_model: best.pt # always start from base model
epochs: 10 # few epochs are sufficient
lr0: 0.001 # 1/10 of base training lr
freeze: 10 # freeze first 10 backbone layers
warmup_epochs: 1
cos_lr: true
# data mixing
new_samples: all # entire fine-tune pool
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000
Why These Settings
| Parameter | Rationale |
|---|---|
epochs: 10 |
More than enough for small datasets; prevents overfitting |
lr0: 0.001 |
Low learning rate preserves base model knowledge |
freeze: 10 |
Backbone features are general; only fine-tune detection head and later layers |
cos_lr: true |
Smooth decay prevents sharp weight updates |
4. Deployment Gating (Most Important)
Every fine-tuned model MUST pass three gates before deployment:
Gate 1: Regression Validation
Run evaluation on the original test set (held out from the 25K training data).
| mAP50 Change | Action |
|---|---|
| Drop < 1% | PASS - deploy |
| Drop 1-3% | REVIEW - human inspection required |
| Drop > 3% | REJECT - do not deploy |
Gate 2: New Sample Validation
Run inference on the new failure documents.
| Detection Rate | Action |
|---|---|
| > 80% correct | PASS |
| < 80% correct | REVIEW - check label quality or increase training |
Gate 3: A/B Comparison (Optional)
Sample 100 production documents, run both old and new models:
- New model must not be worse on any field type
- Compare per-class mAP to detect targeted regressions
5. Fine-Tuning Frequency
| Strategy | Trigger | Recommendation |
|---|---|---|
| By volume (recommended) | Pool reaches 50+ new samples | Best signal-to-noise ratio |
| By schedule | Weekly or monthly | Predictable but may trigger with insufficient data |
| By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure |
Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.
6. Complete Workflow
User marks failed document
|
v
Human reviews and labels annotations
|
v
Add to fine-tune pool
|
v
Pool >= 50 samples? --NO--> Wait for more samples
|
YES
|
v
Prepare mixed dataset:
- All samples from fine-tune pool
- Random sample 5x from original 25K
|
v
Fine-tune from base.pt:
- 10 epochs
- lr0 = 0.001
- freeze first 10 layers
|
v
Gate 1: Original test set mAP drop < 1%?
|
PASS
|
v
Gate 2: New sample detection rate > 80%?
|
PASS
|
v
Deploy new model, retain old model for rollback
|
v
Pool accumulated 2000+ samples?
|
YES --> Merge all data, train new base from scratch
7. Monitoring in Production
Track these metrics continuously:
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Detection rate per field | Catch field-specific regressions | < 90% for any field |
| Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline |
| User-reported failures / week | Measure improvement trend | Increasing over 3 weeks |
| Inference latency | Ensure model size hasn't bloated | > 2x baseline |
8. Summary of Rules
| Rule | Practice |
|---|---|
| Never chain fine-tunes | Always start from base.pt |
| Never use only new data | Must mix with old data |
| Never fine-tune on < 50 samples | Accumulate before triggering |
| Never auto-deploy | Must pass gating validation |
| Never discard old models | Retain versions for rollback |
| Periodically retrain base | Merge all data at 2000+ new samples |
| Always human-review labels | Bad labels are worse than no labels |