WIP
This commit is contained in:
185
docs/fine-tuning-best-practices.md
Normal file
185
docs/fine-tuning-best-practices.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# YOLO Model Fine-Tuning Best Practices
|
||||
|
||||
Production guide for continuous fine-tuning of YOLO object detection models with user feedback.
|
||||
|
||||
## Overview
|
||||
|
||||
When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.
|
||||
|
||||
Key risks:
|
||||
- **Catastrophic forgetting**: model forgets original training after fine-tuning on small new data
|
||||
- **Cumulative drift**: repeated fine-tuning sessions cause progressive degradation
|
||||
- **Overfitting**: few samples + many epochs = memorizing noise
|
||||
|
||||
## 1. Data Management
|
||||
|
||||
```
|
||||
Original training set (25K) --> permanently retained as "anchor dataset"
|
||||
|
|
||||
User-reported failures --> human review & labeling --> "fine-tune pool"
|
||||
|
|
||||
Fine-tune pool accumulates over time, never deleted
|
||||
```
|
||||
|
||||
Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.
|
||||
|
||||
### Data Mixing Ratios
|
||||
|
||||
| Accumulated New Samples | Old Data Multiplier | Total Training Size |
|
||||
|------------------------|--------------------|--------------------|
|
||||
| 10 | 50x (500) | 510 |
|
||||
| 50 | 20x (1,000) | 1,050 |
|
||||
| 200 | 10x (2,000) | 2,200 |
|
||||
| 500+ | 5x (2,500) | 3,000 |
|
||||
|
||||
Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.
|
||||
|
||||
Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.
|
||||
|
||||
## 2. Model Version Management
|
||||
|
||||
```
|
||||
base_v1.pt (original 25K training)
|
||||
+-- ft_v1.1.pt (base + fine-tune batch 1)
|
||||
+-- ft_v1.2.pt (base + fine-tune batch 1+2)
|
||||
+-- ...
|
||||
|
||||
When fine-tune pool reaches 2000+ samples:
|
||||
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
|
||||
+-- ft_v2.1.pt
|
||||
+-- ...
|
||||
```
|
||||
|
||||
CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.
|
||||
|
||||
## 3. Fine-Tuning Parameters
|
||||
|
||||
```yaml
|
||||
base_model: best.pt # always start from base model
|
||||
epochs: 10 # few epochs are sufficient
|
||||
lr0: 0.001 # 1/10 of base training lr
|
||||
freeze: 10 # freeze first 10 backbone layers
|
||||
warmup_epochs: 1
|
||||
cos_lr: true
|
||||
|
||||
# data mixing
|
||||
new_samples: all # entire fine-tune pool
|
||||
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000
|
||||
```
|
||||
|
||||
### Why These Settings
|
||||
|
||||
| Parameter | Rationale |
|
||||
|-----------|-----------|
|
||||
| `epochs: 10` | More than enough for small datasets; prevents overfitting |
|
||||
| `lr0: 0.001` | Low learning rate preserves base model knowledge |
|
||||
| `freeze: 10` | Backbone features are general; only fine-tune detection head and later layers |
|
||||
| `cos_lr: true` | Smooth decay prevents sharp weight updates |
|
||||
|
||||
## 4. Deployment Gating (Most Important)
|
||||
|
||||
Every fine-tuned model MUST pass three gates before deployment:
|
||||
|
||||
### Gate 1: Regression Validation
|
||||
|
||||
Run evaluation on the original test set (held out from the 25K training data).
|
||||
|
||||
| mAP50 Change | Action |
|
||||
|-------------|--------|
|
||||
| Drop < 1% | PASS - deploy |
|
||||
| Drop 1-3% | REVIEW - human inspection required |
|
||||
| Drop > 3% | REJECT - do not deploy |
|
||||
|
||||
### Gate 2: New Sample Validation
|
||||
|
||||
Run inference on the new failure documents.
|
||||
|
||||
| Detection Rate | Action |
|
||||
|---------------|--------|
|
||||
| > 80% correct | PASS |
|
||||
| < 80% correct | REVIEW - check label quality or increase training |
|
||||
|
||||
### Gate 3: A/B Comparison (Optional)
|
||||
|
||||
Sample 100 production documents, run both old and new models:
|
||||
- New model must not be worse on any field type
|
||||
- Compare per-class mAP to detect targeted regressions
|
||||
|
||||
## 5. Fine-Tuning Frequency
|
||||
|
||||
| Strategy | Trigger | Recommendation |
|
||||
|----------|---------|---------------|
|
||||
| **By volume (recommended)** | Pool reaches 50+ new samples | Best signal-to-noise ratio |
|
||||
| By schedule | Weekly or monthly | Predictable but may trigger with insufficient data |
|
||||
| By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure |
|
||||
|
||||
Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.
|
||||
|
||||
## 6. Complete Workflow
|
||||
|
||||
```
|
||||
User marks failed document
|
||||
|
|
||||
v
|
||||
Human reviews and labels annotations
|
||||
|
|
||||
v
|
||||
Add to fine-tune pool
|
||||
|
|
||||
v
|
||||
Pool >= 50 samples? --NO--> Wait for more samples
|
||||
|
|
||||
YES
|
||||
|
|
||||
v
|
||||
Prepare mixed dataset:
|
||||
- All samples from fine-tune pool
|
||||
- Random sample 5x from original 25K
|
||||
|
|
||||
v
|
||||
Fine-tune from base.pt:
|
||||
- 10 epochs
|
||||
- lr0 = 0.001
|
||||
- freeze first 10 layers
|
||||
|
|
||||
v
|
||||
Gate 1: Original test set mAP drop < 1%?
|
||||
|
|
||||
PASS
|
||||
|
|
||||
v
|
||||
Gate 2: New sample detection rate > 80%?
|
||||
|
|
||||
PASS
|
||||
|
|
||||
v
|
||||
Deploy new model, retain old model for rollback
|
||||
|
|
||||
v
|
||||
Pool accumulated 2000+ samples?
|
||||
|
|
||||
YES --> Merge all data, train new base from scratch
|
||||
```
|
||||
|
||||
## 7. Monitoring in Production
|
||||
|
||||
Track these metrics continuously:
|
||||
|
||||
| Metric | Purpose | Alert Threshold |
|
||||
|--------|---------|----------------|
|
||||
| Detection rate per field | Catch field-specific regressions | < 90% for any field |
|
||||
| Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline |
|
||||
| User-reported failures / week | Measure improvement trend | Increasing over 3 weeks |
|
||||
| Inference latency | Ensure model size hasn't bloated | > 2x baseline |
|
||||
|
||||
## 8. Summary of Rules
|
||||
|
||||
| Rule | Practice |
|
||||
|------|----------|
|
||||
| Never chain fine-tunes | Always start from base.pt |
|
||||
| Never use only new data | Must mix with old data |
|
||||
| Never fine-tune on < 50 samples | Accumulate before triggering |
|
||||
| Never auto-deploy | Must pass gating validation |
|
||||
| Never discard old models | Retain versions for rollback |
|
||||
| Periodically retrain base | Merge all data at 2000+ new samples |
|
||||
| Always human-review labels | Bad labels are worse than no labels |
|
||||
Reference in New Issue
Block a user