# YOLO Model Fine-Tuning Best Practices Production guide for continuous fine-tuning of YOLO object detection models with user feedback. ## Overview When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data. Key risks: - **Catastrophic forgetting**: model forgets original training after fine-tuning on small new data - **Cumulative drift**: repeated fine-tuning sessions cause progressive degradation - **Overfitting**: few samples + many epochs = memorizing noise ## 1. Data Management ``` Original training set (25K) --> permanently retained as "anchor dataset" | User-reported failures --> human review & labeling --> "fine-tune pool" | Fine-tune pool accumulates over time, never deleted ``` Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels. ### Data Mixing Ratios | Accumulated New Samples | Old Data Multiplier | Total Training Size | |------------------------|--------------------|--------------------| | 10 | 50x (500) | 510 | | 50 | 20x (1,000) | 1,050 | | 200 | 10x (2,000) | 2,200 | | 500+ | 5x (2,500) | 3,000 | Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+. Old samples are randomly sampled from the original 25K each time, ensuring broad coverage. ## 2. Model Version Management ``` base_v1.pt (original 25K training) +-- ft_v1.1.pt (base + fine-tune batch 1) +-- ft_v1.2.pt (base + fine-tune batch 1+2) +-- ... When fine-tune pool reaches 2000+ samples: base_v2.pt (original 25K + all accumulated samples, trained from scratch) +-- ft_v2.1.pt +-- ... ``` CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift. ## 3. Fine-Tuning Parameters ```yaml base_model: best.pt # always start from base model epochs: 10 # few epochs are sufficient lr0: 0.001 # 1/10 of base training lr freeze: 10 # freeze first 10 backbone layers warmup_epochs: 1 cos_lr: true # data mixing new_samples: all # entire fine-tune pool old_samples: min(5x_new, 3000) # old data sampling, cap at 3000 ``` ### Why These Settings | Parameter | Rationale | |-----------|-----------| | `epochs: 10` | More than enough for small datasets; prevents overfitting | | `lr0: 0.001` | Low learning rate preserves base model knowledge | | `freeze: 10` | Backbone features are general; only fine-tune detection head and later layers | | `cos_lr: true` | Smooth decay prevents sharp weight updates | ## 4. Deployment Gating (Most Important) Every fine-tuned model MUST pass three gates before deployment: ### Gate 1: Regression Validation Run evaluation on the original test set (held out from the 25K training data). | mAP50 Change | Action | |-------------|--------| | Drop < 1% | PASS - deploy | | Drop 1-3% | REVIEW - human inspection required | | Drop > 3% | REJECT - do not deploy | ### Gate 2: New Sample Validation Run inference on the new failure documents. | Detection Rate | Action | |---------------|--------| | > 80% correct | PASS | | < 80% correct | REVIEW - check label quality or increase training | ### Gate 3: A/B Comparison (Optional) Sample 100 production documents, run both old and new models: - New model must not be worse on any field type - Compare per-class mAP to detect targeted regressions ## 5. Fine-Tuning Frequency | Strategy | Trigger | Recommendation | |----------|---------|---------------| | **By volume (recommended)** | Pool reaches 50+ new samples | Best signal-to-noise ratio | | By schedule | Weekly or monthly | Predictable but may trigger with insufficient data | | By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure | Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal. ## 6. Complete Workflow ``` User marks failed document | v Human reviews and labels annotations | v Add to fine-tune pool | v Pool >= 50 samples? --NO--> Wait for more samples | YES | v Prepare mixed dataset: - All samples from fine-tune pool - Random sample 5x from original 25K | v Fine-tune from base.pt: - 10 epochs - lr0 = 0.001 - freeze first 10 layers | v Gate 1: Original test set mAP drop < 1%? | PASS | v Gate 2: New sample detection rate > 80%? | PASS | v Deploy new model, retain old model for rollback | v Pool accumulated 2000+ samples? | YES --> Merge all data, train new base from scratch ``` ## 7. Monitoring in Production Track these metrics continuously: | Metric | Purpose | Alert Threshold | |--------|---------|----------------| | Detection rate per field | Catch field-specific regressions | < 90% for any field | | Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline | | User-reported failures / week | Measure improvement trend | Increasing over 3 weeks | | Inference latency | Ensure model size hasn't bloated | > 2x baseline | ## 8. Summary of Rules | Rule | Practice | |------|----------| | Never chain fine-tunes | Always start from base.pt | | Never use only new data | Must mix with old data | | Never fine-tune on < 50 samples | Accumulate before triggering | | Never auto-deploy | Must pass gating validation | | Never discard old models | Retain versions for rollback | | Periodically retrain base | Merge all data at 2000+ new samples | | Always human-review labels | Bad labels are worse than no labels |