Files
invoice-master-poc-v2/docs/fine-tuning-best-practices.md
Yaojia Wang ad5ed46b4c WIP
2026-02-11 23:40:38 +01:00

5.8 KiB

YOLO Model Fine-Tuning Best Practices

Production guide for continuous fine-tuning of YOLO object detection models with user feedback.

Overview

When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.

Key risks:

  • Catastrophic forgetting: model forgets original training after fine-tuning on small new data
  • Cumulative drift: repeated fine-tuning sessions cause progressive degradation
  • Overfitting: few samples + many epochs = memorizing noise

1. Data Management

Original training set (25K) --> permanently retained as "anchor dataset"
         |
User-reported failures --> human review & labeling --> "fine-tune pool"
         |
Fine-tune pool accumulates over time, never deleted

Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.

Data Mixing Ratios

Accumulated New Samples Old Data Multiplier Total Training Size
10 50x (500) 510
50 20x (1,000) 1,050
200 10x (2,000) 2,200
500+ 5x (2,500) 3,000

Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.

Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.

2. Model Version Management

base_v1.pt (original 25K training)
  +-- ft_v1.1.pt (base + fine-tune batch 1)
  +-- ft_v1.2.pt (base + fine-tune batch 1+2)
  +-- ...

When fine-tune pool reaches 2000+ samples:
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
  +-- ft_v2.1.pt
  +-- ...

CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.

3. Fine-Tuning Parameters

base_model: best.pt           # always start from base model
epochs: 10                    # few epochs are sufficient
lr0: 0.001                    # 1/10 of base training lr
freeze: 10                    # freeze first 10 backbone layers
warmup_epochs: 1
cos_lr: true

# data mixing
new_samples: all              # entire fine-tune pool
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000

Why These Settings

Parameter Rationale
epochs: 10 More than enough for small datasets; prevents overfitting
lr0: 0.001 Low learning rate preserves base model knowledge
freeze: 10 Backbone features are general; only fine-tune detection head and later layers
cos_lr: true Smooth decay prevents sharp weight updates

4. Deployment Gating (Most Important)

Every fine-tuned model MUST pass three gates before deployment:

Gate 1: Regression Validation

Run evaluation on the original test set (held out from the 25K training data).

mAP50 Change Action
Drop < 1% PASS - deploy
Drop 1-3% REVIEW - human inspection required
Drop > 3% REJECT - do not deploy

Gate 2: New Sample Validation

Run inference on the new failure documents.

Detection Rate Action
> 80% correct PASS
< 80% correct REVIEW - check label quality or increase training

Gate 3: A/B Comparison (Optional)

Sample 100 production documents, run both old and new models:

  • New model must not be worse on any field type
  • Compare per-class mAP to detect targeted regressions

5. Fine-Tuning Frequency

Strategy Trigger Recommendation
By volume (recommended) Pool reaches 50+ new samples Best signal-to-noise ratio
By schedule Weekly or monthly Predictable but may trigger with insufficient data
By performance Monitored accuracy drops below threshold Reactive, requires monitoring infrastructure

Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.

6. Complete Workflow

User marks failed document
       |
       v
Human reviews and labels annotations
       |
       v
Add to fine-tune pool
       |
       v
Pool >= 50 samples? --NO--> Wait for more samples
       |
      YES
       |
       v
Prepare mixed dataset:
  - All samples from fine-tune pool
  - Random sample 5x from original 25K
       |
       v
Fine-tune from base.pt:
  - 10 epochs
  - lr0 = 0.001
  - freeze first 10 layers
       |
       v
Gate 1: Original test set mAP drop < 1%?
       |
      PASS
       |
       v
Gate 2: New sample detection rate > 80%?
       |
      PASS
       |
       v
Deploy new model, retain old model for rollback
       |
       v
Pool accumulated 2000+ samples?
       |
      YES --> Merge all data, train new base from scratch

7. Monitoring in Production

Track these metrics continuously:

Metric Purpose Alert Threshold
Detection rate per field Catch field-specific regressions < 90% for any field
Average confidence score Detect model uncertainty drift Drop > 5% from baseline
User-reported failures / week Measure improvement trend Increasing over 3 weeks
Inference latency Ensure model size hasn't bloated > 2x baseline

8. Summary of Rules

Rule Practice
Never chain fine-tunes Always start from base.pt
Never use only new data Must mix with old data
Never fine-tune on < 50 samples Accumulate before triggering
Never auto-deploy Must pass gating validation
Never discard old models Retain versions for rollback
Periodically retrain base Merge all data at 2000+ new samples
Always human-review labels Bad labels are worse than no labels