kai/invoice-master-poc-v2

Fork 0

Files

Yaojia Wang ad5ed46b4c WIP

2026-02-11 23:40:38 +01:00

5.8 KiB

Raw Blame History

YOLO Model Fine-Tuning Best Practices

Production guide for continuous fine-tuning of YOLO object detection models with user feedback.

Overview

When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.

Key risks:

Catastrophic forgetting: model forgets original training after fine-tuning on small new data
Cumulative drift: repeated fine-tuning sessions cause progressive degradation
Overfitting: few samples + many epochs = memorizing noise

1. Data Management

Original training set (25K) --> permanently retained as "anchor dataset"
         |
User-reported failures --> human review & labeling --> "fine-tune pool"
         |
Fine-tune pool accumulates over time, never deleted

Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.

Data Mixing Ratios

Accumulated New Samples	Old Data Multiplier	Total Training Size
10	50x (500)	510
50	20x (1,000)	1,050
200	10x (2,000)	2,200
500+	5x (2,500)	3,000

Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.

Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.

2. Model Version Management

base_v1.pt (original 25K training)
  +-- ft_v1.1.pt (base + fine-tune batch 1)
  +-- ft_v1.2.pt (base + fine-tune batch 1+2)
  +-- ...

When fine-tune pool reaches 2000+ samples:
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
  +-- ft_v2.1.pt
  +-- ...

CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.

3. Fine-Tuning Parameters

base_model: best.pt           # always start from base model
epochs: 10                    # few epochs are sufficient
lr0: 0.001                    # 1/10 of base training lr
freeze: 10                    # freeze first 10 backbone layers
warmup_epochs: 1
cos_lr: true

# data mixing
new_samples: all              # entire fine-tune pool
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000

Why These Settings

Parameter	Rationale
`epochs: 10`	More than enough for small datasets; prevents overfitting
`lr0: 0.001`	Low learning rate preserves base model knowledge
`freeze: 10`	Backbone features are general; only fine-tune detection head and later layers
`cos_lr: true`	Smooth decay prevents sharp weight updates

4. Deployment Gating (Most Important)

Every fine-tuned model MUST pass three gates before deployment:

Gate 1: Regression Validation

Run evaluation on the original test set (held out from the 25K training data).

mAP50 Change	Action
Drop < 1%	PASS - deploy
Drop 1-3%	REVIEW - human inspection required
Drop > 3%	REJECT - do not deploy

Gate 2: New Sample Validation

Run inference on the new failure documents.

Detection Rate	Action
> 80% correct	PASS
< 80% correct	REVIEW - check label quality or increase training

Gate 3: A/B Comparison (Optional)

Sample 100 production documents, run both old and new models:

New model must not be worse on any field type
Compare per-class mAP to detect targeted regressions

5. Fine-Tuning Frequency

Strategy	Trigger	Recommendation
By volume (recommended)	Pool reaches 50+ new samples	Best signal-to-noise ratio
By schedule	Weekly or monthly	Predictable but may trigger with insufficient data
By performance	Monitored accuracy drops below threshold	Reactive, requires monitoring infrastructure

Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.

6. Complete Workflow

User marks failed document
       |
       v
Human reviews and labels annotations
       |
       v
Add to fine-tune pool
       |
       v
Pool >= 50 samples? --NO--> Wait for more samples
       |
      YES
       |
       v
Prepare mixed dataset:
  - All samples from fine-tune pool
  - Random sample 5x from original 25K
       |
       v
Fine-tune from base.pt:
  - 10 epochs
  - lr0 = 0.001
  - freeze first 10 layers
       |
       v
Gate 1: Original test set mAP drop < 1%?
       |
      PASS
       |
       v
Gate 2: New sample detection rate > 80%?
       |
      PASS
       |
       v
Deploy new model, retain old model for rollback
       |
       v
Pool accumulated 2000+ samples?
       |
      YES --> Merge all data, train new base from scratch

7. Monitoring in Production

Track these metrics continuously:

Metric	Purpose	Alert Threshold
Detection rate per field	Catch field-specific regressions	< 90% for any field
Average confidence score	Detect model uncertainty drift	Drop > 5% from baseline
User-reported failures / week	Measure improvement trend	Increasing over 3 weeks
Inference latency	Ensure model size hasn't bloated	> 2x baseline

8. Summary of Rules

Rule	Practice
Never chain fine-tunes	Always start from base.pt
Never use only new data	Must mix with old data
Never fine-tune on < 50 samples	Accumulate before triggering
Never auto-deploy	Must pass gating validation
Never discard old models	Retain versions for rollback
Periodically retrain base	Merge all data at 2000+ new samples
Always human-review labels	Bad labels are worse than no labels

5.8 KiB Raw Blame History