invoice-master-poc-v2/docs/fine-tuning-best-practices.md

# YOLO Model Fine-Tuning Best Practices

Production guide for continuous fine-tuning of YOLO object detection models with user feedback.

## Overview

When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.

Key risks:
- **Catastrophic forgetting**: model forgets original training after fine-tuning on small new data
- **Cumulative drift**: repeated fine-tuning sessions cause progressive degradation
- **Overfitting**: few samples + many epochs = memorizing noise

## 1. Data Management

```
Original training set (25K) --> permanently retained as "anchor dataset"
         |
User-reported failures --> human review & labeling --> "fine-tune pool"
         |
Fine-tune pool accumulates over time, never deleted
```

Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.

### Data Mixing Ratios

| Accumulated New Samples | Old Data Multiplier | Total Training Size |
|------------------------|--------------------|--------------------|
| 10                     | 50x (500)          | 510                |
| 50                     | 20x (1,000)        | 1,050              |
| 200                    | 10x (2,000)        | 2,200              |
| 500+                   | 5x (2,500)         | 3,000              |

Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.

Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.

## 2. Model Version Management

```
base_v1.pt (original 25K training)
  +-- ft_v1.1.pt (base + fine-tune batch 1)
  +-- ft_v1.2.pt (base + fine-tune batch 1+2)
  +-- ...

When fine-tune pool reaches 2000+ samples:
base_v2.pt (original 25K + all accumulated samples, trained from scratch)
  +-- ft_v2.1.pt
  +-- ...
```

CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.

## 3. Fine-Tuning Parameters

```yaml
base_model: best.pt           # always start from base model
epochs: 10                    # few epochs are sufficient
lr0: 0.001                    # 1/10 of base training lr
freeze: 10                    # freeze first 10 backbone layers
warmup_epochs: 1
cos_lr: true

# data mixing
new_samples: all              # entire fine-tune pool
old_samples: min(5x_new, 3000) # old data sampling, cap at 3000
```

### Why These Settings

| Parameter | Rationale |
|-----------|-----------|
| `epochs: 10` | More than enough for small datasets; prevents overfitting |
| `lr0: 0.001` | Low learning rate preserves base model knowledge |
| `freeze: 10` | Backbone features are general; only fine-tune detection head and later layers |
| `cos_lr: true` | Smooth decay prevents sharp weight updates |

## 4. Deployment Gating (Most Important)

Every fine-tuned model MUST pass three gates before deployment:

### Gate 1: Regression Validation

Run evaluation on the original test set (held out from the 25K training data).

| mAP50 Change | Action |
|-------------|--------|
| Drop < 1%   | PASS - deploy |
| Drop 1-3%   | REVIEW - human inspection required |
| Drop > 3%   | REJECT - do not deploy |

### Gate 2: New Sample Validation

Run inference on the new failure documents.

| Detection Rate | Action |
|---------------|--------|
| > 80% correct | PASS |
| < 80% correct | REVIEW - check label quality or increase training |

### Gate 3: A/B Comparison (Optional)

Sample 100 production documents, run both old and new models:
- New model must not be worse on any field type
- Compare per-class mAP to detect targeted regressions

## 5. Fine-Tuning Frequency

| Strategy | Trigger | Recommendation |
|----------|---------|---------------|
| **By volume (recommended)** | Pool reaches 50+ new samples | Best signal-to-noise ratio |
| By schedule | Weekly or monthly | Predictable but may trigger with insufficient data |
| By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure |

Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.

## 6. Complete Workflow

```
User marks failed document
       |
       v
Human reviews and labels annotations
       |
       v
Add to fine-tune pool
       |
       v
Pool >= 50 samples? --NO--> Wait for more samples
       |
      YES
       |
       v
Prepare mixed dataset:
  - All samples from fine-tune pool
  - Random sample 5x from original 25K
       |
       v
Fine-tune from base.pt:
  - 10 epochs
  - lr0 = 0.001
  - freeze first 10 layers
       |
       v
Gate 1: Original test set mAP drop < 1%?
       |
      PASS
       |
       v
Gate 2: New sample detection rate > 80%?
       |
      PASS
       |
       v
Deploy new model, retain old model for rollback
       |
       v
Pool accumulated 2000+ samples?
       |
      YES --> Merge all data, train new base from scratch
```

## 7. Monitoring in Production

Track these metrics continuously:

| Metric | Purpose | Alert Threshold |
|--------|---------|----------------|
| Detection rate per field | Catch field-specific regressions | < 90% for any field |
| Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline |
| User-reported failures / week | Measure improvement trend | Increasing over 3 weeks |
| Inference latency | Ensure model size hasn't bloated | > 2x baseline |

## 8. Summary of Rules

| Rule | Practice |
|------|----------|
| Never chain fine-tunes | Always start from base.pt |
| Never use only new data | Must mix with old data |
| Never fine-tune on < 50 samples | Accumulate before triggering |
| Never auto-deploy | Must pass gating validation |
| Never discard old models | Retain versions for rollback |
| Periodically retrain base | Merge all data at 2000+ new samples |
| Always human-review labels | Bad labels are worse than no labels |