WIP

2026-02-11 23:40:38 +01:00
parent f1a7bfe6b7
commit ad5ed46b4c
117 changed files with 5741 additions and 7669 deletions
--- a/docs/fine-tuning-best-practices.md
+++ b/docs/fine-tuning-best-practices.md
@@ -0,0 +1,185 @@
+# YOLO Model Fine-Tuning Best Practices
+
+Production guide for continuous fine-tuning of YOLO object detection models with user feedback.
+
+## Overview
+
+When users report failed detections, those documents are collected, reviewed, and used to incrementally improve the model without degrading performance on existing data.
+
+Key risks:
+- **Catastrophic forgetting**: model forgets original training after fine-tuning on small new data
+- **Cumulative drift**: repeated fine-tuning sessions cause progressive degradation
+- **Overfitting**: few samples + many epochs = memorizing noise
+
+## 1. Data Management
+
+```
+Original training set (25K) --> permanently retained as "anchor dataset"
+         |
+User-reported failures --> human review & labeling --> "fine-tune pool"
+         |
+Fine-tune pool accumulates over time, never deleted
+```
+
+Every new sample MUST be human-verified before entering the fine-tune pool. Incorrect labels are more harmful than no labels.
+
+### Data Mixing Ratios
+
+| Accumulated New Samples | Old Data Multiplier | Total Training Size |
+|------------------------|--------------------|--------------------|
+| 10                     | 50x (500)          | 510                |
+| 50                     | 20x (1,000)        | 1,050              |
+| 200                    | 10x (2,000)        | 2,200              |
+| 500+                   | 5x (2,500)         | 3,000              |
+
+Principle: fewer new samples require higher old data ratio. Stabilize at 5x once pool reaches 500+.
+
+Old samples are randomly sampled from the original 25K each time, ensuring broad coverage.
+
+## 2. Model Version Management
+
+```
+base_v1.pt (original 25K training)
+  +-- ft_v1.1.pt (base + fine-tune batch 1)
+  +-- ft_v1.2.pt (base + fine-tune batch 1+2)
+  +-- ...
+
+When fine-tune pool reaches 2000+ samples:
+base_v2.pt (original 25K + all accumulated samples, trained from scratch)
+  +-- ft_v2.1.pt
+  +-- ...
+```
+
+CRITICAL: Never chain fine-tunes (ft_v1.1 -> ft_v1.2 -> ft_v1.3). Always start from the base model to avoid cumulative drift.
+
+## 3. Fine-Tuning Parameters
+
+```yaml
+base_model: best.pt           # always start from base model
+epochs: 10                    # few epochs are sufficient
+lr0: 0.001                    # 1/10 of base training lr
+freeze: 10                    # freeze first 10 backbone layers
+warmup_epochs: 1
+cos_lr: true
+
+# data mixing
+new_samples: all              # entire fine-tune pool
+old_samples: min(5x_new, 3000) # old data sampling, cap at 3000
+```
+
+### Why These Settings
+
+| Parameter | Rationale |
+|-----------|-----------|
+| `epochs: 10` | More than enough for small datasets; prevents overfitting |
+| `lr0: 0.001` | Low learning rate preserves base model knowledge |
+| `freeze: 10` | Backbone features are general; only fine-tune detection head and later layers |
+| `cos_lr: true` | Smooth decay prevents sharp weight updates |
+
+## 4. Deployment Gating (Most Important)
+
+Every fine-tuned model MUST pass three gates before deployment:
+
+### Gate 1: Regression Validation
+
+Run evaluation on the original test set (held out from the 25K training data).
+
+| mAP50 Change | Action |
+|-------------|--------|
+| Drop < 1%   | PASS - deploy |
+| Drop 1-3%   | REVIEW - human inspection required |
+| Drop > 3%   | REJECT - do not deploy |
+
+### Gate 2: New Sample Validation
+
+Run inference on the new failure documents.
+
+| Detection Rate | Action |
+|---------------|--------|
+| > 80% correct | PASS |
+| < 80% correct | REVIEW - check label quality or increase training |
+
+### Gate 3: A/B Comparison (Optional)
+
+Sample 100 production documents, run both old and new models:
+- New model must not be worse on any field type
+- Compare per-class mAP to detect targeted regressions
+
+## 5. Fine-Tuning Frequency
+
+| Strategy | Trigger | Recommendation |
+|----------|---------|---------------|
+| **By volume (recommended)** | Pool reaches 50+ new samples | Best signal-to-noise ratio |
+| By schedule | Weekly or monthly | Predictable but may trigger with insufficient data |
+| By performance | Monitored accuracy drops below threshold | Reactive, requires monitoring infrastructure |
+
+Do NOT fine-tune daily with fewer than 50 samples. The noise outweighs the signal.
+
+## 6. Complete Workflow
+
+```
+User marks failed document
+       |
+       v
+Human reviews and labels annotations
+       |
+       v
+Add to fine-tune pool
+       |
+       v
+Pool >= 50 samples? --NO--> Wait for more samples
+       |
+      YES
+       |
+       v
+Prepare mixed dataset:
+  - All samples from fine-tune pool
+  - Random sample 5x from original 25K
+       |
+       v
+Fine-tune from base.pt:
+  - 10 epochs
+  - lr0 = 0.001
+  - freeze first 10 layers
+       |
+       v
+Gate 1: Original test set mAP drop < 1%?
+       |
+      PASS
+       |
+       v
+Gate 2: New sample detection rate > 80%?
+       |
+      PASS
+       |
+       v
+Deploy new model, retain old model for rollback
+       |
+       v
+Pool accumulated 2000+ samples?
+       |
+      YES --> Merge all data, train new base from scratch
+```
+
+## 7. Monitoring in Production
+
+Track these metrics continuously:
+
+| Metric | Purpose | Alert Threshold |
+|--------|---------|----------------|
+| Detection rate per field | Catch field-specific regressions | < 90% for any field |
+| Average confidence score | Detect model uncertainty drift | Drop > 5% from baseline |
+| User-reported failures / week | Measure improvement trend | Increasing over 3 weeks |
+| Inference latency | Ensure model size hasn't bloated | > 2x baseline |
+
+## 8. Summary of Rules
+
+| Rule | Practice |
+|------|----------|
+| Never chain fine-tunes | Always start from base.pt |
+| Never use only new data | Must mix with old data |
+| Never fine-tune on < 50 samples | Accumulate before triggering |
+| Never auto-deploy | Must pass gating validation |
+| Never discard old models | Retain versions for rollback |
+| Periodically retrain base | Merge all data at 2000+ new samples |
+| Always human-review labels | Bad labels are worse than no labels |