WIP

2026-02-01 00:08:40 +01:00
parent 33ada0350d
commit a516de4320
90 changed files with 11642 additions and 398 deletions
--- a/docs/azure-deployment-guide.md
+++ b/docs/azure-deployment-guide.md
@@ -0,0 +1,567 @@
+# Azure 部署方案完整指南
+
+## 目录
+- [核心问题](#核心问题)
+- [存储方案](#存储方案)
+- [训练方案](#训练方案)
+- [推理方案](#推理方案)
+- [价格对比](#价格对比)
+- [推荐架构](#推荐架构)
+- [实施步骤](#实施步骤)
+
+---
+
+## 核心问题
+
+| 问题 | 答案 |
+|------|------|
+| Azure Blob Storage 能用于训练吗？ | 可以，用 BlobFuse2 挂载 |
+| 能实时从 Blob 读取训练吗？ | 可以，但建议配置本地缓存 |
+| 本地能挂载 Azure Blob 吗？ | 可以，用 Rclone (Windows) 或 BlobFuse2 (Linux) |
+| VM 空闲时收费吗？ | 收费，只要开机就按小时计费 |
+| 如何按需付费？ | 用 Serverless GPU 或 min=0 的 Compute Cluster |
+| 推理服务用什么？ | Container Apps (CPU) 或 Serverless GPU |
+
+---
+
+## 存储方案
+
+### Azure Blob Storage + BlobFuse2（推荐）
+
+```bash
+# 安装 BlobFuse2
+sudo apt-get install blobfuse2
+
+# 配置文件
+cat > ~/blobfuse-config.yaml << 'EOF'
+logging:
+  type: syslog
+  level: log_warning
+
+components:
+  - libfuse
+  - file_cache
+  - azstorage
+
+file_cache:
+  path: /tmp/blobfuse2
+  timeout-sec: 120
+  max-size-mb: 4096
+
+azstorage:
+  type: block
+  account-name: YOUR_ACCOUNT
+  account-key: YOUR_KEY
+  container: training-images
+EOF
+
+# 挂载
+mkdir -p /mnt/azure-blob
+blobfuse2 mount /mnt/azure-blob --config-file=~/blobfuse-config.yaml
+```
+
+### 本地开发（Windows）
+
+```powershell
+# 安装
+winget install WinFsp.WinFsp
+winget install Rclone.Rclone
+
+# 配置
+rclone config  # 选择 azureblob
+
+# 挂载为 Z: 盘
+rclone mount azure:training-images Z: --vfs-cache-mode full
+```
+
+### 存储费用
+
+| 层级 | 价格 | 适用场景 |
+|------|------|---------|
+| Hot | $0.018/GB/月 | 频繁访问 |
+| Cool | $0.01/GB/月 | 偶尔访问 |
+| Archive | $0.002/GB/月 | 长期存档 |
+
+**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.09/月**
+
+---
+
+## 训练方案
+
+### 方案总览
+
+| 方案 | 适用场景 | 空闲费用 | 复杂度 |
+|------|---------|---------|--------|
+| Azure VM | 简单直接 | 24/7 收费 | 低 |
+| Azure VM Spot | 省钱、可中断 | 24/7 收费 | 低 |
+| Azure ML Compute | MLOps 集成 | 可缩到 0 | 中 |
+| Container Apps GPU | Serverless | 自动缩到 0 | 中 |
+
+### Azure VM vs Azure ML
+
+| 特性 | Azure VM | Azure ML |
+|------|----------|----------|
+| 本质 | 虚拟机 | 托管 ML 平台 |
+| 计算费用 | $3.06/hr (NC6s_v3) | $3.06/hr (相同) |
+| 附加费用 | ~$5/月 | ~$20-30/月 |
+| 实验跟踪 | 无 | 内置 |
+| 自动扩缩 | 无 | 支持 min=0 |
+| 适用人群 | DevOps | 数据科学家 |
+
+### Azure ML 附加费用明细
+
+| 服务 | 用途 | 费用 |
+|------|------|------|
+| Container Registry | Docker 镜像 | ~$5-20/月 |
+| Blob Storage | 日志、模型 | ~$0.10/月 |
+| Application Insights | 监控 | ~$0-10/月 |
+| Key Vault | 密钥管理 | <$1/月 |
+
+### Spot 实例
+
+两种平台都支持 Spot/低优先级实例，最高节省 90%：
+
+| 类型 | 正常价格 | Spot 价格 | 节省 |
+|------|---------|----------|------|
+| NC6s_v3 (V100) | $3.06/hr | ~$0.92/hr | 70% |
+| NC24ads_A100_v4 | $3.67/hr | ~$1.15/hr | 69% |
+
+### GPU 实例价格
+
+| 实例 | GPU | 显存 | 价格/小时 | Spot 价格 |
+|------|-----|------|---------|----------|
+| NC6s_v3 | 1x V100 | 16GB | $3.06 | $0.92 |
+| NC24s_v3 | 4x V100 | 64GB | $12.24 | $3.67 |
+| NC24ads_A100_v4 | 1x A100 | 80GB | $3.67 | $1.15 |
+| NC48ads_A100_v4 | 2x A100 | 160GB | $7.35 | $2.30 |
+
+---
+
+## 推理方案
+
+### 方案对比
+
+| 方案 | GPU 支持 | 扩缩容 | 价格 | 适用场景 |
+|------|---------|--------|------|---------|
+| Container Apps (CPU) | 否 | 自动 0-N | ~$30/月 | YOLO 推理 (够用) |
+| Container Apps (GPU) | 是 | Serverless | 按秒计费 | 高吞吐推理 |
+| Azure App Service | 否 | 手动/自动 | ~$50/月 | 简单部署 |
+| Azure ML Endpoint | 是 | 自动 | ~$100+/月 | MLOps 集成 |
+| AKS (Kubernetes) | 是 | 自动 | 复杂计费 | 大规模生产 |
+
+### 推荐: Container Apps (CPU)
+
+对于 YOLO 推理，**CPU 足够**，不需要 GPU：
+- YOLOv11n 在 CPU 上推理时间 ~200-500ms
+- 比 GPU 便宜很多，适合中低流量
+
+```yaml
+# Container Apps 配置
+name: invoice-inference
+image: myacr.azurecr.io/invoice-inference:v1
+resources:
+  cpu: 2.0
+  memory: 4Gi
+scale:
+  minReplicas: 1      # 最少 1 个实例保持响应
+  maxReplicas: 10     # 最多扩展到 10 个
+  rules:
+    - name: http-scaling
+      http:
+        metadata:
+          concurrentRequests: "50"  # 每实例 50 并发时扩容
+```
+
+### 推理服务代码示例
+
+```python
+# Dockerfile
+FROM python:3.11-slim
+
+WORKDIR /app
+
+# 安装依赖
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# 复制代码和模型
+COPY src/ ./src/
+COPY models/best.pt ./models/
+
+# 启动服务
+CMD ["uvicorn", "src.web.app:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+```python
+# src/web/app.py
+from fastapi import FastAPI, UploadFile, File
+from ultralytics import YOLO
+import tempfile
+
+app = FastAPI()
+model = YOLO("models/best.pt")
+
+@app.post("/api/v1/infer")
+async def infer(file: UploadFile = File(...)):
+    # 保存上传文件
+    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
+        content = await file.read()
+        tmp.write(content)
+        tmp_path = tmp.name
+
+    # 执行推理
+    results = model.predict(tmp_path, conf=0.5)
+
+    # 返回结果
+    return {
+        "fields": extract_fields(results),
+        "confidence": get_confidence(results)
+    }
+
+@app.get("/health")
+async def health():
+    return {"status": "healthy"}
+```
+
+### 部署命令
+
+```bash
+# 1. 创建 Container Registry
+az acr create --name invoiceacr --resource-group myRG --sku Basic
+
+# 2. 构建并推送镜像
+az acr build --registry invoiceacr --image invoice-inference:v1 .
+
+# 3. 创建 Container Apps 环境
+az containerapp env create \
+  --name invoice-env \
+  --resource-group myRG \
+  --location eastus
+
+# 4. 部署应用
+az containerapp create \
+  --name invoice-inference \
+  --resource-group myRG \
+  --environment invoice-env \
+  --image invoiceacr.azurecr.io/invoice-inference:v1 \
+  --registry-server invoiceacr.azurecr.io \
+  --cpu 2 --memory 4Gi \
+  --min-replicas 1 --max-replicas 10 \
+  --ingress external --target-port 8000
+
+# 5. 获取 URL
+az containerapp show --name invoice-inference --resource-group myRG --query properties.configuration.ingress.fqdn
+```
+
+### 高吞吐场景: Serverless GPU
+
+如果需要 GPU 加速推理（高并发、低延迟）：
+
+```bash
+# 请求 GPU 配额
+az containerapp env workload-profile add \
+  --name invoice-env \
+  --resource-group myRG \
+  --workload-profile-name gpu \
+  --workload-profile-type Consumption-GPU-T4
+
+# 部署 GPU 版本
+az containerapp create \
+  --name invoice-inference-gpu \
+  --resource-group myRG \
+  --environment invoice-env \
+  --image invoiceacr.azurecr.io/invoice-inference-gpu:v1 \
+  --workload-profile-name gpu \
+  --cpu 4 --memory 8Gi \
+  --min-replicas 0 --max-replicas 5 \
+  --ingress external --target-port 8000
+```
+
+### 推理性能对比
+
+| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
+|------|------------|---------|---------|
+| CPU 2核 4GB | ~300-500ms | ~50 QPS | ~$30 |
+| CPU 4核 8GB | ~200-300ms | ~100 QPS | ~$60 |
+| GPU T4 | ~50-100ms | ~200 QPS | 按秒计费 |
+| GPU A100 | ~20-50ms | ~500 QPS | 按秒计费 |
+
+---
+
+## 价格对比
+
+### 月度成本对比（假设每天训练 2 小时）
+
+| 方案 | 计算方式 | 月费 |
+|------|---------|------|
+| VM 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
+| VM 按需启停 | 2h × 30天 × $3.06 | ~$184 |
+| VM Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
+| Serverless GPU | 2h × 30天 × ~$3.50 | ~$210 |
+| Azure ML (min=0) | 2h × 30天 × $3.06 | ~$184 |
+
+### 本项目完整成本估算
+
+| 组件 | 推荐方案 | 月费 |
+|------|---------|------|
+| 图片存储 | Blob Storage (Hot) | ~$0.10 |
+| 数据库 | PostgreSQL Flexible (Burstable B1ms) | ~$25 |
+| 推理服务 | Container Apps CPU (2核4GB) | ~$30 |
+| 训练服务 | Azure ML Spot (按需) | ~$1-5/次 |
+| Container Registry | Basic | ~$5 |
+| **总计** | | **~$65/月** + 训练费 |
+
+---
+
+## 推荐架构
+
+### 整体架构图
+
+```
+                            ┌─────────────────────────────────────┐
+                            │         Azure Blob Storage          │
+                            │  ├── training-images/               │
+                            │  ├── datasets/                      │
+                            │  └── models/                        │
+                            └─────────────────┬───────────────────┘
+                                              │
+            ┌─────────────────────────────────┼─────────────────────────────────┐
+            │                                 │                                 │
+            ▼                                 ▼                                 ▼
+┌───────────────────────┐       ┌───────────────────────┐       ┌───────────────────────┐
+│   推理服务 (24/7)      │       │   训练服务 (按需)      │       │   Web UI (可选)        │
+│   Container Apps      │       │   Azure ML Compute    │       │   Static Web Apps     │
+│   CPU 2核 4GB         │       │   min=0, Spot         │       │   ~$0 (免费层)         │
+│   ~$30/月             │       │   ~$1-5/次训练        │       │                       │
+│                       │       │                       │       │                       │
+│ ┌───────────────────┐ │       │ ┌───────────────────┐ │       │ ┌───────────────────┐ │
+│ │ FastAPI + YOLO    │ │       │ │ YOLOv11 Training  │ │       │ │ React/Vue 前端    │ │
+│ │ /api/v1/infer     │ │       │ │ 100 epochs        │ │       │ │ 上传发票界面      │ │
+│ └───────────────────┘ │       │ └───────────────────┘ │       │ └───────────────────┘ │
+└───────────┬───────────┘       └───────────┬───────────┘       └───────────┬───────────┘
+            │                               │                               │
+            └───────────────────────────────┼───────────────────────────────┘
+                                            │
+                                            ▼
+                              ┌───────────────────────┐
+                              │   PostgreSQL          │
+                              │   Flexible Server     │
+                              │   Burstable B1ms      │
+                              │   ~$25/月             │
+                              └───────────────────────┘
+```
+
+### 推理服务配置
+
+```yaml
+# Container Apps - CPU (24/7 运行)
+name: invoice-inference
+resources:
+  cpu: 2
+  memory: 4Gi
+scale:
+  minReplicas: 1
+  maxReplicas: 10
+env:
+  - name: MODEL_PATH
+    value: /app/models/best.pt
+  - name: DB_HOST
+    secretRef: db-host
+  - name: DB_PASSWORD
+    secretRef: db-password
+```
+
+### 训练服务配置
+
+**方案 A: Azure ML Compute（推荐）**
+
+```python
+from azure.ai.ml.entities import AmlCompute
+
+gpu_cluster = AmlCompute(
+    name="gpu-cluster",
+    size="Standard_NC6s_v3",
+    min_instances=0,      # 空闲时关机
+    max_instances=1,
+    tier="LowPriority",   # Spot 实例
+    idle_time_before_scale_down=120
+)
+```
+
+**方案 B: Container Apps Serverless GPU**
+
+```yaml
+name: invoice-training
+resources:
+  gpu: 1
+  gpuType: A100
+scale:
+  minReplicas: 0
+  maxReplicas: 1
+```
+
+---
+
+## 实施步骤
+
+### 阶段 1: 存储设置
+
+```bash
+# 创建 Storage Account
+az storage account create \
+  --name invoicestorage \
+  --resource-group myRG \
+  --sku Standard_LRS
+
+# 创建容器
+az storage container create --name training-images --account-name invoicestorage
+az storage container create --name datasets --account-name invoicestorage
+az storage container create --name models --account-name invoicestorage
+
+# 上传训练数据
+az storage blob upload-batch \
+  --destination training-images \
+  --source ./data/dataset/temp \
+  --account-name invoicestorage
+```
+
+### 阶段 2: 数据库设置
+
+```bash
+# 创建 PostgreSQL
+az postgres flexible-server create \
+  --name invoice-db \
+  --resource-group myRG \
+  --sku-name Standard_B1ms \
+  --storage-size 32 \
+  --admin-user docmaster \
+  --admin-password YOUR_PASSWORD
+
+# 配置防火墙
+az postgres flexible-server firewall-rule create \
+  --name allow-azure \
+  --resource-group myRG \
+  --server-name invoice-db \
+  --start-ip-address 0.0.0.0 \
+  --end-ip-address 0.0.0.0
+```
+
+### 阶段 3: 推理服务部署
+
+```bash
+# 创建 Container Registry
+az acr create --name invoiceacr --resource-group myRG --sku Basic
+
+# 构建镜像
+az acr build --registry invoiceacr --image invoice-inference:v1 .
+
+# 创建环境
+az containerapp env create \
+  --name invoice-env \
+  --resource-group myRG \
+  --location eastus
+
+# 部署推理服务
+az containerapp create \
+  --name invoice-inference \
+  --resource-group myRG \
+  --environment invoice-env \
+  --image invoiceacr.azurecr.io/invoice-inference:v1 \
+  --registry-server invoiceacr.azurecr.io \
+  --cpu 2 --memory 4Gi \
+  --min-replicas 1 --max-replicas 10 \
+  --ingress external --target-port 8000 \
+  --env-vars \
+    DB_HOST=invoice-db.postgres.database.azure.com \
+    DB_NAME=docmaster \
+    DB_USER=docmaster \
+  --secrets db-password=YOUR_PASSWORD
+```
+
+### 阶段 4: 训练服务设置
+
+```bash
+# 创建 Azure ML Workspace
+az ml workspace create --name invoice-ml --resource-group myRG
+
+# 创建 Compute Cluster
+az ml compute create --name gpu-cluster \
+  --type AmlCompute \
+  --size Standard_NC6s_v3 \
+  --min-instances 0 \
+  --max-instances 1 \
+  --tier low_priority
+```
+
+### 阶段 5: 集成训练触发 API
+
+```python
+# src/web/routes/training.py
+from fastapi import APIRouter
+from azure.ai.ml import MLClient, command
+from azure.identity import DefaultAzureCredential
+
+router = APIRouter()
+
+ml_client = MLClient(
+    credential=DefaultAzureCredential(),
+    subscription_id="your-subscription-id",
+    resource_group_name="myRG",
+    workspace_name="invoice-ml"
+)
+
+@router.post("/api/v1/train")
+async def trigger_training(request: TrainingRequest):
+    """触发 Azure ML 训练任务"""
+    training_job = command(
+        code="./training",
+        command=f"python train.py --epochs {request.epochs}",
+        environment="AzureML-pytorch-2.0-cuda11.8@latest",
+        compute="gpu-cluster",
+    )
+    job = ml_client.jobs.create_or_update(training_job)
+    return {
+        "job_id": job.name,
+        "status": job.status,
+        "studio_url": job.studio_url
+    }
+
+@router.get("/api/v1/train/{job_id}/status")
+async def get_training_status(job_id: str):
+    """查询训练状态"""
+    job = ml_client.jobs.get(job_id)
+    return {"status": job.status}
+```
+
+---
+
+## 总结
+
+### 推荐配置
+
+| 组件 | 推荐方案 | 月费估算 |
+|------|---------|---------|
+| 图片存储 | Blob Storage (Hot) | ~$0.10 |
+| 数据库 | PostgreSQL Flexible | ~$25 |
+| 推理服务 | Container Apps CPU | ~$30 |
+| 训练服务 | Azure ML (min=0, Spot) | 按需 ~$1-5/次 |
+| Container Registry | Basic | ~$5 |
+| **总计** | | **~$65/月** + 训练费 |
+
+### 关键决策
+
+| 场景 | 选择 |
+|------|------|
+| 偶尔训练，简单需求 | Azure VM Spot + 手动启停 |
+| 需要 MLOps，团队协作 | Azure ML Compute |
+| 追求最低空闲成本 | Container Apps Serverless GPU |
+| 生产环境推理 | Container Apps CPU |
+| 高并发推理 | Container Apps Serverless GPU |
+
+### 注意事项
+
+1. **冷启动**: Serverless GPU 启动需要 3-8 分钟
+2. **Spot 中断**: 可能被抢占，需要检查点机制
+3. **网络延迟**: Blob Storage 挂载比本地 SSD 慢，建议开启缓存
+4. **区域选择**: 选择有 GPU 配额的区域 (East US, West Europe 等)
+5. **推理优化**: CPU 推理对于 YOLO 已经足够，无需 GPU