Files

Yaojia Wang a516de4320 WIP

2026-02-01 00:08:40 +01:00

17 KiB

Raw Permalink Blame History

Azure 部署方案完整指南

核心问题

问题	答案
Azure Blob Storage 能用于训练吗？	可以，用 BlobFuse2 挂载
能实时从 Blob 读取训练吗？	可以，但建议配置本地缓存
本地能挂载 Azure Blob 吗？	可以，用 Rclone (Windows) 或 BlobFuse2 (Linux)
VM 空闲时收费吗？	收费，只要开机就按小时计费
如何按需付费？	用 Serverless GPU 或 min=0 的 Compute Cluster
推理服务用什么？	Container Apps (CPU) 或 Serverless GPU

存储方案

Azure Blob Storage + BlobFuse2（推荐）

# 安装 BlobFuse2
sudo apt-get install blobfuse2

# 配置文件
cat > ~/blobfuse-config.yaml << 'EOF'
logging:
  type: syslog
  level: log_warning

components:
  - libfuse
  - file_cache
  - azstorage

file_cache:
  path: /tmp/blobfuse2
  timeout-sec: 120
  max-size-mb: 4096

azstorage:
  type: block
  account-name: YOUR_ACCOUNT
  account-key: YOUR_KEY
  container: training-images
EOF

# 挂载
mkdir -p /mnt/azure-blob
blobfuse2 mount /mnt/azure-blob --config-file=~/blobfuse-config.yaml

本地开发（Windows）

# 安装
winget install WinFsp.WinFsp
winget install Rclone.Rclone

# 配置
rclone config  # 选择 azureblob

# 挂载为 Z: 盘
rclone mount azure:training-images Z: --vfs-cache-mode full

存储费用

层级	价格	适用场景
Hot	$0.018/GB/月	频繁访问
Cool	$0.01/GB/月	偶尔访问
Archive	$0.002/GB/月	长期存档

本项目: ~10,000 张图片 × 500KB = 5GB → **$0.09/月**

训练方案

方案总览

方案	适用场景	空闲费用	复杂度
Azure VM	简单直接	24/7 收费	低
Azure VM Spot	省钱、可中断	24/7 收费	低
Azure ML Compute	MLOps 集成	可缩到 0	中
Container Apps GPU	Serverless	自动缩到 0	中

Azure VM vs Azure ML

特性	Azure VM	Azure ML
本质	虚拟机	托管 ML 平台
计算费用	$3.06/hr (NC6s_v3)	$3.06/hr (相同)
附加费用	~$5/月	~$20-30/月
实验跟踪	无	内置
自动扩缩	无	支持 min=0
适用人群	DevOps	数据科学家

Azure ML 附加费用明细

服务	用途	费用
Container Registry	Docker 镜像	~$5-20/月
Blob Storage	日志、模型	~$0.10/月
Application Insights	监控	~$0-10/月
Key Vault	密钥管理	<$1/月

Spot 实例

两种平台都支持 Spot/低优先级实例，最高节省 90%：

类型	正常价格	Spot 价格	节省
NC6s_v3 (V100)	$3.06/hr	~$0.92/hr	70%
NC24ads_A100_v4	$3.67/hr	~$1.15/hr	69%

GPU 实例价格

实例	GPU	显存	价格/小时	Spot 价格
NC6s_v3	1x V100	16GB	$3.06	$0.92
NC24s_v3	4x V100	64GB	$12.24	$3.67
NC24ads_A100_v4	1x A100	80GB	$3.67	$1.15
NC48ads_A100_v4	2x A100	160GB	$7.35	$2.30

推理方案

方案对比

方案	GPU 支持	扩缩容	价格	适用场景
Container Apps (CPU)	否	自动 0-N	~$30/月	YOLO 推理 (够用)
Container Apps (GPU)	是	Serverless	按秒计费	高吞吐推理
Azure App Service	否	手动/自动	~$50/月	简单部署
Azure ML Endpoint	是	自动	~$100+/月	MLOps 集成
AKS (Kubernetes)	是	自动	复杂计费	大规模生产

推荐: Container Apps (CPU)

对于 YOLO 推理，CPU 足够，不需要 GPU：

YOLOv11n 在 CPU 上推理时间 ~200-500ms
比 GPU 便宜很多，适合中低流量

# Container Apps 配置
name: invoice-inference
image: myacr.azurecr.io/invoice-inference:v1
resources:
  cpu: 2.0
  memory: 4Gi
scale:
  minReplicas: 1      # 最少 1 个实例保持响应
  maxReplicas: 10     # 最多扩展到 10 个
  rules:
    - name: http-scaling
      http:
        metadata:
          concurrentRequests: "50"  # 每实例 50 并发时扩容

推理服务代码示例

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# 安装依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 复制代码和模型
COPY src/ ./src/
COPY models/best.pt ./models/

# 启动服务
CMD ["uvicorn", "src.web.app:app", "--host", "0.0.0.0", "--port", "8000"]

# src/web/app.py
from fastapi import FastAPI, UploadFile, File
from ultralytics import YOLO
import tempfile

app = FastAPI()
model = YOLO("models/best.pt")

@app.post("/api/v1/infer")
async def infer(file: UploadFile = File(...)):
    # 保存上传文件
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    # 执行推理
    results = model.predict(tmp_path, conf=0.5)

    # 返回结果
    return {
        "fields": extract_fields(results),
        "confidence": get_confidence(results)
    }

@app.get("/health")
async def health():
    return {"status": "healthy"}

部署命令

# 1. 创建 Container Registry
az acr create --name invoiceacr --resource-group myRG --sku Basic

# 2. 构建并推送镜像
az acr build --registry invoiceacr --image invoice-inference:v1 .

# 3. 创建 Container Apps 环境
az containerapp env create \
  --name invoice-env \
  --resource-group myRG \
  --location eastus

# 4. 部署应用
az containerapp create \
  --name invoice-inference \
  --resource-group myRG \
  --environment invoice-env \
  --image invoiceacr.azurecr.io/invoice-inference:v1 \
  --registry-server invoiceacr.azurecr.io \
  --cpu 2 --memory 4Gi \
  --min-replicas 1 --max-replicas 10 \
  --ingress external --target-port 8000

# 5. 获取 URL
az containerapp show --name invoice-inference --resource-group myRG --query properties.configuration.ingress.fqdn

高吞吐场景: Serverless GPU

如果需要 GPU 加速推理（高并发、低延迟）：

# 请求 GPU 配额
az containerapp env workload-profile add \
  --name invoice-env \
  --resource-group myRG \
  --workload-profile-name gpu \
  --workload-profile-type Consumption-GPU-T4

# 部署 GPU 版本
az containerapp create \
  --name invoice-inference-gpu \
  --resource-group myRG \
  --environment invoice-env \
  --image invoiceacr.azurecr.io/invoice-inference-gpu:v1 \
  --workload-profile-name gpu \
  --cpu 4 --memory 8Gi \
  --min-replicas 0 --max-replicas 5 \
  --ingress external --target-port 8000

推理性能对比

配置	单次推理时间	并发能力	月费估算
CPU 2核 4GB	~300-500ms	~50 QPS	~$30
CPU 4核 8GB	~200-300ms	~100 QPS	~$60
GPU T4	~50-100ms	~200 QPS	按秒计费
GPU A100	~20-50ms	~500 QPS	按秒计费

价格对比

月度成本对比（假设每天训练 2 小时）

方案	计算方式	月费
VM 24/7 运行	24h × 30天 × $3.06	~$2,200
VM 按需启停	2h × 30天 × $3.06	~$184
VM Spot 按需	2h × 30天 × $0.92	~$55
Serverless GPU	2h × 30天 × ~$3.50	~$210
Azure ML (min=0)	2h × 30天 × $3.06	~$184

本项目完整成本估算

组件	推荐方案	月费
图片存储	Blob Storage (Hot)	~$0.10
数据库	PostgreSQL Flexible (Burstable B1ms)	~$25
推理服务	Container Apps CPU (2核4GB)	~$30
训练服务	Azure ML Spot (按需)	~$1-5/次
Container Registry	Basic	~$5
总计		~$65/月 + 训练费

推荐架构

整体架构图

                            ┌─────────────────────────────────────┐
                            │         Azure Blob Storage          │
                            │  ├── training-images/               │
                            │  ├── datasets/                      │
                            │  └── models/                        │
                            └─────────────────┬───────────────────┘
                                              │
            ┌─────────────────────────────────┼─────────────────────────────────┐
            │                                 │                                 │
            ▼                                 ▼                                 ▼
┌───────────────────────┐       ┌───────────────────────┐       ┌───────────────────────┐
│   推理服务 (24/7)      │       │   训练服务 (按需)      │       │   Web UI (可选)        │
│   Container Apps      │       │   Azure ML Compute    │       │   Static Web Apps     │
│   CPU 2核 4GB         │       │   min=0, Spot         │       │   ~$0 (免费层)         │
│   ~$30/月             │       │   ~$1-5/次训练        │       │                       │
│                       │       │                       │       │                       │
│ ┌───────────────────┐ │       │ ┌───────────────────┐ │       │ ┌───────────────────┐ │
│ │ FastAPI + YOLO    │ │       │ │ YOLOv11 Training  │ │       │ │ React/Vue 前端    │ │
│ │ /api/v1/infer     │ │       │ │ 100 epochs        │ │       │ │ 上传发票界面      │ │
│ └───────────────────┘ │       │ └───────────────────┘ │       │ └───────────────────┘ │
└───────────┬───────────┘       └───────────┬───────────┘       └───────────┬───────────┘
            │                               │                               │
            └───────────────────────────────┼───────────────────────────────┘
                                            │
                                            ▼
                              ┌───────────────────────┐
                              │   PostgreSQL          │
                              │   Flexible Server     │
                              │   Burstable B1ms      │
                              │   ~$25/月             │
                              └───────────────────────┘

推理服务配置

# Container Apps - CPU (24/7 运行)
name: invoice-inference
resources:
  cpu: 2
  memory: 4Gi
scale:
  minReplicas: 1
  maxReplicas: 10
env:
  - name: MODEL_PATH
    value: /app/models/best.pt
  - name: DB_HOST
    secretRef: db-host
  - name: DB_PASSWORD
    secretRef: db-password

训练服务配置

方案 A: Azure ML Compute（推荐）

from azure.ai.ml.entities import AmlCompute

gpu_cluster = AmlCompute(
    name="gpu-cluster",
    size="Standard_NC6s_v3",
    min_instances=0,      # 空闲时关机
    max_instances=1,
    tier="LowPriority",   # Spot 实例
    idle_time_before_scale_down=120
)

方案 B: Container Apps Serverless GPU

name: invoice-training
resources:
  gpu: 1
  gpuType: A100
scale:
  minReplicas: 0
  maxReplicas: 1

实施步骤

阶段 1: 存储设置

# 创建 Storage Account
az storage account create \
  --name invoicestorage \
  --resource-group myRG \
  --sku Standard_LRS

# 创建容器
az storage container create --name training-images --account-name invoicestorage
az storage container create --name datasets --account-name invoicestorage
az storage container create --name models --account-name invoicestorage

# 上传训练数据
az storage blob upload-batch \
  --destination training-images \
  --source ./data/dataset/temp \
  --account-name invoicestorage

阶段 2: 数据库设置

# 创建 PostgreSQL
az postgres flexible-server create \
  --name invoice-db \
  --resource-group myRG \
  --sku-name Standard_B1ms \
  --storage-size 32 \
  --admin-user docmaster \
  --admin-password YOUR_PASSWORD

# 配置防火墙
az postgres flexible-server firewall-rule create \
  --name allow-azure \
  --resource-group myRG \
  --server-name invoice-db \
  --start-ip-address 0.0.0.0 \
  --end-ip-address 0.0.0.0

阶段 3: 推理服务部署

# 创建 Container Registry
az acr create --name invoiceacr --resource-group myRG --sku Basic

# 构建镜像
az acr build --registry invoiceacr --image invoice-inference:v1 .

# 创建环境
az containerapp env create \
  --name invoice-env \
  --resource-group myRG \
  --location eastus

# 部署推理服务
az containerapp create \
  --name invoice-inference \
  --resource-group myRG \
  --environment invoice-env \
  --image invoiceacr.azurecr.io/invoice-inference:v1 \
  --registry-server invoiceacr.azurecr.io \
  --cpu 2 --memory 4Gi \
  --min-replicas 1 --max-replicas 10 \
  --ingress external --target-port 8000 \
  --env-vars \
    DB_HOST=invoice-db.postgres.database.azure.com \
    DB_NAME=docmaster \
    DB_USER=docmaster \
  --secrets db-password=YOUR_PASSWORD

阶段 4: 训练服务设置

# 创建 Azure ML Workspace
az ml workspace create --name invoice-ml --resource-group myRG

# 创建 Compute Cluster
az ml compute create --name gpu-cluster \
  --type AmlCompute \
  --size Standard_NC6s_v3 \
  --min-instances 0 \
  --max-instances 1 \
  --tier low_priority

阶段 5: 集成训练触发 API

# src/web/routes/training.py
from fastapi import APIRouter
from azure.ai.ml import MLClient, command
from azure.identity import DefaultAzureCredential

router = APIRouter()

ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id="your-subscription-id",
    resource_group_name="myRG",
    workspace_name="invoice-ml"
)

@router.post("/api/v1/train")
async def trigger_training(request: TrainingRequest):
    """触发 Azure ML 训练任务"""
    training_job = command(
        code="./training",
        command=f"python train.py --epochs {request.epochs}",
        environment="AzureML-pytorch-2.0-cuda11.8@latest",
        compute="gpu-cluster",
    )
    job = ml_client.jobs.create_or_update(training_job)
    return {
        "job_id": job.name,
        "status": job.status,
        "studio_url": job.studio_url
    }

@router.get("/api/v1/train/{job_id}/status")
async def get_training_status(job_id: str):
    """查询训练状态"""
    job = ml_client.jobs.get(job_id)
    return {"status": job.status}

总结

关键决策

场景	选择
偶尔训练，简单需求	Azure VM Spot + 手动启停
需要 MLOps，团队协作	Azure ML Compute
追求最低空闲成本	Container Apps Serverless GPU
生产环境推理	Container Apps CPU
高并发推理	Container Apps Serverless GPU

注意事项

冷启动: Serverless GPU 启动需要 3-8 分钟
Spot 中断: 可能被抢占，需要检查点机制
网络延迟: Blob Storage 挂载比本地 SSD 慢，建议开启缓存
区域选择: 选择有 GPU 配额的区域 (East US, West Europe 等)
推理优化: CPU 推理对于 YOLO 已经足够，无需 GPU

17 KiB

Raw Permalink Blame History

Azure 部署方案完整指南

目录

核心问题

存储方案

Azure Blob Storage + BlobFuse2（推荐）

本地开发（Windows）

存储费用

训练方案

方案总览

Azure VM vs Azure ML

Azure ML 附加费用明细

Spot 实例

GPU 实例价格

推理方案

方案对比

推荐: Container Apps (CPU)

推理服务代码示例

部署命令

高吞吐场景: Serverless GPU

推理性能对比

价格对比

月度成本对比（假设每天训练 2 小时）

本项目完整成本估算

推荐架构

整体架构图

推理服务配置

训练服务配置

实施步骤

阶段 1: 存储设置

阶段 2: 数据库设置

阶段 3: 推理服务部署

阶段 4: 训练服务设置

阶段 5: 集成训练触发 API

总结

推荐配置

关键决策

注意事项

17 KiB Raw Permalink Blame History Unescape Escape

Azure 部署方案完整指南

目录

核心问题

存储方案

Azure Blob Storage + BlobFuse2（推荐）

本地开发（Windows）

存储费用

训练方案

方案总览

Azure VM vs Azure ML

Azure ML 附加费用明细

Spot 实例

GPU 实例价格

推理方案

方案对比

推荐: Container Apps (CPU)

推理服务代码示例

部署命令

高吞吐场景: Serverless GPU

推理性能对比

价格对比

月度成本对比（假设每天训练 2 小时）

本项目完整成本估算

推荐架构

整体架构图

推理服务配置

训练服务配置

实施步骤

阶段 1: 存储设置

阶段 2: 数据库设置

阶段 3: 推理服务部署

阶段 4: 训练服务设置

阶段 5: 集成训练触发 API

总结

推荐配置

关键决策

注意事项

17 KiB

Raw Permalink Blame History