568 lines
17 KiB
Markdown
568 lines
17 KiB
Markdown
# Azure 部署方案完整指南
|
||
|
||
## 目录
|
||
- [核心问题](#核心问题)
|
||
- [存储方案](#存储方案)
|
||
- [训练方案](#训练方案)
|
||
- [推理方案](#推理方案)
|
||
- [价格对比](#价格对比)
|
||
- [推荐架构](#推荐架构)
|
||
- [实施步骤](#实施步骤)
|
||
|
||
---
|
||
|
||
## 核心问题
|
||
|
||
| 问题 | 答案 |
|
||
|------|------|
|
||
| Azure Blob Storage 能用于训练吗? | 可以,用 BlobFuse2 挂载 |
|
||
| 能实时从 Blob 读取训练吗? | 可以,但建议配置本地缓存 |
|
||
| 本地能挂载 Azure Blob 吗? | 可以,用 Rclone (Windows) 或 BlobFuse2 (Linux) |
|
||
| VM 空闲时收费吗? | 收费,只要开机就按小时计费 |
|
||
| 如何按需付费? | 用 Serverless GPU 或 min=0 的 Compute Cluster |
|
||
| 推理服务用什么? | Container Apps (CPU) 或 Serverless GPU |
|
||
|
||
---
|
||
|
||
## 存储方案
|
||
|
||
### Azure Blob Storage + BlobFuse2(推荐)
|
||
|
||
```bash
|
||
# 安装 BlobFuse2
|
||
sudo apt-get install blobfuse2
|
||
|
||
# 配置文件
|
||
cat > ~/blobfuse-config.yaml << 'EOF'
|
||
logging:
|
||
type: syslog
|
||
level: log_warning
|
||
|
||
components:
|
||
- libfuse
|
||
- file_cache
|
||
- azstorage
|
||
|
||
file_cache:
|
||
path: /tmp/blobfuse2
|
||
timeout-sec: 120
|
||
max-size-mb: 4096
|
||
|
||
azstorage:
|
||
type: block
|
||
account-name: YOUR_ACCOUNT
|
||
account-key: YOUR_KEY
|
||
container: training-images
|
||
EOF
|
||
|
||
# 挂载
|
||
mkdir -p /mnt/azure-blob
|
||
blobfuse2 mount /mnt/azure-blob --config-file=~/blobfuse-config.yaml
|
||
```
|
||
|
||
### 本地开发(Windows)
|
||
|
||
```powershell
|
||
# 安装
|
||
winget install WinFsp.WinFsp
|
||
winget install Rclone.Rclone
|
||
|
||
# 配置
|
||
rclone config # 选择 azureblob
|
||
|
||
# 挂载为 Z: 盘
|
||
rclone mount azure:training-images Z: --vfs-cache-mode full
|
||
```
|
||
|
||
### 存储费用
|
||
|
||
| 层级 | 价格 | 适用场景 |
|
||
|------|------|---------|
|
||
| Hot | $0.018/GB/月 | 频繁访问 |
|
||
| Cool | $0.01/GB/月 | 偶尔访问 |
|
||
| Archive | $0.002/GB/月 | 长期存档 |
|
||
|
||
**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.09/月**
|
||
|
||
---
|
||
|
||
## 训练方案
|
||
|
||
### 方案总览
|
||
|
||
| 方案 | 适用场景 | 空闲费用 | 复杂度 |
|
||
|------|---------|---------|--------|
|
||
| Azure VM | 简单直接 | 24/7 收费 | 低 |
|
||
| Azure VM Spot | 省钱、可中断 | 24/7 收费 | 低 |
|
||
| Azure ML Compute | MLOps 集成 | 可缩到 0 | 中 |
|
||
| Container Apps GPU | Serverless | 自动缩到 0 | 中 |
|
||
|
||
### Azure VM vs Azure ML
|
||
|
||
| 特性 | Azure VM | Azure ML |
|
||
|------|----------|----------|
|
||
| 本质 | 虚拟机 | 托管 ML 平台 |
|
||
| 计算费用 | $3.06/hr (NC6s_v3) | $3.06/hr (相同) |
|
||
| 附加费用 | ~$5/月 | ~$20-30/月 |
|
||
| 实验跟踪 | 无 | 内置 |
|
||
| 自动扩缩 | 无 | 支持 min=0 |
|
||
| 适用人群 | DevOps | 数据科学家 |
|
||
|
||
### Azure ML 附加费用明细
|
||
|
||
| 服务 | 用途 | 费用 |
|
||
|------|------|------|
|
||
| Container Registry | Docker 镜像 | ~$5-20/月 |
|
||
| Blob Storage | 日志、模型 | ~$0.10/月 |
|
||
| Application Insights | 监控 | ~$0-10/月 |
|
||
| Key Vault | 密钥管理 | <$1/月 |
|
||
|
||
### Spot 实例
|
||
|
||
两种平台都支持 Spot/低优先级实例,最高节省 90%:
|
||
|
||
| 类型 | 正常价格 | Spot 价格 | 节省 |
|
||
|------|---------|----------|------|
|
||
| NC6s_v3 (V100) | $3.06/hr | ~$0.92/hr | 70% |
|
||
| NC24ads_A100_v4 | $3.67/hr | ~$1.15/hr | 69% |
|
||
|
||
### GPU 实例价格
|
||
|
||
| 实例 | GPU | 显存 | 价格/小时 | Spot 价格 |
|
||
|------|-----|------|---------|----------|
|
||
| NC6s_v3 | 1x V100 | 16GB | $3.06 | $0.92 |
|
||
| NC24s_v3 | 4x V100 | 64GB | $12.24 | $3.67 |
|
||
| NC24ads_A100_v4 | 1x A100 | 80GB | $3.67 | $1.15 |
|
||
| NC48ads_A100_v4 | 2x A100 | 160GB | $7.35 | $2.30 |
|
||
|
||
---
|
||
|
||
## 推理方案
|
||
|
||
### 方案对比
|
||
|
||
| 方案 | GPU 支持 | 扩缩容 | 价格 | 适用场景 |
|
||
|------|---------|--------|------|---------|
|
||
| Container Apps (CPU) | 否 | 自动 0-N | ~$30/月 | YOLO 推理 (够用) |
|
||
| Container Apps (GPU) | 是 | Serverless | 按秒计费 | 高吞吐推理 |
|
||
| Azure App Service | 否 | 手动/自动 | ~$50/月 | 简单部署 |
|
||
| Azure ML Endpoint | 是 | 自动 | ~$100+/月 | MLOps 集成 |
|
||
| AKS (Kubernetes) | 是 | 自动 | 复杂计费 | 大规模生产 |
|
||
|
||
### 推荐: Container Apps (CPU)
|
||
|
||
对于 YOLO 推理,**CPU 足够**,不需要 GPU:
|
||
- YOLOv11n 在 CPU 上推理时间 ~200-500ms
|
||
- 比 GPU 便宜很多,适合中低流量
|
||
|
||
```yaml
|
||
# Container Apps 配置
|
||
name: invoice-inference
|
||
image: myacr.azurecr.io/invoice-inference:v1
|
||
resources:
|
||
cpu: 2.0
|
||
memory: 4Gi
|
||
scale:
|
||
minReplicas: 1 # 最少 1 个实例保持响应
|
||
maxReplicas: 10 # 最多扩展到 10 个
|
||
rules:
|
||
- name: http-scaling
|
||
http:
|
||
metadata:
|
||
concurrentRequests: "50" # 每实例 50 并发时扩容
|
||
```
|
||
|
||
### 推理服务代码示例
|
||
|
||
```python
|
||
# Dockerfile
|
||
FROM python:3.11-slim
|
||
|
||
WORKDIR /app
|
||
|
||
# 安装依赖
|
||
COPY requirements.txt .
|
||
RUN pip install --no-cache-dir -r requirements.txt
|
||
|
||
# 复制代码和模型
|
||
COPY src/ ./src/
|
||
COPY models/best.pt ./models/
|
||
|
||
# 启动服务
|
||
CMD ["uvicorn", "src.web.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
||
```
|
||
|
||
```python
|
||
# src/web/app.py
|
||
from fastapi import FastAPI, UploadFile, File
|
||
from ultralytics import YOLO
|
||
import tempfile
|
||
|
||
app = FastAPI()
|
||
model = YOLO("models/best.pt")
|
||
|
||
@app.post("/api/v1/infer")
|
||
async def infer(file: UploadFile = File(...)):
|
||
# 保存上传文件
|
||
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
|
||
content = await file.read()
|
||
tmp.write(content)
|
||
tmp_path = tmp.name
|
||
|
||
# 执行推理
|
||
results = model.predict(tmp_path, conf=0.5)
|
||
|
||
# 返回结果
|
||
return {
|
||
"fields": extract_fields(results),
|
||
"confidence": get_confidence(results)
|
||
}
|
||
|
||
@app.get("/health")
|
||
async def health():
|
||
return {"status": "healthy"}
|
||
```
|
||
|
||
### 部署命令
|
||
|
||
```bash
|
||
# 1. 创建 Container Registry
|
||
az acr create --name invoiceacr --resource-group myRG --sku Basic
|
||
|
||
# 2. 构建并推送镜像
|
||
az acr build --registry invoiceacr --image invoice-inference:v1 .
|
||
|
||
# 3. 创建 Container Apps 环境
|
||
az containerapp env create \
|
||
--name invoice-env \
|
||
--resource-group myRG \
|
||
--location eastus
|
||
|
||
# 4. 部署应用
|
||
az containerapp create \
|
||
--name invoice-inference \
|
||
--resource-group myRG \
|
||
--environment invoice-env \
|
||
--image invoiceacr.azurecr.io/invoice-inference:v1 \
|
||
--registry-server invoiceacr.azurecr.io \
|
||
--cpu 2 --memory 4Gi \
|
||
--min-replicas 1 --max-replicas 10 \
|
||
--ingress external --target-port 8000
|
||
|
||
# 5. 获取 URL
|
||
az containerapp show --name invoice-inference --resource-group myRG --query properties.configuration.ingress.fqdn
|
||
```
|
||
|
||
### 高吞吐场景: Serverless GPU
|
||
|
||
如果需要 GPU 加速推理(高并发、低延迟):
|
||
|
||
```bash
|
||
# 请求 GPU 配额
|
||
az containerapp env workload-profile add \
|
||
--name invoice-env \
|
||
--resource-group myRG \
|
||
--workload-profile-name gpu \
|
||
--workload-profile-type Consumption-GPU-T4
|
||
|
||
# 部署 GPU 版本
|
||
az containerapp create \
|
||
--name invoice-inference-gpu \
|
||
--resource-group myRG \
|
||
--environment invoice-env \
|
||
--image invoiceacr.azurecr.io/invoice-inference-gpu:v1 \
|
||
--workload-profile-name gpu \
|
||
--cpu 4 --memory 8Gi \
|
||
--min-replicas 0 --max-replicas 5 \
|
||
--ingress external --target-port 8000
|
||
```
|
||
|
||
### 推理性能对比
|
||
|
||
| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
|
||
|------|------------|---------|---------|
|
||
| CPU 2核 4GB | ~300-500ms | ~50 QPS | ~$30 |
|
||
| CPU 4核 8GB | ~200-300ms | ~100 QPS | ~$60 |
|
||
| GPU T4 | ~50-100ms | ~200 QPS | 按秒计费 |
|
||
| GPU A100 | ~20-50ms | ~500 QPS | 按秒计费 |
|
||
|
||
---
|
||
|
||
## 价格对比
|
||
|
||
### 月度成本对比(假设每天训练 2 小时)
|
||
|
||
| 方案 | 计算方式 | 月费 |
|
||
|------|---------|------|
|
||
| VM 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
|
||
| VM 按需启停 | 2h × 30天 × $3.06 | ~$184 |
|
||
| VM Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
|
||
| Serverless GPU | 2h × 30天 × ~$3.50 | ~$210 |
|
||
| Azure ML (min=0) | 2h × 30天 × $3.06 | ~$184 |
|
||
|
||
### 本项目完整成本估算
|
||
|
||
| 组件 | 推荐方案 | 月费 |
|
||
|------|---------|------|
|
||
| 图片存储 | Blob Storage (Hot) | ~$0.10 |
|
||
| 数据库 | PostgreSQL Flexible (Burstable B1ms) | ~$25 |
|
||
| 推理服务 | Container Apps CPU (2核4GB) | ~$30 |
|
||
| 训练服务 | Azure ML Spot (按需) | ~$1-5/次 |
|
||
| Container Registry | Basic | ~$5 |
|
||
| **总计** | | **~$65/月** + 训练费 |
|
||
|
||
---
|
||
|
||
## 推荐架构
|
||
|
||
### 整体架构图
|
||
|
||
```
|
||
┌─────────────────────────────────────┐
|
||
│ Azure Blob Storage │
|
||
│ ├── training-images/ │
|
||
│ ├── datasets/ │
|
||
│ └── models/ │
|
||
└─────────────────┬───────────────────┘
|
||
│
|
||
┌─────────────────────────────────┼─────────────────────────────────┐
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
|
||
│ 推理服务 (24/7) │ │ 训练服务 (按需) │ │ Web UI (可选) │
|
||
│ Container Apps │ │ Azure ML Compute │ │ Static Web Apps │
|
||
│ CPU 2核 4GB │ │ min=0, Spot │ │ ~$0 (免费层) │
|
||
│ ~$30/月 │ │ ~$1-5/次训练 │ │ │
|
||
│ │ │ │ │ │
|
||
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
|
||
│ │ FastAPI + YOLO │ │ │ │ YOLOv11 Training │ │ │ │ React/Vue 前端 │ │
|
||
│ │ /api/v1/infer │ │ │ │ 100 epochs │ │ │ │ 上传发票界面 │ │
|
||
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
|
||
└───────────┬───────────┘ └───────────┬───────────┘ └───────────┬───────────┘
|
||
│ │ │
|
||
└───────────────────────────────┼───────────────────────────────┘
|
||
│
|
||
▼
|
||
┌───────────────────────┐
|
||
│ PostgreSQL │
|
||
│ Flexible Server │
|
||
│ Burstable B1ms │
|
||
│ ~$25/月 │
|
||
└───────────────────────┘
|
||
```
|
||
|
||
### 推理服务配置
|
||
|
||
```yaml
|
||
# Container Apps - CPU (24/7 运行)
|
||
name: invoice-inference
|
||
resources:
|
||
cpu: 2
|
||
memory: 4Gi
|
||
scale:
|
||
minReplicas: 1
|
||
maxReplicas: 10
|
||
env:
|
||
- name: MODEL_PATH
|
||
value: /app/models/best.pt
|
||
- name: DB_HOST
|
||
secretRef: db-host
|
||
- name: DB_PASSWORD
|
||
secretRef: db-password
|
||
```
|
||
|
||
### 训练服务配置
|
||
|
||
**方案 A: Azure ML Compute(推荐)**
|
||
|
||
```python
|
||
from azure.ai.ml.entities import AmlCompute
|
||
|
||
gpu_cluster = AmlCompute(
|
||
name="gpu-cluster",
|
||
size="Standard_NC6s_v3",
|
||
min_instances=0, # 空闲时关机
|
||
max_instances=1,
|
||
tier="LowPriority", # Spot 实例
|
||
idle_time_before_scale_down=120
|
||
)
|
||
```
|
||
|
||
**方案 B: Container Apps Serverless GPU**
|
||
|
||
```yaml
|
||
name: invoice-training
|
||
resources:
|
||
gpu: 1
|
||
gpuType: A100
|
||
scale:
|
||
minReplicas: 0
|
||
maxReplicas: 1
|
||
```
|
||
|
||
---
|
||
|
||
## 实施步骤
|
||
|
||
### 阶段 1: 存储设置
|
||
|
||
```bash
|
||
# 创建 Storage Account
|
||
az storage account create \
|
||
--name invoicestorage \
|
||
--resource-group myRG \
|
||
--sku Standard_LRS
|
||
|
||
# 创建容器
|
||
az storage container create --name training-images --account-name invoicestorage
|
||
az storage container create --name datasets --account-name invoicestorage
|
||
az storage container create --name models --account-name invoicestorage
|
||
|
||
# 上传训练数据
|
||
az storage blob upload-batch \
|
||
--destination training-images \
|
||
--source ./data/dataset/temp \
|
||
--account-name invoicestorage
|
||
```
|
||
|
||
### 阶段 2: 数据库设置
|
||
|
||
```bash
|
||
# 创建 PostgreSQL
|
||
az postgres flexible-server create \
|
||
--name invoice-db \
|
||
--resource-group myRG \
|
||
--sku-name Standard_B1ms \
|
||
--storage-size 32 \
|
||
--admin-user docmaster \
|
||
--admin-password YOUR_PASSWORD
|
||
|
||
# 配置防火墙
|
||
az postgres flexible-server firewall-rule create \
|
||
--name allow-azure \
|
||
--resource-group myRG \
|
||
--server-name invoice-db \
|
||
--start-ip-address 0.0.0.0 \
|
||
--end-ip-address 0.0.0.0
|
||
```
|
||
|
||
### 阶段 3: 推理服务部署
|
||
|
||
```bash
|
||
# 创建 Container Registry
|
||
az acr create --name invoiceacr --resource-group myRG --sku Basic
|
||
|
||
# 构建镜像
|
||
az acr build --registry invoiceacr --image invoice-inference:v1 .
|
||
|
||
# 创建环境
|
||
az containerapp env create \
|
||
--name invoice-env \
|
||
--resource-group myRG \
|
||
--location eastus
|
||
|
||
# 部署推理服务
|
||
az containerapp create \
|
||
--name invoice-inference \
|
||
--resource-group myRG \
|
||
--environment invoice-env \
|
||
--image invoiceacr.azurecr.io/invoice-inference:v1 \
|
||
--registry-server invoiceacr.azurecr.io \
|
||
--cpu 2 --memory 4Gi \
|
||
--min-replicas 1 --max-replicas 10 \
|
||
--ingress external --target-port 8000 \
|
||
--env-vars \
|
||
DB_HOST=invoice-db.postgres.database.azure.com \
|
||
DB_NAME=docmaster \
|
||
DB_USER=docmaster \
|
||
--secrets db-password=YOUR_PASSWORD
|
||
```
|
||
|
||
### 阶段 4: 训练服务设置
|
||
|
||
```bash
|
||
# 创建 Azure ML Workspace
|
||
az ml workspace create --name invoice-ml --resource-group myRG
|
||
|
||
# 创建 Compute Cluster
|
||
az ml compute create --name gpu-cluster \
|
||
--type AmlCompute \
|
||
--size Standard_NC6s_v3 \
|
||
--min-instances 0 \
|
||
--max-instances 1 \
|
||
--tier low_priority
|
||
```
|
||
|
||
### 阶段 5: 集成训练触发 API
|
||
|
||
```python
|
||
# src/web/routes/training.py
|
||
from fastapi import APIRouter
|
||
from azure.ai.ml import MLClient, command
|
||
from azure.identity import DefaultAzureCredential
|
||
|
||
router = APIRouter()
|
||
|
||
ml_client = MLClient(
|
||
credential=DefaultAzureCredential(),
|
||
subscription_id="your-subscription-id",
|
||
resource_group_name="myRG",
|
||
workspace_name="invoice-ml"
|
||
)
|
||
|
||
@router.post("/api/v1/train")
|
||
async def trigger_training(request: TrainingRequest):
|
||
"""触发 Azure ML 训练任务"""
|
||
training_job = command(
|
||
code="./training",
|
||
command=f"python train.py --epochs {request.epochs}",
|
||
environment="AzureML-pytorch-2.0-cuda11.8@latest",
|
||
compute="gpu-cluster",
|
||
)
|
||
job = ml_client.jobs.create_or_update(training_job)
|
||
return {
|
||
"job_id": job.name,
|
||
"status": job.status,
|
||
"studio_url": job.studio_url
|
||
}
|
||
|
||
@router.get("/api/v1/train/{job_id}/status")
|
||
async def get_training_status(job_id: str):
|
||
"""查询训练状态"""
|
||
job = ml_client.jobs.get(job_id)
|
||
return {"status": job.status}
|
||
```
|
||
|
||
---
|
||
|
||
## 总结
|
||
|
||
### 推荐配置
|
||
|
||
| 组件 | 推荐方案 | 月费估算 |
|
||
|------|---------|---------|
|
||
| 图片存储 | Blob Storage (Hot) | ~$0.10 |
|
||
| 数据库 | PostgreSQL Flexible | ~$25 |
|
||
| 推理服务 | Container Apps CPU | ~$30 |
|
||
| 训练服务 | Azure ML (min=0, Spot) | 按需 ~$1-5/次 |
|
||
| Container Registry | Basic | ~$5 |
|
||
| **总计** | | **~$65/月** + 训练费 |
|
||
|
||
### 关键决策
|
||
|
||
| 场景 | 选择 |
|
||
|------|------|
|
||
| 偶尔训练,简单需求 | Azure VM Spot + 手动启停 |
|
||
| 需要 MLOps,团队协作 | Azure ML Compute |
|
||
| 追求最低空闲成本 | Container Apps Serverless GPU |
|
||
| 生产环境推理 | Container Apps CPU |
|
||
| 高并发推理 | Container Apps Serverless GPU |
|
||
|
||
### 注意事项
|
||
|
||
1. **冷启动**: Serverless GPU 启动需要 3-8 分钟
|
||
2. **Spot 中断**: 可能被抢占,需要检查点机制
|
||
3. **网络延迟**: Blob Storage 挂载比本地 SSD 慢,建议开启缓存
|
||
4. **区域选择**: 选择有 GPU 配额的区域 (East US, West Europe 等)
|
||
5. **推理优化**: CPU 推理对于 YOLO 已经足够,无需 GPU
|