WIP
This commit is contained in:
772
docs/aws-deployment-guide.md
Normal file
772
docs/aws-deployment-guide.md
Normal file
@@ -0,0 +1,772 @@
|
||||
# AWS 部署方案完整指南
|
||||
|
||||
## 目录
|
||||
- [核心问题](#核心问题)
|
||||
- [存储方案](#存储方案)
|
||||
- [训练方案](#训练方案)
|
||||
- [推理方案](#推理方案)
|
||||
- [价格对比](#价格对比)
|
||||
- [推荐架构](#推荐架构)
|
||||
- [实施步骤](#实施步骤)
|
||||
- [AWS vs Azure 对比](#aws-vs-azure-对比)
|
||||
|
||||
---
|
||||
|
||||
## 核心问题
|
||||
|
||||
| 问题 | 答案 |
|
||||
|------|------|
|
||||
| S3 能用于训练吗? | 可以,用 Mountpoint for S3 或 SageMaker 原生支持 |
|
||||
| 能实时从 S3 读取训练吗? | 可以,SageMaker 支持 Pipe Mode 流式读取 |
|
||||
| 本地能挂载 S3 吗? | 可以,用 s3fs-fuse 或 Rclone |
|
||||
| EC2 空闲时收费吗? | 收费,只要运行就按小时计费 |
|
||||
| 如何按需付费? | 用 SageMaker Managed Spot 或 Lambda |
|
||||
| 推理服务用什么? | Lambda (Serverless) 或 ECS/Fargate (容器) |
|
||||
|
||||
---
|
||||
|
||||
## 存储方案
|
||||
|
||||
### Amazon S3(推荐)
|
||||
|
||||
S3 是 AWS 的核心存储服务,与 SageMaker 深度集成。
|
||||
|
||||
```bash
|
||||
# 创建 S3 桶
|
||||
aws s3 mb s3://invoice-training-data --region us-east-1
|
||||
|
||||
# 上传训练数据
|
||||
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/
|
||||
|
||||
# 创建目录结构
|
||||
aws s3api put-object --bucket invoice-training-data --key datasets/
|
||||
aws s3api put-object --bucket invoice-training-data --key models/
|
||||
```
|
||||
|
||||
### Mountpoint for Amazon S3
|
||||
|
||||
AWS 官方的 S3 挂载客户端,性能优于 s3fs:
|
||||
|
||||
```bash
|
||||
# 安装 Mountpoint
|
||||
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
|
||||
sudo dpkg -i mount-s3.deb
|
||||
|
||||
# 挂载 S3
|
||||
mkdir -p /mnt/s3-data
|
||||
mount-s3 invoice-training-data /mnt/s3-data --region us-east-1
|
||||
|
||||
# 配置缓存(推荐)
|
||||
mount-s3 invoice-training-data /mnt/s3-data \
|
||||
--region us-east-1 \
|
||||
--cache /tmp/s3-cache \
|
||||
--metadata-ttl 60
|
||||
```
|
||||
|
||||
### 本地开发挂载
|
||||
|
||||
**Linux/Mac (s3fs-fuse):**
|
||||
```bash
|
||||
# 安装
|
||||
sudo apt-get install s3fs
|
||||
|
||||
# 配置凭证
|
||||
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ~/.passwd-s3fs
|
||||
chmod 600 ~/.passwd-s3fs
|
||||
|
||||
# 挂载
|
||||
s3fs invoice-training-data /mnt/s3 -o passwd_file=~/.passwd-s3fs
|
||||
```
|
||||
|
||||
**Windows (Rclone):**
|
||||
```powershell
|
||||
# 安装
|
||||
winget install Rclone.Rclone
|
||||
|
||||
# 配置
|
||||
rclone config # 选择 s3
|
||||
|
||||
# 挂载
|
||||
rclone mount aws:invoice-training-data Z: --vfs-cache-mode full
|
||||
```
|
||||
|
||||
### 存储费用
|
||||
|
||||
| 层级 | 价格 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| S3 Standard | $0.023/GB/月 | 频繁访问 |
|
||||
| S3 Intelligent-Tiering | $0.023/GB/月 | 自动分层 |
|
||||
| S3 Infrequent Access | $0.0125/GB/月 | 偶尔访问 |
|
||||
| S3 Glacier | $0.004/GB/月 | 长期存档 |
|
||||
|
||||
**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.12/月**
|
||||
|
||||
### SageMaker 数据输入模式
|
||||
|
||||
| 模式 | 说明 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| File Mode | 下载到本地再训练 | 小数据集 |
|
||||
| Pipe Mode | 流式读取,不占本地空间 | 大数据集 |
|
||||
| FastFile Mode | 按需下载,最高 3x 加速 | 推荐 |
|
||||
|
||||
---
|
||||
|
||||
## 训练方案
|
||||
|
||||
### 方案总览
|
||||
|
||||
| 方案 | 适用场景 | 空闲费用 | 复杂度 | Spot 支持 |
|
||||
|------|---------|---------|--------|----------|
|
||||
| EC2 GPU | 简单直接 | 24/7 收费 | 低 | 是 |
|
||||
| SageMaker Training | MLOps 集成 | 按任务计费 | 中 | 是 |
|
||||
| EKS + GPU | Kubernetes | 复杂计费 | 高 | 是 |
|
||||
|
||||
### EC2 vs SageMaker
|
||||
|
||||
| 特性 | EC2 | SageMaker |
|
||||
|------|-----|-----------|
|
||||
| 本质 | 虚拟机 | 托管 ML 平台 |
|
||||
| 计算费用 | $3.06/hr (p3.2xlarge) | $3.825/hr (+25%) |
|
||||
| 管理开销 | 需自己配置 | 全托管 |
|
||||
| Spot 折扣 | 最高 90% | 最高 90% |
|
||||
| 实验跟踪 | 无 | 内置 |
|
||||
| 自动关机 | 无 | 任务完成自动停止 |
|
||||
|
||||
### GPU 实例价格 (2025 年 6 月降价后)
|
||||
|
||||
| 实例 | GPU | 显存 | On-Demand | Spot 价格 |
|
||||
|------|-----|------|-----------|----------|
|
||||
| g4dn.xlarge | 1x T4 | 16GB | $0.526/hr | ~$0.16/hr |
|
||||
| g4dn.2xlarge | 1x T4 | 16GB | $0.752/hr | ~$0.23/hr |
|
||||
| p3.2xlarge | 1x V100 | 16GB | $3.06/hr | ~$0.92/hr |
|
||||
| p3.8xlarge | 4x V100 | 64GB | $12.24/hr | ~$3.67/hr |
|
||||
| p4d.24xlarge | 8x A100 | 320GB | $32.77/hr | ~$9.83/hr |
|
||||
|
||||
**注意**: 2025 年 6 月 AWS 宣布 P4/P5 系列最高降价 45%。
|
||||
|
||||
### Spot 实例
|
||||
|
||||
```bash
|
||||
# EC2 Spot 请求
|
||||
aws ec2 request-spot-instances \
|
||||
--instance-count 1 \
|
||||
--type "one-time" \
|
||||
--launch-specification '{
|
||||
"ImageId": "ami-0123456789abcdef0",
|
||||
"InstanceType": "p3.2xlarge",
|
||||
"KeyName": "my-key"
|
||||
}'
|
||||
```
|
||||
|
||||
### SageMaker Managed Spot Training
|
||||
|
||||
```python
|
||||
from sagemaker.pytorch import PyTorch
|
||||
|
||||
estimator = PyTorch(
|
||||
entry_point="train.py",
|
||||
source_dir="./src",
|
||||
role="arn:aws:iam::123456789012:role/SageMakerRole",
|
||||
instance_count=1,
|
||||
instance_type="ml.p3.2xlarge",
|
||||
framework_version="2.0",
|
||||
py_version="py310",
|
||||
|
||||
# 启用 Spot 实例
|
||||
use_spot_instances=True,
|
||||
max_run=3600, # 最长运行 1 小时
|
||||
max_wait=7200, # 最长等待 2 小时
|
||||
|
||||
# 检查点配置(Spot 中断恢复)
|
||||
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
|
||||
checkpoint_local_path="/opt/ml/checkpoints",
|
||||
|
||||
hyperparameters={
|
||||
"epochs": 100,
|
||||
"batch-size": 16,
|
||||
}
|
||||
)
|
||||
|
||||
estimator.fit({
|
||||
"training": "s3://invoice-training-data/datasets/train/",
|
||||
"validation": "s3://invoice-training-data/datasets/val/"
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 推理方案
|
||||
|
||||
### 方案对比
|
||||
|
||||
| 方案 | GPU 支持 | 扩缩容 | 冷启动 | 价格 | 适用场景 |
|
||||
|------|---------|--------|--------|------|---------|
|
||||
| Lambda | 否 | 自动 0-N | 快 | 按调用 | 低流量、CPU 推理 |
|
||||
| Lambda + Container | 否 | 自动 0-N | 较慢 | 按调用 | 复杂依赖 |
|
||||
| ECS Fargate | 否 | 自动 | 中 | ~$30/月 | 容器化服务 |
|
||||
| ECS + EC2 GPU | 是 | 手动/自动 | 慢 | ~$100+/月 | GPU 推理 |
|
||||
| SageMaker Endpoint | 是 | 自动 | 慢 | ~$80+/月 | MLOps 集成 |
|
||||
| SageMaker Serverless | 否 | 自动 0-N | 中 | 按调用 | 间歇性流量 |
|
||||
|
||||
### 推荐方案 1: AWS Lambda (低流量)
|
||||
|
||||
对于 YOLO CPU 推理,Lambda 最经济:
|
||||
|
||||
```python
|
||||
# lambda_function.py
|
||||
import json
|
||||
import boto3
|
||||
from ultralytics import YOLO
|
||||
|
||||
# 模型在 Lambda Layer 或 /tmp 加载
|
||||
model = None
|
||||
|
||||
def load_model():
|
||||
global model
|
||||
if model is None:
|
||||
# 从 S3 下载模型到 /tmp
|
||||
s3 = boto3.client('s3')
|
||||
s3.download_file('invoice-models', 'best.pt', '/tmp/best.pt')
|
||||
model = YOLO('/tmp/best.pt')
|
||||
return model
|
||||
|
||||
def lambda_handler(event, context):
|
||||
model = load_model()
|
||||
|
||||
# 从 S3 获取图片
|
||||
s3 = boto3.client('s3')
|
||||
bucket = event['bucket']
|
||||
key = event['key']
|
||||
|
||||
local_path = f'/tmp/{key.split("/")[-1]}'
|
||||
s3.download_file(bucket, key, local_path)
|
||||
|
||||
# 执行推理
|
||||
results = model.predict(local_path, conf=0.5)
|
||||
|
||||
return {
|
||||
'statusCode': 200,
|
||||
'body': json.dumps({
|
||||
'fields': extract_fields(results),
|
||||
'confidence': get_confidence(results)
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
**Lambda 配置:**
|
||||
```yaml
|
||||
# serverless.yml
|
||||
service: invoice-inference
|
||||
|
||||
provider:
|
||||
name: aws
|
||||
runtime: python3.11
|
||||
timeout: 30
|
||||
memorySize: 4096 # 4GB 内存
|
||||
|
||||
functions:
|
||||
infer:
|
||||
handler: lambda_function.lambda_handler
|
||||
events:
|
||||
- http:
|
||||
path: /infer
|
||||
method: post
|
||||
layers:
|
||||
- arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1
|
||||
```
|
||||
|
||||
### 推荐方案 2: ECS Fargate (中流量)
|
||||
|
||||
```yaml
|
||||
# task-definition.json
|
||||
{
|
||||
"family": "invoice-inference",
|
||||
"networkMode": "awsvpc",
|
||||
"requiresCompatibilities": ["FARGATE"],
|
||||
"cpu": "2048",
|
||||
"memory": "4096",
|
||||
"containerDefinitions": [
|
||||
{
|
||||
"name": "inference",
|
||||
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest",
|
||||
"portMappings": [
|
||||
{
|
||||
"containerPort": 8000,
|
||||
"protocol": "tcp"
|
||||
}
|
||||
],
|
||||
"environment": [
|
||||
{"name": "MODEL_PATH", "value": "/app/models/best.pt"}
|
||||
],
|
||||
"logConfiguration": {
|
||||
"logDriver": "awslogs",
|
||||
"options": {
|
||||
"awslogs-group": "/ecs/invoice-inference",
|
||||
"awslogs-region": "us-east-1",
|
||||
"awslogs-stream-prefix": "ecs"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Auto Scaling 配置:**
|
||||
```bash
|
||||
# 创建 Auto Scaling Target
|
||||
aws application-autoscaling register-scalable-target \
|
||||
--service-namespace ecs \
|
||||
--resource-id service/invoice-cluster/invoice-service \
|
||||
--scalable-dimension ecs:service:DesiredCount \
|
||||
--min-capacity 1 \
|
||||
--max-capacity 10
|
||||
|
||||
# 基于 CPU 使用率扩缩容
|
||||
aws application-autoscaling put-scaling-policy \
|
||||
--service-namespace ecs \
|
||||
--resource-id service/invoice-cluster/invoice-service \
|
||||
--scalable-dimension ecs:service:DesiredCount \
|
||||
--policy-name cpu-scaling \
|
||||
--policy-type TargetTrackingScaling \
|
||||
--target-tracking-scaling-policy-configuration '{
|
||||
"TargetValue": 70,
|
||||
"PredefinedMetricSpecification": {
|
||||
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
|
||||
},
|
||||
"ScaleOutCooldown": 60,
|
||||
"ScaleInCooldown": 120
|
||||
}'
|
||||
```
|
||||
|
||||
### 方案 3: SageMaker Serverless Inference
|
||||
|
||||
```python
|
||||
from sagemaker.serverless import ServerlessInferenceConfig
|
||||
from sagemaker.pytorch import PyTorchModel
|
||||
|
||||
model = PyTorchModel(
|
||||
model_data="s3://invoice-models/model.tar.gz",
|
||||
role="arn:aws:iam::123456789012:role/SageMakerRole",
|
||||
entry_point="inference.py",
|
||||
framework_version="2.0",
|
||||
py_version="py310"
|
||||
)
|
||||
|
||||
serverless_config = ServerlessInferenceConfig(
|
||||
memory_size_in_mb=4096,
|
||||
max_concurrency=10
|
||||
)
|
||||
|
||||
predictor = model.deploy(
|
||||
serverless_inference_config=serverless_config,
|
||||
endpoint_name="invoice-inference-serverless"
|
||||
)
|
||||
```
|
||||
|
||||
### 推理性能对比
|
||||
|
||||
| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
|
||||
|------|------------|---------|---------|
|
||||
| Lambda 4GB | ~500-800ms | 按需扩展 | ~$15 (10K 请求) |
|
||||
| Fargate 2vCPU 4GB | ~300-500ms | ~50 QPS | ~$30 |
|
||||
| Fargate 4vCPU 8GB | ~200-300ms | ~100 QPS | ~$60 |
|
||||
| EC2 g4dn.xlarge (T4) | ~50-100ms | ~200 QPS | ~$380 |
|
||||
|
||||
---
|
||||
|
||||
## 价格对比
|
||||
|
||||
### 训练成本对比(假设每天训练 2 小时)
|
||||
|
||||
| 方案 | 计算方式 | 月费 |
|
||||
|------|---------|------|
|
||||
| EC2 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
|
||||
| EC2 按需启停 | 2h × 30天 × $3.06 | ~$184 |
|
||||
| EC2 Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
|
||||
| SageMaker On-Demand | 2h × 30天 × $3.825 | ~$230 |
|
||||
| SageMaker Spot | 2h × 30天 × $1.15 | ~$69 |
|
||||
|
||||
### 本项目完整成本估算
|
||||
|
||||
| 组件 | 推荐方案 | 月费 |
|
||||
|------|---------|------|
|
||||
| 数据存储 | S3 Standard (5GB) | ~$0.12 |
|
||||
| 数据库 | RDS PostgreSQL (db.t3.micro) | ~$15 |
|
||||
| 推理服务 | Lambda (10K 请求/月) | ~$15 |
|
||||
| 推理服务 (替代) | ECS Fargate | ~$30 |
|
||||
| 训练服务 | SageMaker Spot (按需) | ~$2-5/次 |
|
||||
| ECR (镜像存储) | 基本使用 | ~$1 |
|
||||
| **总计 (Lambda)** | | **~$35/月** + 训练费 |
|
||||
| **总计 (Fargate)** | | **~$50/月** + 训练费 |
|
||||
|
||||
---
|
||||
|
||||
## 推荐架构
|
||||
|
||||
### 整体架构图
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Amazon S3 │
|
||||
│ ├── training-images/ │
|
||||
│ ├── datasets/ │
|
||||
│ ├── models/ │
|
||||
│ └── checkpoints/ │
|
||||
└─────────────────┬───────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┼─────────────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
|
||||
│ 推理服务 │ │ 训练服务 │ │ API Gateway │
|
||||
│ │ │ │ │ │
|
||||
│ 方案 A: Lambda │ │ SageMaker │ │ REST API │
|
||||
│ ~$15/月 (10K req) │ │ Managed Spot │ │ 触发 Lambda/ECS │
|
||||
│ │ │ ~$2-5/次训练 │ │ │
|
||||
│ 方案 B: ECS Fargate │ │ │ │ │
|
||||
│ ~$30/月 │ │ - 自动启动 │ │ │
|
||||
│ │ │ - 训练完成自动停止 │ │ │
|
||||
│ ┌───────────────────┐ │ │ - 检查点自动保存 │ │ │
|
||||
│ │ FastAPI + YOLO │ │ │ │ │ │
|
||||
│ │ CPU 推理 │ │ │ │ │ │
|
||||
│ └───────────────────┘ │ └───────────┬───────────┘ └───────────────────────┘
|
||||
└───────────┬───────────┘ │
|
||||
│ │
|
||||
└───────────────────────────────┼───────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Amazon RDS │
|
||||
│ PostgreSQL │
|
||||
│ db.t3.micro │
|
||||
│ ~$15/月 │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
### Lambda 推理配置
|
||||
|
||||
```yaml
|
||||
# SAM template
|
||||
AWSTemplateFormatVersion: '2010-09-09'
|
||||
Transform: AWS::Serverless-2016-10-31
|
||||
|
||||
Resources:
|
||||
InferenceFunction:
|
||||
Type: AWS::Serverless::Function
|
||||
Properties:
|
||||
Handler: app.lambda_handler
|
||||
Runtime: python3.11
|
||||
MemorySize: 4096
|
||||
Timeout: 30
|
||||
Environment:
|
||||
Variables:
|
||||
MODEL_BUCKET: invoice-models
|
||||
MODEL_KEY: best.pt
|
||||
Policies:
|
||||
- S3ReadPolicy:
|
||||
BucketName: invoice-models
|
||||
- S3ReadPolicy:
|
||||
BucketName: invoice-uploads
|
||||
Events:
|
||||
InferApi:
|
||||
Type: Api
|
||||
Properties:
|
||||
Path: /infer
|
||||
Method: post
|
||||
```
|
||||
|
||||
### SageMaker 训练配置
|
||||
|
||||
```python
|
||||
from sagemaker.pytorch import PyTorch
|
||||
|
||||
estimator = PyTorch(
|
||||
entry_point="train.py",
|
||||
source_dir="./src",
|
||||
role="arn:aws:iam::123456789012:role/SageMakerRole",
|
||||
instance_count=1,
|
||||
instance_type="ml.g4dn.xlarge", # T4 GPU
|
||||
framework_version="2.0",
|
||||
py_version="py310",
|
||||
|
||||
# Spot 实例配置
|
||||
use_spot_instances=True,
|
||||
max_run=7200,
|
||||
max_wait=14400,
|
||||
|
||||
# 检查点
|
||||
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
|
||||
|
||||
hyperparameters={
|
||||
"epochs": 100,
|
||||
"batch-size": 16,
|
||||
"model": "yolo11n.pt"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 实施步骤
|
||||
|
||||
### 阶段 1: 存储设置
|
||||
|
||||
```bash
|
||||
# 创建 S3 桶
|
||||
aws s3 mb s3://invoice-training-data --region us-east-1
|
||||
aws s3 mb s3://invoice-models --region us-east-1
|
||||
|
||||
# 上传训练数据
|
||||
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/
|
||||
|
||||
# 配置生命周期(可选,自动转冷存储)
|
||||
aws s3api put-bucket-lifecycle-configuration \
|
||||
--bucket invoice-training-data \
|
||||
--lifecycle-configuration '{
|
||||
"Rules": [{
|
||||
"ID": "MoveToIA",
|
||||
"Status": "Enabled",
|
||||
"Transitions": [{
|
||||
"Days": 30,
|
||||
"StorageClass": "STANDARD_IA"
|
||||
}]
|
||||
}]
|
||||
}'
|
||||
```
|
||||
|
||||
### 阶段 2: 数据库设置
|
||||
|
||||
```bash
|
||||
# 创建 RDS PostgreSQL
|
||||
aws rds create-db-instance \
|
||||
--db-instance-identifier invoice-db \
|
||||
--db-instance-class db.t3.micro \
|
||||
--engine postgres \
|
||||
--engine-version 15 \
|
||||
--master-username docmaster \
|
||||
--master-user-password YOUR_PASSWORD \
|
||||
--allocated-storage 20
|
||||
|
||||
# 配置安全组
|
||||
aws ec2 authorize-security-group-ingress \
|
||||
--group-id sg-xxx \
|
||||
--protocol tcp \
|
||||
--port 5432 \
|
||||
--source-group sg-yyy
|
||||
```
|
||||
|
||||
### 阶段 3: 推理服务部署
|
||||
|
||||
**方案 A: Lambda**
|
||||
|
||||
```bash
|
||||
# 创建 Lambda Layer (依赖)
|
||||
cd lambda-layer
|
||||
pip install ultralytics opencv-python-headless -t python/
|
||||
zip -r layer.zip python/
|
||||
aws lambda publish-layer-version \
|
||||
--layer-name yolo-deps \
|
||||
--zip-file fileb://layer.zip \
|
||||
--compatible-runtimes python3.11
|
||||
|
||||
# 部署 Lambda 函数
|
||||
cd ../lambda
|
||||
zip function.zip lambda_function.py
|
||||
aws lambda create-function \
|
||||
--function-name invoice-inference \
|
||||
--runtime python3.11 \
|
||||
--handler lambda_function.lambda_handler \
|
||||
--role arn:aws:iam::123456789012:role/LambdaRole \
|
||||
--zip-file fileb://function.zip \
|
||||
--memory-size 4096 \
|
||||
--timeout 30 \
|
||||
--layers arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1
|
||||
|
||||
# 创建 API Gateway
|
||||
aws apigatewayv2 create-api \
|
||||
--name invoice-api \
|
||||
--protocol-type HTTP \
|
||||
--target arn:aws:lambda:us-east-1:123456789012:function:invoice-inference
|
||||
```
|
||||
|
||||
**方案 B: ECS Fargate**
|
||||
|
||||
```bash
|
||||
# 创建 ECR 仓库
|
||||
aws ecr create-repository --repository-name invoice-inference
|
||||
|
||||
# 构建并推送镜像
|
||||
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
|
||||
docker build -t invoice-inference .
|
||||
docker tag invoice-inference:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
|
||||
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
|
||||
|
||||
# 创建 ECS 集群
|
||||
aws ecs create-cluster --cluster-name invoice-cluster
|
||||
|
||||
# 注册任务定义
|
||||
aws ecs register-task-definition --cli-input-json file://task-definition.json
|
||||
|
||||
# 创建服务
|
||||
aws ecs create-service \
|
||||
--cluster invoice-cluster \
|
||||
--service-name invoice-service \
|
||||
--task-definition invoice-inference \
|
||||
--desired-count 1 \
|
||||
--launch-type FARGATE \
|
||||
--network-configuration '{
|
||||
"awsvpcConfiguration": {
|
||||
"subnets": ["subnet-xxx"],
|
||||
"securityGroups": ["sg-xxx"],
|
||||
"assignPublicIp": "ENABLED"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
### 阶段 4: 训练服务设置
|
||||
|
||||
```python
|
||||
# setup_sagemaker.py
|
||||
import boto3
|
||||
import sagemaker
|
||||
from sagemaker.pytorch import PyTorch
|
||||
|
||||
# 创建 SageMaker 执行角色
|
||||
iam = boto3.client('iam')
|
||||
role_arn = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
|
||||
|
||||
# 配置训练任务
|
||||
estimator = PyTorch(
|
||||
entry_point="train.py",
|
||||
source_dir="./src/training",
|
||||
role=role_arn,
|
||||
instance_count=1,
|
||||
instance_type="ml.g4dn.xlarge",
|
||||
framework_version="2.0",
|
||||
py_version="py310",
|
||||
use_spot_instances=True,
|
||||
max_run=7200,
|
||||
max_wait=14400,
|
||||
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
|
||||
)
|
||||
|
||||
# 保存配置供后续使用
|
||||
estimator.save("training_config.json")
|
||||
```
|
||||
|
||||
### 阶段 5: 集成训练触发 API
|
||||
|
||||
```python
|
||||
# lambda_trigger_training.py
|
||||
import boto3
|
||||
import sagemaker
|
||||
from sagemaker.pytorch import PyTorch
|
||||
|
||||
def lambda_handler(event, context):
|
||||
"""触发 SageMaker 训练任务"""
|
||||
|
||||
epochs = event.get('epochs', 100)
|
||||
|
||||
estimator = PyTorch(
|
||||
entry_point="train.py",
|
||||
source_dir="s3://invoice-training-data/code/",
|
||||
role="arn:aws:iam::123456789012:role/SageMakerRole",
|
||||
instance_count=1,
|
||||
instance_type="ml.g4dn.xlarge",
|
||||
framework_version="2.0",
|
||||
py_version="py310",
|
||||
use_spot_instances=True,
|
||||
max_run=7200,
|
||||
max_wait=14400,
|
||||
hyperparameters={
|
||||
"epochs": epochs,
|
||||
"batch-size": 16,
|
||||
}
|
||||
)
|
||||
|
||||
estimator.fit(
|
||||
inputs={
|
||||
"training": "s3://invoice-training-data/datasets/train/",
|
||||
"validation": "s3://invoice-training-data/datasets/val/"
|
||||
},
|
||||
wait=False # 异步执行
|
||||
)
|
||||
|
||||
return {
|
||||
'statusCode': 200,
|
||||
'body': {
|
||||
'training_job_name': estimator.latest_training_job.name,
|
||||
'status': 'Started'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## AWS vs Azure 对比
|
||||
|
||||
### 服务对应关系
|
||||
|
||||
| 功能 | AWS | Azure |
|
||||
|------|-----|-------|
|
||||
| 对象存储 | S3 | Blob Storage |
|
||||
| 挂载工具 | Mountpoint for S3 | BlobFuse2 |
|
||||
| ML 平台 | SageMaker | Azure ML |
|
||||
| 容器服务 | ECS/Fargate | Container Apps |
|
||||
| Serverless | Lambda | Functions |
|
||||
| GPU VM | EC2 P3/G4dn | NC/ND 系列 |
|
||||
| 容器注册 | ECR | ACR |
|
||||
| 数据库 | RDS PostgreSQL | PostgreSQL Flexible |
|
||||
|
||||
### 价格对比
|
||||
|
||||
| 组件 | AWS | Azure |
|
||||
|------|-----|-------|
|
||||
| 存储 (5GB) | ~$0.12/月 | ~$0.09/月 |
|
||||
| 数据库 | ~$15/月 | ~$25/月 |
|
||||
| 推理 (Serverless) | ~$15/月 | ~$30/月 |
|
||||
| 推理 (容器) | ~$30/月 | ~$30/月 |
|
||||
| 训练 (Spot GPU) | ~$2-5/次 | ~$1-5/次 |
|
||||
| **总计** | **~$35-50/月** | **~$65/月** |
|
||||
|
||||
### 优劣对比
|
||||
|
||||
| 方面 | AWS 优势 | Azure 优势 |
|
||||
|------|---------|-----------|
|
||||
| 价格 | Lambda 更便宜 | GPU Spot 更便宜 |
|
||||
| ML 平台 | SageMaker 更成熟 | Azure ML 更易用 |
|
||||
| Serverless GPU | 无原生支持 | Container Apps GPU |
|
||||
| 文档 | 更丰富 | 中文文档更好 |
|
||||
| 生态 | 更大 | Office 365 集成 |
|
||||
|
||||
---
|
||||
|
||||
## 总结
|
||||
|
||||
### 推荐配置
|
||||
|
||||
| 组件 | 推荐方案 | 月费估算 |
|
||||
|------|---------|---------|
|
||||
| 数据存储 | S3 Standard | ~$0.12 |
|
||||
| 数据库 | RDS db.t3.micro | ~$15 |
|
||||
| 推理服务 | Lambda 4GB | ~$15 |
|
||||
| 训练服务 | SageMaker Spot | 按需 ~$2-5/次 |
|
||||
| ECR | 基本使用 | ~$1 |
|
||||
| **总计** | | **~$35/月** + 训练费 |
|
||||
|
||||
### 关键决策
|
||||
|
||||
| 场景 | 选择 |
|
||||
|------|------|
|
||||
| 最低成本 | Lambda + SageMaker Spot |
|
||||
| 稳定推理 | ECS Fargate |
|
||||
| GPU 推理 | ECS + EC2 GPU |
|
||||
| MLOps 集成 | SageMaker 全家桶 |
|
||||
|
||||
### 注意事项
|
||||
|
||||
1. **Lambda 冷启动**: 首次调用 ~3-5 秒,可用 Provisioned Concurrency 解决
|
||||
2. **Spot 中断**: 配置检查点,SageMaker 自动恢复
|
||||
3. **S3 传输**: 同区域免费,跨区域收费
|
||||
4. **Fargate 无 GPU**: 需要 GPU 必须用 ECS + EC2
|
||||
5. **SageMaker 加价**: 比 EC2 贵 ~25%,但省管理成本
|
||||
567
docs/azure-deployment-guide.md
Normal file
567
docs/azure-deployment-guide.md
Normal file
@@ -0,0 +1,567 @@
|
||||
# Azure 部署方案完整指南
|
||||
|
||||
## 目录
|
||||
- [核心问题](#核心问题)
|
||||
- [存储方案](#存储方案)
|
||||
- [训练方案](#训练方案)
|
||||
- [推理方案](#推理方案)
|
||||
- [价格对比](#价格对比)
|
||||
- [推荐架构](#推荐架构)
|
||||
- [实施步骤](#实施步骤)
|
||||
|
||||
---
|
||||
|
||||
## 核心问题
|
||||
|
||||
| 问题 | 答案 |
|
||||
|------|------|
|
||||
| Azure Blob Storage 能用于训练吗? | 可以,用 BlobFuse2 挂载 |
|
||||
| 能实时从 Blob 读取训练吗? | 可以,但建议配置本地缓存 |
|
||||
| 本地能挂载 Azure Blob 吗? | 可以,用 Rclone (Windows) 或 BlobFuse2 (Linux) |
|
||||
| VM 空闲时收费吗? | 收费,只要开机就按小时计费 |
|
||||
| 如何按需付费? | 用 Serverless GPU 或 min=0 的 Compute Cluster |
|
||||
| 推理服务用什么? | Container Apps (CPU) 或 Serverless GPU |
|
||||
|
||||
---
|
||||
|
||||
## 存储方案
|
||||
|
||||
### Azure Blob Storage + BlobFuse2(推荐)
|
||||
|
||||
```bash
|
||||
# 安装 BlobFuse2
|
||||
sudo apt-get install blobfuse2
|
||||
|
||||
# 配置文件
|
||||
cat > ~/blobfuse-config.yaml << 'EOF'
|
||||
logging:
|
||||
type: syslog
|
||||
level: log_warning
|
||||
|
||||
components:
|
||||
- libfuse
|
||||
- file_cache
|
||||
- azstorage
|
||||
|
||||
file_cache:
|
||||
path: /tmp/blobfuse2
|
||||
timeout-sec: 120
|
||||
max-size-mb: 4096
|
||||
|
||||
azstorage:
|
||||
type: block
|
||||
account-name: YOUR_ACCOUNT
|
||||
account-key: YOUR_KEY
|
||||
container: training-images
|
||||
EOF
|
||||
|
||||
# 挂载
|
||||
mkdir -p /mnt/azure-blob
|
||||
blobfuse2 mount /mnt/azure-blob --config-file=~/blobfuse-config.yaml
|
||||
```
|
||||
|
||||
### 本地开发(Windows)
|
||||
|
||||
```powershell
|
||||
# 安装
|
||||
winget install WinFsp.WinFsp
|
||||
winget install Rclone.Rclone
|
||||
|
||||
# 配置
|
||||
rclone config # 选择 azureblob
|
||||
|
||||
# 挂载为 Z: 盘
|
||||
rclone mount azure:training-images Z: --vfs-cache-mode full
|
||||
```
|
||||
|
||||
### 存储费用
|
||||
|
||||
| 层级 | 价格 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| Hot | $0.018/GB/月 | 频繁访问 |
|
||||
| Cool | $0.01/GB/月 | 偶尔访问 |
|
||||
| Archive | $0.002/GB/月 | 长期存档 |
|
||||
|
||||
**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.09/月**
|
||||
|
||||
---
|
||||
|
||||
## 训练方案
|
||||
|
||||
### 方案总览
|
||||
|
||||
| 方案 | 适用场景 | 空闲费用 | 复杂度 |
|
||||
|------|---------|---------|--------|
|
||||
| Azure VM | 简单直接 | 24/7 收费 | 低 |
|
||||
| Azure VM Spot | 省钱、可中断 | 24/7 收费 | 低 |
|
||||
| Azure ML Compute | MLOps 集成 | 可缩到 0 | 中 |
|
||||
| Container Apps GPU | Serverless | 自动缩到 0 | 中 |
|
||||
|
||||
### Azure VM vs Azure ML
|
||||
|
||||
| 特性 | Azure VM | Azure ML |
|
||||
|------|----------|----------|
|
||||
| 本质 | 虚拟机 | 托管 ML 平台 |
|
||||
| 计算费用 | $3.06/hr (NC6s_v3) | $3.06/hr (相同) |
|
||||
| 附加费用 | ~$5/月 | ~$20-30/月 |
|
||||
| 实验跟踪 | 无 | 内置 |
|
||||
| 自动扩缩 | 无 | 支持 min=0 |
|
||||
| 适用人群 | DevOps | 数据科学家 |
|
||||
|
||||
### Azure ML 附加费用明细
|
||||
|
||||
| 服务 | 用途 | 费用 |
|
||||
|------|------|------|
|
||||
| Container Registry | Docker 镜像 | ~$5-20/月 |
|
||||
| Blob Storage | 日志、模型 | ~$0.10/月 |
|
||||
| Application Insights | 监控 | ~$0-10/月 |
|
||||
| Key Vault | 密钥管理 | <$1/月 |
|
||||
|
||||
### Spot 实例
|
||||
|
||||
两种平台都支持 Spot/低优先级实例,最高节省 90%:
|
||||
|
||||
| 类型 | 正常价格 | Spot 价格 | 节省 |
|
||||
|------|---------|----------|------|
|
||||
| NC6s_v3 (V100) | $3.06/hr | ~$0.92/hr | 70% |
|
||||
| NC24ads_A100_v4 | $3.67/hr | ~$1.15/hr | 69% |
|
||||
|
||||
### GPU 实例价格
|
||||
|
||||
| 实例 | GPU | 显存 | 价格/小时 | Spot 价格 |
|
||||
|------|-----|------|---------|----------|
|
||||
| NC6s_v3 | 1x V100 | 16GB | $3.06 | $0.92 |
|
||||
| NC24s_v3 | 4x V100 | 64GB | $12.24 | $3.67 |
|
||||
| NC24ads_A100_v4 | 1x A100 | 80GB | $3.67 | $1.15 |
|
||||
| NC48ads_A100_v4 | 2x A100 | 160GB | $7.35 | $2.30 |
|
||||
|
||||
---
|
||||
|
||||
## 推理方案
|
||||
|
||||
### 方案对比
|
||||
|
||||
| 方案 | GPU 支持 | 扩缩容 | 价格 | 适用场景 |
|
||||
|------|---------|--------|------|---------|
|
||||
| Container Apps (CPU) | 否 | 自动 0-N | ~$30/月 | YOLO 推理 (够用) |
|
||||
| Container Apps (GPU) | 是 | Serverless | 按秒计费 | 高吞吐推理 |
|
||||
| Azure App Service | 否 | 手动/自动 | ~$50/月 | 简单部署 |
|
||||
| Azure ML Endpoint | 是 | 自动 | ~$100+/月 | MLOps 集成 |
|
||||
| AKS (Kubernetes) | 是 | 自动 | 复杂计费 | 大规模生产 |
|
||||
|
||||
### 推荐: Container Apps (CPU)
|
||||
|
||||
对于 YOLO 推理,**CPU 足够**,不需要 GPU:
|
||||
- YOLOv11n 在 CPU 上推理时间 ~200-500ms
|
||||
- 比 GPU 便宜很多,适合中低流量
|
||||
|
||||
```yaml
|
||||
# Container Apps 配置
|
||||
name: invoice-inference
|
||||
image: myacr.azurecr.io/invoice-inference:v1
|
||||
resources:
|
||||
cpu: 2.0
|
||||
memory: 4Gi
|
||||
scale:
|
||||
minReplicas: 1 # 最少 1 个实例保持响应
|
||||
maxReplicas: 10 # 最多扩展到 10 个
|
||||
rules:
|
||||
- name: http-scaling
|
||||
http:
|
||||
metadata:
|
||||
concurrentRequests: "50" # 每实例 50 并发时扩容
|
||||
```
|
||||
|
||||
### 推理服务代码示例
|
||||
|
||||
```python
|
||||
# Dockerfile
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# 安装依赖
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# 复制代码和模型
|
||||
COPY src/ ./src/
|
||||
COPY models/best.pt ./models/
|
||||
|
||||
# 启动服务
|
||||
CMD ["uvicorn", "src.web.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
```python
|
||||
# src/web/app.py
|
||||
from fastapi import FastAPI, UploadFile, File
|
||||
from ultralytics import YOLO
|
||||
import tempfile
|
||||
|
||||
app = FastAPI()
|
||||
model = YOLO("models/best.pt")
|
||||
|
||||
@app.post("/api/v1/infer")
|
||||
async def infer(file: UploadFile = File(...)):
|
||||
# 保存上传文件
|
||||
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
|
||||
content = await file.read()
|
||||
tmp.write(content)
|
||||
tmp_path = tmp.name
|
||||
|
||||
# 执行推理
|
||||
results = model.predict(tmp_path, conf=0.5)
|
||||
|
||||
# 返回结果
|
||||
return {
|
||||
"fields": extract_fields(results),
|
||||
"confidence": get_confidence(results)
|
||||
}
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {"status": "healthy"}
|
||||
```
|
||||
|
||||
### 部署命令
|
||||
|
||||
```bash
|
||||
# 1. 创建 Container Registry
|
||||
az acr create --name invoiceacr --resource-group myRG --sku Basic
|
||||
|
||||
# 2. 构建并推送镜像
|
||||
az acr build --registry invoiceacr --image invoice-inference:v1 .
|
||||
|
||||
# 3. 创建 Container Apps 环境
|
||||
az containerapp env create \
|
||||
--name invoice-env \
|
||||
--resource-group myRG \
|
||||
--location eastus
|
||||
|
||||
# 4. 部署应用
|
||||
az containerapp create \
|
||||
--name invoice-inference \
|
||||
--resource-group myRG \
|
||||
--environment invoice-env \
|
||||
--image invoiceacr.azurecr.io/invoice-inference:v1 \
|
||||
--registry-server invoiceacr.azurecr.io \
|
||||
--cpu 2 --memory 4Gi \
|
||||
--min-replicas 1 --max-replicas 10 \
|
||||
--ingress external --target-port 8000
|
||||
|
||||
# 5. 获取 URL
|
||||
az containerapp show --name invoice-inference --resource-group myRG --query properties.configuration.ingress.fqdn
|
||||
```
|
||||
|
||||
### 高吞吐场景: Serverless GPU
|
||||
|
||||
如果需要 GPU 加速推理(高并发、低延迟):
|
||||
|
||||
```bash
|
||||
# 请求 GPU 配额
|
||||
az containerapp env workload-profile add \
|
||||
--name invoice-env \
|
||||
--resource-group myRG \
|
||||
--workload-profile-name gpu \
|
||||
--workload-profile-type Consumption-GPU-T4
|
||||
|
||||
# 部署 GPU 版本
|
||||
az containerapp create \
|
||||
--name invoice-inference-gpu \
|
||||
--resource-group myRG \
|
||||
--environment invoice-env \
|
||||
--image invoiceacr.azurecr.io/invoice-inference-gpu:v1 \
|
||||
--workload-profile-name gpu \
|
||||
--cpu 4 --memory 8Gi \
|
||||
--min-replicas 0 --max-replicas 5 \
|
||||
--ingress external --target-port 8000
|
||||
```
|
||||
|
||||
### 推理性能对比
|
||||
|
||||
| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
|
||||
|------|------------|---------|---------|
|
||||
| CPU 2核 4GB | ~300-500ms | ~50 QPS | ~$30 |
|
||||
| CPU 4核 8GB | ~200-300ms | ~100 QPS | ~$60 |
|
||||
| GPU T4 | ~50-100ms | ~200 QPS | 按秒计费 |
|
||||
| GPU A100 | ~20-50ms | ~500 QPS | 按秒计费 |
|
||||
|
||||
---
|
||||
|
||||
## 价格对比
|
||||
|
||||
### 月度成本对比(假设每天训练 2 小时)
|
||||
|
||||
| 方案 | 计算方式 | 月费 |
|
||||
|------|---------|------|
|
||||
| VM 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
|
||||
| VM 按需启停 | 2h × 30天 × $3.06 | ~$184 |
|
||||
| VM Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
|
||||
| Serverless GPU | 2h × 30天 × ~$3.50 | ~$210 |
|
||||
| Azure ML (min=0) | 2h × 30天 × $3.06 | ~$184 |
|
||||
|
||||
### 本项目完整成本估算
|
||||
|
||||
| 组件 | 推荐方案 | 月费 |
|
||||
|------|---------|------|
|
||||
| 图片存储 | Blob Storage (Hot) | ~$0.10 |
|
||||
| 数据库 | PostgreSQL Flexible (Burstable B1ms) | ~$25 |
|
||||
| 推理服务 | Container Apps CPU (2核4GB) | ~$30 |
|
||||
| 训练服务 | Azure ML Spot (按需) | ~$1-5/次 |
|
||||
| Container Registry | Basic | ~$5 |
|
||||
| **总计** | | **~$65/月** + 训练费 |
|
||||
|
||||
---
|
||||
|
||||
## 推荐架构
|
||||
|
||||
### 整体架构图
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Azure Blob Storage │
|
||||
│ ├── training-images/ │
|
||||
│ ├── datasets/ │
|
||||
│ └── models/ │
|
||||
└─────────────────┬───────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┼─────────────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
|
||||
│ 推理服务 (24/7) │ │ 训练服务 (按需) │ │ Web UI (可选) │
|
||||
│ Container Apps │ │ Azure ML Compute │ │ Static Web Apps │
|
||||
│ CPU 2核 4GB │ │ min=0, Spot │ │ ~$0 (免费层) │
|
||||
│ ~$30/月 │ │ ~$1-5/次训练 │ │ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
|
||||
│ │ FastAPI + YOLO │ │ │ │ YOLOv11 Training │ │ │ │ React/Vue 前端 │ │
|
||||
│ │ /api/v1/infer │ │ │ │ 100 epochs │ │ │ │ 上传发票界面 │ │
|
||||
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
|
||||
└───────────┬───────────┘ └───────────┬───────────┘ └───────────┬───────────┘
|
||||
│ │ │
|
||||
└───────────────────────────────┼───────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ PostgreSQL │
|
||||
│ Flexible Server │
|
||||
│ Burstable B1ms │
|
||||
│ ~$25/月 │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
### 推理服务配置
|
||||
|
||||
```yaml
|
||||
# Container Apps - CPU (24/7 运行)
|
||||
name: invoice-inference
|
||||
resources:
|
||||
cpu: 2
|
||||
memory: 4Gi
|
||||
scale:
|
||||
minReplicas: 1
|
||||
maxReplicas: 10
|
||||
env:
|
||||
- name: MODEL_PATH
|
||||
value: /app/models/best.pt
|
||||
- name: DB_HOST
|
||||
secretRef: db-host
|
||||
- name: DB_PASSWORD
|
||||
secretRef: db-password
|
||||
```
|
||||
|
||||
### 训练服务配置
|
||||
|
||||
**方案 A: Azure ML Compute(推荐)**
|
||||
|
||||
```python
|
||||
from azure.ai.ml.entities import AmlCompute
|
||||
|
||||
gpu_cluster = AmlCompute(
|
||||
name="gpu-cluster",
|
||||
size="Standard_NC6s_v3",
|
||||
min_instances=0, # 空闲时关机
|
||||
max_instances=1,
|
||||
tier="LowPriority", # Spot 实例
|
||||
idle_time_before_scale_down=120
|
||||
)
|
||||
```
|
||||
|
||||
**方案 B: Container Apps Serverless GPU**
|
||||
|
||||
```yaml
|
||||
name: invoice-training
|
||||
resources:
|
||||
gpu: 1
|
||||
gpuType: A100
|
||||
scale:
|
||||
minReplicas: 0
|
||||
maxReplicas: 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 实施步骤
|
||||
|
||||
### 阶段 1: 存储设置
|
||||
|
||||
```bash
|
||||
# 创建 Storage Account
|
||||
az storage account create \
|
||||
--name invoicestorage \
|
||||
--resource-group myRG \
|
||||
--sku Standard_LRS
|
||||
|
||||
# 创建容器
|
||||
az storage container create --name training-images --account-name invoicestorage
|
||||
az storage container create --name datasets --account-name invoicestorage
|
||||
az storage container create --name models --account-name invoicestorage
|
||||
|
||||
# 上传训练数据
|
||||
az storage blob upload-batch \
|
||||
--destination training-images \
|
||||
--source ./data/dataset/temp \
|
||||
--account-name invoicestorage
|
||||
```
|
||||
|
||||
### 阶段 2: 数据库设置
|
||||
|
||||
```bash
|
||||
# 创建 PostgreSQL
|
||||
az postgres flexible-server create \
|
||||
--name invoice-db \
|
||||
--resource-group myRG \
|
||||
--sku-name Standard_B1ms \
|
||||
--storage-size 32 \
|
||||
--admin-user docmaster \
|
||||
--admin-password YOUR_PASSWORD
|
||||
|
||||
# 配置防火墙
|
||||
az postgres flexible-server firewall-rule create \
|
||||
--name allow-azure \
|
||||
--resource-group myRG \
|
||||
--server-name invoice-db \
|
||||
--start-ip-address 0.0.0.0 \
|
||||
--end-ip-address 0.0.0.0
|
||||
```
|
||||
|
||||
### 阶段 3: 推理服务部署
|
||||
|
||||
```bash
|
||||
# 创建 Container Registry
|
||||
az acr create --name invoiceacr --resource-group myRG --sku Basic
|
||||
|
||||
# 构建镜像
|
||||
az acr build --registry invoiceacr --image invoice-inference:v1 .
|
||||
|
||||
# 创建环境
|
||||
az containerapp env create \
|
||||
--name invoice-env \
|
||||
--resource-group myRG \
|
||||
--location eastus
|
||||
|
||||
# 部署推理服务
|
||||
az containerapp create \
|
||||
--name invoice-inference \
|
||||
--resource-group myRG \
|
||||
--environment invoice-env \
|
||||
--image invoiceacr.azurecr.io/invoice-inference:v1 \
|
||||
--registry-server invoiceacr.azurecr.io \
|
||||
--cpu 2 --memory 4Gi \
|
||||
--min-replicas 1 --max-replicas 10 \
|
||||
--ingress external --target-port 8000 \
|
||||
--env-vars \
|
||||
DB_HOST=invoice-db.postgres.database.azure.com \
|
||||
DB_NAME=docmaster \
|
||||
DB_USER=docmaster \
|
||||
--secrets db-password=YOUR_PASSWORD
|
||||
```
|
||||
|
||||
### 阶段 4: 训练服务设置
|
||||
|
||||
```bash
|
||||
# 创建 Azure ML Workspace
|
||||
az ml workspace create --name invoice-ml --resource-group myRG
|
||||
|
||||
# 创建 Compute Cluster
|
||||
az ml compute create --name gpu-cluster \
|
||||
--type AmlCompute \
|
||||
--size Standard_NC6s_v3 \
|
||||
--min-instances 0 \
|
||||
--max-instances 1 \
|
||||
--tier low_priority
|
||||
```
|
||||
|
||||
### 阶段 5: 集成训练触发 API
|
||||
|
||||
```python
|
||||
# src/web/routes/training.py
|
||||
from fastapi import APIRouter
|
||||
from azure.ai.ml import MLClient, command
|
||||
from azure.identity import DefaultAzureCredential
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
ml_client = MLClient(
|
||||
credential=DefaultAzureCredential(),
|
||||
subscription_id="your-subscription-id",
|
||||
resource_group_name="myRG",
|
||||
workspace_name="invoice-ml"
|
||||
)
|
||||
|
||||
@router.post("/api/v1/train")
|
||||
async def trigger_training(request: TrainingRequest):
|
||||
"""触发 Azure ML 训练任务"""
|
||||
training_job = command(
|
||||
code="./training",
|
||||
command=f"python train.py --epochs {request.epochs}",
|
||||
environment="AzureML-pytorch-2.0-cuda11.8@latest",
|
||||
compute="gpu-cluster",
|
||||
)
|
||||
job = ml_client.jobs.create_or_update(training_job)
|
||||
return {
|
||||
"job_id": job.name,
|
||||
"status": job.status,
|
||||
"studio_url": job.studio_url
|
||||
}
|
||||
|
||||
@router.get("/api/v1/train/{job_id}/status")
|
||||
async def get_training_status(job_id: str):
|
||||
"""查询训练状态"""
|
||||
job = ml_client.jobs.get(job_id)
|
||||
return {"status": job.status}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 总结
|
||||
|
||||
### 推荐配置
|
||||
|
||||
| 组件 | 推荐方案 | 月费估算 |
|
||||
|------|---------|---------|
|
||||
| 图片存储 | Blob Storage (Hot) | ~$0.10 |
|
||||
| 数据库 | PostgreSQL Flexible | ~$25 |
|
||||
| 推理服务 | Container Apps CPU | ~$30 |
|
||||
| 训练服务 | Azure ML (min=0, Spot) | 按需 ~$1-5/次 |
|
||||
| Container Registry | Basic | ~$5 |
|
||||
| **总计** | | **~$65/月** + 训练费 |
|
||||
|
||||
### 关键决策
|
||||
|
||||
| 场景 | 选择 |
|
||||
|------|------|
|
||||
| 偶尔训练,简单需求 | Azure VM Spot + 手动启停 |
|
||||
| 需要 MLOps,团队协作 | Azure ML Compute |
|
||||
| 追求最低空闲成本 | Container Apps Serverless GPU |
|
||||
| 生产环境推理 | Container Apps CPU |
|
||||
| 高并发推理 | Container Apps Serverless GPU |
|
||||
|
||||
### 注意事项
|
||||
|
||||
1. **冷启动**: Serverless GPU 启动需要 3-8 分钟
|
||||
2. **Spot 中断**: 可能被抢占,需要检查点机制
|
||||
3. **网络延迟**: Blob Storage 挂载比本地 SSD 慢,建议开启缓存
|
||||
4. **区域选择**: 选择有 GPU 配额的区域 (East US, West Europe 等)
|
||||
5. **推理优化**: CPU 推理对于 YOLO 已经足够,无需 GPU
|
||||
Reference in New Issue
Block a user