Files
invoice-master-poc-v2/docs/aws-deployment-guide.md
Yaojia Wang a516de4320 WIP
2026-02-01 00:08:40 +01:00

773 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AWS 部署方案完整指南
## 目录
- [核心问题](#核心问题)
- [存储方案](#存储方案)
- [训练方案](#训练方案)
- [推理方案](#推理方案)
- [价格对比](#价格对比)
- [推荐架构](#推荐架构)
- [实施步骤](#实施步骤)
- [AWS vs Azure 对比](#aws-vs-azure-对比)
---
## 核心问题
| 问题 | 答案 |
|------|------|
| S3 能用于训练吗? | 可以,用 Mountpoint for S3 或 SageMaker 原生支持 |
| 能实时从 S3 读取训练吗? | 可以SageMaker 支持 Pipe Mode 流式读取 |
| 本地能挂载 S3 吗? | 可以,用 s3fs-fuse 或 Rclone |
| EC2 空闲时收费吗? | 收费,只要运行就按小时计费 |
| 如何按需付费? | 用 SageMaker Managed Spot 或 Lambda |
| 推理服务用什么? | Lambda (Serverless) 或 ECS/Fargate (容器) |
---
## 存储方案
### Amazon S3推荐
S3 是 AWS 的核心存储服务,与 SageMaker 深度集成。
```bash
# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1
# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/
# 创建目录结构
aws s3api put-object --bucket invoice-training-data --key datasets/
aws s3api put-object --bucket invoice-training-data --key models/
```
### Mountpoint for Amazon S3
AWS 官方的 S3 挂载客户端,性能优于 s3fs
```bash
# 安装 Mountpoint
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
sudo dpkg -i mount-s3.deb
# 挂载 S3
mkdir -p /mnt/s3-data
mount-s3 invoice-training-data /mnt/s3-data --region us-east-1
# 配置缓存(推荐)
mount-s3 invoice-training-data /mnt/s3-data \
--region us-east-1 \
--cache /tmp/s3-cache \
--metadata-ttl 60
```
### 本地开发挂载
**Linux/Mac (s3fs-fuse):**
```bash
# 安装
sudo apt-get install s3fs
# 配置凭证
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs
# 挂载
s3fs invoice-training-data /mnt/s3 -o passwd_file=~/.passwd-s3fs
```
**Windows (Rclone):**
```powershell
# 安装
winget install Rclone.Rclone
# 配置
rclone config # 选择 s3
# 挂载
rclone mount aws:invoice-training-data Z: --vfs-cache-mode full
```
### 存储费用
| 层级 | 价格 | 适用场景 |
|------|------|---------|
| S3 Standard | $0.023/GB/月 | 频繁访问 |
| S3 Intelligent-Tiering | $0.023/GB/月 | 自动分层 |
| S3 Infrequent Access | $0.0125/GB/月 | 偶尔访问 |
| S3 Glacier | $0.004/GB/月 | 长期存档 |
**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.12/月**
### SageMaker 数据输入模式
| 模式 | 说明 | 适用场景 |
|------|------|---------|
| File Mode | 下载到本地再训练 | 小数据集 |
| Pipe Mode | 流式读取,不占本地空间 | 大数据集 |
| FastFile Mode | 按需下载,最高 3x 加速 | 推荐 |
---
## 训练方案
### 方案总览
| 方案 | 适用场景 | 空闲费用 | 复杂度 | Spot 支持 |
|------|---------|---------|--------|----------|
| EC2 GPU | 简单直接 | 24/7 收费 | 低 | 是 |
| SageMaker Training | MLOps 集成 | 按任务计费 | 中 | 是 |
| EKS + GPU | Kubernetes | 复杂计费 | 高 | 是 |
### EC2 vs SageMaker
| 特性 | EC2 | SageMaker |
|------|-----|-----------|
| 本质 | 虚拟机 | 托管 ML 平台 |
| 计算费用 | $3.06/hr (p3.2xlarge) | $3.825/hr (+25%) |
| 管理开销 | 需自己配置 | 全托管 |
| Spot 折扣 | 最高 90% | 最高 90% |
| 实验跟踪 | 无 | 内置 |
| 自动关机 | 无 | 任务完成自动停止 |
### GPU 实例价格 (2025 年 6 月降价后)
| 实例 | GPU | 显存 | On-Demand | Spot 价格 |
|------|-----|------|-----------|----------|
| g4dn.xlarge | 1x T4 | 16GB | $0.526/hr | ~$0.16/hr |
| g4dn.2xlarge | 1x T4 | 16GB | $0.752/hr | ~$0.23/hr |
| p3.2xlarge | 1x V100 | 16GB | $3.06/hr | ~$0.92/hr |
| p3.8xlarge | 4x V100 | 64GB | $12.24/hr | ~$3.67/hr |
| p4d.24xlarge | 8x A100 | 320GB | $32.77/hr | ~$9.83/hr |
**注意**: 2025 年 6 月 AWS 宣布 P4/P5 系列最高降价 45%。
### Spot 实例
```bash
# EC2 Spot 请求
aws ec2 request-spot-instances \
--instance-count 1 \
--type "one-time" \
--launch-specification '{
"ImageId": "ami-0123456789abcdef0",
"InstanceType": "p3.2xlarge",
"KeyName": "my-key"
}'
```
### SageMaker Managed Spot Training
```python
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point="train.py",
source_dir="./src",
role="arn:aws:iam::123456789012:role/SageMakerRole",
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.0",
py_version="py310",
# 启用 Spot 实例
use_spot_instances=True,
max_run=3600, # 最长运行 1 小时
max_wait=7200, # 最长等待 2 小时
# 检查点配置Spot 中断恢复)
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
checkpoint_local_path="/opt/ml/checkpoints",
hyperparameters={
"epochs": 100,
"batch-size": 16,
}
)
estimator.fit({
"training": "s3://invoice-training-data/datasets/train/",
"validation": "s3://invoice-training-data/datasets/val/"
})
```
---
## 推理方案
### 方案对比
| 方案 | GPU 支持 | 扩缩容 | 冷启动 | 价格 | 适用场景 |
|------|---------|--------|--------|------|---------|
| Lambda | 否 | 自动 0-N | 快 | 按调用 | 低流量、CPU 推理 |
| Lambda + Container | 否 | 自动 0-N | 较慢 | 按调用 | 复杂依赖 |
| ECS Fargate | 否 | 自动 | 中 | ~$30/月 | 容器化服务 |
| ECS + EC2 GPU | 是 | 手动/自动 | 慢 | ~$100+/月 | GPU 推理 |
| SageMaker Endpoint | 是 | 自动 | 慢 | ~$80+/月 | MLOps 集成 |
| SageMaker Serverless | 否 | 自动 0-N | 中 | 按调用 | 间歇性流量 |
### 推荐方案 1: AWS Lambda (低流量)
对于 YOLO CPU 推理Lambda 最经济:
```python
# lambda_function.py
import json
import boto3
from ultralytics import YOLO
# 模型在 Lambda Layer 或 /tmp 加载
model = None
def load_model():
global model
if model is None:
# 从 S3 下载模型到 /tmp
s3 = boto3.client('s3')
s3.download_file('invoice-models', 'best.pt', '/tmp/best.pt')
model = YOLO('/tmp/best.pt')
return model
def lambda_handler(event, context):
model = load_model()
# 从 S3 获取图片
s3 = boto3.client('s3')
bucket = event['bucket']
key = event['key']
local_path = f'/tmp/{key.split("/")[-1]}'
s3.download_file(bucket, key, local_path)
# 执行推理
results = model.predict(local_path, conf=0.5)
return {
'statusCode': 200,
'body': json.dumps({
'fields': extract_fields(results),
'confidence': get_confidence(results)
})
}
```
**Lambda 配置:**
```yaml
# serverless.yml
service: invoice-inference
provider:
name: aws
runtime: python3.11
timeout: 30
memorySize: 4096 # 4GB 内存
functions:
infer:
handler: lambda_function.lambda_handler
events:
- http:
path: /infer
method: post
layers:
- arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1
```
### 推荐方案 2: ECS Fargate (中流量)
```yaml
# task-definition.json
{
"family": "invoice-inference",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "2048",
"memory": "4096",
"containerDefinitions": [
{
"name": "inference",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest",
"portMappings": [
{
"containerPort": 8000,
"protocol": "tcp"
}
],
"environment": [
{"name": "MODEL_PATH", "value": "/app/models/best.pt"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/invoice-inference",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
```
**Auto Scaling 配置:**
```bash
# 创建 Auto Scaling Target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/invoice-cluster/invoice-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 1 \
--max-capacity 10
# 基于 CPU 使用率扩缩容
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/invoice-cluster/invoice-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 120
}'
```
### 方案 3: SageMaker Serverless Inference
```python
from sagemaker.serverless import ServerlessInferenceConfig
from sagemaker.pytorch import PyTorchModel
model = PyTorchModel(
model_data="s3://invoice-models/model.tar.gz",
role="arn:aws:iam::123456789012:role/SageMakerRole",
entry_point="inference.py",
framework_version="2.0",
py_version="py310"
)
serverless_config = ServerlessInferenceConfig(
memory_size_in_mb=4096,
max_concurrency=10
)
predictor = model.deploy(
serverless_inference_config=serverless_config,
endpoint_name="invoice-inference-serverless"
)
```
### 推理性能对比
| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
|------|------------|---------|---------|
| Lambda 4GB | ~500-800ms | 按需扩展 | ~$15 (10K 请求) |
| Fargate 2vCPU 4GB | ~300-500ms | ~50 QPS | ~$30 |
| Fargate 4vCPU 8GB | ~200-300ms | ~100 QPS | ~$60 |
| EC2 g4dn.xlarge (T4) | ~50-100ms | ~200 QPS | ~$380 |
---
## 价格对比
### 训练成本对比(假设每天训练 2 小时)
| 方案 | 计算方式 | 月费 |
|------|---------|------|
| EC2 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
| EC2 按需启停 | 2h × 30天 × $3.06 | ~$184 |
| EC2 Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
| SageMaker On-Demand | 2h × 30天 × $3.825 | ~$230 |
| SageMaker Spot | 2h × 30天 × $1.15 | ~$69 |
### 本项目完整成本估算
| 组件 | 推荐方案 | 月费 |
|------|---------|------|
| 数据存储 | S3 Standard (5GB) | ~$0.12 |
| 数据库 | RDS PostgreSQL (db.t3.micro) | ~$15 |
| 推理服务 | Lambda (10K 请求/月) | ~$15 |
| 推理服务 (替代) | ECS Fargate | ~$30 |
| 训练服务 | SageMaker Spot (按需) | ~$2-5/次 |
| ECR (镜像存储) | 基本使用 | ~$1 |
| **总计 (Lambda)** | | **~$35/月** + 训练费 |
| **总计 (Fargate)** | | **~$50/月** + 训练费 |
---
## 推荐架构
### 整体架构图
```
┌─────────────────────────────────────┐
│ Amazon S3 │
│ ├── training-images/ │
│ ├── datasets/ │
│ ├── models/ │
│ └── checkpoints/ │
└─────────────────┬───────────────────┘
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
│ 推理服务 │ │ 训练服务 │ │ API Gateway │
│ │ │ │ │ │
│ 方案 A: Lambda │ │ SageMaker │ │ REST API │
│ ~$15/月 (10K req) │ │ Managed Spot │ │ 触发 Lambda/ECS │
│ │ │ ~$2-5/次训练 │ │ │
│ 方案 B: ECS Fargate │ │ │ │ │
│ ~$30/月 │ │ - 自动启动 │ │ │
│ │ │ - 训练完成自动停止 │ │ │
│ ┌───────────────────┐ │ │ - 检查点自动保存 │ │ │
│ │ FastAPI + YOLO │ │ │ │ │ │
│ │ CPU 推理 │ │ │ │ │ │
│ └───────────────────┘ │ └───────────┬───────────┘ └───────────────────────┘
└───────────┬───────────┘ │
│ │
└───────────────────────────────┼───────────────────────────────────────────┘
┌───────────────────────┐
│ Amazon RDS │
│ PostgreSQL │
│ db.t3.micro │
│ ~$15/月 │
└───────────────────────┘
```
### Lambda 推理配置
```yaml
# SAM template
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
InferenceFunction:
Type: AWS::Serverless::Function
Properties:
Handler: app.lambda_handler
Runtime: python3.11
MemorySize: 4096
Timeout: 30
Environment:
Variables:
MODEL_BUCKET: invoice-models
MODEL_KEY: best.pt
Policies:
- S3ReadPolicy:
BucketName: invoice-models
- S3ReadPolicy:
BucketName: invoice-uploads
Events:
InferApi:
Type: Api
Properties:
Path: /infer
Method: post
```
### SageMaker 训练配置
```python
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point="train.py",
source_dir="./src",
role="arn:aws:iam::123456789012:role/SageMakerRole",
instance_count=1,
instance_type="ml.g4dn.xlarge", # T4 GPU
framework_version="2.0",
py_version="py310",
# Spot 实例配置
use_spot_instances=True,
max_run=7200,
max_wait=14400,
# 检查点
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
hyperparameters={
"epochs": 100,
"batch-size": 16,
"model": "yolo11n.pt"
}
)
```
---
## 实施步骤
### 阶段 1: 存储设置
```bash
# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1
aws s3 mb s3://invoice-models --region us-east-1
# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/
# 配置生命周期(可选,自动转冷存储)
aws s3api put-bucket-lifecycle-configuration \
--bucket invoice-training-data \
--lifecycle-configuration '{
"Rules": [{
"ID": "MoveToIA",
"Status": "Enabled",
"Transitions": [{
"Days": 30,
"StorageClass": "STANDARD_IA"
}]
}]
}'
```
### 阶段 2: 数据库设置
```bash
# 创建 RDS PostgreSQL
aws rds create-db-instance \
--db-instance-identifier invoice-db \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 15 \
--master-username docmaster \
--master-user-password YOUR_PASSWORD \
--allocated-storage 20
# 配置安全组
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--protocol tcp \
--port 5432 \
--source-group sg-yyy
```
### 阶段 3: 推理服务部署
**方案 A: Lambda**
```bash
# 创建 Lambda Layer (依赖)
cd lambda-layer
pip install ultralytics opencv-python-headless -t python/
zip -r layer.zip python/
aws lambda publish-layer-version \
--layer-name yolo-deps \
--zip-file fileb://layer.zip \
--compatible-runtimes python3.11
# 部署 Lambda 函数
cd ../lambda
zip function.zip lambda_function.py
aws lambda create-function \
--function-name invoice-inference \
--runtime python3.11 \
--handler lambda_function.lambda_handler \
--role arn:aws:iam::123456789012:role/LambdaRole \
--zip-file fileb://function.zip \
--memory-size 4096 \
--timeout 30 \
--layers arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1
# 创建 API Gateway
aws apigatewayv2 create-api \
--name invoice-api \
--protocol-type HTTP \
--target arn:aws:lambda:us-east-1:123456789012:function:invoice-inference
```
**方案 B: ECS Fargate**
```bash
# 创建 ECR 仓库
aws ecr create-repository --repository-name invoice-inference
# 构建并推送镜像
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker build -t invoice-inference .
docker tag invoice-inference:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
# 创建 ECS 集群
aws ecs create-cluster --cluster-name invoice-cluster
# 注册任务定义
aws ecs register-task-definition --cli-input-json file://task-definition.json
# 创建服务
aws ecs create-service \
--cluster invoice-cluster \
--service-name invoice-service \
--task-definition invoice-inference \
--desired-count 1 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-xxx"],
"securityGroups": ["sg-xxx"],
"assignPublicIp": "ENABLED"
}
}'
```
### 阶段 4: 训练服务设置
```python
# setup_sagemaker.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
# 创建 SageMaker 执行角色
iam = boto3.client('iam')
role_arn = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
# 配置训练任务
estimator = PyTorch(
entry_point="train.py",
source_dir="./src/training",
role=role_arn,
instance_count=1,
instance_type="ml.g4dn.xlarge",
framework_version="2.0",
py_version="py310",
use_spot_instances=True,
max_run=7200,
max_wait=14400,
checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
)
# 保存配置供后续使用
estimator.save("training_config.json")
```
### 阶段 5: 集成训练触发 API
```python
# lambda_trigger_training.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
def lambda_handler(event, context):
"""触发 SageMaker 训练任务"""
epochs = event.get('epochs', 100)
estimator = PyTorch(
entry_point="train.py",
source_dir="s3://invoice-training-data/code/",
role="arn:aws:iam::123456789012:role/SageMakerRole",
instance_count=1,
instance_type="ml.g4dn.xlarge",
framework_version="2.0",
py_version="py310",
use_spot_instances=True,
max_run=7200,
max_wait=14400,
hyperparameters={
"epochs": epochs,
"batch-size": 16,
}
)
estimator.fit(
inputs={
"training": "s3://invoice-training-data/datasets/train/",
"validation": "s3://invoice-training-data/datasets/val/"
},
wait=False # 异步执行
)
return {
'statusCode': 200,
'body': {
'training_job_name': estimator.latest_training_job.name,
'status': 'Started'
}
}
```
---
## AWS vs Azure 对比
### 服务对应关系
| 功能 | AWS | Azure |
|------|-----|-------|
| 对象存储 | S3 | Blob Storage |
| 挂载工具 | Mountpoint for S3 | BlobFuse2 |
| ML 平台 | SageMaker | Azure ML |
| 容器服务 | ECS/Fargate | Container Apps |
| Serverless | Lambda | Functions |
| GPU VM | EC2 P3/G4dn | NC/ND 系列 |
| 容器注册 | ECR | ACR |
| 数据库 | RDS PostgreSQL | PostgreSQL Flexible |
### 价格对比
| 组件 | AWS | Azure |
|------|-----|-------|
| 存储 (5GB) | ~$0.12/月 | ~$0.09/月 |
| 数据库 | ~$15/月 | ~$25/月 |
| 推理 (Serverless) | ~$15/月 | ~$30/月 |
| 推理 (容器) | ~$30/月 | ~$30/月 |
| 训练 (Spot GPU) | ~$2-5/次 | ~$1-5/次 |
| **总计** | **~$35-50/月** | **~$65/月** |
### 优劣对比
| 方面 | AWS 优势 | Azure 优势 |
|------|---------|-----------|
| 价格 | Lambda 更便宜 | GPU Spot 更便宜 |
| ML 平台 | SageMaker 更成熟 | Azure ML 更易用 |
| Serverless GPU | 无原生支持 | Container Apps GPU |
| 文档 | 更丰富 | 中文文档更好 |
| 生态 | 更大 | Office 365 集成 |
---
## 总结
### 推荐配置
| 组件 | 推荐方案 | 月费估算 |
|------|---------|---------|
| 数据存储 | S3 Standard | ~$0.12 |
| 数据库 | RDS db.t3.micro | ~$15 |
| 推理服务 | Lambda 4GB | ~$15 |
| 训练服务 | SageMaker Spot | 按需 ~$2-5/次 |
| ECR | 基本使用 | ~$1 |
| **总计** | | **~$35/月** + 训练费 |
### 关键决策
| 场景 | 选择 |
|------|------|
| 最低成本 | Lambda + SageMaker Spot |
| 稳定推理 | ECS Fargate |
| GPU 推理 | ECS + EC2 GPU |
| MLOps 集成 | SageMaker 全家桶 |
### 注意事项
1. **Lambda 冷启动**: 首次调用 ~3-5 秒,可用 Provisioned Concurrency 解决
2. **Spot 中断**: 配置检查点SageMaker 自动恢复
3. **S3 传输**: 同区域免费,跨区域收费
4. **Fargate 无 GPU**: 需要 GPU 必须用 ECS + EC2
5. **SageMaker 加价**: 比 EC2 贵 ~25%,但省管理成本