Files

Yaojia Wang a516de4320 WIP

2026-02-01 00:08:40 +01:00

22 KiB

Raw Blame History

AWS 部署方案完整指南

核心问题

问题	答案
S3 能用于训练吗？	可以，用 Mountpoint for S3 或 SageMaker 原生支持
能实时从 S3 读取训练吗？	可以，SageMaker 支持 Pipe Mode 流式读取
本地能挂载 S3 吗？	可以，用 s3fs-fuse 或 Rclone
EC2 空闲时收费吗？	收费，只要运行就按小时计费
如何按需付费？	用 SageMaker Managed Spot 或 Lambda
推理服务用什么？	Lambda (Serverless) 或 ECS/Fargate (容器)

存储方案

Amazon S3（推荐）

S3 是 AWS 的核心存储服务，与 SageMaker 深度集成。

# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1

# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/

# 创建目录结构
aws s3api put-object --bucket invoice-training-data --key datasets/
aws s3api put-object --bucket invoice-training-data --key models/

Mountpoint for Amazon S3

AWS 官方的 S3 挂载客户端，性能优于 s3fs：

# 安装 Mountpoint
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
sudo dpkg -i mount-s3.deb

# 挂载 S3
mkdir -p /mnt/s3-data
mount-s3 invoice-training-data /mnt/s3-data --region us-east-1

# 配置缓存（推荐）
mount-s3 invoice-training-data /mnt/s3-data \
  --region us-east-1 \
  --cache /tmp/s3-cache \
  --metadata-ttl 60

本地开发挂载

Linux/Mac (s3fs-fuse):

# 安装
sudo apt-get install s3fs

# 配置凭证
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs

# 挂载
s3fs invoice-training-data /mnt/s3 -o passwd_file=~/.passwd-s3fs

Windows (Rclone):

# 安装
winget install Rclone.Rclone

# 配置
rclone config  # 选择 s3

# 挂载
rclone mount aws:invoice-training-data Z: --vfs-cache-mode full

存储费用

层级	价格	适用场景
S3 Standard	$0.023/GB/月	频繁访问
S3 Intelligent-Tiering	$0.023/GB/月	自动分层
S3 Infrequent Access	$0.0125/GB/月	偶尔访问
S3 Glacier	$0.004/GB/月	长期存档

本项目: ~10,000 张图片 × 500KB = 5GB → **$0.12/月**

SageMaker 数据输入模式

模式	说明	适用场景
File Mode	下载到本地再训练	小数据集
Pipe Mode	流式读取，不占本地空间	大数据集
FastFile Mode	按需下载，最高 3x 加速	推荐

训练方案

方案总览

方案	适用场景	空闲费用	复杂度	Spot 支持
EC2 GPU	简单直接	24/7 收费	低	是
SageMaker Training	MLOps 集成	按任务计费	中	是
EKS + GPU	Kubernetes	复杂计费	高	是

EC2 vs SageMaker

特性	EC2	SageMaker
本质	虚拟机	托管 ML 平台
计算费用	$3.06/hr (p3.2xlarge)	$3.825/hr (+25%)
管理开销	需自己配置	全托管
Spot 折扣	最高 90%	最高 90%
实验跟踪	无	内置
自动关机	无	任务完成自动停止

GPU 实例价格 (2025 年 6 月降价后)

实例	GPU	显存	On-Demand	Spot 价格
g4dn.xlarge	1x T4	16GB	$0.526/hr	~$0.16/hr
g4dn.2xlarge	1x T4	16GB	$0.752/hr	~$0.23/hr
p3.2xlarge	1x V100	16GB	$3.06/hr	~$0.92/hr
p3.8xlarge	4x V100	64GB	$12.24/hr	~$3.67/hr
p4d.24xlarge	8x A100	320GB	$32.77/hr	~$9.83/hr

注意: 2025 年 6 月 AWS 宣布 P4/P5 系列最高降价 45%。

Spot 实例

# EC2 Spot 请求
aws ec2 request-spot-instances \
  --instance-count 1 \
  --type "one-time" \
  --launch-specification '{
    "ImageId": "ami-0123456789abcdef0",
    "InstanceType": "p3.2xlarge",
    "KeyName": "my-key"
  }'

SageMaker Managed Spot Training

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.0",
    py_version="py310",

    # 启用 Spot 实例
    use_spot_instances=True,
    max_run=3600,           # 最长运行 1 小时
    max_wait=7200,          # 最长等待 2 小时

    # 检查点配置（Spot 中断恢复）
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
    checkpoint_local_path="/opt/ml/checkpoints",

    hyperparameters={
        "epochs": 100,
        "batch-size": 16,
    }
)

estimator.fit({
    "training": "s3://invoice-training-data/datasets/train/",
    "validation": "s3://invoice-training-data/datasets/val/"
})

推理方案

方案对比

方案	GPU 支持	扩缩容	冷启动	价格	适用场景
Lambda	否	自动 0-N	快	按调用	低流量、CPU 推理
Lambda + Container	否	自动 0-N	较慢	按调用	复杂依赖
ECS Fargate	否	自动	中	~$30/月	容器化服务
ECS + EC2 GPU	是	手动/自动	慢	~$100+/月	GPU 推理
SageMaker Endpoint	是	自动	慢	~$80+/月	MLOps 集成
SageMaker Serverless	否	自动 0-N	中	按调用	间歇性流量

推荐方案 1: AWS Lambda (低流量)

对于 YOLO CPU 推理，Lambda 最经济：

# lambda_function.py
import json
import boto3
from ultralytics import YOLO

# 模型在 Lambda Layer 或 /tmp 加载
model = None

def load_model():
    global model
    if model is None:
        # 从 S3 下载模型到 /tmp
        s3 = boto3.client('s3')
        s3.download_file('invoice-models', 'best.pt', '/tmp/best.pt')
        model = YOLO('/tmp/best.pt')
    return model

def lambda_handler(event, context):
    model = load_model()

    # 从 S3 获取图片
    s3 = boto3.client('s3')
    bucket = event['bucket']
    key = event['key']

    local_path = f'/tmp/{key.split("/")[-1]}'
    s3.download_file(bucket, key, local_path)

    # 执行推理
    results = model.predict(local_path, conf=0.5)

    return {
        'statusCode': 200,
        'body': json.dumps({
            'fields': extract_fields(results),
            'confidence': get_confidence(results)
        })
    }

Lambda 配置:

# serverless.yml
service: invoice-inference

provider:
  name: aws
  runtime: python3.11
  timeout: 30
  memorySize: 4096  # 4GB 内存

functions:
  infer:
    handler: lambda_function.lambda_handler
    events:
      - http:
          path: /infer
          method: post
    layers:
      - arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1

推荐方案 2: ECS Fargate (中流量)

# task-definition.json
{
  "family": "invoice-inference",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "2048",
  "memory": "4096",
  "containerDefinitions": [
    {
      "name": "inference",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest",
      "portMappings": [
        {
          "containerPort": 8000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "MODEL_PATH", "value": "/app/models/best.pt"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/invoice-inference",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Auto Scaling 配置:

# 创建 Auto Scaling Target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/invoice-cluster/invoice-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 1 \
  --max-capacity 10

# 基于 CPU 使用率扩缩容
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/invoice-cluster/invoice-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 120
  }'

方案 3: SageMaker Serverless Inference

from sagemaker.serverless import ServerlessInferenceConfig
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data="s3://invoice-models/model.tar.gz",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    entry_point="inference.py",
    framework_version="2.0",
    py_version="py310"
)

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096,
    max_concurrency=10
)

predictor = model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="invoice-inference-serverless"
)

推理性能对比

配置	单次推理时间	并发能力	月费估算
Lambda 4GB	~500-800ms	按需扩展	~$15 (10K 请求)
Fargate 2vCPU 4GB	~300-500ms	~50 QPS	~$30
Fargate 4vCPU 8GB	~200-300ms	~100 QPS	~$60
EC2 g4dn.xlarge (T4)	~50-100ms	~200 QPS	~$380

价格对比

训练成本对比（假设每天训练 2 小时）

方案	计算方式	月费
EC2 24/7 运行	24h × 30天 × $3.06	~$2,200
EC2 按需启停	2h × 30天 × $3.06	~$184
EC2 Spot 按需	2h × 30天 × $0.92	~$55
SageMaker On-Demand	2h × 30天 × $3.825	~$230
SageMaker Spot	2h × 30天 × $1.15	~$69

本项目完整成本估算

组件	推荐方案	月费
数据存储	S3 Standard (5GB)	~$0.12
数据库	RDS PostgreSQL (db.t3.micro)	~$15
推理服务	Lambda (10K 请求/月)	~$15
推理服务 (替代)	ECS Fargate	~$30
训练服务	SageMaker Spot (按需)	~$2-5/次
ECR (镜像存储)	基本使用	~$1
总计 (Lambda)		~$35/月 + 训练费
总计 (Fargate)		~$50/月 + 训练费

推荐架构

整体架构图

                            ┌─────────────────────────────────────┐
                            │           Amazon S3                 │
                            │  ├── training-images/               │
                            │  ├── datasets/                      │
                            │  ├── models/                        │
                            │  └── checkpoints/                   │
                            └─────────────────┬───────────────────┘
                                              │
            ┌─────────────────────────────────┼─────────────────────────────────┐
            │                                 │                                 │
            ▼                                 ▼                                 ▼
┌───────────────────────┐       ┌───────────────────────┐       ┌───────────────────────┐
│   推理服务             │       │   训练服务             │       │   API Gateway         │
│                       │       │                       │       │                       │
│  方案 A: Lambda       │       │   SageMaker           │       │   REST API            │
│  ~$15/月 (10K req)    │       │   Managed Spot        │       │   触发 Lambda/ECS     │
│                       │       │   ~$2-5/次训练        │       │                       │
│  方案 B: ECS Fargate  │       │                       │       │                       │
│  ~$30/月              │       │   - 自动启动          │       │                       │
│                       │       │   - 训练完成自动停止   │       │                       │
│ ┌───────────────────┐ │       │   - 检查点自动保存    │       │                       │
│ │ FastAPI + YOLO    │ │       │                       │       │                       │
│ │ CPU 推理          │ │       │                       │       │                       │
│ └───────────────────┘ │       └───────────┬───────────┘       └───────────────────────┘
└───────────┬───────────┘                   │
            │                               │
            └───────────────────────────────┼───────────────────────────────────────────┘
                                            │
                                            ▼
                              ┌───────────────────────┐
                              │   Amazon RDS          │
                              │   PostgreSQL          │
                              │   db.t3.micro         │
                              │   ~$15/月             │
                              └───────────────────────┘

Lambda 推理配置

# SAM template
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  InferenceFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.lambda_handler
      Runtime: python3.11
      MemorySize: 4096
      Timeout: 30
      Environment:
        Variables:
          MODEL_BUCKET: invoice-models
          MODEL_KEY: best.pt
      Policies:
        - S3ReadPolicy:
            BucketName: invoice-models
        - S3ReadPolicy:
            BucketName: invoice-uploads
      Events:
        InferApi:
          Type: Api
          Properties:
            Path: /infer
            Method: post

SageMaker 训练配置

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",  # T4 GPU
    framework_version="2.0",
    py_version="py310",

    # Spot 实例配置
    use_spot_instances=True,
    max_run=7200,
    max_wait=14400,

    # 检查点
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",

    hyperparameters={
        "epochs": 100,
        "batch-size": 16,
        "model": "yolo11n.pt"
    }
)

实施步骤

阶段 1: 存储设置

# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1
aws s3 mb s3://invoice-models --region us-east-1

# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/

# 配置生命周期（可选，自动转冷存储）
aws s3api put-bucket-lifecycle-configuration \
  --bucket invoice-training-data \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "MoveToIA",
      "Status": "Enabled",
      "Transitions": [{
        "Days": 30,
        "StorageClass": "STANDARD_IA"
      }]
    }]
  }'

阶段 2: 数据库设置

# 创建 RDS PostgreSQL
aws rds create-db-instance \
  --db-instance-identifier invoice-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 15 \
  --master-username docmaster \
  --master-user-password YOUR_PASSWORD \
  --allocated-storage 20

# 配置安全组
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --protocol tcp \
  --port 5432 \
  --source-group sg-yyy

阶段 3: 推理服务部署

方案 A: Lambda

# 创建 Lambda Layer (依赖)
cd lambda-layer
pip install ultralytics opencv-python-headless -t python/
zip -r layer.zip python/
aws lambda publish-layer-version \
  --layer-name yolo-deps \
  --zip-file fileb://layer.zip \
  --compatible-runtimes python3.11

# 部署 Lambda 函数
cd ../lambda
zip function.zip lambda_function.py
aws lambda create-function \
  --function-name invoice-inference \
  --runtime python3.11 \
  --handler lambda_function.lambda_handler \
  --role arn:aws:iam::123456789012:role/LambdaRole \
  --zip-file fileb://function.zip \
  --memory-size 4096 \
  --timeout 30 \
  --layers arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1

# 创建 API Gateway
aws apigatewayv2 create-api \
  --name invoice-api \
  --protocol-type HTTP \
  --target arn:aws:lambda:us-east-1:123456789012:function:invoice-inference

方案 B: ECS Fargate

# 创建 ECR 仓库
aws ecr create-repository --repository-name invoice-inference

# 构建并推送镜像
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker build -t invoice-inference .
docker tag invoice-inference:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest

# 创建 ECS 集群
aws ecs create-cluster --cluster-name invoice-cluster

# 注册任务定义
aws ecs register-task-definition --cli-input-json file://task-definition.json

# 创建服务
aws ecs create-service \
  --cluster invoice-cluster \
  --service-name invoice-service \
  --task-definition invoice-inference \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-xxx"],
      "securityGroups": ["sg-xxx"],
      "assignPublicIp": "ENABLED"
    }
  }'

阶段 4: 训练服务设置

# setup_sagemaker.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch

# 创建 SageMaker 执行角色
iam = boto3.client('iam')
role_arn = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# 配置训练任务
estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src/training",
    role=role_arn,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    framework_version="2.0",
    py_version="py310",
    use_spot_instances=True,
    max_run=7200,
    max_wait=14400,
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
)

# 保存配置供后续使用
estimator.save("training_config.json")

阶段 5: 集成训练触发 API

# lambda_trigger_training.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch

def lambda_handler(event, context):
    """触发 SageMaker 训练任务"""

    epochs = event.get('epochs', 100)

    estimator = PyTorch(
        entry_point="train.py",
        source_dir="s3://invoice-training-data/code/",
        role="arn:aws:iam::123456789012:role/SageMakerRole",
        instance_count=1,
        instance_type="ml.g4dn.xlarge",
        framework_version="2.0",
        py_version="py310",
        use_spot_instances=True,
        max_run=7200,
        max_wait=14400,
        hyperparameters={
            "epochs": epochs,
            "batch-size": 16,
        }
    )

    estimator.fit(
        inputs={
            "training": "s3://invoice-training-data/datasets/train/",
            "validation": "s3://invoice-training-data/datasets/val/"
        },
        wait=False  # 异步执行
    )

    return {
        'statusCode': 200,
        'body': {
            'training_job_name': estimator.latest_training_job.name,
            'status': 'Started'
        }
    }

AWS vs Azure 对比

服务对应关系

功能	AWS	Azure
对象存储	S3	Blob Storage
挂载工具	Mountpoint for S3	BlobFuse2
ML 平台	SageMaker	Azure ML
容器服务	ECS/Fargate	Container Apps
Serverless	Lambda	Functions
GPU VM	EC2 P3/G4dn	NC/ND 系列
容器注册	ECR	ACR
数据库	RDS PostgreSQL	PostgreSQL Flexible

价格对比

组件	AWS	Azure
存储 (5GB)	~$0.12/月	~$0.09/月
数据库	~$15/月	~$25/月
推理 (Serverless)	~$15/月	~$30/月
推理 (容器)	~$30/月	~$30/月
训练 (Spot GPU)	~$2-5/次	~$1-5/次
总计	~$35-50/月	~$65/月

优劣对比

方面	AWS 优势	Azure 优势
价格	Lambda 更便宜	GPU Spot 更便宜
ML 平台	SageMaker 更成熟	Azure ML 更易用
Serverless GPU	无原生支持	Container Apps GPU
文档	更丰富	中文文档更好
生态	更大	Office 365 集成

总结

组件	推荐方案	月费估算
数据存储	S3 Standard	~$0.12
数据库	RDS db.t3.micro	~$15
推理服务	Lambda 4GB	~$15
训练服务	SageMaker Spot	按需 ~$2-5/次
ECR	基本使用	~$1
总计		~$35/月 + 训练费

关键决策

场景	选择
最低成本	Lambda + SageMaker Spot
稳定推理	ECS Fargate
GPU 推理	ECS + EC2 GPU
MLOps 集成	SageMaker 全家桶

注意事项

Lambda 冷启动: 首次调用 ~3-5 秒，可用 Provisioned Concurrency 解决
Spot 中断: 配置检查点，SageMaker 自动恢复
S3 传输: 同区域免费，跨区域收费
Fargate 无 GPU: 需要 GPU 必须用 ECS + EC2
SageMaker 加价: 比 EC2 贵 ~25%，但省管理成本

22 KiB Raw Blame History Unescape Escape

AWS 部署方案完整指南

目录

核心问题

存储方案

Amazon S3（推荐）

Mountpoint for Amazon S3

本地开发挂载

存储费用

SageMaker 数据输入模式

训练方案

方案总览

EC2 vs SageMaker

GPU 实例价格 (2025 年 6 月降价后)

Spot 实例

SageMaker Managed Spot Training

推理方案

方案对比

推荐方案 1: AWS Lambda (低流量)

推荐方案 2: ECS Fargate (中流量)

方案 3: SageMaker Serverless Inference

推理性能对比

价格对比

训练成本对比（假设每天训练 2 小时）

本项目完整成本估算

推荐架构

整体架构图

Lambda 推理配置

SageMaker 训练配置

实施步骤

阶段 1: 存储设置

阶段 2: 数据库设置

阶段 3: 推理服务部署

阶段 4: 训练服务设置

阶段 5: 集成训练触发 API

AWS vs Azure 对比

服务对应关系

价格对比

优劣对比

总结

推荐配置

关键决策

注意事项

22 KiB

Raw Blame History