invoice-master-poc-v2/docs/aws-deployment-guide.md

# AWS 部署方案完整指南

## 目录
- [核心问题](#核心问题)
- [存储方案](#存储方案)
- [训练方案](#训练方案)
- [推理方案](#推理方案)
- [价格对比](#价格对比)
- [推荐架构](#推荐架构)
- [实施步骤](#实施步骤)
- [AWS vs Azure 对比](#aws-vs-azure-对比)

---

## 核心问题

| 问题 | 答案 |
|------|------|
| S3 能用于训练吗？ | 可以，用 Mountpoint for S3 或 SageMaker 原生支持 |
| 能实时从 S3 读取训练吗？ | 可以，SageMaker 支持 Pipe Mode 流式读取 |
| 本地能挂载 S3 吗？ | 可以，用 s3fs-fuse 或 Rclone |
| EC2 空闲时收费吗？ | 收费，只要运行就按小时计费 |
| 如何按需付费？ | 用 SageMaker Managed Spot 或 Lambda |
| 推理服务用什么？ | Lambda (Serverless) 或 ECS/Fargate (容器) |

---

## 存储方案

### Amazon S3（推荐）

S3 是 AWS 的核心存储服务，与 SageMaker 深度集成。

```bash
# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1

# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/

# 创建目录结构
aws s3api put-object --bucket invoice-training-data --key datasets/
aws s3api put-object --bucket invoice-training-data --key models/
```

### Mountpoint for Amazon S3

AWS 官方的 S3 挂载客户端，性能优于 s3fs：

```bash
# 安装 Mountpoint
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
sudo dpkg -i mount-s3.deb

# 挂载 S3
mkdir -p /mnt/s3-data
mount-s3 invoice-training-data /mnt/s3-data --region us-east-1

# 配置缓存（推荐）
mount-s3 invoice-training-data /mnt/s3-data \
  --region us-east-1 \
  --cache /tmp/s3-cache \
  --metadata-ttl 60
```

### 本地开发挂载

**Linux/Mac (s3fs-fuse):**
```bash
# 安装
sudo apt-get install s3fs

# 配置凭证
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs

# 挂载
s3fs invoice-training-data /mnt/s3 -o passwd_file=~/.passwd-s3fs
```

**Windows (Rclone):**
```powershell
# 安装
winget install Rclone.Rclone

# 配置
rclone config  # 选择 s3

# 挂载
rclone mount aws:invoice-training-data Z: --vfs-cache-mode full
```

### 存储费用

| 层级 | 价格 | 适用场景 |
|------|------|---------|
| S3 Standard | $0.023/GB/月 | 频繁访问 |
| S3 Intelligent-Tiering | $0.023/GB/月 | 自动分层 |
| S3 Infrequent Access | $0.0125/GB/月 | 偶尔访问 |
| S3 Glacier | $0.004/GB/月 | 长期存档 |

**本项目**: ~10,000 张图片 × 500KB = ~5GB → **~$0.12/月**

### SageMaker 数据输入模式

| 模式 | 说明 | 适用场景 |
|------|------|---------|
| File Mode | 下载到本地再训练 | 小数据集 |
| Pipe Mode | 流式读取，不占本地空间 | 大数据集 |
| FastFile Mode | 按需下载，最高 3x 加速 | 推荐 |

---

## 训练方案

### 方案总览

| 方案 | 适用场景 | 空闲费用 | 复杂度 | Spot 支持 |
|------|---------|---------|--------|----------|
| EC2 GPU | 简单直接 | 24/7 收费 | 低 | 是 |
| SageMaker Training | MLOps 集成 | 按任务计费 | 中 | 是 |
| EKS + GPU | Kubernetes | 复杂计费 | 高 | 是 |

### EC2 vs SageMaker

| 特性 | EC2 | SageMaker |
|------|-----|-----------|
| 本质 | 虚拟机 | 托管 ML 平台 |
| 计算费用 | $3.06/hr (p3.2xlarge) | $3.825/hr (+25%) |
| 管理开销 | 需自己配置 | 全托管 |
| Spot 折扣 | 最高 90% | 最高 90% |
| 实验跟踪 | 无 | 内置 |
| 自动关机 | 无 | 任务完成自动停止 |

### GPU 实例价格 (2025 年 6 月降价后)

| 实例 | GPU | 显存 | On-Demand | Spot 价格 |
|------|-----|------|-----------|----------|
| g4dn.xlarge | 1x T4 | 16GB | $0.526/hr | ~$0.16/hr |
| g4dn.2xlarge | 1x T4 | 16GB | $0.752/hr | ~$0.23/hr |
| p3.2xlarge | 1x V100 | 16GB | $3.06/hr | ~$0.92/hr |
| p3.8xlarge | 4x V100 | 64GB | $12.24/hr | ~$3.67/hr |
| p4d.24xlarge | 8x A100 | 320GB | $32.77/hr | ~$9.83/hr |

**注意**: 2025 年 6 月 AWS 宣布 P4/P5 系列最高降价 45%。

### Spot 实例

```bash
# EC2 Spot 请求
aws ec2 request-spot-instances \
  --instance-count 1 \
  --type "one-time" \
  --launch-specification '{
    "ImageId": "ami-0123456789abcdef0",
    "InstanceType": "p3.2xlarge",
    "KeyName": "my-key"
  }'
```

### SageMaker Managed Spot Training

```python
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.0",
    py_version="py310",

    # 启用 Spot 实例
    use_spot_instances=True,
    max_run=3600,           # 最长运行 1 小时
    max_wait=7200,          # 最长等待 2 小时

    # 检查点配置（Spot 中断恢复）
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
    checkpoint_local_path="/opt/ml/checkpoints",

    hyperparameters={
        "epochs": 100,
        "batch-size": 16,
    }
)

estimator.fit({
    "training": "s3://invoice-training-data/datasets/train/",
    "validation": "s3://invoice-training-data/datasets/val/"
})
```

---

## 推理方案

### 方案对比

| 方案 | GPU 支持 | 扩缩容 | 冷启动 | 价格 | 适用场景 |
|------|---------|--------|--------|------|---------|
| Lambda | 否 | 自动 0-N | 快 | 按调用 | 低流量、CPU 推理 |
| Lambda + Container | 否 | 自动 0-N | 较慢 | 按调用 | 复杂依赖 |
| ECS Fargate | 否 | 自动 | 中 | ~$30/月 | 容器化服务 |
| ECS + EC2 GPU | 是 | 手动/自动 | 慢 | ~$100+/月 | GPU 推理 |
| SageMaker Endpoint | 是 | 自动 | 慢 | ~$80+/月 | MLOps 集成 |
| SageMaker Serverless | 否 | 自动 0-N | 中 | 按调用 | 间歇性流量 |

### 推荐方案 1: AWS Lambda (低流量)

对于 YOLO CPU 推理，Lambda 最经济：

```python
# lambda_function.py
import json
import boto3
from ultralytics import YOLO

# 模型在 Lambda Layer 或 /tmp 加载
model = None

def load_model():
    global model
    if model is None:
        # 从 S3 下载模型到 /tmp
        s3 = boto3.client('s3')
        s3.download_file('invoice-models', 'best.pt', '/tmp/best.pt')
        model = YOLO('/tmp/best.pt')
    return model

def lambda_handler(event, context):
    model = load_model()

    # 从 S3 获取图片
    s3 = boto3.client('s3')
    bucket = event['bucket']
    key = event['key']

    local_path = f'/tmp/{key.split("/")[-1]}'
    s3.download_file(bucket, key, local_path)

    # 执行推理
    results = model.predict(local_path, conf=0.5)

    return {
        'statusCode': 200,
        'body': json.dumps({
            'fields': extract_fields(results),
            'confidence': get_confidence(results)
        })
    }
```

**Lambda 配置:**
```yaml
# serverless.yml
service: invoice-inference

provider:
  name: aws
  runtime: python3.11
  timeout: 30
  memorySize: 4096  # 4GB 内存

functions:
  infer:
    handler: lambda_function.lambda_handler
    events:
      - http:
          path: /infer
          method: post
    layers:
      - arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1
```

### 推荐方案 2: ECS Fargate (中流量)

```yaml
# task-definition.json
{
  "family": "invoice-inference",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "2048",
  "memory": "4096",
  "containerDefinitions": [
    {
      "name": "inference",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest",
      "portMappings": [
        {
          "containerPort": 8000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "MODEL_PATH", "value": "/app/models/best.pt"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/invoice-inference",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}
```

**Auto Scaling 配置:**
```bash
# 创建 Auto Scaling Target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/invoice-cluster/invoice-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 1 \
  --max-capacity 10

# 基于 CPU 使用率扩缩容
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/invoice-cluster/invoice-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 120
  }'
```

### 方案 3: SageMaker Serverless Inference

```python
from sagemaker.serverless import ServerlessInferenceConfig
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data="s3://invoice-models/model.tar.gz",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    entry_point="inference.py",
    framework_version="2.0",
    py_version="py310"
)

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096,
    max_concurrency=10
)

predictor = model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="invoice-inference-serverless"
)
```

### 推理性能对比

| 配置 | 单次推理时间 | 并发能力 | 月费估算 |
|------|------------|---------|---------|
| Lambda 4GB | ~500-800ms | 按需扩展 | ~$15 (10K 请求) |
| Fargate 2vCPU 4GB | ~300-500ms | ~50 QPS | ~$30 |
| Fargate 4vCPU 8GB | ~200-300ms | ~100 QPS | ~$60 |
| EC2 g4dn.xlarge (T4) | ~50-100ms | ~200 QPS | ~$380 |

---

## 价格对比

### 训练成本对比（假设每天训练 2 小时）

| 方案 | 计算方式 | 月费 |
|------|---------|------|
| EC2 24/7 运行 | 24h × 30天 × $3.06 | ~$2,200 |
| EC2 按需启停 | 2h × 30天 × $3.06 | ~$184 |
| EC2 Spot 按需 | 2h × 30天 × $0.92 | ~$55 |
| SageMaker On-Demand | 2h × 30天 × $3.825 | ~$230 |
| SageMaker Spot | 2h × 30天 × $1.15 | ~$69 |

### 本项目完整成本估算

| 组件 | 推荐方案 | 月费 |
|------|---------|------|
| 数据存储 | S3 Standard (5GB) | ~$0.12 |
| 数据库 | RDS PostgreSQL (db.t3.micro) | ~$15 |
| 推理服务 | Lambda (10K 请求/月) | ~$15 |
| 推理服务 (替代) | ECS Fargate | ~$30 |
| 训练服务 | SageMaker Spot (按需) | ~$2-5/次 |
| ECR (镜像存储) | 基本使用 | ~$1 |
| **总计 (Lambda)** | | **~$35/月** + 训练费 |
| **总计 (Fargate)** | | **~$50/月** + 训练费 |

---

## 推荐架构

### 整体架构图

```
                            ┌─────────────────────────────────────┐
                            │           Amazon S3                 │
                            │  ├── training-images/               │
                            │  ├── datasets/                      │
                            │  ├── models/                        │
                            │  └── checkpoints/                   │
                            └─────────────────┬───────────────────┘
                                              │
            ┌─────────────────────────────────┼─────────────────────────────────┐
            │                                 │                                 │
            ▼                                 ▼                                 ▼
┌───────────────────────┐       ┌───────────────────────┐       ┌───────────────────────┐
│   推理服务             │       │   训练服务             │       │   API Gateway         │
│                       │       │                       │       │                       │
│  方案 A: Lambda       │       │   SageMaker           │       │   REST API            │
│  ~$15/月 (10K req)    │       │   Managed Spot        │       │   触发 Lambda/ECS     │
│                       │       │   ~$2-5/次训练        │       │                       │
│  方案 B: ECS Fargate  │       │                       │       │                       │
│  ~$30/月              │       │   - 自动启动          │       │                       │
│                       │       │   - 训练完成自动停止   │       │                       │
│ ┌───────────────────┐ │       │   - 检查点自动保存    │       │                       │
│ │ FastAPI + YOLO    │ │       │                       │       │                       │
│ │ CPU 推理          │ │       │                       │       │                       │
│ └───────────────────┘ │       └───────────┬───────────┘       └───────────────────────┘
└───────────┬───────────┘                   │
            │                               │
            └───────────────────────────────┼───────────────────────────────────────────┘
                                            │
                                            ▼
                              ┌───────────────────────┐
                              │   Amazon RDS          │
                              │   PostgreSQL          │
                              │   db.t3.micro         │
                              │   ~$15/月             │
                              └───────────────────────┘
```

### Lambda 推理配置

```yaml
# SAM template
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  InferenceFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.lambda_handler
      Runtime: python3.11
      MemorySize: 4096
      Timeout: 30
      Environment:
        Variables:
          MODEL_BUCKET: invoice-models
          MODEL_KEY: best.pt
      Policies:
        - S3ReadPolicy:
            BucketName: invoice-models
        - S3ReadPolicy:
            BucketName: invoice-uploads
      Events:
        InferApi:
          Type: Api
          Properties:
            Path: /infer
            Method: post
```

### SageMaker 训练配置

```python
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",  # T4 GPU
    framework_version="2.0",
    py_version="py310",

    # Spot 实例配置
    use_spot_instances=True,
    max_run=7200,
    max_wait=14400,

    # 检查点
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",

    hyperparameters={
        "epochs": 100,
        "batch-size": 16,
        "model": "yolo11n.pt"
    }
)
```

---

## 实施步骤

### 阶段 1: 存储设置

```bash
# 创建 S3 桶
aws s3 mb s3://invoice-training-data --region us-east-1
aws s3 mb s3://invoice-models --region us-east-1

# 上传训练数据
aws s3 sync ./data/dataset/temp s3://invoice-training-data/images/

# 配置生命周期（可选，自动转冷存储）
aws s3api put-bucket-lifecycle-configuration \
  --bucket invoice-training-data \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "MoveToIA",
      "Status": "Enabled",
      "Transitions": [{
        "Days": 30,
        "StorageClass": "STANDARD_IA"
      }]
    }]
  }'
```

### 阶段 2: 数据库设置

```bash
# 创建 RDS PostgreSQL
aws rds create-db-instance \
  --db-instance-identifier invoice-db \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 15 \
  --master-username docmaster \
  --master-user-password YOUR_PASSWORD \
  --allocated-storage 20

# 配置安全组
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --protocol tcp \
  --port 5432 \
  --source-group sg-yyy
```

### 阶段 3: 推理服务部署

**方案 A: Lambda**

```bash
# 创建 Lambda Layer (依赖)
cd lambda-layer
pip install ultralytics opencv-python-headless -t python/
zip -r layer.zip python/
aws lambda publish-layer-version \
  --layer-name yolo-deps \
  --zip-file fileb://layer.zip \
  --compatible-runtimes python3.11

# 部署 Lambda 函数
cd ../lambda
zip function.zip lambda_function.py
aws lambda create-function \
  --function-name invoice-inference \
  --runtime python3.11 \
  --handler lambda_function.lambda_handler \
  --role arn:aws:iam::123456789012:role/LambdaRole \
  --zip-file fileb://function.zip \
  --memory-size 4096 \
  --timeout 30 \
  --layers arn:aws:lambda:us-east-1:123456789012:layer:yolo-deps:1

# 创建 API Gateway
aws apigatewayv2 create-api \
  --name invoice-api \
  --protocol-type HTTP \
  --target arn:aws:lambda:us-east-1:123456789012:function:invoice-inference
```

**方案 B: ECS Fargate**

```bash
# 创建 ECR 仓库
aws ecr create-repository --repository-name invoice-inference

# 构建并推送镜像
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker build -t invoice-inference .
docker tag invoice-inference:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/invoice-inference:latest

# 创建 ECS 集群
aws ecs create-cluster --cluster-name invoice-cluster

# 注册任务定义
aws ecs register-task-definition --cli-input-json file://task-definition.json

# 创建服务
aws ecs create-service \
  --cluster invoice-cluster \
  --service-name invoice-service \
  --task-definition invoice-inference \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-xxx"],
      "securityGroups": ["sg-xxx"],
      "assignPublicIp": "ENABLED"
    }
  }'
```

### 阶段 4: 训练服务设置

```python
# setup_sagemaker.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch

# 创建 SageMaker 执行角色
iam = boto3.client('iam')
role_arn = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# 配置训练任务
estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src/training",
    role=role_arn,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    framework_version="2.0",
    py_version="py310",
    use_spot_instances=True,
    max_run=7200,
    max_wait=14400,
    checkpoint_s3_uri="s3://invoice-training-data/checkpoints/",
)

# 保存配置供后续使用
estimator.save("training_config.json")
```

### 阶段 5: 集成训练触发 API

```python
# lambda_trigger_training.py
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch

def lambda_handler(event, context):
    """触发 SageMaker 训练任务"""

    epochs = event.get('epochs', 100)

    estimator = PyTorch(
        entry_point="train.py",
        source_dir="s3://invoice-training-data/code/",
        role="arn:aws:iam::123456789012:role/SageMakerRole",
        instance_count=1,
        instance_type="ml.g4dn.xlarge",
        framework_version="2.0",
        py_version="py310",
        use_spot_instances=True,
        max_run=7200,
        max_wait=14400,
        hyperparameters={
            "epochs": epochs,
            "batch-size": 16,
        }
    )

    estimator.fit(
        inputs={
            "training": "s3://invoice-training-data/datasets/train/",
            "validation": "s3://invoice-training-data/datasets/val/"
        },
        wait=False  # 异步执行
    )

    return {
        'statusCode': 200,
        'body': {
            'training_job_name': estimator.latest_training_job.name,
            'status': 'Started'
        }
    }
```

---

## AWS vs Azure 对比

### 服务对应关系

| 功能 | AWS | Azure |
|------|-----|-------|
| 对象存储 | S3 | Blob Storage |
| 挂载工具 | Mountpoint for S3 | BlobFuse2 |
| ML 平台 | SageMaker | Azure ML |
| 容器服务 | ECS/Fargate | Container Apps |
| Serverless | Lambda | Functions |
| GPU VM | EC2 P3/G4dn | NC/ND 系列 |
| 容器注册 | ECR | ACR |
| 数据库 | RDS PostgreSQL | PostgreSQL Flexible |

### 价格对比

| 组件 | AWS | Azure |
|------|-----|-------|
| 存储 (5GB) | ~$0.12/月 | ~$0.09/月 |
| 数据库 | ~$15/月 | ~$25/月 |
| 推理 (Serverless) | ~$15/月 | ~$30/月 |
| 推理 (容器) | ~$30/月 | ~$30/月 |
| 训练 (Spot GPU) | ~$2-5/次 | ~$1-5/次 |
| **总计** | **~$35-50/月** | **~$65/月** |

### 优劣对比

| 方面 | AWS 优势 | Azure 优势 |
|------|---------|-----------|
| 价格 | Lambda 更便宜 | GPU Spot 更便宜 |
| ML 平台 | SageMaker 更成熟 | Azure ML 更易用 |
| Serverless GPU | 无原生支持 | Container Apps GPU |
| 文档 | 更丰富 | 中文文档更好 |
| 生态 | 更大 | Office 365 集成 |

---

## 总结

### 推荐配置

| 组件 | 推荐方案 | 月费估算 |
|------|---------|---------|
| 数据存储 | S3 Standard | ~$0.12 |
| 数据库 | RDS db.t3.micro | ~$15 |
| 推理服务 | Lambda 4GB | ~$15 |
| 训练服务 | SageMaker Spot | 按需 ~$2-5/次 |
| ECR | 基本使用 | ~$1 |
| **总计** | | **~$35/月** + 训练费 |

### 关键决策

| 场景 | 选择 |
|------|------|
| 最低成本 | Lambda + SageMaker Spot |
| 稳定推理 | ECS Fargate |
| GPU 推理 | ECS + EC2 GPU |
| MLOps 集成 | SageMaker 全家桶 |

### 注意事项

1. **Lambda 冷启动**: 首次调用 ~3-5 秒，可用 Provisioned Concurrency 解决
2. **Spot 中断**: 配置检查点，SageMaker 自动恢复
3. **S3 传输**: 同区域免费，跨区域收费
4. **Fargate 无 GPU**: 需要 GPU 必须用 ECS + EC2
5. **SageMaker 加价**: 比 EC2 贵 ~25%，但省管理成本