Files
knowledge-base/2 - Projects/Billo Release Agent.md
2026-03-25 23:37:37 +01:00

460 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
created: "2026-03-24"
type: project
status: active
deadline: ""
tags: [langgraph, python, devops, automation]
---
# Billo Release Agent
## 目标
将现有的 Claude Code release workflow skill 转换为独立的 LangGraph Python 服务,实现:
- Azure DevOps webhook 自动触发(替代手动粘贴 PR URL
- LangGraph `interrupt()` 实现 human-in-the-loop 审批
- PostgreSQL 持久化状态(替代 JSON 文件)
- 多线程并发处理(每个 PR/release 独立 thread
- Slack 通知 + 审批按钮
## 架构
```
Azure DevOps PR Webhook → FastAPI → LangGraph Agent → Azure DevOps / Jira / Slack / Claude API
Slack Button / API (human approval resume)
PostgreSQL (checkpointer + store)
```
## 代码位置
- 项目目录: `/c/Users/yaoji/git/Billo/billo-release-agent/`
- 源代码: `src/release_agent/`
- 测试: `tests/`
- 原始 skill: `/c/Users/yaoji/git/Billo/release-workflow/.claude/skills/billo-release-workflow/SKILL.md`
## 项目结构
```
billo-release-agent/
├── pyproject.toml
├── Dockerfile
├── docker-compose.yml
├── src/release_agent/
│ ├── main.py # FastAPI app + lifespan + task management
│ ├── config.py # pydantic-settings (所有环境变量)
│ ├── state.py # ReleaseState TypedDict (LangGraph state)
│ ├── exceptions.py # 异常层级
│ ├── branch_parser.py # 纯函数:从 branch 提取 ticket ID
│ ├── versioning.py # 纯函数:版本号计算
│ ├── models/ # Pydantic 数据模型
│ │ ├── pr.py, ticket.py, release.py, pipeline.py
│ │ ├── webhook.py, review.py, jira.py
│ ├── tools/ # 外部服务客户端
│ │ ├── azdo.py, jira.py, slack.py, claude_review.py
│ │ ├── _http.py, _retry.py # 共享 helpers
│ ├── graph/ # LangGraph 图定义
│ │ ├── dependencies.py # ToolClients, StagingStore
│ │ ├── routing.py # 6 个纯函数路由
│ │ ├── pr_completed.py # 12 nodes + graph builder
│ │ ├── release.py # 14 nodes + graph builder
│ │ └── full_cycle.py # subgraph 组合
│ └── api/ # FastAPI 路由
│ ├── models.py # HTTP request/response 模型
│ ├── dependencies.py # Depends() 注入
│ ├── webhooks.py, approvals.py, status.py
└── tests/ # 647 tests, 99.11% coverage
├── test_*.py # Phase 1 单元测试
├── tools/test_*.py # Phase 2 客户端测试
├── graph/test_*.py # Phase 3 图测试
└── api/test_*.py # Phase 4 API 测试
```
## 实施阶段
### Phase 1: Foundation (已完成 2026-03-23)
项目结构、Pydantic models、config、versioning、branch parser。
**成果:**
- 152 tests → review 后 152 tests, 100% coverage
- 文件: branch_parser.py, versioning.py, config.py, state.py
- Models: pr.py, ticket.py, release.py, pipeline.py, webhook.py
**Review 修复:**
- `postgres_dsn` 改为 `SecretStr`
- `import re` 移到模块级别预编译
- `ReleasePipelineStage` 添加 approval_id/requires_approval 一致性验证
- `WebhookResource.status` 改用 `Literal`
- 去除重复测试
### Phase 2: Service Clients (已完成 2026-03-24)
4 个外部服务客户端 + 异常体系 + 共享 HTTP helpers。
**成果:**
- 364 tests, 99.6% coverage
- 新增: exceptions.py, models/review.py, models/jira.py
- 客户端: tools/azdo.py, tools/jira.py, tools/slack.py, tools/claude_review.py
- 共享: tools/_http.py, tools/_retry.py
**关键设计:**
- httpx.AsyncClient 注入实现可测试性
- 自定义异常层级: ServiceError → AuthenticationError / NotFoundError / RateLimitError / ServiceUnavailableError
- 指数退避重试装饰器 `with_retry`
- Claude tool_use 实现结构化 code review 输出
- Jira 两步转换逻辑(先 Dev in Progress 再 code review
**Review 修复:**
- bare `except Exception` 改为 `(ValueError, KeyError)`
- retry 装饰器 implicit None return path 修复
- ClaudeReviewer client 参数添加类型标注
- 401/403 错误传递 detail 信息
- Jira errorMessages 格式支持
### Phase 3: LangGraph Graphs (已完成 2026-03-24)
3 个 graph + 依赖注入 + routing + staging store。
**成果:**
- 520 tests (155 new), 99.42% coverage
- 文件: graph/dependencies.py, graph/routing.py, graph/pr_completed.py, graph/release.py, graph/full_cycle.py
- state.py 扩展 17 个新字段
**关键设计:**
- `ToolClients` frozen dataclass 通过 `config["configurable"]["clients"]` 注入
- `StagingStore` Protocol + `JsonFileStagingStore` 文件实现(后续迁移 PostgreSQL
- 专用 interrupt 节点(非 inline interrupt
- Subgraph 组合: full_cycle 包含 pr_completed + release 两个子图
- 6 个纯函数路由: is_pr_already_merged, is_review_approved, has_ticket, should_continue_to_release, has_pipelines, has_pending_approvals
- 错误处理: 非关键节点 catch ReleaseAgentError 追加到 errors关键节点 re-raise
**Graph: PR Completed (12 nodes):**
```
parse_webhook → fetch_pr_details → [已merge?]
├─ 是 → move_jira_ready_for_stage
└─ 否 → move_jira_code_review → run_code_review → evaluate_review
├─ approve → interrupt_confirm_merge → merge_pr
└─ request_changes → notify_request_changes → END
→ move_jira_ready_for_stage → add_jira_pr_link → calculate_version → update_staging → END
```
**Graph: Release (14 nodes):**
```
load_staging → interrupt_confirm_release → create_release_pr → interrupt_confirm_merge_release
→ merge_release_pr → move_tickets_to_done → send_slack_notification → archive_release
→ list_pipelines → [有 pipeline?]
├─ 是 → interrupt_confirm_trigger → trigger_pipelines → check_release_approvals → END
└─ 否 → END
```
**5 个 interrupt 点:**
1. Code review 通过后 → confirm merge
2. 创建 release PR 前 → confirm create
3. Merge release PR 前 → confirm merge
4. 触发 build pipeline 前 → confirm trigger
5. Approve release stage → confirm approve (per stage)
### Phase 4: API Layer + Deployment (已完成 2026-03-24)
FastAPI 应用 + Docker 部署配置。
**成果:**
- 647 tests (127 new), 99.11% coverage
- 文件: main.py, api/models.py, api/dependencies.py, api/webhooks.py, api/approvals.py, api/status.py
- 部署: Dockerfile, docker-compose.yml
**API Endpoints**
| Method | Path | 用途 |
|--------|------|------|
| POST | `/webhooks/azdo` | Azure DevOps PR webhook 接收 |
| POST | `/approvals/{thread_id}` | 恢复中断的 graphhuman approval |
| GET | `/approvals/pending` | 列出等待审批的 threads |
| GET | `/status` | 健康检查 |
| GET | `/releases/{repo}` | 列出 repo 的所有版本 |
| GET | `/staging` | 当前 staging 状态 |
| POST | `/manual/pr/{pr_id}` | 手动触发 PR 处理webhook 备用) |
| POST | `/manual/release` | 手动触发 release |
**关键设计:**
- Singleton compiled graphs 存储在 `app.state` 启动时编译一次
- `agent_threads` PostgreSQL 表追踪线程状态running/interrupted/completed/error
- `asyncio.create_task` + checkpointer 实现后台执行和崩溃恢复
- Webhook 密钥通过 `X-Webhook-Secret` header + `hmac.compare_digest` 验证
- FastAPI dependencies 通过 `request.app.state` + `Depends()` 注入
- 优雅关闭:等待 30 秒后取消剩余 background tasks
**Review 修复3 CRITICAL**
- `webhook_secret` 改为必填(移除空默认值),防止未配置时绕过认证
- `submit_approval` 从 DB 查找 `graph_name` 后再 resume原来硬编码 pr_completed
- `_resume_graph` 异常捕获后返回 ApprovalResponse 而非泄漏 500 错误
**部署配置:**
- Dockerfile: Python 3.12-slim, non-root user, uv 安装依赖
- docker-compose: agent + postgres:16-alpine, health check, pgdata volume
- 需要的环境变量: AZDO_PAT, ANTHROPIC_API_KEY, POSTGRES_DSN, JIRA_EMAIL, JIRA_API_TOKEN, SLACK_WEBHOOK_URL, WEBHOOK_SECRET
### Phase 5: Migration + Hardening (已完成 2026-03-24)
数据迁移、PostgreSQL Store、operator 认证、文档。
**成果:**
- 760 tests (113 new), 99.22% coverage
- 新增: graph/postgres_staging_store.py, scripts/migrate_json_to_db.py, .env.example, README.md
- StagingStore Protocol 改为 async所有调用点添加 await
**关键设计:**
- `PostgresStagingStore` 使用 psycopg3 async poolJSONB 存储 tickets
- `archive()` 使用显式事务(`conn.transaction()`)确保 INSERT + DELETE 原子性
- `staging_releases` 表 (per-repo upsert) + `archived_releases` 表 (repo+version unique)
- Operator token 认证: `require_operator_token` dependency 应用于 POST /approvals, POST /manual/* 端点
- 迁移脚本: 纯函数提取 + dry-run 模式,从 JSON 文件读取插入 PostgreSQL
- `JsonFileStagingStore` 保留作为本地开发 fallback
**Review 修复1 HIGH**
- `archive()` 添加 `async with conn.transaction()` 包裹 INSERT + DELETE
## 技术栈
| 组件 | 技术 |
|------|------|
| Agent 框架 | LangGraph |
| Web 框架 | FastAPI + uvicorn |
| HTTP 客户端 | httpx (async) |
| AI Code Review | Claude Code CLI (`claude -p`) — 使用 subscription 额度 |
| 数据库 | PostgreSQL (checkpointer + store) |
| 验证 | Pydantic v2 + pydantic-settings |
| 数据库驱动 | psycopg3 + psycopg_pool (async PostgreSQL) |
| 测试 | pytest + pytest-asyncio + httpx.MockTransport + FastAPI TestClient |
| 部署 | Docker Compose on homelab |
## 外部服务集成
| 服务 | 用途 | 认证方式 |
|------|------|---------|
| Azure DevOps | PR 管理、Pipeline 触发 | PAT (Basic auth) |
| Jira | Ticket 状态流转 | Email + API token (Basic auth) |
| Slack | Release 通知、审批请求 | Incoming Webhook |
| Claude Code CLI | 自动 Code Review | Subscription (非 API Key) |
## Azure DevOps Pipeline 映射
| Repo | Build Pipeline ID | Release Pipeline | Release ID |
|------|------------------|-----------------|------------|
| Billo.Platform.Payment | 41 | Billo Payment | 37 |
| Billo.Platform.Payment (Scheduler) | 51 | Billo Payment Scheduler | 47 |
| Billo.Platform.Document.DocumentAnalyser | 75 | DocumentAnalyser | 58 |
## Release Pipeline Approve 配置
| Pipeline | Sandbox | Production |
|----------|---------|------------|
| Billo Payment | Project Admins approve | Release Admins approve |
| DocumentAnalyser | 自动 | Release Admins approve |
## Jira Workflow 状态流转
```
IN PROGRESS → CODE REVIEW → WAITING FOR TEST → IN TEST
→ READY FOR STAGE → DEPLOYED IN STAGE → IN PRODUCTION → CLOSED
```
注意: CODE REVIEW 只能从 IN PROGRESS 转入。
## 已完成总览
| Phase | 状态 | Tests | Coverage |
|-------|------|-------|----------|
| 1. Foundation | Done | 152 | 100% |
| 2. Service Clients | Done | +212 = 364 | 99.6% |
| 3. LangGraph Graphs | Done | +156 = 520 | 99.4% |
| 4. API + Deploy | Done | +127 = 647 | 99.1% |
| 5. Migration + Hardening | Done | +113 = 760 | 99.2% |
| Final Code Review + Fix | Done | +12 = 772 | 98.4% |
| 6. Slack + CI/CD | Done | +193 = 965 | 96.6% |
| 7. PR Polling + Auto Ticket | Done | +96 = 1061 | 96.0% |
## Code Review 方案变更 (2026-03-24)
原方案通过 Anthropic API 直接调用 Claude改为 Claude Code CLI subprocess
| 项目 | 之前 | 之后 |
|------|------|------|
| 调用方式 | `anthropic.AsyncAnthropic` API | `claude -p` subprocess |
| 计费 | API Key (按 token 计费) | Subscription 额度 |
| 代码理解 | 只能看传入的 diff 文本 | 可自主 Read/Glob/Grep 整个 codebase |
| 结构化输出 | tool_use schema | `--json-schema` + `--output-format json` |
| 依赖 | ANTHROPIC_API_KEY | `claude` CLI 在 PATH + REPOS_BASE_DIR |
关键配置:`.env` 中设置 `REPOS_BASE_DIR=/c/Users/yaoji/git/Billo`Claude Code 在对应 repo 目录下执行 review。
### Phase 6: Slack Interactive + CI/CD (已完成 2026-03-24)
Slack 按钮审批 + CI/CD 自动触发/轮询/审批。
**成果:**
- 965 tests (+193 new), 96.55% coverage
- 新增: models/build.py, graph/polling.py, graph/ci_nodes.py, api/slack_interactions.py
- SlackClient 改为双模式 (webhook fallback + Web API)
**Slack 交互流程:**
```
Graph interrupt → Slack 消息 [Approve] [Cancel] 按钮
→ 用户点击按钮 → POST /slack/interactions
→ 验证签名 (HMAC-SHA256 + 5 分钟重放保护)
→ 提取 thread_id + decision → _resume_graph
→ 更新 Slack 消息显示结果
```
**CI/CD 流程:**
```
PR merge → develop:
merge_pr → trigger_ci_build(develop) → poll_ci_build → notify_ci_result → END
Release merge → main:
merge_release_pr → trigger_ci_build(main) → poll_ci_build
→ ci_passed: wait_for_cd → approval loop (Sandbox → Production)
→ ci_failed: notify_failure → END
```
**新增配置:**
- `SLACK_BOT_TOKEN` — Slack App Bot Token (xoxb-...)
- `SLACK_SIGNING_SECRET` — Slack 签名密钥 (必须非空)
- `SLACK_CHANNEL_ID` — 发送消息的频道
- `CI_POLL_INTERVAL_SECONDS` — CI 轮询间隔 (默认 30s)
- `CI_POLL_MAX_WAIT_SECONDS` — CI 最大等待时间 (默认 30min)
**Review 修复2 CRITICAL + 4 HIGH**
- 添加 5 分钟时间戳重放攻击防护
- 空 signing_secret 返回 503 而非静默跳过
- Decision 值白名单校验
- CI 分支逻辑修正develop for PR, main for release
- ci_build_id 类型验证
### Phase 7: PR Polling + Auto-Create Jira Ticket (已完成 2026-03-24)
定时扫描所有 repo 的 active PRs + 无 ticket 时自动创建 Jira ticket。
**成果:**
- 1061 tests (+96 new), 95.96% coverage
- 新增: services/pr_dedup.py, services/pr_poller.py
- 修改: azdo.py (list_active_prs), jira.py (create_issue + _text_to_adf), claude_review.py (generate_ticket_content), routing.py (route_after_fetch), pr_completed.py (auto_create_ticket node)
**PR 轮询流程:**
```
每 5 分钟 → 扫描 WATCHED_REPOS 所有 active PRs (target=develop)
→ 对比 agent_threads 去重
→ 合成 webhook payload → 触发 pr_completed graph
```
**自动创建 Jira Ticket 流程:**
```
fetch_pr_details → route_after_fetch (3-way routing)
├─ merged → calculate_version (跳过 review)
├─ active_with_ticket → move_jira_code_review (正常流程)
└─ active_no_ticket → auto_create_ticket
→ Claude CLI 生成 summary + description
→ Jira create_issue (ALLPOST project)
→ 设置 ticket_id + has_ticket=True
→ move_jira_code_review (继续正常流程)
```
**新增配置:**
- `WATCHED_REPOS` — 逗号分隔的 repo 列表
- `PR_POLL_INTERVAL_SECONDS=300` — 轮询间隔
- `PR_POLL_ENABLED=False` — 轮询开关
- `DEFAULT_JIRA_PROJECT=ALLPOST` — 自动创建 ticket 的项目
**Review 修复1 CRITICAL + 2 HIGH**
- schedule_fn 参数签名不匹配导致轮询静默失败 → 修正为只传 initial_state
- dedup SQL 未强制 (pr_id, repo_name) 配对 → 改用 unnest 配对查询
- run_graph_in_background 缺失 repos_base_dir + default_jira_project → 已补全
## Final Code Review 修复 (2026-03-24)
全面 code review 发现 3 CRITICAL + 8 HIGH 问题,已全部修复:
| # | 严重级 | 问题 | 修复 |
|---|--------|------|------|
| 1 | CRITICAL | AzDoClient 构造函数参数不匹配,启动崩溃 | 传入正确的 `base_url`, `vsrm_base_url`, `vsrm_http_client` |
| 2 | CRITICAL | 空 webhook_secret 绕过认证 | 空 expected 拒绝所有请求 |
| 3 | CRITICAL | docker-compose 默认密码 `secret` | 改为 `${POSTGRES_PASSWORD:?must be set}` |
| 4 | HIGH | `graph_name` 未存储到 agent_threads | `_upsert_thread` 新增 `graph_name`, `repo_name`, `pr_id` 参数 |
| 5 | HIGH | 无 httpx 超时设置 | 添加 `timeout=30.0` |
| 6 | HIGH | httpx.AsyncClient 未关闭 | lifespan shutdown 关闭所有 HTTP 客户端 |
| 7 | HIGH | 错误处理泄漏内部信息 | `_generic_error_handler` 返回固定消息 |
| 8 | HIGH | Approvals 返回 200+error body | 改为 HTTPException(404/400) |
额外修复:
- `anthropic_api_key` 改为可选CLI 用 subscription 不需要)
- docker-compose: `WEBHOOK_SECRET` 必填, agent health check, `REPOS_BASE_DIR` 环境变量
- `_run_graph` 添加 `logger.exception` 日志
## 后续优化(非阻塞)
- [ ] `get_pr_diff` 目前只返回文件名,需增强为实际 diff 内容Claude Code CLI 可自主读取,优先级降低)
- [ ] `list_build_pipelines` 需要按 repo 过滤 API 请求
- [ ] `@with_retry` 装饰器尚未应用到客户端方法
- [ ] Jira fallback transition name 应可配置而非硬编码
- [ ] `check_release_approvals` 是 stub需实现实际 approval gate 检测
- [ ] `last_merge_source_commit` 始终为 None需从 AzDo API 获取
- [ ] interrupt 节点不检查返回值,任何 resume 都会继续执行(需加 post-interrupt routing
- [ ] `archive_release` 使用 `date.today()` 不可测试,应注入
- [ ] `_upsert_thread` 从 webhooks.py 提取到共享 `api/db.py` 消除循环引用
- [ ] Dockerfile 改为多阶段构建
- [ ] CLI prompt 超过 100K 字符时可能超 OS ARG_MAX应改为 stdin pipe
- [ ] `PostgresStagingStore.save` 并发竞争(需 SELECT FOR UPDATE 或应用锁)
- [ ] 关闭超时 30s 可能不够 Claude CLI 的 300s 超时
## 运行环境WSL (推荐)
在 Windows 上直接运行有两个问题:
1. psycopg async 需要 SelectorEventLoopWindows 默认 ProactorEventLoop 不兼容
2. Claude CLI subprocess 在 Windows uvicorn 里返回空 stdout
**解决方案:在 WSL Ubuntu 里运行 appPostgreSQL 在 Docker**
```bash
# WSL 启动命令
cd /mnt/c/Users/yaoji/git/Billo/billo-release-agent
docker compose up -d db
uv run uvicorn release_agent.main:app --host 0.0.0.0 --port 8080
```
关键 .env 配置:
- `CLAUDE_CMD=claude` (不是 claude.cmd)
- `REPOS_BASE_DIR=/mnt/c/Users/yaoji/git/Billo` (或克隆到 WSL 原生 fs 更快)
## 集成测试结果 (2026-03-24)
**已验证通过:**
- App 启动 + /status health check
- Azure DevOps API (get_pr, list_active_prs, iterations/changes)
- PR 信息解析 (repo_name, ticket_id, branch)
- Graph 完整流程执行 (parse → fetch → route → review → notify)
- 数据库读写 (agent_threads)
- Claude CLI ticket generation (WSL 下成功返回 structured JSON)
- Claude CLI code review 启动 (WSL 下成功调用)
- RunnableConfig 类型修复(消除 LangGraph 警告)
- URL 编码修复project name 含空格)
- AzDo iterations/changes API替代不存在的 diffs endpoint
**待解决:**
- Claude CLI code review 在 WSL+/mnt/c 下极慢10+ 分钟,跨文件系统 I/O
- Graph 没有 checkpointerinterrupt 不持久化)
- CI poll 在无 pipeline 环境下会超时
## 部署步骤
1. `cp .env.example .env` 并填写所有 REQUIRED 变量
2. `docker compose up -d db` 只启动 PostgreSQL
3. 在 WSL 里: `uv run uvicorn release_agent.main:app --port 8080`
4. 运行迁移: `python scripts/migrate_json_to_db.py --source ../release-workflow/releases`
5. 可选: 配置 Azure DevOps Service Hook / Cloudflare Tunnel
## 相关笔记
- [[Billo Release Workflow Skill]] — 原始 Claude Code skill 的工作流定义