19 KiB
created, type, status, deadline, tags
| created | type | status | deadline | tags | ||||
|---|---|---|---|---|---|---|---|---|
| 2026-03-24 | project | active |
|
Billo Release Agent
目标
将现有的 Claude Code release workflow skill 转换为独立的 LangGraph Python 服务,实现:
- Azure DevOps webhook 自动触发(替代手动粘贴 PR URL)
- LangGraph
interrupt()实现 human-in-the-loop 审批 - PostgreSQL 持久化状态(替代 JSON 文件)
- 多线程并发处理(每个 PR/release 独立 thread)
- Slack 通知 + 审批按钮
架构
Azure DevOps PR Webhook → FastAPI → LangGraph Agent → Azure DevOps / Jira / Slack / Claude API
↑
Slack Button / API (human approval resume)
↑
PostgreSQL (checkpointer + store)
代码位置
- 项目目录:
/c/Users/yaoji/git/Billo/billo-release-agent/ - 源代码:
src/release_agent/ - 测试:
tests/ - 原始 skill:
/c/Users/yaoji/git/Billo/release-workflow/.claude/skills/billo-release-workflow/SKILL.md
项目结构
billo-release-agent/
├── pyproject.toml
├── Dockerfile
├── docker-compose.yml
├── src/release_agent/
│ ├── main.py # FastAPI app + lifespan + task management
│ ├── config.py # pydantic-settings (所有环境变量)
│ ├── state.py # ReleaseState TypedDict (LangGraph state)
│ ├── exceptions.py # 异常层级
│ ├── branch_parser.py # 纯函数:从 branch 提取 ticket ID
│ ├── versioning.py # 纯函数:版本号计算
│ ├── models/ # Pydantic 数据模型
│ │ ├── pr.py, ticket.py, release.py, pipeline.py
│ │ ├── webhook.py, review.py, jira.py
│ ├── tools/ # 外部服务客户端
│ │ ├── azdo.py, jira.py, slack.py, claude_review.py
│ │ ├── _http.py, _retry.py # 共享 helpers
│ ├── graph/ # LangGraph 图定义
│ │ ├── dependencies.py # ToolClients, StagingStore
│ │ ├── routing.py # 6 个纯函数路由
│ │ ├── pr_completed.py # 12 nodes + graph builder
│ │ ├── release.py # 14 nodes + graph builder
│ │ └── full_cycle.py # subgraph 组合
│ └── api/ # FastAPI 路由
│ ├── models.py # HTTP request/response 模型
│ ├── dependencies.py # Depends() 注入
│ ├── webhooks.py, approvals.py, status.py
└── tests/ # 647 tests, 99.11% coverage
├── test_*.py # Phase 1 单元测试
├── tools/test_*.py # Phase 2 客户端测试
├── graph/test_*.py # Phase 3 图测试
└── api/test_*.py # Phase 4 API 测试
实施阶段
Phase 1: Foundation (已完成 2026-03-23)
项目结构、Pydantic models、config、versioning、branch parser。
成果:
- 152 tests → review 后 152 tests, 100% coverage
- 文件: branch_parser.py, versioning.py, config.py, state.py
- Models: pr.py, ticket.py, release.py, pipeline.py, webhook.py
Review 修复:
postgres_dsn改为SecretStrimport re移到模块级别预编译ReleasePipelineStage添加 approval_id/requires_approval 一致性验证WebhookResource.status改用Literal- 去除重复测试
Phase 2: Service Clients (已完成 2026-03-24)
4 个外部服务客户端 + 异常体系 + 共享 HTTP helpers。
成果:
- 364 tests, 99.6% coverage
- 新增: exceptions.py, models/review.py, models/jira.py
- 客户端: tools/azdo.py, tools/jira.py, tools/slack.py, tools/claude_review.py
- 共享: tools/_http.py, tools/_retry.py
关键设计:
- httpx.AsyncClient 注入实现可测试性
- 自定义异常层级: ServiceError → AuthenticationError / NotFoundError / RateLimitError / ServiceUnavailableError
- 指数退避重试装饰器
with_retry - Claude tool_use 实现结构化 code review 输出
- Jira 两步转换逻辑(先 Dev in Progress 再 code review)
Review 修复:
- bare
except Exception改为(ValueError, KeyError) - retry 装饰器 implicit None return path 修复
- ClaudeReviewer client 参数添加类型标注
- 401/403 错误传递 detail 信息
- Jira errorMessages 格式支持
Phase 3: LangGraph Graphs (已完成 2026-03-24)
3 个 graph + 依赖注入 + routing + staging store。
成果:
- 520 tests (155 new), 99.42% coverage
- 文件: graph/dependencies.py, graph/routing.py, graph/pr_completed.py, graph/release.py, graph/full_cycle.py
- state.py 扩展 17 个新字段
关键设计:
ToolClientsfrozen dataclass 通过config["configurable"]["clients"]注入StagingStoreProtocol +JsonFileStagingStore文件实现(后续迁移 PostgreSQL)- 专用 interrupt 节点(非 inline interrupt)
- Subgraph 组合: full_cycle 包含 pr_completed + release 两个子图
- 6 个纯函数路由: is_pr_already_merged, is_review_approved, has_ticket, should_continue_to_release, has_pipelines, has_pending_approvals
- 错误处理: 非关键节点 catch ReleaseAgentError 追加到 errors,关键节点 re-raise
Graph: PR Completed (12 nodes):
parse_webhook → fetch_pr_details → [已merge?]
├─ 是 → move_jira_ready_for_stage
└─ 否 → move_jira_code_review → run_code_review → evaluate_review
├─ approve → interrupt_confirm_merge → merge_pr
└─ request_changes → notify_request_changes → END
→ move_jira_ready_for_stage → add_jira_pr_link → calculate_version → update_staging → END
Graph: Release (14 nodes):
load_staging → interrupt_confirm_release → create_release_pr → interrupt_confirm_merge_release
→ merge_release_pr → move_tickets_to_done → send_slack_notification → archive_release
→ list_pipelines → [有 pipeline?]
├─ 是 → interrupt_confirm_trigger → trigger_pipelines → check_release_approvals → END
└─ 否 → END
5 个 interrupt 点:
- Code review 通过后 → confirm merge
- 创建 release PR 前 → confirm create
- Merge release PR 前 → confirm merge
- 触发 build pipeline 前 → confirm trigger
- Approve release stage → confirm approve (per stage)
Phase 4: API Layer + Deployment (已完成 2026-03-24)
FastAPI 应用 + Docker 部署配置。
成果:
- 647 tests (127 new), 99.11% coverage
- 文件: main.py, api/models.py, api/dependencies.py, api/webhooks.py, api/approvals.py, api/status.py
- 部署: Dockerfile, docker-compose.yml
API Endpoints:
| Method | Path | 用途 |
|---|---|---|
| POST | /webhooks/azdo |
Azure DevOps PR webhook 接收 |
| POST | /approvals/{thread_id} |
恢复中断的 graph(human approval) |
| GET | /approvals/pending |
列出等待审批的 threads |
| GET | /status |
健康检查 |
| GET | /releases/{repo} |
列出 repo 的所有版本 |
| GET | /staging |
当前 staging 状态 |
| POST | /manual/pr/{pr_id} |
手动触发 PR 处理(webhook 备用) |
| POST | /manual/release |
手动触发 release |
关键设计:
- Singleton compiled graphs 存储在
app.state启动时编译一次 agent_threadsPostgreSQL 表追踪线程状态(running/interrupted/completed/error)asyncio.create_task+ checkpointer 实现后台执行和崩溃恢复- Webhook 密钥通过
X-Webhook-Secretheader +hmac.compare_digest验证 - FastAPI dependencies 通过
request.app.state+Depends()注入 - 优雅关闭:等待 30 秒后取消剩余 background tasks
Review 修复(3 CRITICAL):
webhook_secret改为必填(移除空默认值),防止未配置时绕过认证submit_approval从 DB 查找graph_name后再 resume(原来硬编码 pr_completed)_resume_graph异常捕获后返回 ApprovalResponse 而非泄漏 500 错误
部署配置:
- Dockerfile: Python 3.12-slim, non-root user, uv 安装依赖
- docker-compose: agent + postgres:16-alpine, health check, pgdata volume
- 需要的环境变量: AZDO_PAT, ANTHROPIC_API_KEY, POSTGRES_DSN, JIRA_EMAIL, JIRA_API_TOKEN, SLACK_WEBHOOK_URL, WEBHOOK_SECRET
Phase 5: Migration + Hardening (已完成 2026-03-24)
数据迁移、PostgreSQL Store、operator 认证、文档。
成果:
- 760 tests (113 new), 99.22% coverage
- 新增: graph/postgres_staging_store.py, scripts/migrate_json_to_db.py, .env.example, README.md
- StagingStore Protocol 改为 async,所有调用点添加 await
关键设计:
PostgresStagingStore使用 psycopg3 async pool,JSONB 存储 ticketsarchive()使用显式事务(conn.transaction())确保 INSERT + DELETE 原子性staging_releases表 (per-repo upsert) +archived_releases表 (repo+version unique)- Operator token 认证:
require_operator_tokendependency 应用于 POST /approvals, POST /manual/* 端点 - 迁移脚本: 纯函数提取 + dry-run 模式,从 JSON 文件读取插入 PostgreSQL
JsonFileStagingStore保留作为本地开发 fallback
Review 修复(1 HIGH):
archive()添加async with conn.transaction()包裹 INSERT + DELETE
技术栈
| 组件 | 技术 |
|---|---|
| Agent 框架 | LangGraph |
| Web 框架 | FastAPI + uvicorn |
| HTTP 客户端 | httpx (async) |
| AI Code Review | Claude Code CLI (claude -p) — 使用 subscription 额度 |
| 数据库 | PostgreSQL (checkpointer + store) |
| 验证 | Pydantic v2 + pydantic-settings |
| 数据库驱动 | psycopg3 + psycopg_pool (async PostgreSQL) |
| 测试 | pytest + pytest-asyncio + httpx.MockTransport + FastAPI TestClient |
| 部署 | Docker Compose on homelab |
外部服务集成
| 服务 | 用途 | 认证方式 |
|---|---|---|
| Azure DevOps | PR 管理、Pipeline 触发 | PAT (Basic auth) |
| Jira | Ticket 状态流转 | Email + API token (Basic auth) |
| Slack | Release 通知、审批请求 | Incoming Webhook |
| Claude Code CLI | 自动 Code Review | Subscription (非 API Key) |
Azure DevOps Pipeline 映射
| Repo | Build Pipeline ID | Release Pipeline | Release ID |
|---|---|---|---|
| Billo.Platform.Payment | 41 | Billo Payment | 37 |
| Billo.Platform.Payment (Scheduler) | 51 | Billo Payment Scheduler | 47 |
| Billo.Platform.Document.DocumentAnalyser | 75 | DocumentAnalyser | 58 |
Release Pipeline Approve 配置
| Pipeline | Sandbox | Production |
|---|---|---|
| Billo Payment | Project Admins approve | Release Admins approve |
| DocumentAnalyser | 自动 | Release Admins approve |
Jira Workflow 状态流转
IN PROGRESS → CODE REVIEW → WAITING FOR TEST → IN TEST
→ READY FOR STAGE → DEPLOYED IN STAGE → IN PRODUCTION → CLOSED
注意: CODE REVIEW 只能从 IN PROGRESS 转入。
已完成总览
| Phase | 状态 | Tests | Coverage |
|---|---|---|---|
| 1. Foundation | Done | 152 | 100% |
| 2. Service Clients | Done | +212 = 364 | 99.6% |
| 3. LangGraph Graphs | Done | +156 = 520 | 99.4% |
| 4. API + Deploy | Done | +127 = 647 | 99.1% |
| 5. Migration + Hardening | Done | +113 = 760 | 99.2% |
| Final Code Review + Fix | Done | +12 = 772 | 98.4% |
| 6. Slack + CI/CD | Done | +193 = 965 | 96.6% |
| 7. PR Polling + Auto Ticket | Done | +96 = 1061 | 96.0% |
Code Review 方案变更 (2026-03-24)
原方案通过 Anthropic API 直接调用 Claude,改为 Claude Code CLI subprocess:
| 项目 | 之前 | 之后 |
|---|---|---|
| 调用方式 | anthropic.AsyncAnthropic API |
claude -p subprocess |
| 计费 | API Key (按 token 计费) | Subscription 额度 |
| 代码理解 | 只能看传入的 diff 文本 | 可自主 Read/Glob/Grep 整个 codebase |
| 结构化输出 | tool_use schema | --json-schema + --output-format json |
| 依赖 | ANTHROPIC_API_KEY | claude CLI 在 PATH + REPOS_BASE_DIR |
关键配置:.env 中设置 REPOS_BASE_DIR=/c/Users/yaoji/git/Billo,Claude Code 在对应 repo 目录下执行 review。
Phase 6: Slack Interactive + CI/CD (已完成 2026-03-24)
Slack 按钮审批 + CI/CD 自动触发/轮询/审批。
成果:
- 965 tests (+193 new), 96.55% coverage
- 新增: models/build.py, graph/polling.py, graph/ci_nodes.py, api/slack_interactions.py
- SlackClient 改为双模式 (webhook fallback + Web API)
Slack 交互流程:
Graph interrupt → Slack 消息 [Approve] [Cancel] 按钮
→ 用户点击按钮 → POST /slack/interactions
→ 验证签名 (HMAC-SHA256 + 5 分钟重放保护)
→ 提取 thread_id + decision → _resume_graph
→ 更新 Slack 消息显示结果
CI/CD 流程:
PR merge → develop:
merge_pr → trigger_ci_build(develop) → poll_ci_build → notify_ci_result → END
Release merge → main:
merge_release_pr → trigger_ci_build(main) → poll_ci_build
→ ci_passed: wait_for_cd → approval loop (Sandbox → Production)
→ ci_failed: notify_failure → END
新增配置:
SLACK_BOT_TOKEN— Slack App Bot Token (xoxb-...)SLACK_SIGNING_SECRET— Slack 签名密钥 (必须非空)SLACK_CHANNEL_ID— 发送消息的频道CI_POLL_INTERVAL_SECONDS— CI 轮询间隔 (默认 30s)CI_POLL_MAX_WAIT_SECONDS— CI 最大等待时间 (默认 30min)
Review 修复(2 CRITICAL + 4 HIGH):
- 添加 5 分钟时间戳重放攻击防护
- 空 signing_secret 返回 503 而非静默跳过
- Decision 值白名单校验
- CI 分支逻辑修正:develop for PR, main for release
- ci_build_id 类型验证
Phase 7: PR Polling + Auto-Create Jira Ticket (已完成 2026-03-24)
定时扫描所有 repo 的 active PRs + 无 ticket 时自动创建 Jira ticket。
成果:
- 1061 tests (+96 new), 95.96% coverage
- 新增: services/pr_dedup.py, services/pr_poller.py
- 修改: azdo.py (list_active_prs), jira.py (create_issue + _text_to_adf), claude_review.py (generate_ticket_content), routing.py (route_after_fetch), pr_completed.py (auto_create_ticket node)
PR 轮询流程:
每 5 分钟 → 扫描 WATCHED_REPOS 所有 active PRs (target=develop)
→ 对比 agent_threads 去重
→ 合成 webhook payload → 触发 pr_completed graph
自动创建 Jira Ticket 流程:
fetch_pr_details → route_after_fetch (3-way routing)
├─ merged → calculate_version (跳过 review)
├─ active_with_ticket → move_jira_code_review (正常流程)
└─ active_no_ticket → auto_create_ticket
→ Claude CLI 生成 summary + description
→ Jira create_issue (ALLPOST project)
→ 设置 ticket_id + has_ticket=True
→ move_jira_code_review (继续正常流程)
新增配置:
WATCHED_REPOS— 逗号分隔的 repo 列表PR_POLL_INTERVAL_SECONDS=300— 轮询间隔PR_POLL_ENABLED=False— 轮询开关DEFAULT_JIRA_PROJECT=ALLPOST— 自动创建 ticket 的项目
Review 修复(1 CRITICAL + 2 HIGH):
- schedule_fn 参数签名不匹配导致轮询静默失败 → 修正为只传 initial_state
- dedup SQL 未强制 (pr_id, repo_name) 配对 → 改用 unnest 配对查询
- run_graph_in_background 缺失 repos_base_dir + default_jira_project → 已补全
Final Code Review 修复 (2026-03-24)
全面 code review 发现 3 CRITICAL + 8 HIGH 问题,已全部修复:
| # | 严重级 | 问题 | 修复 |
|---|---|---|---|
| 1 | CRITICAL | AzDoClient 构造函数参数不匹配,启动崩溃 | 传入正确的 base_url, vsrm_base_url, vsrm_http_client |
| 2 | CRITICAL | 空 webhook_secret 绕过认证 | 空 expected 拒绝所有请求 |
| 3 | CRITICAL | docker-compose 默认密码 secret |
改为 ${POSTGRES_PASSWORD:?must be set} |
| 4 | HIGH | graph_name 未存储到 agent_threads |
_upsert_thread 新增 graph_name, repo_name, pr_id 参数 |
| 5 | HIGH | 无 httpx 超时设置 | 添加 timeout=30.0 |
| 6 | HIGH | httpx.AsyncClient 未关闭 | lifespan shutdown 关闭所有 HTTP 客户端 |
| 7 | HIGH | 错误处理泄漏内部信息 | _generic_error_handler 返回固定消息 |
| 8 | HIGH | Approvals 返回 200+error body | 改为 HTTPException(404/400) |
额外修复:
anthropic_api_key改为可选(CLI 用 subscription 不需要)- docker-compose:
WEBHOOK_SECRET必填, agent health check,REPOS_BASE_DIR环境变量 _run_graph添加logger.exception日志
后续优化(非阻塞)
get_pr_diff目前只返回文件名,需增强为实际 diff 内容(Claude Code CLI 可自主读取,优先级降低)list_build_pipelines需要按 repo 过滤 API 请求@with_retry装饰器尚未应用到客户端方法- Jira fallback transition name 应可配置而非硬编码
check_release_approvals是 stub,需实现实际 approval gate 检测last_merge_source_commit始终为 None,需从 AzDo API 获取- interrupt 节点不检查返回值,任何 resume 都会继续执行(需加 post-interrupt routing)
archive_release使用date.today()不可测试,应注入_upsert_thread从 webhooks.py 提取到共享api/db.py消除循环引用- Dockerfile 改为多阶段构建
- CLI prompt 超过 100K 字符时可能超 OS ARG_MAX,应改为 stdin pipe
PostgresStagingStore.save并发竞争(需 SELECT FOR UPDATE 或应用锁)- 关闭超时 30s 可能不够 Claude CLI 的 300s 超时
运行环境:WSL (推荐)
在 Windows 上直接运行有两个问题:
- psycopg async 需要 SelectorEventLoop,Windows 默认 ProactorEventLoop 不兼容
- Claude CLI subprocess 在 Windows uvicorn 里返回空 stdout
解决方案:在 WSL Ubuntu 里运行 app,PostgreSQL 在 Docker
# WSL 启动命令
cd /mnt/c/Users/yaoji/git/Billo/billo-release-agent
docker compose up -d db
uv run uvicorn release_agent.main:app --host 0.0.0.0 --port 8080
关键 .env 配置:
CLAUDE_CMD=claude(不是 claude.cmd)REPOS_BASE_DIR=/mnt/c/Users/yaoji/git/Billo(或克隆到 WSL 原生 fs 更快)
集成测试结果 (2026-03-24)
已验证通过:
- App 启动 + /status health check
- Azure DevOps API (get_pr, list_active_prs, iterations/changes)
- PR 信息解析 (repo_name, ticket_id, branch)
- Graph 完整流程执行 (parse → fetch → route → review → notify)
- 数据库读写 (agent_threads)
- Claude CLI ticket generation (WSL 下成功返回 structured JSON)
- Claude CLI code review 启动 (WSL 下成功调用)
- RunnableConfig 类型修复(消除 LangGraph 警告)
- URL 编码修复(project name 含空格)
- AzDo iterations/changes API(替代不存在的 diffs endpoint)
待解决:
- Claude CLI code review 在 WSL+/mnt/c 下极慢(10+ 分钟,跨文件系统 I/O)
- Graph 没有 checkpointer(interrupt 不持久化)
- CI poll 在无 pipeline 环境下会超时
部署步骤
cp .env.example .env并填写所有 REQUIRED 变量docker compose up -d db只启动 PostgreSQL- 在 WSL 里:
uv run uvicorn release_agent.main:app --port 8080 - 运行迁移:
python scripts/migrate_json_to_db.py --source ../release-workflow/releases - 可选: 配置 Azure DevOps Service Hook / Cloudflare Tunnel
相关笔记
- Billo Release Workflow Skill — 原始 Claude Code skill 的工作流定义