--- created: "2026-03-24" type: project status: active deadline: "" tags: [langgraph, python, devops, automation] --- # Billo Release Agent ## 目标 将现有的 Claude Code release workflow skill 转换为独立的 LangGraph Python 服务,实现: - Azure DevOps webhook 自动触发(替代手动粘贴 PR URL) - LangGraph `interrupt()` 实现 human-in-the-loop 审批 - PostgreSQL 持久化状态(替代 JSON 文件) - 多线程并发处理(每个 PR/release 独立 thread) - Slack 通知 + 审批按钮 ## 架构 ``` Azure DevOps PR Webhook → FastAPI → LangGraph Agent → Azure DevOps / Jira / Slack / Claude API ↑ Slack Button / API (human approval resume) ↑ PostgreSQL (checkpointer + store) ``` ## 代码位置 - 项目目录: `/c/Users/yaoji/git/Billo/billo-release-agent/` - 源代码: `src/release_agent/` - 测试: `tests/` - 原始 skill: `/c/Users/yaoji/git/Billo/release-workflow/.claude/skills/billo-release-workflow/SKILL.md` ## 项目结构 ``` billo-release-agent/ ├── pyproject.toml ├── Dockerfile ├── docker-compose.yml ├── src/release_agent/ │ ├── main.py # FastAPI app + lifespan + task management │ ├── config.py # pydantic-settings (所有环境变量) │ ├── state.py # ReleaseState TypedDict (LangGraph state) │ ├── exceptions.py # 异常层级 │ ├── branch_parser.py # 纯函数:从 branch 提取 ticket ID │ ├── versioning.py # 纯函数:版本号计算 │ ├── models/ # Pydantic 数据模型 │ │ ├── pr.py, ticket.py, release.py, pipeline.py │ │ ├── webhook.py, review.py, jira.py │ ├── tools/ # 外部服务客户端 │ │ ├── azdo.py, jira.py, slack.py, claude_review.py │ │ ├── _http.py, _retry.py # 共享 helpers │ ├── graph/ # LangGraph 图定义 │ │ ├── dependencies.py # ToolClients, StagingStore │ │ ├── routing.py # 6 个纯函数路由 │ │ ├── pr_completed.py # 12 nodes + graph builder │ │ ├── release.py # 14 nodes + graph builder │ │ └── full_cycle.py # subgraph 组合 │ └── api/ # FastAPI 路由 │ ├── models.py # HTTP request/response 模型 │ ├── dependencies.py # Depends() 注入 │ ├── webhooks.py, approvals.py, status.py └── tests/ # 647 tests, 99.11% coverage ├── test_*.py # Phase 1 单元测试 ├── tools/test_*.py # Phase 2 客户端测试 ├── graph/test_*.py # Phase 3 图测试 └── api/test_*.py # Phase 4 API 测试 ``` ## 实施阶段 ### Phase 1: Foundation (已完成 2026-03-23) 项目结构、Pydantic models、config、versioning、branch parser。 **成果:** - 152 tests → review 后 152 tests, 100% coverage - 文件: branch_parser.py, versioning.py, config.py, state.py - Models: pr.py, ticket.py, release.py, pipeline.py, webhook.py **Review 修复:** - `postgres_dsn` 改为 `SecretStr` - `import re` 移到模块级别预编译 - `ReleasePipelineStage` 添加 approval_id/requires_approval 一致性验证 - `WebhookResource.status` 改用 `Literal` - 去除重复测试 ### Phase 2: Service Clients (已完成 2026-03-24) 4 个外部服务客户端 + 异常体系 + 共享 HTTP helpers。 **成果:** - 364 tests, 99.6% coverage - 新增: exceptions.py, models/review.py, models/jira.py - 客户端: tools/azdo.py, tools/jira.py, tools/slack.py, tools/claude_review.py - 共享: tools/_http.py, tools/_retry.py **关键设计:** - httpx.AsyncClient 注入实现可测试性 - 自定义异常层级: ServiceError → AuthenticationError / NotFoundError / RateLimitError / ServiceUnavailableError - 指数退避重试装饰器 `with_retry` - Claude tool_use 实现结构化 code review 输出 - Jira 两步转换逻辑(先 Dev in Progress 再 code review) **Review 修复:** - bare `except Exception` 改为 `(ValueError, KeyError)` - retry 装饰器 implicit None return path 修复 - ClaudeReviewer client 参数添加类型标注 - 401/403 错误传递 detail 信息 - Jira errorMessages 格式支持 ### Phase 3: LangGraph Graphs (已完成 2026-03-24) 3 个 graph + 依赖注入 + routing + staging store。 **成果:** - 520 tests (155 new), 99.42% coverage - 文件: graph/dependencies.py, graph/routing.py, graph/pr_completed.py, graph/release.py, graph/full_cycle.py - state.py 扩展 17 个新字段 **关键设计:** - `ToolClients` frozen dataclass 通过 `config["configurable"]["clients"]` 注入 - `StagingStore` Protocol + `JsonFileStagingStore` 文件实现(后续迁移 PostgreSQL) - 专用 interrupt 节点(非 inline interrupt) - Subgraph 组合: full_cycle 包含 pr_completed + release 两个子图 - 6 个纯函数路由: is_pr_already_merged, is_review_approved, has_ticket, should_continue_to_release, has_pipelines, has_pending_approvals - 错误处理: 非关键节点 catch ReleaseAgentError 追加到 errors,关键节点 re-raise **Graph: PR Completed (12 nodes):** ``` parse_webhook → fetch_pr_details → [已merge?] ├─ 是 → move_jira_ready_for_stage └─ 否 → move_jira_code_review → run_code_review → evaluate_review ├─ approve → interrupt_confirm_merge → merge_pr └─ request_changes → notify_request_changes → END → move_jira_ready_for_stage → add_jira_pr_link → calculate_version → update_staging → END ``` **Graph: Release (14 nodes):** ``` load_staging → interrupt_confirm_release → create_release_pr → interrupt_confirm_merge_release → merge_release_pr → move_tickets_to_done → send_slack_notification → archive_release → list_pipelines → [有 pipeline?] ├─ 是 → interrupt_confirm_trigger → trigger_pipelines → check_release_approvals → END └─ 否 → END ``` **5 个 interrupt 点:** 1. Code review 通过后 → confirm merge 2. 创建 release PR 前 → confirm create 3. Merge release PR 前 → confirm merge 4. 触发 build pipeline 前 → confirm trigger 5. Approve release stage → confirm approve (per stage) ### Phase 4: API Layer + Deployment (已完成 2026-03-24) FastAPI 应用 + Docker 部署配置。 **成果:** - 647 tests (127 new), 99.11% coverage - 文件: main.py, api/models.py, api/dependencies.py, api/webhooks.py, api/approvals.py, api/status.py - 部署: Dockerfile, docker-compose.yml **API Endpoints:** | Method | Path | 用途 | |--------|------|------| | POST | `/webhooks/azdo` | Azure DevOps PR webhook 接收 | | POST | `/approvals/{thread_id}` | 恢复中断的 graph(human approval) | | GET | `/approvals/pending` | 列出等待审批的 threads | | GET | `/status` | 健康检查 | | GET | `/releases/{repo}` | 列出 repo 的所有版本 | | GET | `/staging` | 当前 staging 状态 | | POST | `/manual/pr/{pr_id}` | 手动触发 PR 处理(webhook 备用) | | POST | `/manual/release` | 手动触发 release | **关键设计:** - Singleton compiled graphs 存储在 `app.state` 启动时编译一次 - `agent_threads` PostgreSQL 表追踪线程状态(running/interrupted/completed/error) - `asyncio.create_task` + checkpointer 实现后台执行和崩溃恢复 - Webhook 密钥通过 `X-Webhook-Secret` header + `hmac.compare_digest` 验证 - FastAPI dependencies 通过 `request.app.state` + `Depends()` 注入 - 优雅关闭:等待 30 秒后取消剩余 background tasks **Review 修复(3 CRITICAL):** - `webhook_secret` 改为必填(移除空默认值),防止未配置时绕过认证 - `submit_approval` 从 DB 查找 `graph_name` 后再 resume(原来硬编码 pr_completed) - `_resume_graph` 异常捕获后返回 ApprovalResponse 而非泄漏 500 错误 **部署配置:** - Dockerfile: Python 3.12-slim, non-root user, uv 安装依赖 - docker-compose: agent + postgres:16-alpine, health check, pgdata volume - 需要的环境变量: AZDO_PAT, ANTHROPIC_API_KEY, POSTGRES_DSN, JIRA_EMAIL, JIRA_API_TOKEN, SLACK_WEBHOOK_URL, WEBHOOK_SECRET ### Phase 5: Migration + Hardening (已完成 2026-03-24) 数据迁移、PostgreSQL Store、operator 认证、文档。 **成果:** - 760 tests (113 new), 99.22% coverage - 新增: graph/postgres_staging_store.py, scripts/migrate_json_to_db.py, .env.example, README.md - StagingStore Protocol 改为 async,所有调用点添加 await **关键设计:** - `PostgresStagingStore` 使用 psycopg3 async pool,JSONB 存储 tickets - `archive()` 使用显式事务(`conn.transaction()`)确保 INSERT + DELETE 原子性 - `staging_releases` 表 (per-repo upsert) + `archived_releases` 表 (repo+version unique) - Operator token 认证: `require_operator_token` dependency 应用于 POST /approvals, POST /manual/* 端点 - 迁移脚本: 纯函数提取 + dry-run 模式,从 JSON 文件读取插入 PostgreSQL - `JsonFileStagingStore` 保留作为本地开发 fallback **Review 修复(1 HIGH):** - `archive()` 添加 `async with conn.transaction()` 包裹 INSERT + DELETE ## 技术栈 | 组件 | 技术 | |------|------| | Agent 框架 | LangGraph | | Web 框架 | FastAPI + uvicorn | | HTTP 客户端 | httpx (async) | | AI Code Review | Claude Code CLI (`claude -p`) — 使用 subscription 额度 | | 数据库 | PostgreSQL (checkpointer + store) | | 验证 | Pydantic v2 + pydantic-settings | | 数据库驱动 | psycopg3 + psycopg_pool (async PostgreSQL) | | 测试 | pytest + pytest-asyncio + httpx.MockTransport + FastAPI TestClient | | 部署 | Docker Compose on homelab | ## 外部服务集成 | 服务 | 用途 | 认证方式 | |------|------|---------| | Azure DevOps | PR 管理、Pipeline 触发 | PAT (Basic auth) | | Jira | Ticket 状态流转 | Email + API token (Basic auth) | | Slack | Release 通知、审批请求 | Incoming Webhook | | Claude Code CLI | 自动 Code Review | Subscription (非 API Key) | ## Azure DevOps Pipeline 映射 | Repo | Build Pipeline ID | Release Pipeline | Release ID | |------|------------------|-----------------|------------| | Billo.Platform.Payment | 41 | Billo Payment | 37 | | Billo.Platform.Payment (Scheduler) | 51 | Billo Payment Scheduler | 47 | | Billo.Platform.Document.DocumentAnalyser | 75 | DocumentAnalyser | 58 | ## Release Pipeline Approve 配置 | Pipeline | Sandbox | Production | |----------|---------|------------| | Billo Payment | Project Admins approve | Release Admins approve | | DocumentAnalyser | 自动 | Release Admins approve | ## Jira Workflow 状态流转 ``` IN PROGRESS → CODE REVIEW → WAITING FOR TEST → IN TEST → READY FOR STAGE → DEPLOYED IN STAGE → IN PRODUCTION → CLOSED ``` 注意: CODE REVIEW 只能从 IN PROGRESS 转入。 ## 已完成总览 | Phase | 状态 | Tests | Coverage | |-------|------|-------|----------| | 1. Foundation | Done | 152 | 100% | | 2. Service Clients | Done | +212 = 364 | 99.6% | | 3. LangGraph Graphs | Done | +156 = 520 | 99.4% | | 4. API + Deploy | Done | +127 = 647 | 99.1% | | 5. Migration + Hardening | Done | +113 = 760 | 99.2% | | Final Code Review + Fix | Done | +12 = 772 | 98.4% | | 6. Slack + CI/CD | Done | +193 = 965 | 96.6% | | 7. PR Polling + Auto Ticket | Done | +96 = 1061 | 96.0% | ## Code Review 方案变更 (2026-03-24) 原方案通过 Anthropic API 直接调用 Claude,改为 Claude Code CLI subprocess: | 项目 | 之前 | 之后 | |------|------|------| | 调用方式 | `anthropic.AsyncAnthropic` API | `claude -p` subprocess | | 计费 | API Key (按 token 计费) | Subscription 额度 | | 代码理解 | 只能看传入的 diff 文本 | 可自主 Read/Glob/Grep 整个 codebase | | 结构化输出 | tool_use schema | `--json-schema` + `--output-format json` | | 依赖 | ANTHROPIC_API_KEY | `claude` CLI 在 PATH + REPOS_BASE_DIR | 关键配置:`.env` 中设置 `REPOS_BASE_DIR=/c/Users/yaoji/git/Billo`,Claude Code 在对应 repo 目录下执行 review。 ### Phase 6: Slack Interactive + CI/CD (已完成 2026-03-24) Slack 按钮审批 + CI/CD 自动触发/轮询/审批。 **成果:** - 965 tests (+193 new), 96.55% coverage - 新增: models/build.py, graph/polling.py, graph/ci_nodes.py, api/slack_interactions.py - SlackClient 改为双模式 (webhook fallback + Web API) **Slack 交互流程:** ``` Graph interrupt → Slack 消息 [Approve] [Cancel] 按钮 → 用户点击按钮 → POST /slack/interactions → 验证签名 (HMAC-SHA256 + 5 分钟重放保护) → 提取 thread_id + decision → _resume_graph → 更新 Slack 消息显示结果 ``` **CI/CD 流程:** ``` PR merge → develop: merge_pr → trigger_ci_build(develop) → poll_ci_build → notify_ci_result → END Release merge → main: merge_release_pr → trigger_ci_build(main) → poll_ci_build → ci_passed: wait_for_cd → approval loop (Sandbox → Production) → ci_failed: notify_failure → END ``` **新增配置:** - `SLACK_BOT_TOKEN` — Slack App Bot Token (xoxb-...) - `SLACK_SIGNING_SECRET` — Slack 签名密钥 (必须非空) - `SLACK_CHANNEL_ID` — 发送消息的频道 - `CI_POLL_INTERVAL_SECONDS` — CI 轮询间隔 (默认 30s) - `CI_POLL_MAX_WAIT_SECONDS` — CI 最大等待时间 (默认 30min) **Review 修复(2 CRITICAL + 4 HIGH):** - 添加 5 分钟时间戳重放攻击防护 - 空 signing_secret 返回 503 而非静默跳过 - Decision 值白名单校验 - CI 分支逻辑修正:develop for PR, main for release - ci_build_id 类型验证 ### Phase 7: PR Polling + Auto-Create Jira Ticket (已完成 2026-03-24) 定时扫描所有 repo 的 active PRs + 无 ticket 时自动创建 Jira ticket。 **成果:** - 1061 tests (+96 new), 95.96% coverage - 新增: services/pr_dedup.py, services/pr_poller.py - 修改: azdo.py (list_active_prs), jira.py (create_issue + _text_to_adf), claude_review.py (generate_ticket_content), routing.py (route_after_fetch), pr_completed.py (auto_create_ticket node) **PR 轮询流程:** ``` 每 5 分钟 → 扫描 WATCHED_REPOS 所有 active PRs (target=develop) → 对比 agent_threads 去重 → 合成 webhook payload → 触发 pr_completed graph ``` **自动创建 Jira Ticket 流程:** ``` fetch_pr_details → route_after_fetch (3-way routing) ├─ merged → calculate_version (跳过 review) ├─ active_with_ticket → move_jira_code_review (正常流程) └─ active_no_ticket → auto_create_ticket → Claude CLI 生成 summary + description → Jira create_issue (ALLPOST project) → 设置 ticket_id + has_ticket=True → move_jira_code_review (继续正常流程) ``` **新增配置:** - `WATCHED_REPOS` — 逗号分隔的 repo 列表 - `PR_POLL_INTERVAL_SECONDS=300` — 轮询间隔 - `PR_POLL_ENABLED=False` — 轮询开关 - `DEFAULT_JIRA_PROJECT=ALLPOST` — 自动创建 ticket 的项目 **Review 修复(1 CRITICAL + 2 HIGH):** - schedule_fn 参数签名不匹配导致轮询静默失败 → 修正为只传 initial_state - dedup SQL 未强制 (pr_id, repo_name) 配对 → 改用 unnest 配对查询 - run_graph_in_background 缺失 repos_base_dir + default_jira_project → 已补全 ## Final Code Review 修复 (2026-03-24) 全面 code review 发现 3 CRITICAL + 8 HIGH 问题,已全部修复: | # | 严重级 | 问题 | 修复 | |---|--------|------|------| | 1 | CRITICAL | AzDoClient 构造函数参数不匹配,启动崩溃 | 传入正确的 `base_url`, `vsrm_base_url`, `vsrm_http_client` | | 2 | CRITICAL | 空 webhook_secret 绕过认证 | 空 expected 拒绝所有请求 | | 3 | CRITICAL | docker-compose 默认密码 `secret` | 改为 `${POSTGRES_PASSWORD:?must be set}` | | 4 | HIGH | `graph_name` 未存储到 agent_threads | `_upsert_thread` 新增 `graph_name`, `repo_name`, `pr_id` 参数 | | 5 | HIGH | 无 httpx 超时设置 | 添加 `timeout=30.0` | | 6 | HIGH | httpx.AsyncClient 未关闭 | lifespan shutdown 关闭所有 HTTP 客户端 | | 7 | HIGH | 错误处理泄漏内部信息 | `_generic_error_handler` 返回固定消息 | | 8 | HIGH | Approvals 返回 200+error body | 改为 HTTPException(404/400) | 额外修复: - `anthropic_api_key` 改为可选(CLI 用 subscription 不需要) - docker-compose: `WEBHOOK_SECRET` 必填, agent health check, `REPOS_BASE_DIR` 环境变量 - `_run_graph` 添加 `logger.exception` 日志 ## 后续优化(非阻塞) - [ ] `get_pr_diff` 目前只返回文件名,需增强为实际 diff 内容(Claude Code CLI 可自主读取,优先级降低) - [ ] `list_build_pipelines` 需要按 repo 过滤 API 请求 - [ ] `@with_retry` 装饰器尚未应用到客户端方法 - [ ] Jira fallback transition name 应可配置而非硬编码 - [ ] `check_release_approvals` 是 stub,需实现实际 approval gate 检测 - [ ] `last_merge_source_commit` 始终为 None,需从 AzDo API 获取 - [ ] interrupt 节点不检查返回值,任何 resume 都会继续执行(需加 post-interrupt routing) - [ ] `archive_release` 使用 `date.today()` 不可测试,应注入 - [ ] `_upsert_thread` 从 webhooks.py 提取到共享 `api/db.py` 消除循环引用 - [ ] Dockerfile 改为多阶段构建 - [ ] CLI prompt 超过 100K 字符时可能超 OS ARG_MAX,应改为 stdin pipe - [ ] `PostgresStagingStore.save` 并发竞争(需 SELECT FOR UPDATE 或应用锁) - [ ] 关闭超时 30s 可能不够 Claude CLI 的 300s 超时 ## 运行环境:WSL (推荐) 在 Windows 上直接运行有两个问题: 1. psycopg async 需要 SelectorEventLoop,Windows 默认 ProactorEventLoop 不兼容 2. Claude CLI subprocess 在 Windows uvicorn 里返回空 stdout **解决方案:在 WSL Ubuntu 里运行 app,PostgreSQL 在 Docker** ```bash # WSL 启动命令 cd /mnt/c/Users/yaoji/git/Billo/billo-release-agent docker compose up -d db uv run uvicorn release_agent.main:app --host 0.0.0.0 --port 8080 ``` 关键 .env 配置: - `CLAUDE_CMD=claude` (不是 claude.cmd) - `REPOS_BASE_DIR=/mnt/c/Users/yaoji/git/Billo` (或克隆到 WSL 原生 fs 更快) ## 集成测试结果 (2026-03-24) **已验证通过:** - App 启动 + /status health check - Azure DevOps API (get_pr, list_active_prs, iterations/changes) - PR 信息解析 (repo_name, ticket_id, branch) - Graph 完整流程执行 (parse → fetch → route → review → notify) - 数据库读写 (agent_threads) - Claude CLI ticket generation (WSL 下成功返回 structured JSON) - Claude CLI code review 启动 (WSL 下成功调用) - RunnableConfig 类型修复(消除 LangGraph 警告) - URL 编码修复(project name 含空格) - AzDo iterations/changes API(替代不存在的 diffs endpoint) **待解决:** - Claude CLI code review 在 WSL+/mnt/c 下极慢(10+ 分钟,跨文件系统 I/O) - Graph 没有 checkpointer(interrupt 不持久化) - CI poll 在无 pipeline 环境下会超时 ## 部署步骤 1. `cp .env.example .env` 并填写所有 REQUIRED 变量 2. `docker compose up -d db` 只启动 PostgreSQL 3. 在 WSL 里: `uv run uvicorn release_agent.main:app --port 8080` 4. 运行迁移: `python scripts/migrate_json_to_db.py --source ../release-workflow/releases` 5. 可选: 配置 Azure DevOps Service Hook / Cloudflare Tunnel ## 相关笔记 - [[Billo Release Workflow Skill]] — 原始 Claude Code skill 的工作流定义