From 91884751e8ae2ea591240a12cef696f79654ff94 Mon Sep 17 00:00:00 2001 From: Yaojia Wang Date: Wed, 25 Mar 2026 23:37:37 +0100 Subject: [PATCH] vault: sync pending changes --- 2 - Projects/Billo Release Agent.md | 459 ++++++++++++++++++++++++++++ Billo Release Workflow Skill.md | 0 scripts/auto-sync.sh | 23 ++ 3 files changed, 482 insertions(+) create mode 100644 2 - Projects/Billo Release Agent.md create mode 100644 Billo Release Workflow Skill.md create mode 100644 scripts/auto-sync.sh diff --git a/2 - Projects/Billo Release Agent.md b/2 - Projects/Billo Release Agent.md new file mode 100644 index 0000000..07d8f41 --- /dev/null +++ b/2 - Projects/Billo Release Agent.md @@ -0,0 +1,459 @@ +--- +created: "2026-03-24" +type: project +status: active +deadline: "" +tags: [langgraph, python, devops, automation] +--- + +# Billo Release Agent + +## 目标 + +将现有的 Claude Code release workflow skill 转换为独立的 LangGraph Python 服务,实现: +- Azure DevOps webhook 自动触发(替代手动粘贴 PR URL) +- LangGraph `interrupt()` 实现 human-in-the-loop 审批 +- PostgreSQL 持久化状态(替代 JSON 文件) +- 多线程并发处理(每个 PR/release 独立 thread) +- Slack 通知 + 审批按钮 + +## 架构 + +``` +Azure DevOps PR Webhook → FastAPI → LangGraph Agent → Azure DevOps / Jira / Slack / Claude API + ↑ + Slack Button / API (human approval resume) + ↑ + PostgreSQL (checkpointer + store) +``` + +## 代码位置 + +- 项目目录: `/c/Users/yaoji/git/Billo/billo-release-agent/` +- 源代码: `src/release_agent/` +- 测试: `tests/` +- 原始 skill: `/c/Users/yaoji/git/Billo/release-workflow/.claude/skills/billo-release-workflow/SKILL.md` + +## 项目结构 + +``` +billo-release-agent/ +├── pyproject.toml +├── Dockerfile +├── docker-compose.yml +├── src/release_agent/ +│ ├── main.py # FastAPI app + lifespan + task management +│ ├── config.py # pydantic-settings (所有环境变量) +│ ├── state.py # ReleaseState TypedDict (LangGraph state) +│ ├── exceptions.py # 异常层级 +│ ├── branch_parser.py # 纯函数:从 branch 提取 ticket ID +│ ├── versioning.py # 纯函数:版本号计算 +│ ├── models/ # Pydantic 数据模型 +│ │ ├── pr.py, ticket.py, release.py, pipeline.py +│ │ ├── webhook.py, review.py, jira.py +│ ├── tools/ # 外部服务客户端 +│ │ ├── azdo.py, jira.py, slack.py, claude_review.py +│ │ ├── _http.py, _retry.py # 共享 helpers +│ ├── graph/ # LangGraph 图定义 +│ │ ├── dependencies.py # ToolClients, StagingStore +│ │ ├── routing.py # 6 个纯函数路由 +│ │ ├── pr_completed.py # 12 nodes + graph builder +│ │ ├── release.py # 14 nodes + graph builder +│ │ └── full_cycle.py # subgraph 组合 +│ └── api/ # FastAPI 路由 +│ ├── models.py # HTTP request/response 模型 +│ ├── dependencies.py # Depends() 注入 +│ ├── webhooks.py, approvals.py, status.py +└── tests/ # 647 tests, 99.11% coverage + ├── test_*.py # Phase 1 单元测试 + ├── tools/test_*.py # Phase 2 客户端测试 + ├── graph/test_*.py # Phase 3 图测试 + └── api/test_*.py # Phase 4 API 测试 +``` + +## 实施阶段 + +### Phase 1: Foundation (已完成 2026-03-23) + +项目结构、Pydantic models、config、versioning、branch parser。 + +**成果:** +- 152 tests → review 后 152 tests, 100% coverage +- 文件: branch_parser.py, versioning.py, config.py, state.py +- Models: pr.py, ticket.py, release.py, pipeline.py, webhook.py + +**Review 修复:** +- `postgres_dsn` 改为 `SecretStr` +- `import re` 移到模块级别预编译 +- `ReleasePipelineStage` 添加 approval_id/requires_approval 一致性验证 +- `WebhookResource.status` 改用 `Literal` +- 去除重复测试 + +### Phase 2: Service Clients (已完成 2026-03-24) + +4 个外部服务客户端 + 异常体系 + 共享 HTTP helpers。 + +**成果:** +- 364 tests, 99.6% coverage +- 新增: exceptions.py, models/review.py, models/jira.py +- 客户端: tools/azdo.py, tools/jira.py, tools/slack.py, tools/claude_review.py +- 共享: tools/_http.py, tools/_retry.py + +**关键设计:** +- httpx.AsyncClient 注入实现可测试性 +- 自定义异常层级: ServiceError → AuthenticationError / NotFoundError / RateLimitError / ServiceUnavailableError +- 指数退避重试装饰器 `with_retry` +- Claude tool_use 实现结构化 code review 输出 +- Jira 两步转换逻辑(先 Dev in Progress 再 code review) + +**Review 修复:** +- bare `except Exception` 改为 `(ValueError, KeyError)` +- retry 装饰器 implicit None return path 修复 +- ClaudeReviewer client 参数添加类型标注 +- 401/403 错误传递 detail 信息 +- Jira errorMessages 格式支持 + +### Phase 3: LangGraph Graphs (已完成 2026-03-24) + +3 个 graph + 依赖注入 + routing + staging store。 + +**成果:** +- 520 tests (155 new), 99.42% coverage +- 文件: graph/dependencies.py, graph/routing.py, graph/pr_completed.py, graph/release.py, graph/full_cycle.py +- state.py 扩展 17 个新字段 + +**关键设计:** +- `ToolClients` frozen dataclass 通过 `config["configurable"]["clients"]` 注入 +- `StagingStore` Protocol + `JsonFileStagingStore` 文件实现(后续迁移 PostgreSQL) +- 专用 interrupt 节点(非 inline interrupt) +- Subgraph 组合: full_cycle 包含 pr_completed + release 两个子图 +- 6 个纯函数路由: is_pr_already_merged, is_review_approved, has_ticket, should_continue_to_release, has_pipelines, has_pending_approvals +- 错误处理: 非关键节点 catch ReleaseAgentError 追加到 errors,关键节点 re-raise + +**Graph: PR Completed (12 nodes):** +``` +parse_webhook → fetch_pr_details → [已merge?] + ├─ 是 → move_jira_ready_for_stage + └─ 否 → move_jira_code_review → run_code_review → evaluate_review + ├─ approve → interrupt_confirm_merge → merge_pr + └─ request_changes → notify_request_changes → END + → move_jira_ready_for_stage → add_jira_pr_link → calculate_version → update_staging → END +``` + +**Graph: Release (14 nodes):** +``` +load_staging → interrupt_confirm_release → create_release_pr → interrupt_confirm_merge_release + → merge_release_pr → move_tickets_to_done → send_slack_notification → archive_release + → list_pipelines → [有 pipeline?] + ├─ 是 → interrupt_confirm_trigger → trigger_pipelines → check_release_approvals → END + └─ 否 → END +``` + +**5 个 interrupt 点:** +1. Code review 通过后 → confirm merge +2. 创建 release PR 前 → confirm create +3. Merge release PR 前 → confirm merge +4. 触发 build pipeline 前 → confirm trigger +5. Approve release stage → confirm approve (per stage) + +### Phase 4: API Layer + Deployment (已完成 2026-03-24) + +FastAPI 应用 + Docker 部署配置。 + +**成果:** +- 647 tests (127 new), 99.11% coverage +- 文件: main.py, api/models.py, api/dependencies.py, api/webhooks.py, api/approvals.py, api/status.py +- 部署: Dockerfile, docker-compose.yml + +**API Endpoints:** + +| Method | Path | 用途 | +|--------|------|------| +| POST | `/webhooks/azdo` | Azure DevOps PR webhook 接收 | +| POST | `/approvals/{thread_id}` | 恢复中断的 graph(human approval) | +| GET | `/approvals/pending` | 列出等待审批的 threads | +| GET | `/status` | 健康检查 | +| GET | `/releases/{repo}` | 列出 repo 的所有版本 | +| GET | `/staging` | 当前 staging 状态 | +| POST | `/manual/pr/{pr_id}` | 手动触发 PR 处理(webhook 备用) | +| POST | `/manual/release` | 手动触发 release | + +**关键设计:** +- Singleton compiled graphs 存储在 `app.state` 启动时编译一次 +- `agent_threads` PostgreSQL 表追踪线程状态(running/interrupted/completed/error) +- `asyncio.create_task` + checkpointer 实现后台执行和崩溃恢复 +- Webhook 密钥通过 `X-Webhook-Secret` header + `hmac.compare_digest` 验证 +- FastAPI dependencies 通过 `request.app.state` + `Depends()` 注入 +- 优雅关闭:等待 30 秒后取消剩余 background tasks + +**Review 修复(3 CRITICAL):** +- `webhook_secret` 改为必填(移除空默认值),防止未配置时绕过认证 +- `submit_approval` 从 DB 查找 `graph_name` 后再 resume(原来硬编码 pr_completed) +- `_resume_graph` 异常捕获后返回 ApprovalResponse 而非泄漏 500 错误 + +**部署配置:** +- Dockerfile: Python 3.12-slim, non-root user, uv 安装依赖 +- docker-compose: agent + postgres:16-alpine, health check, pgdata volume +- 需要的环境变量: AZDO_PAT, ANTHROPIC_API_KEY, POSTGRES_DSN, JIRA_EMAIL, JIRA_API_TOKEN, SLACK_WEBHOOK_URL, WEBHOOK_SECRET + +### Phase 5: Migration + Hardening (已完成 2026-03-24) + +数据迁移、PostgreSQL Store、operator 认证、文档。 + +**成果:** +- 760 tests (113 new), 99.22% coverage +- 新增: graph/postgres_staging_store.py, scripts/migrate_json_to_db.py, .env.example, README.md +- StagingStore Protocol 改为 async,所有调用点添加 await + +**关键设计:** +- `PostgresStagingStore` 使用 psycopg3 async pool,JSONB 存储 tickets +- `archive()` 使用显式事务(`conn.transaction()`)确保 INSERT + DELETE 原子性 +- `staging_releases` 表 (per-repo upsert) + `archived_releases` 表 (repo+version unique) +- Operator token 认证: `require_operator_token` dependency 应用于 POST /approvals, POST /manual/* 端点 +- 迁移脚本: 纯函数提取 + dry-run 模式,从 JSON 文件读取插入 PostgreSQL +- `JsonFileStagingStore` 保留作为本地开发 fallback + +**Review 修复(1 HIGH):** +- `archive()` 添加 `async with conn.transaction()` 包裹 INSERT + DELETE + +## 技术栈 + +| 组件 | 技术 | +|------|------| +| Agent 框架 | LangGraph | +| Web 框架 | FastAPI + uvicorn | +| HTTP 客户端 | httpx (async) | +| AI Code Review | Claude Code CLI (`claude -p`) — 使用 subscription 额度 | +| 数据库 | PostgreSQL (checkpointer + store) | +| 验证 | Pydantic v2 + pydantic-settings | +| 数据库驱动 | psycopg3 + psycopg_pool (async PostgreSQL) | +| 测试 | pytest + pytest-asyncio + httpx.MockTransport + FastAPI TestClient | +| 部署 | Docker Compose on homelab | + +## 外部服务集成 + +| 服务 | 用途 | 认证方式 | +|------|------|---------| +| Azure DevOps | PR 管理、Pipeline 触发 | PAT (Basic auth) | +| Jira | Ticket 状态流转 | Email + API token (Basic auth) | +| Slack | Release 通知、审批请求 | Incoming Webhook | +| Claude Code CLI | 自动 Code Review | Subscription (非 API Key) | + +## Azure DevOps Pipeline 映射 + +| Repo | Build Pipeline ID | Release Pipeline | Release ID | +|------|------------------|-----------------|------------| +| Billo.Platform.Payment | 41 | Billo Payment | 37 | +| Billo.Platform.Payment (Scheduler) | 51 | Billo Payment Scheduler | 47 | +| Billo.Platform.Document.DocumentAnalyser | 75 | DocumentAnalyser | 58 | + +## Release Pipeline Approve 配置 + +| Pipeline | Sandbox | Production | +|----------|---------|------------| +| Billo Payment | Project Admins approve | Release Admins approve | +| DocumentAnalyser | 自动 | Release Admins approve | + +## Jira Workflow 状态流转 + +``` +IN PROGRESS → CODE REVIEW → WAITING FOR TEST → IN TEST + → READY FOR STAGE → DEPLOYED IN STAGE → IN PRODUCTION → CLOSED +``` + +注意: CODE REVIEW 只能从 IN PROGRESS 转入。 + +## 已完成总览 + +| Phase | 状态 | Tests | Coverage | +|-------|------|-------|----------| +| 1. Foundation | Done | 152 | 100% | +| 2. Service Clients | Done | +212 = 364 | 99.6% | +| 3. LangGraph Graphs | Done | +156 = 520 | 99.4% | +| 4. API + Deploy | Done | +127 = 647 | 99.1% | +| 5. Migration + Hardening | Done | +113 = 760 | 99.2% | +| Final Code Review + Fix | Done | +12 = 772 | 98.4% | +| 6. Slack + CI/CD | Done | +193 = 965 | 96.6% | +| 7. PR Polling + Auto Ticket | Done | +96 = 1061 | 96.0% | + +## Code Review 方案变更 (2026-03-24) + +原方案通过 Anthropic API 直接调用 Claude,改为 Claude Code CLI subprocess: + +| 项目 | 之前 | 之后 | +|------|------|------| +| 调用方式 | `anthropic.AsyncAnthropic` API | `claude -p` subprocess | +| 计费 | API Key (按 token 计费) | Subscription 额度 | +| 代码理解 | 只能看传入的 diff 文本 | 可自主 Read/Glob/Grep 整个 codebase | +| 结构化输出 | tool_use schema | `--json-schema` + `--output-format json` | +| 依赖 | ANTHROPIC_API_KEY | `claude` CLI 在 PATH + REPOS_BASE_DIR | + +关键配置:`.env` 中设置 `REPOS_BASE_DIR=/c/Users/yaoji/git/Billo`,Claude Code 在对应 repo 目录下执行 review。 + +### Phase 6: Slack Interactive + CI/CD (已完成 2026-03-24) + +Slack 按钮审批 + CI/CD 自动触发/轮询/审批。 + +**成果:** +- 965 tests (+193 new), 96.55% coverage +- 新增: models/build.py, graph/polling.py, graph/ci_nodes.py, api/slack_interactions.py +- SlackClient 改为双模式 (webhook fallback + Web API) + +**Slack 交互流程:** +``` +Graph interrupt → Slack 消息 [Approve] [Cancel] 按钮 + → 用户点击按钮 → POST /slack/interactions + → 验证签名 (HMAC-SHA256 + 5 分钟重放保护) + → 提取 thread_id + decision → _resume_graph + → 更新 Slack 消息显示结果 +``` + +**CI/CD 流程:** +``` +PR merge → develop: + merge_pr → trigger_ci_build(develop) → poll_ci_build → notify_ci_result → END + +Release merge → main: + merge_release_pr → trigger_ci_build(main) → poll_ci_build + → ci_passed: wait_for_cd → approval loop (Sandbox → Production) + → ci_failed: notify_failure → END +``` + +**新增配置:** +- `SLACK_BOT_TOKEN` — Slack App Bot Token (xoxb-...) +- `SLACK_SIGNING_SECRET` — Slack 签名密钥 (必须非空) +- `SLACK_CHANNEL_ID` — 发送消息的频道 +- `CI_POLL_INTERVAL_SECONDS` — CI 轮询间隔 (默认 30s) +- `CI_POLL_MAX_WAIT_SECONDS` — CI 最大等待时间 (默认 30min) + +**Review 修复(2 CRITICAL + 4 HIGH):** +- 添加 5 分钟时间戳重放攻击防护 +- 空 signing_secret 返回 503 而非静默跳过 +- Decision 值白名单校验 +- CI 分支逻辑修正:develop for PR, main for release +- ci_build_id 类型验证 + +### Phase 7: PR Polling + Auto-Create Jira Ticket (已完成 2026-03-24) + +定时扫描所有 repo 的 active PRs + 无 ticket 时自动创建 Jira ticket。 + +**成果:** +- 1061 tests (+96 new), 95.96% coverage +- 新增: services/pr_dedup.py, services/pr_poller.py +- 修改: azdo.py (list_active_prs), jira.py (create_issue + _text_to_adf), claude_review.py (generate_ticket_content), routing.py (route_after_fetch), pr_completed.py (auto_create_ticket node) + +**PR 轮询流程:** +``` +每 5 分钟 → 扫描 WATCHED_REPOS 所有 active PRs (target=develop) + → 对比 agent_threads 去重 + → 合成 webhook payload → 触发 pr_completed graph +``` + +**自动创建 Jira Ticket 流程:** +``` +fetch_pr_details → route_after_fetch (3-way routing) + ├─ merged → calculate_version (跳过 review) + ├─ active_with_ticket → move_jira_code_review (正常流程) + └─ active_no_ticket → auto_create_ticket + → Claude CLI 生成 summary + description + → Jira create_issue (ALLPOST project) + → 设置 ticket_id + has_ticket=True + → move_jira_code_review (继续正常流程) +``` + +**新增配置:** +- `WATCHED_REPOS` — 逗号分隔的 repo 列表 +- `PR_POLL_INTERVAL_SECONDS=300` — 轮询间隔 +- `PR_POLL_ENABLED=False` — 轮询开关 +- `DEFAULT_JIRA_PROJECT=ALLPOST` — 自动创建 ticket 的项目 + +**Review 修复(1 CRITICAL + 2 HIGH):** +- schedule_fn 参数签名不匹配导致轮询静默失败 → 修正为只传 initial_state +- dedup SQL 未强制 (pr_id, repo_name) 配对 → 改用 unnest 配对查询 +- run_graph_in_background 缺失 repos_base_dir + default_jira_project → 已补全 + +## Final Code Review 修复 (2026-03-24) + +全面 code review 发现 3 CRITICAL + 8 HIGH 问题,已全部修复: + +| # | 严重级 | 问题 | 修复 | +|---|--------|------|------| +| 1 | CRITICAL | AzDoClient 构造函数参数不匹配,启动崩溃 | 传入正确的 `base_url`, `vsrm_base_url`, `vsrm_http_client` | +| 2 | CRITICAL | 空 webhook_secret 绕过认证 | 空 expected 拒绝所有请求 | +| 3 | CRITICAL | docker-compose 默认密码 `secret` | 改为 `${POSTGRES_PASSWORD:?must be set}` | +| 4 | HIGH | `graph_name` 未存储到 agent_threads | `_upsert_thread` 新增 `graph_name`, `repo_name`, `pr_id` 参数 | +| 5 | HIGH | 无 httpx 超时设置 | 添加 `timeout=30.0` | +| 6 | HIGH | httpx.AsyncClient 未关闭 | lifespan shutdown 关闭所有 HTTP 客户端 | +| 7 | HIGH | 错误处理泄漏内部信息 | `_generic_error_handler` 返回固定消息 | +| 8 | HIGH | Approvals 返回 200+error body | 改为 HTTPException(404/400) | + +额外修复: +- `anthropic_api_key` 改为可选(CLI 用 subscription 不需要) +- docker-compose: `WEBHOOK_SECRET` 必填, agent health check, `REPOS_BASE_DIR` 环境变量 +- `_run_graph` 添加 `logger.exception` 日志 + +## 后续优化(非阻塞) + +- [ ] `get_pr_diff` 目前只返回文件名,需增强为实际 diff 内容(Claude Code CLI 可自主读取,优先级降低) +- [ ] `list_build_pipelines` 需要按 repo 过滤 API 请求 +- [ ] `@with_retry` 装饰器尚未应用到客户端方法 +- [ ] Jira fallback transition name 应可配置而非硬编码 +- [ ] `check_release_approvals` 是 stub,需实现实际 approval gate 检测 +- [ ] `last_merge_source_commit` 始终为 None,需从 AzDo API 获取 +- [ ] interrupt 节点不检查返回值,任何 resume 都会继续执行(需加 post-interrupt routing) +- [ ] `archive_release` 使用 `date.today()` 不可测试,应注入 +- [ ] `_upsert_thread` 从 webhooks.py 提取到共享 `api/db.py` 消除循环引用 +- [ ] Dockerfile 改为多阶段构建 +- [ ] CLI prompt 超过 100K 字符时可能超 OS ARG_MAX,应改为 stdin pipe +- [ ] `PostgresStagingStore.save` 并发竞争(需 SELECT FOR UPDATE 或应用锁) +- [ ] 关闭超时 30s 可能不够 Claude CLI 的 300s 超时 + +## 运行环境:WSL (推荐) + +在 Windows 上直接运行有两个问题: +1. psycopg async 需要 SelectorEventLoop,Windows 默认 ProactorEventLoop 不兼容 +2. Claude CLI subprocess 在 Windows uvicorn 里返回空 stdout + +**解决方案:在 WSL Ubuntu 里运行 app,PostgreSQL 在 Docker** + +```bash +# WSL 启动命令 +cd /mnt/c/Users/yaoji/git/Billo/billo-release-agent +docker compose up -d db +uv run uvicorn release_agent.main:app --host 0.0.0.0 --port 8080 +``` + +关键 .env 配置: +- `CLAUDE_CMD=claude` (不是 claude.cmd) +- `REPOS_BASE_DIR=/mnt/c/Users/yaoji/git/Billo` (或克隆到 WSL 原生 fs 更快) + +## 集成测试结果 (2026-03-24) + +**已验证通过:** +- App 启动 + /status health check +- Azure DevOps API (get_pr, list_active_prs, iterations/changes) +- PR 信息解析 (repo_name, ticket_id, branch) +- Graph 完整流程执行 (parse → fetch → route → review → notify) +- 数据库读写 (agent_threads) +- Claude CLI ticket generation (WSL 下成功返回 structured JSON) +- Claude CLI code review 启动 (WSL 下成功调用) +- RunnableConfig 类型修复(消除 LangGraph 警告) +- URL 编码修复(project name 含空格) +- AzDo iterations/changes API(替代不存在的 diffs endpoint) + +**待解决:** +- Claude CLI code review 在 WSL+/mnt/c 下极慢(10+ 分钟,跨文件系统 I/O) +- Graph 没有 checkpointer(interrupt 不持久化) +- CI poll 在无 pipeline 环境下会超时 + +## 部署步骤 + +1. `cp .env.example .env` 并填写所有 REQUIRED 变量 +2. `docker compose up -d db` 只启动 PostgreSQL +3. 在 WSL 里: `uv run uvicorn release_agent.main:app --port 8080` +4. 运行迁移: `python scripts/migrate_json_to_db.py --source ../release-workflow/releases` +5. 可选: 配置 Azure DevOps Service Hook / Cloudflare Tunnel + +## 相关笔记 + +- [[Billo Release Workflow Skill]] — 原始 Claude Code skill 的工作流定义 diff --git a/Billo Release Workflow Skill.md b/Billo Release Workflow Skill.md new file mode 100644 index 0000000..e69de29 diff --git a/scripts/auto-sync.sh b/scripts/auto-sync.sh new file mode 100644 index 0000000..32660de --- /dev/null +++ b/scripts/auto-sync.sh @@ -0,0 +1,23 @@ +#!/bin/bash +# Auto-sync Obsidian vault to git remote +# Runs daily via Windows Task Scheduler + +VAULT_DIR="/c/Users/yaoji/git/Knowledge" +cd "$VAULT_DIR" || exit 1 + +# Check if there are any changes +if git diff --quiet && git diff --cached --quiet && [ -z "$(git ls-files --others --exclude-standard)" ]; then + echo "$(date): No changes to sync" + exit 0 +fi + +# Stage all changes +git add -A + +# Commit with timestamp +git commit -m "vault: auto-sync $(date '+%Y-%m-%d %H:%M')" + +# Push to remote +git push origin main + +echo "$(date): Sync complete"