Files
billo-release-agent/README.md
Yaojia Wang b67cbcfd93 fix: runtime fixes for WSL deployment and integration testing
- Fix RunnableConfig type annotations (dict -> RunnableConfig) for LangGraph compat
- Fix AzDo PR URL parsing (_links.web.href fallback + remoteUrl construction)
- Fix AzDo diff endpoint (use iterations/changes instead of non-existent diffs API)
- Fix _format_diff to read changeEntries field (not changes)
- Fix URL encoding for project names with spaces (Billo App Platform)
- Fix subprocess.run for Windows (replace asyncio.create_subprocess_exec with thread pool)
- Fix SlackClient to handle empty webhook URL gracefully
- Fix notify_request_changes to catch all exceptions (not just ReleaseAgentError)
- Fix JSON parsing to strip whitespace before json.loads
- Add CLAUDE_CMD config field for cross-platform CLI path
- Add run.py for Windows SelectorEventLoop workaround
- Add db port mapping in docker-compose for local dev
- Add comprehensive README sections: WSL setup, known issues, TODO list
2026-03-24 23:05:04 +01:00

18 KiB

Billo Release Agent

A LangGraph-based release automation agent for Billo. Automates the full release pipeline: PR discovery, code review (via Claude Code CLI), Jira ticket management, staging release tracking, CI/CD pipeline triggering with approval gates, and Slack interactive notifications.

Architecture

  +--- Azure DevOps Webhook ---+    +--- PR Poller (every 5 min) ---+
  |  POST /webhooks/azdo       |    |  Scans WATCHED_REPOS for      |
  |  (push-based)              |    |  active PRs (pull-based)       |
  +------------+---------------+    +------------+------------------+
               |                                 |
               v                                 v
         +---------------------------------------------+
         |             FastAPI Application              |
         |  /webhooks/azdo  /slack/interactions         |
         |  /approvals/*    /manual/*   /status         |
         +---------------------+------------------------+
                               |
                               v
         +---------------------------------------------+
         |            LangGraph Graphs                  |
         |                                             |
         |  pr_completed:                              |
         |    parse -> fetch -> [has ticket?]           |
         |      no  -> Claude generates ticket          |
         |      yes -> Claude code review               |
         |    -> merge -> Jira -> staging -> CI build   |
         |                                             |
         |  release:                                   |
         |    create release PR -> merge -> CI build    |
         |    -> CD release -> [Sandbox approve]        |
         |    -> [Production approve] -> Slack notify   |
         +---------------------+------------------------+
                               |
              +----------------+----------------+
              |                |                |
              v                v                v
         PostgreSQL      Azure DevOps     Slack (buttons)
         - threads       - PRs/Pipelines  - Notifications
         - staging       - Builds/Releases - Approvals
         - releases

Key Features

Feature Description
PR Discovery Webhook-based (push) or polling-based (pull) — or both
Auto-Create Jira Ticket When PR branch has no ticket ID, Claude generates summary + description and creates a Jira Story
AI Code Review Claude Code CLI reviews PRs with full repo context (Read/Glob/Grep), using your subscription
CI/CD Integration Triggers CI builds after merge, polls for completion, handles CD release approval gates
Slack Interactive Approval requests with [Approve]/[Cancel] buttons, CI/CD status notifications
Human-in-the-loop 5 interrupt points where operator confirmation is required before destructive actions
Per-repo Versioning Independent semantic versioning per repository (patch auto-increment)

Prerequisites

  • Python 3.12+
  • PostgreSQL 16+
  • uv (recommended) or pip
  • Claude Code CLI installed and authenticated (claude in PATH)
  • Slack App (for interactive buttons) or Slack Incoming Webhook (for notifications only)

Quick Start

Local Development

# Install dependencies
uv sync --all-extras

# Copy and configure environment
cp .env.example .env
# Edit .env -- fill in all REQUIRED variables

# Start PostgreSQL
docker compose up -d db

# Run the server
uv run uvicorn release_agent.main:app --reload --port 8000

# Verify
curl http://localhost:8000/status

Docker Compose (Production)

cp .env.example .env
# Edit .env -- POSTGRES_PASSWORD, WEBHOOK_SECRET, etc. are required

docker compose up -d

Tables are created automatically on first startup.

Configuration

All configuration is via environment variables. See .env.example for the full list.

Required Variables

Variable Description
AZDO_ORGANIZATION Azure DevOps organization name
AZDO_PROJECT Azure DevOps project name
AZDO_PAT Azure DevOps personal access token
POSTGRES_DSN PostgreSQL connection string
POSTGRES_PASSWORD PostgreSQL password (used by docker-compose)
JIRA_EMAIL Jira account email
JIRA_API_TOKEN Jira API token
SLACK_WEBHOOK_URL Slack incoming webhook URL
WEBHOOK_SECRET Shared secret for validating AzDo webhooks (must be non-empty)

Optional Variables

Variable Default Description
REPOS_BASE_DIR "" Base dir with Billo repos (e.g., /c/Users/yaoji/git/Billo)
WATCHED_REPOS "" Comma-separated repos to poll (e.g., Billo.Platform.Payment,Billo.Platform.Document.DocumentAnalyser)
PR_POLL_ENABLED False Enable periodic PR polling
PR_POLL_INTERVAL_SECONDS 300 Polling interval (5 min)
PR_POLL_TARGET_BRANCH refs/heads/develop Target branch filter
DEFAULT_JIRA_PROJECT ALLPOST Jira project key for auto-created tickets
AUTO_CREATE_TICKET_ENABLED True Auto-create Jira ticket when branch has no ticket ID
SLACK_BOT_TOKEN "" Slack App bot token (for interactive buttons)
SLACK_SIGNING_SECRET "" Slack App signing secret (required for /slack/interactions)
SLACK_CHANNEL_ID "" Channel for interactive messages
CI_POLL_INTERVAL_SECONDS 30 CI build status poll interval
CI_POLL_MAX_WAIT_SECONDS 1800 Max wait for CI completion (30 min)
OPERATOR_TOKEN "" Token for operator endpoints (empty = no auth)
JIRA_BASE_URL https://billolife.atlassian.net Jira instance URL
PORT 8000 HTTP server port

Security Notes

  • WEBHOOK_SECRET must be non-empty; empty secret rejects all webhooks
  • POSTGRES_PASSWORD has no default in docker-compose; fails if unset
  • SLACK_SIGNING_SECRET must be set for /slack/interactions to accept requests (returns 503 if empty)
  • Slack signature verification includes 5-minute replay attack prevention
  • All secrets use SecretStr and are never logged or included in error responses
  • Set OPERATOR_TOKEN in production to protect approval and manual trigger endpoints

API Endpoints

Webhooks

Method Path Auth Description
POST /webhooks/azdo X-Webhook-Secret Receive Azure DevOps PR webhook events
POST /slack/interactions Slack Signing Secret Receive Slack button click callbacks

Approvals (requires X-Operator-Token when configured)

Method Path Description
GET /approvals/pending List threads awaiting operator approval
POST /approvals/{thread_id} Submit approval decision (merge/cancel/approve/skip)

Status and Manual Triggers

Method Path Auth Description
GET /status None Health check
GET /releases/{repo} None List versions for a repo
GET /staging?repo={repo} None Current staging release
POST /manual/pr/{pr_id} X-Operator-Token Manually trigger PR processing
POST /manual/release X-Operator-Token Manually trigger a release

Graph Workflows

PR Completed

parse_webhook -> fetch_pr_details -> route_after_fetch
  |-- merged -----------------> calculate_version -> update_staging -> CI build -> END
  |-- active_with_ticket -----> move_jira_code_review -+
  |-- active_no_ticket -------> auto_create_ticket ----+
                                                       |
                                  run_code_review -> evaluate_review
                                    |-- approve -> [Slack: Merge?] -> merge_pr
                                    |-- request_changes -> notify -> END
                                  -> Jira transitions -> calculate_version
                                  -> update_staging -> CI build -> notify -> END

Release

load_staging -> [Slack: Create release?] -> create_release_pr
  -> [Slack: Merge release?] -> merge_release_pr
  -> CI build on main -> poll until complete
    |-- ci_failed -> notify failure -> END
    |-- ci_passed -> wait for CD release -> approval loop:
        |-- [Slack: Approve Sandbox?] -> approve -> poll again
        |-- [Slack: Approve Production?] -> approve -> poll again
        |-- all_deployed -> move tickets to Done
            -> Slack release notification -> archive -> END

Interrupt Points (Slack buttons)

# When Slack Message Buttons
1 After code review approves PR title + review summary [Merge] [Cancel]
2 Before creating release PR Version + ticket list [Create] [Cancel]
3 Before merging release PR Release PR link [Merge] [Cancel]
4 Before triggering pipelines Pipeline list [Trigger] [Skip]
5 Before approving release stage Stage name + status [Approve] [Skip]

PR Polling (Alternative to Webhooks)

When PR_POLL_ENABLED=True, the agent periodically scans all WATCHED_REPOS for active PRs targeting the configured branch. New PRs not yet tracked in agent_threads are automatically processed through the pr_completed graph.

This eliminates the need for Azure DevOps webhook configuration and works behind firewalls without public endpoint exposure.

Auto-Create Jira Ticket

When a PR branch has no ticket ID (e.g., chore/update-dependencies instead of feature/ALLPOST-4028_login-page), the agent automatically:

  1. Sends the PR diff to Claude Code CLI
  2. Claude generates a concise ticket summary and description
  3. Creates a Jira Story in the DEFAULT_JIRA_PROJECT
  4. Continues the normal workflow with the created ticket

Database Schema

Tables are created automatically on startup:

-- Thread tracking for LangGraph interrupts and PR dedup
agent_threads (thread_id, graph_name, repo_name, pr_id, status, state JSONB,
               slack_message_ts, created_at, updated_at)

-- Current in-progress releases (one per repo)
staging_releases (repo, version, started_at, tickets JSONB, updated_at)

-- Completed releases (immutable history)
archived_releases (repo, version, started_at, tickets JSONB, released_at)

Migrating Existing JSON Files

If you have existing release data from the Claude Code skill:

# Dry run
uv run python scripts/migrate_json_to_db.py \
    --source ../release-workflow/releases --dry-run

# Execute
uv run python scripts/migrate_json_to_db.py \
    --source ../release-workflow/releases \
    --dsn "postgresql://agent:password@localhost/agent"

Development

Running Tests

# Run all tests with coverage (1061 tests, 96%+ coverage)
uv run pytest

# Run without coverage (faster)
uv run pytest --no-cov

# Run specific module
uv run pytest tests/graph/test_pr_completed.py -v

Project Structure

src/release_agent/
    main.py                        # FastAPI app, lifespan, task management
    config.py                      # pydantic-settings (all env vars)
    state.py                       # LangGraph ReleaseState TypedDict
    exceptions.py                  # Exception hierarchy
    branch_parser.py               # Extract ticket ID from branch name
    versioning.py                  # Per-repo version calculation
    api/
        models.py                  # HTTP request/response Pydantic models
        dependencies.py            # FastAPI Depends() + operator auth
        webhooks.py                # POST /webhooks/azdo
        approvals.py               # Approval endpoints
        status.py                  # Status, releases, manual triggers
        slack_interactions.py      # POST /slack/interactions (button callbacks)
    graph/
        dependencies.py            # ToolClients, StagingStore Protocol
        postgres_staging_store.py  # PostgreSQL-backed store
        routing.py                 # Pure routing functions (route_after_fetch, etc.)
        pr_completed.py            # PR graph nodes + auto_create_ticket
        release.py                 # Release graph nodes + CI/CD approval loop
        full_cycle.py              # Subgraph composition
        ci_nodes.py                # CI trigger, poll, notify nodes
        polling.py                 # Reusable async poll_until utility
    models/
        pr.py, ticket.py, release.py, pipeline.py
        webhook.py, review.py, jira.py, build.py
    tools/
        azdo.py                    # Azure DevOps REST client
        jira.py                    # Jira REST client (transitions + create_issue)
        slack.py                   # Slack dual-mode (webhook + Web API)
        claude_review.py           # Claude Code CLI (review + ticket generation)
        _http.py, _retry.py       # Shared helpers
    services/
        pr_poller.py               # Background PR polling loop
        pr_dedup.py                # PR deduplication via agent_threads
scripts/
    migrate_json_to_db.py          # One-time JSON -> PostgreSQL migration
tests/                             # 1061 tests, 96%+ coverage

Docker

# Required: set POSTGRES_PASSWORD and WEBHOOK_SECRET in .env
docker compose up -d

The agent service includes a health check at /status. PostgreSQL uses pg_isready with service_healthy dependency.

The app runs best on WSL (Ubuntu) because:

  • psycopg async requires SelectorEventLoop (incompatible with Windows ProactorEventLoop)
  • subprocess.run captures Claude CLI stdout correctly on Linux but not reliably on Windows
  • PostgreSQL runs in Docker (accessible from both Windows and WSL via localhost)

Setup

# 1. Start PostgreSQL (from Windows or WSL)
cd /mnt/c/Users/yaoji/git/Billo/billo-release-agent
docker compose up -d db

# 2. Install uv in WSL (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 3. Install dependencies
uv sync --all-extras

# 4. Configure .env
#    Key settings:
#    CLAUDE_CMD=claude          (not claude.cmd — WSL finds it via PATH)
#    REPOS_BASE_DIR=/mnt/c/Users/yaoji/git/Billo
#    PR_POLL_ENABLED=False      (disable during dev to avoid noise)
#    SLACK_WEBHOOK_URL=          (leave empty during dev)

# 5. Start the server
uv run uvicorn release_agent.main:app --host 0.0.0.0 --port 8080

# 6. Test
curl http://localhost:8080/status
curl -X POST http://localhost:8080/manual/pr/10443

Windows-only (Fallback)

If you must run on Windows directly, use the provided run.py script which sets WindowsSelectorEventLoopPolicy before starting uvicorn:

uv run python run.py

Note: Claude CLI subprocess may return empty stdout on Windows due to event loop incompatibility. WSL is the recommended approach.

Performance Note: WSL + /mnt/c

Claude Code CLI with --allowedTools Read,Glob,Grep on /mnt/c (Windows filesystem mounted in WSL) is very slow. For faster code reviews, either:

  1. Clone repos to WSL native filesystem (~/git/Billo/) and set REPOS_BASE_DIR=~/git/Billo
  2. Remove --allowedTools so Claude only reviews the diff text (faster but less thorough)

Slack App Setup

To use interactive buttons (optional — REST API approvals still work without it):

  1. Create a Slack App at https://api.slack.com/apps
  2. Enable Interactivity with Request URL: https://<your-domain>/slack/interactions
  3. Add Bot Token Scopes: chat:write, chat:update
  4. Install to workspace, get Bot Token (xoxb-...)
  5. Set SLACK_BOT_TOKEN, SLACK_SIGNING_SECRET, SLACK_CHANNEL_ID in .env

Current Status

Working

  • App startup, health check, API endpoints
  • Azure DevOps API integration (get PR, list active PRs, get iterations/changes)
  • PR info parsing (repo_name, ticket_id, branch extraction)
  • Graph execution (full pr_completed flow: parse -> fetch -> route -> review -> notify)
  • Database read/write (agent_threads table)
  • Slack error handling (empty webhook URL gracefully skipped)
  • Claude CLI ticket generation (tested: returns structured JSON)
  • Claude CLI code review (tested: returns structured JSON with verdict + issues)
  • PR review comments posted to Azure DevOps (inline + summary)
  • Node type annotations fixed (RunnableConfig instead of dict)

Known Issues

Issue Severity Workaround
Windows: Claude CLI subprocess returns empty stdout HIGH Run in WSL
WSL + /mnt/c: Claude CLI Read/Glob very slow (10+ min) MEDIUM Clone repos to WSL native fs
Graph has no LangGraph checkpointer (interrupt not persistent) MEDIUM Graphs run to completion or fail; no resume
_upsert_thread only writes final state (no intermediate updates) LOW Query DB only after graph completes
CI poll may run indefinitely (no build to poll in dev) LOW Leave PR_POLL_ENABLED=False
Config test failures (env var leakage from .env) LOW Run with -k "not test_config"

TODO (Not Yet Implemented)

  • Wire LangGraph checkpointer (PostgreSQL) for interrupt persistence
  • Interrupt decision validation (currently any resume value proceeds)
  • Slack interactive buttons end-to-end (Slack App not yet created)
  • CI/CD pipeline trigger end-to-end testing
  • Release approval gate detection (check_release_approvals is a stub)
  • last_merge_source_commit from AzDo API for safe merge
  • Operator token auth testing in production
  • Multi-stage Dockerfile for smaller images
  • Centralize _upsert_thread into shared api/db.py module
  • Remove dead has_ticket routing function
  • PR poller dedup query correctness (unnest pair-wise matching untested against real DB)
  • archive_release date injection (replace date.today() with config)
  • Approval loop max iteration guard (prevent infinite loops)
  • Migrate existing release JSON data to PostgreSQL