From f93e8baef11e80530e5f62c54a364527032cb77d Mon Sep 17 00:00:00 2001
From: Yaojia Wang <yaojia.wang@billo.life>
Date: Sun, 29 Mar 2026 21:11:36 +0200
Subject: [PATCH] feat: initial project setup with planning docs

Smart Support - AI customer service action layer framework.
Includes design doc, CEO plan, eng review, test plan, and README.
---
 .claude/settings.local.json |   8 ++
 .gitignore                  |  26 ++++
 README.md                   | 160 +++++++++++++++++++++++
 TODOS.md                    |  25 ++++
 ceo-plan.md                 |  52 ++++++++
 design-doc.md               | 250 ++++++++++++++++++++++++++++++++++++
 eng-review-plan.md          | 194 ++++++++++++++++++++++++++++
 eng-review-test-plan.md     |  47 +++++++
 8 files changed, 762 insertions(+)
 create mode 100644 .claude/settings.local.json
 create mode 100644 .gitignore
 create mode 100644 README.md
 create mode 100644 TODOS.md
 create mode 100644 ceo-plan.md
 create mode 100644 design-doc.md
 create mode 100644 eng-review-plan.md
 create mode 100644 eng-review-test-plan.md

diff --git a/.claude/settings.local.json b/.claude/settings.local.json
new file mode 100644
index 0000000..0cc3e37
--- /dev/null
+++ b/.claude/settings.local.json
@@ -0,0 +1,8 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(find:*)",
+      "WebSearch"
+    ]
+  }
+}
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..ffa9be3
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,26 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+venv/
+.env
+
+# Node
+node_modules/
+.next/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Docker
+*.log
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..3fb03e4
--- /dev/null
+++ b/README.md
@@ -0,0 +1,160 @@
+# Smart Support
+
+AI 客服行动层框架。粘贴你的 API，获得一个能执行真实操作的智能客服。
+
+## 问题
+
+现有客服工具（Zendesk、Intercom、Ada）擅长回答 FAQ，但自动化率卡在 20-30%。剩下 70% 的工单需要人工登录内部系统，手动查订单、取消订单、发优惠券。
+
+Smart Support 是补全这个缺口的「行动层」。它不替代现有客服平台，而是让 AI 能直接调用内部系统完成操作。
+
+## 工作原理
+
+```
+客户消息 → Chat UI → FastAPI WebSocket → LangGraph Supervisor → 专业 Agent → MCP Tools → 你的内部系统
+                                                ↑                      ↑
+                                          Agent 注册表            interrupt()
+                                          (YAML 配置)           (人工确认)
+                                                ↑
+                                          PostgresSaver
+                                         (会话状态持久化)
+```
+
+1. 客户在聊天界面发送消息
+2. LangGraph Supervisor 分析意图，路由到对应的专业 Agent
+3. Agent 通过 MCP 协议调用你的内部系统（查订单、取消订单、发折扣...）
+4. 涉及写操作时，自动触发人工确认流程
+5. 所有操作全程记录，支持回放和分析
+
+## 核心特性
+
+- **多 Agent 协作** - 不同操作由不同 Agent 处理，各自拥有独立的权限边界和工具集
+- **即插即用** - 粘贴 OpenAPI 规范 URL，自动生成 MCP 工具和 Agent 配置
+- **人工确认** - 所有写操作（取消、退款、修改）需要人工审批，读操作直接执行
+- **会话上下文** - 支持多轮对话，Agent 能理解「取消那个订单」这样的指代
+- **实时流式输出** - WebSocket 双向通信，逐 token 流式返回
+- **对话回放** - 逐步查看 Agent 决策过程、工具调用和返回结果
+- **数据分析** - 解决率、Agent 使用率、升级率、每次对话成本
+- **YAML 驱动配置** - Agent 定义、人设、垂直模板全部通过 YAML 配置
+
+## 技术栈
+
+| 组件 | 技术选型 |
+|------|---------|
+| 后端 | Python 3.11+, FastAPI |
+| Agent 编排 | LangGraph v1.1, langgraph-supervisor |
+| 工具集成 | langchain-mcp-adapters, @tool |
+| 状态持久化 | PostgreSQL + langgraph-checkpoint-postgres |
+| LLM | Claude Sonnet 4.6（可切换 OpenAI、Google 等） |
+| 前端 | React |
+| 部署 | Docker Compose |
+
+## 项目结构
+
+```
+smart-support/
+├── backend/
+│   ├── app/
+│   │   ├── main.py          # FastAPI + WebSocket 入口
+│   │   ├── graph.py         # LangGraph Supervisor 配置
+│   │   ├── agents/          # Agent 定义 + 工具
+│   │   ├── registry.py      # YAML Agent 注册表加载器
+│   │   ├── openapi/         # OpenAPI 解析 + MCP 服务器生成
+│   │   ├── replay/          # 对话回放 API
+│   │   ├── analytics/       # 数据分析查询 + API
+│   │   └── callbacks.py     # Token 用量统计
+│   ├── agents.yaml          # Agent 注册表配置
+│   ├── templates/           # 垂直行业模板
+│   └── tests/
+├── frontend/                # React 聊天 UI + 回放 + 仪表盘
+├── docker-compose.yml       # PostgreSQL + 应用
+└── pyproject.toml
+```
+
+## 快速开始
+
+```bash
+# 启动 PostgreSQL 和应用
+docker compose up
+
+# 访问聊天界面
+open http://localhost:8000
+```
+
+## Agent 配置示例
+
+```yaml
+# agents.yaml
+agents:
+  - name: order_lookup
+    description: 查询订单状态、物流信息
+    permission: read
+    personality:
+      tone: professional
+      greeting: "您好，我来帮您查询订单信息。"
+    tools:
+      - get_order_status
+      - get_tracking_info
+
+  - name: order_actions
+    description: 取消订单、修改订单
+    permission: write  # 触发人工确认
+    personality:
+      tone: careful
+      greeting: "我可以帮您处理订单变更，所有操作都会先经过您的确认。"
+    tools:
+      - cancel_order
+      - modify_order
+
+  - name: discount
+    description: 发放优惠券、折扣码
+    permission: write
+    tools:
+      - apply_discount
+      - generate_coupon
+```
+
+## OpenAPI 自动接入
+
+不需要手动写 MCP 连接器。粘贴你的 API 规范 URL：
+
+1. 框架解析 OpenAPI 3.0 规范
+2. LLM 自动分类每个端点（读/写、客户参数、Agent 分组）
+3. 运维人员审核分类结果
+4. 自动生成 MCP 服务器 + Agent YAML 配置
+5. 新工具立即可用
+
+## 安全设计
+
+- **人工确认** - 所有写操作需要客户或运维人员批准
+- **SSRF 防护** - OpenAPI URL 导入时屏蔽内网地址和 DNS 重绑定攻击
+- **操作审计** - 每个操作记录 Agent、参数、结果、时间戳
+- **权限隔离** - 每个 Agent 只能访问其配置的工具集
+- **中断超时** - 30 分钟未确认的操作自动取消，防止过期审批
+
+## 开发阶段
+
+| 阶段 | 周期 | 内容 |
+|------|------|------|
+| Phase 1 | 第 1-3 周 | 核心框架：Chat UI + Supervisor + Agent 注册表 + 中断流程 |
+| Phase 2 | 第 3-4 周 | 多 Agent 路由 + Webhook 升级 + 垂直模板 |
+| Phase 3 | 第 4-6 周 | OpenAPI 自动发现 + MCP 服务器生成 + SSRF 防护 |
+| Phase 4 | 第 6-7 周 | 对话回放 + 数据分析仪表盘 |
+
+## 目标用户
+
+中型电商公司（日均 500-5000 订单，5-20 名客服）的客户体验负责人。
+
+他们的痛点：客服需要在 Zendesk 和 Shopify 后台之间反复切换，手动执行查询和操作。Smart Support 让 AI 直接完成这些操作，人工只需审批关键步骤。
+
+## 相关文档
+
+- [设计文档](design-doc.md) - 问题定义、约束、方案选择
+- [CEO 计划](ceo-plan.md) - 产品愿景、范围决策
+- [工程评审计划](eng-review-plan.md) - 架构决策、测试策略、失败模式
+- [测试计划](eng-review-test-plan.md) - 测试路径、边界情况、E2E 流程
+- [待办事项](TODOS.md) - 延迟到后续阶段的工作
+
+## License
+
+MIT
diff --git a/TODOS.md b/TODOS.md
new file mode 100644
index 0000000..16e7c64
--- /dev/null
+++ b/TODOS.md
@@ -0,0 +1,25 @@
+# TODOS
+
+## Before Phase 3
+- [ ] **Tool interface decision:** The tool layer should support multiple backends, not just MCP:
+  1. **MCP tools** — for complex, stateful interactions via MCP protocol (stdio/SSE)
+  2. **CLI tools** — wrap existing CLIs (Shopify CLI, AWS CLI, Stripe CLI, etc.). Parse stdout/stderr.
+  3. **Direct API tools** — simple REST/GraphQL HTTP calls, no MCP overhead.
+  LangChain tools are just Python functions with descriptions — the backend is an implementation detail. Research MCP Python SDK (`mcp` on PyPI) for the MCP path. Design the tool base class to abstract over all three backends. ~2-3 hours research. Flagged by eng review outside voice + user feedback.
+
+## Before Production Deployment (P1)
+- [ ] **Auth system:** API key auth for chat WebSocket, session-based auth for dashboard/replay/import. Rate limiting on all endpoints. Blocks any real client deployment. Effort: M (CC: ~2 days). Depends on: Phase 4 completion.
+
+## Before Phase 4 (Client Engagement)
+- [ ] **Checkpointer migration plan:** InMemorySaver → PostgresSaver (or SQLiteSaver as intermediate). InMemorySaver loses all state on restart/crash. PostgresSaver requires schema, connection pooling, serialization compatibility. Not a simple config swap. Plan the migration before any real client deployment.
+
+## Design Changes from Eng Review
+- [x] **Use LangGraph built-ins:** Checkpointers for session state, interrupt() for human-in-the-loop, supervisor pattern for multi-agent routing. Don't rebuild what LangGraph provides.
+- [x] **WebSocket for streaming:** Bidirectional connection for streaming tokens + interrupt flow.
+- [x] **Supervisor pattern:** Despite latency concern (8-15s per response), founder chose multi-agent supervisor over single-agent. Stream all tokens to mitigate perceived wait.
+- [x] **YAML agent registry:** Declarative agent definitions for client configurability.
+- [x] **Prompt caching:** Enabled from day one to reduce LLM costs.
+- [x] **Multi-LLM provider support:** Use LangChain's provider abstractions (ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI). Provider configurable per deployment.
+- [x] **Multi-backend tool support:** Tool layer supports MCP servers, CLI wrappers, and direct API calls. LangChain tools abstract over all three backends.
+- [x] **Interrupt resume flow:** Design WebSocket reconnection + re-send interrupt prompt on reconnect.
+- [x] **Tests per phase:** 28 unit/integration + 4 E2E, written alongside each phase.
diff --git a/ceo-plan.md b/ceo-plan.md
new file mode 100644
index 0000000..2f5dff8
--- /dev/null
+++ b/ceo-plan.md
@@ -0,0 +1,52 @@
+---
+status: ACTIVE
+---
+# CEO Plan: Smart Support — AI Customer Service Action Layer Framework
+
+Generated by /plan-ceo-review on 2026-03-29
+Branch: unknown | Mode: SCOPE EXPANSION
+Repo: smart-support
+
+## Vision
+
+### 10x Check
+A framework that comes alive the moment a client connects their system. Client pastes an OpenAPI spec URL, the framework auto-generates MCP tool wrappers and agent definitions, and a working chatbot appears in minutes. Every conversation is logged with full replay capability. Analytics dashboard shows ROI in real-time. The client doesn't configure anything. They paste a URL.
+
+### Platonic Ideal
+Open smart-support.com. Paste your Shopify store URL. In 90 seconds, a chat widget appears connected to your store. Type "cancel order #1042" and watch the agent look up the order, ask for confirmation, and cancel it. No setup. No config. No code. Deploy to Zendesk with one click. Dashboard shows 60% automated resolution rate and $4,200/month savings. Sleep at night knowing every destructive action requires human approval, every action is logged, and the system pages you if something unusual happens.
+
+## Scope Decisions
+
+| # | Proposal | Effort | Decision | Reasoning |
+|---|----------|--------|----------|-----------|
+| 1 | Auto-discovery from OpenAPI specs | L (CC: ~3-4 days) | ACCEPTED | 10x differentiator. "Paste URL, get chatbot." |
+| 2 | Conversation analytics dashboard | M (CC: ~2-3 days) | ACCEPTED | Proves ROI. Makes the product sticky. |
+| 3 | Agent personality config (YAML) | S (CC: ~1 hour) | ACCEPTED | Near-zero cost, customizable feel. |
+| 4 | Conversation replay / debugger | M (CC: ~2 days) | ACCEPTED | Trust = adoption. Clients need to see WHY. |
+| 5 | Webhook escalation to Slack/email | S (CC: ~1 hour) | ACCEPTED | Bridge between AI-only and human-in-the-loop. |
+| 6 | Quick-start vertical templates | S (CC: ~30 min) | ACCEPTED | First 5 minutes feel magical. |
+
+## Accepted Scope (added to this plan)
+- Auto-discovery: parse OpenAPI/Swagger specs, generate MCP tool wrappers + agent YAML
+- Analytics dashboard: resolution rate, turns, agent usage, common intents, escalation %
+- Agent personality: tone/greeting/escalation style configurable in YAML
+- Conversation replay: step-by-step replay of agent decisions, tool calls, results
+- Webhook escalation: HTTP POST with full conversation context on escalation
+- Vertical templates: pre-built YAML for e-commerce, SaaS, fintech
+
+## Revised Phasing (with expansions)
+- **Phase 1 (Week 1):** Chat UI (React) + FastAPI + LangGraph graph with PostgresSaver checkpointer + agent registry from YAML + single mock agent + agent personality config + tests
+- **Phase 2 (Week 2):** Multi-agent supervisor (uses registry from Phase 1) + vertical templates (e-commerce, SaaS, fintech) + interrupt() for write ops + webhook escalation + tests
+- **Phase 3 (Week 2-3):** OpenAPI auto-discovery: parse spec (REST, OpenAPI 3.0+), generate tool wrappers + agent YAML. SSRF protection on URL import. Pluggable tool interface (MCP/CLI/API backends).
+- **Phase 4 (Week 3-4):** Conversation replay UI (browsable checkpointer state) + analytics dashboard (resolution rate, agent usage, escalation %, conversation count). PostgreSQL schema for conversation data locked in Phase 1.
+- **Phase 5 (client engagement):** Real connectors for first client's systems.
+
+**Dependencies resolved:** Agent registry ships in Phase 1 (before supervisor in Phase 2). Conversation data schema locked in Phase 1 so Phase 4 analytics can query it without migration.
+
+## Deferred to TODOS.md
+- (none — all proposals accepted)
+
+## Effort Estimate
+- Original: 3 weeks (1 engineer)
+- With expansions: 4-5 weeks (1 engineer). The expansions add ~1.5-2 weeks.
+- With CC+gstack: 2-3 weeks realistic.
diff --git a/design-doc.md b/design-doc.md
new file mode 100644
index 0000000..ef031fd
--- /dev/null
+++ b/design-doc.md
@@ -0,0 +1,250 @@
+# Design: Smart Support — AI Customer Service Action Layer
+
+Generated by /office-hours on 2026-03-28
+Branch: unknown
+Repo: smart-support
+Status: APPROVED
+Mode: Startup
+
+## Problem Statement
+
+Existing customer support tools (Zendesk, Intercom, Ada) handle FAQ-style queries well but plateau at 20-30% automation because they can't execute actions in internal systems. The remaining 70% of support volume requires a human agent to manually log into internal tools, look up orders, cancel subscriptions, apply discounts, etc. Smart Support is the "action layer" — a multi-agent AI system that connects to internal services via MCP to actually perform these operations, complementing (not replacing) existing support platforms.
+
+## Demand Evidence
+
+- Founder's own pain (observed, not firsthand in support operations)
+- No paying customers or pilots yet
+- No specific companies contacted
+- Market evidence: Zendesk/Intercom AI agents plateau at 20-30% automation (Qualtrics, Swifteq). Klarna reversed course after replacing 700 human agents with AI — quality collapsed because the AI could answer questions but couldn't reliably execute workflows
+- $10.9B market growing 40% CAGR, but the "action execution" sub-segment is underserved by incumbents
+
+**Demand risk: HIGH.** The thesis is sharp but unvalidated with real buyers. Priority #1 after this design: customer conversations.
+
+## Status Quo
+
+Companies currently handle the "action" part of support via:
+1. Human agents manually switching between Zendesk/Intercom and internal tools (Shopify admin, CRM, billing systems)
+2. Internal dashboards and admin panels built by engineering teams
+3. Macros and automations that handle simple cases (auto-refund under $X) but can't reason about context
+4. Some use Retool/internal tools, but these still require human judgment to select the right action
+
+The gap: no tool bridges "understanding what the customer wants" to "executing the action in the internal system" autonomously.
+
+## Target User & Narrowest Wedge
+
+**Buyer type:** Head of Customer Experience or VP of Operations at a mid-size e-commerce company (Shopify-based, 500-5000 orders/day, 5-20 support agents).
+
+**What gets them promoted:** Reducing support cost per ticket while maintaining or improving CSAT scores.
+
+**What gets them fired:** Customer churn from slow resolution times, or a support incident that goes viral.
+
+**Narrowest wedge:** E-commerce order management — check order status, cancel orders, apply discounts/credits, track shipments. These are the highest-volume, most repetitive "action" tasks in e-commerce support.
+
+**Note:** No specific buyer identified by name yet. This is a critical gap to close within the first 2 weeks.
+
+## Constraints
+
+- Must complement existing support tools (Zendesk, Intercom), not replace them
+- Must use LangGraph for multi-agent orchestration (founder's architecture choice)
+- Must use MCP (Model Context Protocol) for internal service connectivity, with a pluggable connector pattern (no specific vertical baked in)
+- Must manage session context across multi-turn conversations
+- Must include human-in-the-loop confirmation for destructive actions (cancellations, refunds)
+- Framework-first: no specific vertical in prototype. Client-specific MCP connectors built per engagement.
+
+## Premises
+
+1. Existing support tools handle FAQ well but CAN'T execute internal system actions — **AGREED**
+2. The value is in the "action layer" connecting to internal services via MCP — **AGREED**
+3. Multi-agent architecture is the right approach (different actions need different permissions and safety checks) — **AGREED** (challenged by second opinion as premature for prototype stage; founder defended with conviction but without specific reasoning)
+4. Session context management matters for multi-step action workflows — **AGREED**
+5. Horizontal is the long-term vision; vertical e-commerce is the prototype scope — **AGREED**. The prototype does NOT need to generalize. Build tight Shopify integration first, abstract later.
+
+**Unvalidated premise (HIGH RISK):** Can compete in this market without existing customer relationships, domain expertise, or proprietary training data.
+
+## Cross-Model Perspective
+
+Independent cold read (Claude subagent):
+
+- **Steelman:** Most enterprise support costs aren't in answering questions — they're in the 5-10 minute human tasks that follow. A thin MCP-based action layer captures that tail without displacing the incumbent. If MCP standardizes the integration, the cost drops enough for per-action pricing at SMB scale.
+- **Key insight:** "I have a type in mind" is the whole problem. The action layer thesis is technically sharp but commercially unanchored. "Any company with internal services" is a TAM slide, not a buyer.
+- **Challenged premise:** Multi-agent is architecturally correct but commercially wrong at prototype stage. Multi-agent requires buyers to trust your orchestration with production credentials across multiple systems simultaneously — that's a security review and procurement cycle before you've proven anything.
+- **48-hour prototype suggestion:** One Shopify merchant, one action (cancel order), LangGraph single agent, MCP wrapping Shopify Admin API, Slack as UI, human-in-the-loop confirmation. Goal: 90-second video.
+
+## Approaches Considered
+
+### Approach A: One Vertical, Full Stack (CHOSEN)
+
+Multi-agent LangGraph system targeting e-commerce. Three agents (order lookup, order actions, discount/refund), MCP connectors to Shopify Admin API, session context via Redis, web chat UI. Deploy as a working demo for Shopify merchants.
+
+- Effort: M (2-3 weeks)
+- Risk: Medium
+- Proves multi-agent orchestration end-to-end
+- Concrete demo for a specific buyer type
+- Shopify has 4.6M merchants
+
+### Approach B: Horizontal Framework + One Demo
+
+Build the multi-agent orchestrator as a generic framework first (agent registry, MCP tool discovery, session manager, permission system), then one vertical demo on top.
+
+- Effort: L (4-6 weeks)
+- Risk: High — more code before first customer feedback
+- Framework without customers is just code
+
+### Approach C: Video-First Prototype
+
+Thinnest possible multi-agent demo, hardcoded to one test merchant, no auth, minimal UI. Goal: 90-second screen recording showing real actions.
+
+- Effort: S (1 week)
+- Risk: Low
+- Fastest to customer conversations, but not production-ready
+
+## Recommended Approach
+
+**Revised Approach: Pluggable Multi-Agent Framework.**
+
+The product is the framework itself — not a Shopify integration or any specific vertical. When a client comes in, we build (or they build) MCP connectors for their systems. The framework handles everything else: chat, routing, context, safety.
+
+### Core Components (prototype scope)
+
+**1. Chat Interface**
+- Web-based chat UI (HTML + fetch or lightweight React). This is a real product surface, not throwaway scaffolding.
+- Supports multi-turn conversations with streaming responses.
+- Displays agent actions and confirmation prompts inline.
+
+**2. Agent Router (Orchestrator)**
+- LangGraph graph that classifies customer intent and routes to the correct agent.
+- Intent classification via LLM structured output.
+- Multi-intent requests (e.g., "cancel my order and give me a discount") are sequenced by the orchestrator. Ambiguous or conflicting intents escalate to human.
+- **Agent registry:** Agents are registered declaratively (name, description, available MCP tools, permission level). The router uses agent descriptions to select the right one. Adding a new agent = adding a config entry + connecting its MCP tools.
+
+**3. Context Manager (Session State)**
+- In-memory Python dict for prototype phase. Redis introduced before any external pilot.
+- Session keyed by conversation ID, 30-minute sliding window TTL (reset on each turn).
+- Stores: conversation history, resolved entities (e.g., "that order" → order #1042), customer profile, current agent state.
+- Pending human-in-the-loop confirmations extend the TTL until resolved or cancelled with a user-facing notice.
+- Context is passed to agents on each turn so they have full conversation awareness.
+
+**4. Pluggable MCP Layer**
+- Framework defines a standard interface for MCP tool connectors.
+- Each client engagement produces a set of MCP servers wrapping their specific systems (Shopify Admin API, custom REST APIs, internal gRPC services, databases, etc.).
+- **No specific MCP connectors are built in the prototype.** Instead, provide 1-2 example/mock MCP tools (e.g., a mock "order lookup" and "order cancel") to demonstrate the plug-in pattern and enable end-to-end testing.
+- When onboarding a real client: build MCP wrappers for their APIs, register them with the agent registry, done.
+
+**5. Safety Layer**
+- Human-in-the-loop confirmation for write/destructive operations, surfaced as a confirmation prompt in the chat UI.
+- Permission boundaries per agent (read-only agents skip confirmation, write agents require it).
+- All actions logged with action ID, timestamp, agent, parameters, and outcome.
+- On MCP call failure: log error, escalate to human with full context.
+
+### Architecture Diagram
+
+```
+Customer Chat UI
+       │
+       ▼
+  FastAPI Server
+       │
+       ▼
+  Context Manager ◄── session store (in-memory / Redis)
+       │
+       ▼
+  Agent Router (LangGraph Orchestrator)
+       │
+       ├──► Agent A ──► MCP Tools (client-specific)
+       ├──► Agent B ──► MCP Tools (client-specific)
+       └──► Agent C ──► MCP Tools (client-specific)
+                            │
+                            ▼
+                   Client's Internal Systems
+                   (Shopify, custom APIs, etc.)
+```
+
+### Tech Stack
+
+- Python (LangGraph, FastAPI)
+- In-memory Python dict (prototype) / Redis (post-pilot)
+- MCP SDK (for building client-specific connectors)
+- LLM: Claude Sonnet 4.6 via Anthropic API. Abstracted behind a provider interface (`complete(messages, tools) -> response`) so it can be swapped.
+- Web chat frontend (HTML + fetch or lightweight React)
+
+### Phasing
+
+- **Phase 1 (Week 1):** Chat UI + Context Manager + basic LangGraph orchestrator with a single mock agent. Proves: chat works, context persists across turns, agent receives full conversation history.
+- **Phase 2 (Week 2):** Agent Router with multi-agent support + agent registry. Add 2-3 mock agents with different capabilities. Proves: router correctly selects agent based on intent, multi-agent handoff works.
+- **Phase 3 (Week 3):** Safety layer (human-in-the-loop confirmation) + pluggable MCP interface with example mock tools. Proves: write operations require confirmation, new MCP tools can be added without changing framework code.
+- **Phase 4 (client engagement):** Build real MCP connectors for first client's systems. This is where Shopify, custom APIs, etc. get wired in.
+- **Fallback:** If multi-agent graph complexity blocks progress past day 5, fall back to single-agent with tool routing and refactor to multi-agent post-validation.
+
+**Effort estimate:** 3 weeks for a full-time senior Python engineer. LangGraph experience assumed; add 3-4 days if new to LangGraph. No client-specific MCP connectors included in this estimate — those are per-client engagement work.
+
+## Open Questions
+
+1. **Pricing model:** Per-action? Per-seat? Per-resolution? Per-action aligns value with usage but creates billing uncertainty for buyers (same problem Intercom has). Defer until first pilot — let the customer's willingness to pay inform the model.
+2. **Multi-tenant architecture:** Single-tenant for prototype. Multi-tenant architecture decision deferred until first paid customer.
+
+## Success Criteria
+
+### Engineering Done
+1. Working framework: Chat UI + Agent Router + Context Manager functioning end-to-end with mock agents
+2. Multi-agent routing: router correctly selects agent based on conversation intent
+3. Session context: agent correctly resolves references ("cancel that one") across turns using context manager
+4. Human-in-the-loop: write operations require confirmation before execution
+5. Pluggable MCP: new MCP tools can be added via config without changing framework code
+6. 90-second screen recording of the framework in action with mock agents
+
+### Business Validation
+5. At least 5 customer conversations with real e-commerce operators within 2 weeks of demo completion
+6. At least 1 paid pilot within 4 weeks of demo completion
+
+## Distribution Plan
+
+- **Initial:** Direct demo to potential clients via cold outreach, showing the framework with mock agents and explaining per-client MCP customization
+- **Demo hosting:** Deploy on a cloud provider (Fly.io, Railway, or AWS) with a shareable demo link
+- **Video:** 90-second screen recording of the framework in action for async sales
+- **Future:** Self-service onboarding where clients can configure their own MCP connectors; Zendesk/Intercom marketplace integrations
+- **CI/CD:** Deferred to post-validation. Manual deploy for prototype phase.
+
+## Dependencies
+
+- LangGraph (open source)
+- MCP SDK (open source, for building client-specific connectors)
+- LLM API access (Claude Sonnet 4.6 via Anthropic API)
+- Domain for demo hosting
+
+## GSTACK REVIEW REPORT
+
+| Review | Trigger | Why | Runs | Status | Findings |
+|--------|---------|-----|------|--------|----------|
+| CEO Review | `/plan-ceo-review` | Scope & strategy | 1 | CLEAR | 6 proposals, 6 accepted, 0 deferred. Scope expanded: OpenAPI auto-discovery, analytics dashboard, conversation replay, agent personality, webhook escalation, vertical templates. |
+| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
+| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR | 4 issues, 0 critical gaps. Scope reduced: LangGraph built-ins replace 3/5 custom components. Note: ran before CEO expansion. |
+| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
+| Outside Voice | via eng review | Independent challenge | 1 | issues_found | 8 issues: latency (kept supervisor), interrupt resume (added), YAML registry (kept), MCP interface (TODO), auth gap (accepted for demo) |
+
+**VERDICT:** CEO + ENG CLEARED. Eng review may be stale (ran before CEO expanded scope with 6 new features). Consider re-running `/plan-eng-review`. Run `/plan-design-review` for the 3 UI surfaces (chat, analytics, replay).
+
+## Reviewer Concerns
+
+The following issues were flagged by adversarial review and deferred to implementation phase:
+
+1. **Error taxonomy for MCP failures:** Define retriable vs. non-retriable errors, retry policy (e.g., 3 attempts with exponential backoff for transient errors, immediate escalation for auth failures). Address when building first real MCP connector for a client.
+2. **Destructive action boundary:** Create explicit table of which operations require human-in-the-loop confirmation. Default rule: all write operations require confirmation; all read operations do not. Client-specific overrides configurable per agent.
+3. **Multi-intent atomicity:** Clarify whether multi-intent sequences are "all-or-nothing" or "best-effort" with partial failure escalation. Address during orchestrator implementation.
+4. **External integration (Zendesk/Intercom):** Webhook integration flow (payload shape, async acknowledgment, response posting) to be designed when a client requires it. Not in prototype scope.
+
+## Appendix: Founder Action Plan (The Assignment)
+
+**Do not write more code this week.** Instead:
+
+1. Go to the Shopify Community forums and find 5 merchants who have posted about support tool frustrations in the last 30 days. DM them. Ask: "What does your support team spend the most time doing in the Shopify admin panel during a customer conversation?"
+2. Find 3 customer support managers on LinkedIn at Shopify-based e-commerce companies (100-500 employees). Send a 3-sentence cold message: "I'm building an AI agent that can cancel orders, apply discounts, and look up shipments automatically during support conversations. Would you spend 15 minutes showing me how your team handles these tasks today?"
+3. If even ONE person responds with enthusiasm, you have a design partner. Build for them specifically.
+
+The code can wait. The customer can't.
+
+## What I noticed about how you think
+
+- You clarified your positioning mid-session — shifting from "competing with Zendesk" to "complementing Zendesk as the action layer." That pivot from replacement to complement is a much sharper thesis, and you got there on your own when pushed.
+- You chose multi-agent architecture and defended it against both my skepticism and the independent second opinion's challenge. You didn't articulate the specific reasoning, but you held your ground. Next time someone challenges this, be ready with the "why" — "because different actions need different permission boundaries and different failure modes" is the answer.
+- Your examples were concrete: "check orders, cancel orders, give discounts." You think in terms of specific actions, not abstract capabilities. That's good product instinct for a developer tool.
+- You chose the full vertical (Approach A) over the quick prototype (Approach C). That tells me you want to build something real, not just validate. I respect that, but I'll push back: the risk is that you spend 3 weeks building something beautiful that no merchant wants. Talk to merchants this week. The code will be better for it.
diff --git a/eng-review-plan.md b/eng-review-plan.md
new file mode 100644
index 0000000..8e384f1
--- /dev/null
+++ b/eng-review-plan.md
@@ -0,0 +1,194 @@
+# Smart Support Framework — Eng Review Plan
+
+## Context
+
+Build a pluggable AI customer support framework. Core value: "Paste your API, get an AI agent that executes actions." This plan incorporates all CEO review expansions (6 features) with re-sequenced phasing (core first). Timeline extended to 6-7 weeks per outside voice feedback.
+
+No code exists yet. Greenfield project.
+
+## Architecture Decisions
+
+```
+Customer → React Chat UI → FastAPI WebSocket → LangGraph Supervisor → Agents → MCP Tools → Client APIs
+                                                      ↑                   ↑
+                                                Agent Registry        interrupt()
+                                                (YAML config)        (HITL safety)
+                                                      ↑
+                                              PostgresSaver
+                                            (checkpoint persistence)
+```
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Agent orchestration | `langgraph-supervisor` v1.1 | Built-in supervisor with middleware. Don't rebuild. [Layer 1] |
+| MCP integration | `langchain-mcp-adapters` + `@tool` | MultiServerMCPClient for MCP, @tool for CLI/API. No custom base class. [Layer 1] |
+| Checkpointer | PostgresSaver from day one (app + tests) | Phase 4 analytics/replay needs queryable data. Docker Compose. |
+| LLM provider | LangChain `BaseChatModel` + env config | No custom wrapper. `LLM_PROVIDER` + `LLM_MODEL` env vars. [Layer 1] |
+| Streaming | FastAPI WebSocket + `astream_events()` | Built-in. No custom streaming layer. [Layer 1] |
+| OpenAPI import | Full MCP server generation + LLM classification + human review | Parse spec → generate tools → LLM classifies read/write/params → operator reviews |
+| OpenAPI import UX | Async background task with WebSocket progress | Don't block chat during import |
+| Replay | Custom paginated API endpoint | Not raw `get_state_history()`. Design for 200+ turn threads. |
+| Interrupt TTL | Auto-cancel + retry offer after 30 min | Stale approvals are dangerous. Re-evaluate current state on retry. |
+| Routing fallback | General-purpose fallback agent | Catches misroutes. TODO for routing accuracy eval. |
+| Resolution metric | Tool call success + no escalation | Honest starting definition. Refine with customer satisfaction signals. |
+| Cost tracking | LangChain callback logging tokens per conversation | Surface cost-per-resolution in analytics. |
+| SSRF protection | Block private IPs + DNS rebinding protection | Mandatory for OpenAPI URL fetching. Build as standalone utility. |
+| DB error handling | try/except around graph invocation | Return clear error message to user, don't fail silently. |
+
+## Project Structure
+
+```
+smart-support/
+├── backend/
+│   ├── app/
+│   │   ├── main.py          # FastAPI app + WebSocket
+│   │   ├── graph.py         # LangGraph supervisor setup
+│   │   ├── agents/          # Agent definitions + tools
+│   │   ├── registry.py      # YAML agent registry loader
+│   │   ├── openapi/         # OpenAPI parser + MCP server generator
+│   │   ├── replay/          # Replay API endpoint
+│   │   ├── analytics/       # Analytics queries + endpoint
+│   │   └── callbacks.py     # Token usage logging callback
+│   ├── agents.yaml          # Agent registry config
+│   ├── templates/           # Vertical templates (e-commerce.yaml, etc.)
+│   └── tests/
+├── frontend/                # React chat UI + replay + dashboard
+├── docker-compose.yml       # Postgres + app
+└── pyproject.toml
+```
+
+## Phasing (6-7 weeks)
+
+### Phase 1 (Weeks 1-3): Core Framework
+- FastAPI backend with WebSocket for chat
+- LangGraph supervisor with 2-3 demo agents (order lookup, FAQ, escalation)
+- PostgresSaver checkpointer via Docker Compose
+- YAML-based agent registry with validation
+- React chat UI with streaming tokens
+- Agent personality via YAML config
+- Basic interrupt() flow for write operations
+- Fallback agent for misrouted queries
+- Token usage logging callback
+- Try/except for DB errors
+- **Integration checkpoint:** End of week 3, full chat loop works end-to-end
+
+### Phase 2 (Weeks 3-4): Multi-Agent + Safety
+- Full supervisor routing with intent classification
+- Webhook escalation (HTTP POST to configured URL + retry)
+- Vertical templates (YAML configs for e-commerce, SaaS)
+- Expired interrupt handling (auto-cancel + retry offer after 30-min TTL)
+- **Integration checkpoint:** End of week 4, multi-agent routing + interrupt flow works
+
+### Phase 3 (Weeks 4-6): OpenAPI Auto-Discovery
+- Parse OpenAPI 3.0 specs from user-provided URLs
+- SSRF protection (block private IPs, DNS rebinding, URL allowlist)
+- Generate full MCP servers wrapping each endpoint
+- LLM-assisted endpoint classification (read/write, customer params, agent groupings)
+- Operator review/correction UI for classifications
+- Auto-generate agent YAML from classified spec
+- Async import with WebSocket progress updates
+- **Integration checkpoint:** End of week 6, paste a real API spec → tools work in chat
+
+### Phase 4 (Weeks 6-7): Analytics + Replay
+- Custom paginated replay API endpoint
+- Replay UI (step-by-step timeline in React)
+- Analytics queries (resolution rate, agent usage, escalation %, cost-per-resolution)
+- Analytics dashboard UI with zero-state handling
+- Resolution rate = successful tool call + no escalation
+- **Integration checkpoint:** End of week 7, full product demo ready
+
+### Phase 5 (Buffer): Polish + Demo Prep
+- Error handling hardening
+- Demo script and sample data
+- Docker Compose for full-stack deployment
+
+## Tech Stack
+
+- Python 3.11+, FastAPI, LangGraph v1.1.0
+- langgraph-supervisor, langchain-mcp-adapters, langgraph-checkpoint-postgres v3.0.5
+- React (frontend), PostgreSQL 16 (via Docker Compose)
+- Claude Sonnet 4.6 via `ChatAnthropic` (configurable via env)
+- pytest + FastAPI TestClient for backend tests
+- openapi-spec-validator for spec validation
+
+## NOT in scope
+
+- Authentication/authorization (deferred to pre-production)
+- Multi-tenant architecture (deferred to first paid customer)
+- CI/CD pipeline (manual deploy for prototype)
+- Rate limiting (deferred to pre-production)
+- Zendesk/Intercom marketplace integration (deferred to post-validation)
+- Mobile-responsive chat UI (desktop-only for demo)
+- Internationalization/i18n
+- Billing/pricing infrastructure
+- Distribution pipeline (manual Docker Compose deploy)
+
+## What already exists (reuse, don't rebuild)
+
+- `langgraph-supervisor` — agent orchestration
+- `langgraph-checkpoint-postgres` — state persistence
+- LangGraph `interrupt()` — human-in-the-loop
+- `langchain-mcp-adapters` (`MultiServerMCPClient`) — MCP tool integration
+- LangChain `BaseChatModel` — LLM provider abstraction
+- FastAPI WebSocket + `astream_events()` — streaming
+- `openapi-spec-validator` — OpenAPI spec validation
+
+## Testing Strategy
+
+TDD per phase. 80%+ coverage target. pytest + FastAPI TestClient.
+
+45 codepaths identified (33 code paths + 12 user flows, 6 E2E).
+
+Key test categories:
+1. **Graph tests** — invoke supervisor with mock tools, assert routing + state
+2. **MCP tool tests** — mock external HTTP, test structured responses
+3. **WebSocket tests** — FastAPI TestClient, test message → response cycle
+4. **Interrupt tests** — test approval, rejection, and TTL expiry flows
+5. **OpenAPI tests** — test spec parsing, SSRF blocking, MCP generation
+6. **E2E tests** — 6 critical flows (happy path, cancel+approve, cancel+reject, multi-turn, OpenAPI import, replay)
+
+## Failure Modes
+
+| Codepath | Failure | Mitigation |
+|----------|---------|------------|
+| LLM API call | Timeout/rate limit | Error message to user |
+| MCP tool call | External API down | Escalation + error message |
+| Interrupt resume | 30-min TTL expired | Auto-cancel + retry offer |
+| PostgresSaver | DB connection lost | try/except + user-facing error |
+| OpenAPI URL fetch | SSRF attempt | Block private IPs + DNS rebinding |
+| Supervisor routing | Wrong agent | Fallback agent catches misroutes |
+| Webhook POST | Target unreachable | Retry with backoff + log |
+
+## Parallelization Strategy
+
+| Lane | Steps | Modules |
+|------|-------|---------|
+| A | Phase 1 backend + Phase 2 | backend/app/ |
+| B | Phase 1 frontend | frontend/ |
+| C | SSRF utility (standalone) | backend/app/openapi/ssrf.py |
+
+Launch A + B + C in parallel. Merge after Phase 1. Phase 3-4 are sequential (depend on core).
+
+## Verification
+
+1. `docker compose up` — Postgres + app starts
+2. Open `http://localhost:8000` — chat UI loads
+3. Send "What's the status of order 1042?" — get streaming response
+4. Send "Cancel order 1042" — get interrupt prompt → approve → confirmation
+5. `pytest --cov` — 80%+ coverage
+6. Paste sample OpenAPI spec → tools generated → chat uses them (Phase 3)
+7. View replay of completed conversation (Phase 4)
+8. View analytics dashboard (Phase 4)
+
+## GSTACK REVIEW REPORT
+
+| Review | Trigger | Why | Runs | Status | Findings |
+|--------|---------|-----|------|--------|----------|
+| CEO Review | `/plan-ceo-review` | Scope & strategy | 1 | CLEAR | 6 proposals, 6 accepted, 0 deferred |
+| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
+| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 2 | CLEAR | 10 issues, 0 critical gaps |
+| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
+
+- **OUTSIDE VOICE:** Claude subagent review found 10 issues. 3 cross-model tensions resolved (PostgresSaver timing, OpenAPI feasibility, timeline). 3 new findings adopted (routing fallback, resolution metric definition, LLM cost tracking).
+- **UNRESOLVED:** 0
+- **VERDICT:** CEO + ENG CLEARED — ready to implement
diff --git a/eng-review-test-plan.md b/eng-review-test-plan.md
new file mode 100644
index 0000000..af11749
--- /dev/null
+++ b/eng-review-test-plan.md
@@ -0,0 +1,47 @@
+# Test Plan
+Generated by /plan-eng-review on 2026-03-29
+Branch: unknown
+Repo: smart-support
+
+## Affected Pages/Routes
+- WebSocket `/ws` — Main chat endpoint. Test connection, message flow, streaming, interrupt responses, reconnection
+- GET `/` — Chat UI serving. Test static file serving
+- GET `/api/replay/{thread_id}` — Conversation replay. Test pagination, 404, structured timeline JSON
+- GET `/api/analytics` — Analytics dashboard. Test resolution rate, agent usage, escalation %, zero state
+- POST (webhook URL) — Escalation webhook. Test payload shape, retry on failure
+
+## Key Interactions to Verify
+- Send message via WebSocket → receive streaming tokens back
+- Send message triggering write action → receive interrupt prompt → send approval → receive confirmation
+- Send message triggering write action → receive interrupt prompt → send rejection → receive cancellation
+- Multi-turn: send order lookup, then "cancel that one" → agent resolves reference from context
+- Paste OpenAPI spec URL → tools generated → agents registered → chat works with new tools
+- View conversation replay → see step-by-step agent decisions and tool calls
+- View analytics dashboard → see resolution rate, agent usage metrics
+
+## Edge Cases
+- Empty message → graceful error response
+- Very long message (10K+ chars) → handled without crash
+- Rapid-fire messages → no race condition in graph execution
+- WebSocket disconnect mid-stream → server cleans up gracefully, client reconnects + resumes interrupt
+- LLM API timeout → error message returned to client
+- Invalid YAML agent config → clear startup error with file/line reference
+- MCP tool timeout → timeout error returned to user with agent name
+- Cancel already-cancelled order → appropriate error message
+- Ambiguous intent with no context → clarifying question asked
+- OpenAPI spec with 100+ endpoints → generation completes without timeout
+- Invalid/malformed OpenAPI spec → clear error with what's wrong
+- SSRF attempt (private IP, localhost, 169.254.x) → blocked with clear error
+- DNS rebinding attack → blocked
+- Replay of thread with 200+ turns → pagination works, no slow query
+- Analytics with no conversations → zero state displayed correctly
+- Webhook URL unreachable → retry with backoff, log failure
+- Multi-intent request ("cancel my order and give me a discount") → sequenced correctly
+
+## Critical Paths (E2E)
+- Happy path: "What's the status of order 1042?" → lookup → answer
+- Cancel with approval: "Cancel order 1042" → interrupt → approve → cancel confirmed
+- Cancel with rejection: "Cancel order 1042" → interrupt → reject → no action taken
+- Multi-turn context: "Check order 1042" then "cancel that one" → correct entity resolution
+- OpenAPI import: paste spec URL → tools generated → use new tool in chat
+- Conversation replay: select completed conversation → step-by-step replay renders correctly