feat: initial project setup with planning docs

Smart Support - AI customer service action layer framework. Includes design doc, CEO plan, eng review, test plan, and README.
2026-03-29 21:11:36 +02:00
commit f93e8baef1
8 changed files with 762 additions and 0 deletions
--- a/design-doc.md
+++ b/design-doc.md
@@ -0,0 +1,250 @@
+# Design: Smart Support — AI Customer Service Action Layer
+
+Generated by /office-hours on 2026-03-28
+Branch: unknown
+Repo: smart-support
+Status: APPROVED
+Mode: Startup
+
+## Problem Statement
+
+Existing customer support tools (Zendesk, Intercom, Ada) handle FAQ-style queries well but plateau at 20-30% automation because they can't execute actions in internal systems. The remaining 70% of support volume requires a human agent to manually log into internal tools, look up orders, cancel subscriptions, apply discounts, etc. Smart Support is the "action layer" — a multi-agent AI system that connects to internal services via MCP to actually perform these operations, complementing (not replacing) existing support platforms.
+
+## Demand Evidence
+
+- Founder's own pain (observed, not firsthand in support operations)
+- No paying customers or pilots yet
+- No specific companies contacted
+- Market evidence: Zendesk/Intercom AI agents plateau at 20-30% automation (Qualtrics, Swifteq). Klarna reversed course after replacing 700 human agents with AI — quality collapsed because the AI could answer questions but couldn't reliably execute workflows
+- $10.9B market growing 40% CAGR, but the "action execution" sub-segment is underserved by incumbents
+
+**Demand risk: HIGH.** The thesis is sharp but unvalidated with real buyers. Priority #1 after this design: customer conversations.
+
+## Status Quo
+
+Companies currently handle the "action" part of support via:
+1. Human agents manually switching between Zendesk/Intercom and internal tools (Shopify admin, CRM, billing systems)
+2. Internal dashboards and admin panels built by engineering teams
+3. Macros and automations that handle simple cases (auto-refund under $X) but can't reason about context
+4. Some use Retool/internal tools, but these still require human judgment to select the right action
+
+The gap: no tool bridges "understanding what the customer wants" to "executing the action in the internal system" autonomously.
+
+## Target User & Narrowest Wedge
+
+**Buyer type:** Head of Customer Experience or VP of Operations at a mid-size e-commerce company (Shopify-based, 500-5000 orders/day, 5-20 support agents).
+
+**What gets them promoted:** Reducing support cost per ticket while maintaining or improving CSAT scores.
+
+**What gets them fired:** Customer churn from slow resolution times, or a support incident that goes viral.
+
+**Narrowest wedge:** E-commerce order management — check order status, cancel orders, apply discounts/credits, track shipments. These are the highest-volume, most repetitive "action" tasks in e-commerce support.
+
+**Note:** No specific buyer identified by name yet. This is a critical gap to close within the first 2 weeks.
+
+## Constraints
+
+- Must complement existing support tools (Zendesk, Intercom), not replace them
+- Must use LangGraph for multi-agent orchestration (founder's architecture choice)
+- Must use MCP (Model Context Protocol) for internal service connectivity, with a pluggable connector pattern (no specific vertical baked in)
+- Must manage session context across multi-turn conversations
+- Must include human-in-the-loop confirmation for destructive actions (cancellations, refunds)
+- Framework-first: no specific vertical in prototype. Client-specific MCP connectors built per engagement.
+
+## Premises
+
+1. Existing support tools handle FAQ well but CAN'T execute internal system actions — **AGREED**
+2. The value is in the "action layer" connecting to internal services via MCP — **AGREED**
+3. Multi-agent architecture is the right approach (different actions need different permissions and safety checks) — **AGREED** (challenged by second opinion as premature for prototype stage; founder defended with conviction but without specific reasoning)
+4. Session context management matters for multi-step action workflows — **AGREED**
+5. Horizontal is the long-term vision; vertical e-commerce is the prototype scope — **AGREED**. The prototype does NOT need to generalize. Build tight Shopify integration first, abstract later.
+
+**Unvalidated premise (HIGH RISK):** Can compete in this market without existing customer relationships, domain expertise, or proprietary training data.
+
+## Cross-Model Perspective
+
+Independent cold read (Claude subagent):
+
+- **Steelman:** Most enterprise support costs aren't in answering questions — they're in the 5-10 minute human tasks that follow. A thin MCP-based action layer captures that tail without displacing the incumbent. If MCP standardizes the integration, the cost drops enough for per-action pricing at SMB scale.
+- **Key insight:** "I have a type in mind" is the whole problem. The action layer thesis is technically sharp but commercially unanchored. "Any company with internal services" is a TAM slide, not a buyer.
+- **Challenged premise:** Multi-agent is architecturally correct but commercially wrong at prototype stage. Multi-agent requires buyers to trust your orchestration with production credentials across multiple systems simultaneously — that's a security review and procurement cycle before you've proven anything.
+- **48-hour prototype suggestion:** One Shopify merchant, one action (cancel order), LangGraph single agent, MCP wrapping Shopify Admin API, Slack as UI, human-in-the-loop confirmation. Goal: 90-second video.
+
+## Approaches Considered
+
+### Approach A: One Vertical, Full Stack (CHOSEN)
+
+Multi-agent LangGraph system targeting e-commerce. Three agents (order lookup, order actions, discount/refund), MCP connectors to Shopify Admin API, session context via Redis, web chat UI. Deploy as a working demo for Shopify merchants.
+
+- Effort: M (2-3 weeks)
+- Risk: Medium
+- Proves multi-agent orchestration end-to-end
+- Concrete demo for a specific buyer type
+- Shopify has 4.6M merchants
+
+### Approach B: Horizontal Framework + One Demo
+
+Build the multi-agent orchestrator as a generic framework first (agent registry, MCP tool discovery, session manager, permission system), then one vertical demo on top.
+
+- Effort: L (4-6 weeks)
+- Risk: High — more code before first customer feedback
+- Framework without customers is just code
+
+### Approach C: Video-First Prototype
+
+Thinnest possible multi-agent demo, hardcoded to one test merchant, no auth, minimal UI. Goal: 90-second screen recording showing real actions.
+
+- Effort: S (1 week)
+- Risk: Low
+- Fastest to customer conversations, but not production-ready
+
+## Recommended Approach
+
+**Revised Approach: Pluggable Multi-Agent Framework.**
+
+The product is the framework itself — not a Shopify integration or any specific vertical. When a client comes in, we build (or they build) MCP connectors for their systems. The framework handles everything else: chat, routing, context, safety.
+
+### Core Components (prototype scope)
+
+**1. Chat Interface**
+- Web-based chat UI (HTML + fetch or lightweight React). This is a real product surface, not throwaway scaffolding.
+- Supports multi-turn conversations with streaming responses.
+- Displays agent actions and confirmation prompts inline.
+
+**2. Agent Router (Orchestrator)**
+- LangGraph graph that classifies customer intent and routes to the correct agent.
+- Intent classification via LLM structured output.
+- Multi-intent requests (e.g., "cancel my order and give me a discount") are sequenced by the orchestrator. Ambiguous or conflicting intents escalate to human.
+- **Agent registry:** Agents are registered declaratively (name, description, available MCP tools, permission level). The router uses agent descriptions to select the right one. Adding a new agent = adding a config entry + connecting its MCP tools.
+
+**3. Context Manager (Session State)**
+- In-memory Python dict for prototype phase. Redis introduced before any external pilot.
+- Session keyed by conversation ID, 30-minute sliding window TTL (reset on each turn).
+- Stores: conversation history, resolved entities (e.g., "that order" → order #1042), customer profile, current agent state.
+- Pending human-in-the-loop confirmations extend the TTL until resolved or cancelled with a user-facing notice.
+- Context is passed to agents on each turn so they have full conversation awareness.
+
+**4. Pluggable MCP Layer**
+- Framework defines a standard interface for MCP tool connectors.
+- Each client engagement produces a set of MCP servers wrapping their specific systems (Shopify Admin API, custom REST APIs, internal gRPC services, databases, etc.).
+- **No specific MCP connectors are built in the prototype.** Instead, provide 1-2 example/mock MCP tools (e.g., a mock "order lookup" and "order cancel") to demonstrate the plug-in pattern and enable end-to-end testing.
+- When onboarding a real client: build MCP wrappers for their APIs, register them with the agent registry, done.
+
+**5. Safety Layer**
+- Human-in-the-loop confirmation for write/destructive operations, surfaced as a confirmation prompt in the chat UI.
+- Permission boundaries per agent (read-only agents skip confirmation, write agents require it).
+- All actions logged with action ID, timestamp, agent, parameters, and outcome.
+- On MCP call failure: log error, escalate to human with full context.
+
+### Architecture Diagram
+
+```
+Customer Chat UI
+       │
+       ▼
+  FastAPI Server
+       │
+       ▼
+  Context Manager ◄── session store (in-memory / Redis)
+       │
+       ▼
+  Agent Router (LangGraph Orchestrator)
+       │
+       ├──► Agent A ──► MCP Tools (client-specific)
+       ├──► Agent B ──► MCP Tools (client-specific)
+       └──► Agent C ──► MCP Tools (client-specific)
+                            │
+                            ▼
+                   Client's Internal Systems
+                   (Shopify, custom APIs, etc.)
+```
+
+### Tech Stack
+
+- Python (LangGraph, FastAPI)
+- In-memory Python dict (prototype) / Redis (post-pilot)
+- MCP SDK (for building client-specific connectors)
+- LLM: Claude Sonnet 4.6 via Anthropic API. Abstracted behind a provider interface (`complete(messages, tools) -> response`) so it can be swapped.
+- Web chat frontend (HTML + fetch or lightweight React)
+
+### Phasing
+
+- **Phase 1 (Week 1):** Chat UI + Context Manager + basic LangGraph orchestrator with a single mock agent. Proves: chat works, context persists across turns, agent receives full conversation history.
+- **Phase 2 (Week 2):** Agent Router with multi-agent support + agent registry. Add 2-3 mock agents with different capabilities. Proves: router correctly selects agent based on intent, multi-agent handoff works.
+- **Phase 3 (Week 3):** Safety layer (human-in-the-loop confirmation) + pluggable MCP interface with example mock tools. Proves: write operations require confirmation, new MCP tools can be added without changing framework code.
+- **Phase 4 (client engagement):** Build real MCP connectors for first client's systems. This is where Shopify, custom APIs, etc. get wired in.
+- **Fallback:** If multi-agent graph complexity blocks progress past day 5, fall back to single-agent with tool routing and refactor to multi-agent post-validation.
+
+**Effort estimate:** 3 weeks for a full-time senior Python engineer. LangGraph experience assumed; add 3-4 days if new to LangGraph. No client-specific MCP connectors included in this estimate — those are per-client engagement work.
+
+## Open Questions
+
+1. **Pricing model:** Per-action? Per-seat? Per-resolution? Per-action aligns value with usage but creates billing uncertainty for buyers (same problem Intercom has). Defer until first pilot — let the customer's willingness to pay inform the model.
+2. **Multi-tenant architecture:** Single-tenant for prototype. Multi-tenant architecture decision deferred until first paid customer.
+
+## Success Criteria
+
+### Engineering Done
+1. Working framework: Chat UI + Agent Router + Context Manager functioning end-to-end with mock agents
+2. Multi-agent routing: router correctly selects agent based on conversation intent
+3. Session context: agent correctly resolves references ("cancel that one") across turns using context manager
+4. Human-in-the-loop: write operations require confirmation before execution
+5. Pluggable MCP: new MCP tools can be added via config without changing framework code
+6. 90-second screen recording of the framework in action with mock agents
+
+### Business Validation
+5. At least 5 customer conversations with real e-commerce operators within 2 weeks of demo completion
+6. At least 1 paid pilot within 4 weeks of demo completion
+
+## Distribution Plan
+
+- **Initial:** Direct demo to potential clients via cold outreach, showing the framework with mock agents and explaining per-client MCP customization
+- **Demo hosting:** Deploy on a cloud provider (Fly.io, Railway, or AWS) with a shareable demo link
+- **Video:** 90-second screen recording of the framework in action for async sales
+- **Future:** Self-service onboarding where clients can configure their own MCP connectors; Zendesk/Intercom marketplace integrations
+- **CI/CD:** Deferred to post-validation. Manual deploy for prototype phase.
+
+## Dependencies
+
+- LangGraph (open source)
+- MCP SDK (open source, for building client-specific connectors)
+- LLM API access (Claude Sonnet 4.6 via Anthropic API)
+- Domain for demo hosting
+
+## GSTACK REVIEW REPORT
+
+| Review | Trigger | Why | Runs | Status | Findings |
+|--------|---------|-----|------|--------|----------|
+| CEO Review | `/plan-ceo-review` | Scope & strategy | 1 | CLEAR | 6 proposals, 6 accepted, 0 deferred. Scope expanded: OpenAPI auto-discovery, analytics dashboard, conversation replay, agent personality, webhook escalation, vertical templates. |
+| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
+| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR | 4 issues, 0 critical gaps. Scope reduced: LangGraph built-ins replace 3/5 custom components. Note: ran before CEO expansion. |
+| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
+| Outside Voice | via eng review | Independent challenge | 1 | issues_found | 8 issues: latency (kept supervisor), interrupt resume (added), YAML registry (kept), MCP interface (TODO), auth gap (accepted for demo) |
+
+**VERDICT:** CEO + ENG CLEARED. Eng review may be stale (ran before CEO expanded scope with 6 new features). Consider re-running `/plan-eng-review`. Run `/plan-design-review` for the 3 UI surfaces (chat, analytics, replay).
+
+## Reviewer Concerns
+
+The following issues were flagged by adversarial review and deferred to implementation phase:
+
+1. **Error taxonomy for MCP failures:** Define retriable vs. non-retriable errors, retry policy (e.g., 3 attempts with exponential backoff for transient errors, immediate escalation for auth failures). Address when building first real MCP connector for a client.
+2. **Destructive action boundary:** Create explicit table of which operations require human-in-the-loop confirmation. Default rule: all write operations require confirmation; all read operations do not. Client-specific overrides configurable per agent.
+3. **Multi-intent atomicity:** Clarify whether multi-intent sequences are "all-or-nothing" or "best-effort" with partial failure escalation. Address during orchestrator implementation.
+4. **External integration (Zendesk/Intercom):** Webhook integration flow (payload shape, async acknowledgment, response posting) to be designed when a client requires it. Not in prototype scope.
+
+## Appendix: Founder Action Plan (The Assignment)
+
+**Do not write more code this week.** Instead:
+
+1. Go to the Shopify Community forums and find 5 merchants who have posted about support tool frustrations in the last 30 days. DM them. Ask: "What does your support team spend the most time doing in the Shopify admin panel during a customer conversation?"
+2. Find 3 customer support managers on LinkedIn at Shopify-based e-commerce companies (100-500 employees). Send a 3-sentence cold message: "I'm building an AI agent that can cancel orders, apply discounts, and look up shipments automatically during support conversations. Would you spend 15 minutes showing me how your team handles these tasks today?"
+3. If even ONE person responds with enthusiasm, you have a design partner. Build for them specifically.
+
+The code can wait. The customer can't.
+
+## What I noticed about how you think
+
+- You clarified your positioning mid-session — shifting from "competing with Zendesk" to "complementing Zendesk as the action layer." That pivot from replacement to complement is a much sharper thesis, and you got there on your own when pushed.
+- You chose multi-agent architecture and defended it against both my skepticism and the independent second opinion's challenge. You didn't articulate the specific reasoning, but you held your ground. Next time someone challenges this, be ready with the "why" — "because different actions need different permission boundaries and different failure modes" is the answer.
+- Your examples were concrete: "check orders, cancel orders, give discounts." You think in terms of specific actions, not abstract capabilities. That's good product instinct for a developer tool.
+- You chose the full vertical (Approach A) over the quick prototype (Approach C). That tells me you want to build something real, not just validate. I respect that, but I'll push back: the risk is that you spend 3 weeks building something beautiful that no merchant wants. Talk to merchants this week. The code will be better for it.