refactor: engineering improvements -- API versioning, structured logging, Alembic, error standardization, test coverage

- API versioning: all REST endpoints prefixed with /api/v1/
- Structured logging: replaced stdlib logging with structlog (console/JSON modes)
- Alembic migrations: versioned DB schema with initial migration
- Error standardization: global exception handlers for consistent envelope format
- Interrupt cleanup: asyncio background task for expired interrupt removal
- Integration tests: +30 tests (analytics, replay, openapi, error, session APIs)
- Frontend tests: +57 tests (all components, pages, useWebSocket hook)
- Backend: 557 tests, 89.75% coverage | Frontend: 80 tests, 16 test files
This commit is contained in:
Yaojia Wang
2026-04-06 23:19:29 +02:00
parent af53111928
commit f0699436c5
59 changed files with 2846 additions and 149 deletions

View File

@@ -99,7 +99,12 @@ smart-support/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI + WebSocket 入口
│ │ ├── graph.py # LangGraph Supervisor 配置
│ │ ├── graph.py # LangGraph Supervisor 构建
│ │ ├── graph_context.py # GraphContext: 图 + 分类器 + 注册表的类型化封装
│ │ ├── ws_handler.py # WebSocket 消息分发 + 速率限制
│ │ ├── ws_context.py # WebSocketContext: WS 依赖包
│ │ ├── auth.py # API Key 认证中间件
│ │ ├── api_utils.py # 共享 API 响应工具 (envelope)
│ │ ├── agents/ # Agent 定义 + 工具绑定
│ │ ├── registry.py # YAML Agent 注册表加载器
│ │ ├── openapi/ # OpenAPI 解析 + MCP 服务器生成
@@ -139,7 +144,11 @@ smart-support/
| 模块 | 职责 |
|------|------|
| main.py | 应用入口, WebSocket 端点, 静态文件服务 |
| WebSocket Handler | 双向通信: 接收用户消息, 流式返回 token, 处理 interrupt 响应 |
| auth.py | API Key 认证: 管理端点通过 `X-API-Key` header, WebSocket 通过 `?token=` query param |
| ws_handler.py | 双向通信: 接收用户消息, 流式返回 token, 处理 interrupt 响应 |
| graph_context.py | 类型化封装: 将编译后的图与分类器、注册表绑定, 替代猴子补丁 |
| ws_context.py | 依赖包: 将 WebSocket 处理所需的 9 个依赖打包为单一不可变对象 |
| api_utils.py | 共享响应格式: 统一的 `envelope()` 函数 |
### 2.3 Agent 编排层 (LangGraph)
@@ -427,6 +436,19 @@ CREATE INDEX idx_interrupts_ttl ON interrupts(ttl_expires_at)
WHERE status = 'pending';
```
#### sessions (自定义 - 会话状态持久化)
```sql
-- 用于多 worker 部署的 PostgreSQL 会话状态管理
-- PgSessionManager 使用此表替代内存中的 dict
CREATE TABLE sessions (
thread_id TEXT PRIMARY KEY,
last_activity TIMESTAMPTZ NOT NULL DEFAULT NOW(),
has_pending_interrupt BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
#### analytics_events (自定义 - 分析事件流)
```sql

View File

@@ -54,11 +54,19 @@ Set these in production (never commit secrets):
| `ANTHROPIC_API_KEY` | Yes* | LLM provider API key |
| `LLM_PROVIDER` | Yes | `anthropic`, `openai`, or `google` |
| `LLM_MODEL` | Yes | Model name for your provider |
| `ADMIN_API_KEY` | Recommended | API key for admin endpoints (analytics, replay, openapi, WS). Leave empty to disable auth (dev mode only) |
| `WEBHOOK_URL` | No | Escalation notification endpoint |
| `SESSION_TTL_MINUTES` | No | Session timeout (default: 30) |
*Or `OPENAI_API_KEY` / `GOOGLE_API_KEY` depending on `LLM_PROVIDER`.
### Authentication
When `ADMIN_API_KEY` is set, all admin REST endpoints require the `X-API-Key` header,
and WebSocket connections require a `?token=<key>` query parameter.
When unset or empty, authentication is disabled (suitable for local development only).
### HTTPS
For production, place a reverse proxy (nginx, Caddy, or a load balancer) in
@@ -87,10 +95,12 @@ cat backup.sql | docker compose exec -T postgres psql -U smart_support smart_sup
### Scaling
The backend is stateless (session state is in PostgreSQL via LangGraph's
PostgresSaver). You can run multiple backend replicas behind a load balancer.
The backend supports multi-worker deployments. LangGraph session state is
persisted in PostgreSQL via PostgresSaver. For full horizontal scaling, use
`PgSessionManager` and `PgInterruptManager` (instead of the default in-memory
managers) to share session and interrupt state across workers.
The WebSocket connections are session-specific. Use sticky sessions or a shared
WebSocket connections are session-specific. Use sticky sessions or a shared
session backend if load balancing WebSockets across multiple instances.
## Manual / Development Setup
@@ -139,7 +149,7 @@ GET /api/health
Response:
```json
{"status": "ok", "version": "0.5.0"}
{"status": "ok", "version": "0.6.0"}
```
### WebSocket health

View File

@@ -86,7 +86,21 @@ Content-Type: application/json
POST /api/openapi/jobs/{job_id}/approve
```
No request body. Changes the job status to `approved`.
No request body. Generates tool code for each classified endpoint and produces
an agent YAML configuration. Response includes `generated_tools_count`.
Response:
```json
{
"job_id": "abc123",
"status": "approved",
"spec_url": "https://api.example.com/openapi.yaml",
"total_endpoints": 5,
"classified_count": 5,
"error_message": null,
"generated_tools_count": 5
}
```
## Access Type Classification

View File

@@ -0,0 +1,76 @@
# Engineering Improvements -- Development Log
> Status: COMPLETED
> Branch: `eng/engineering-improvements`
> Date started: 2026-04-06
> Date completed: 2026-04-06
## What Was Built
### Phase 1: Quick Wins (no new deps)
1. **Interrupt Cleanup Background Task** -- Added asyncio background task in lifespan that calls `interrupt_manager.cleanup_expired()` every 60 seconds. Prevents unbounded memory growth from expired interrupts.
2. **API Versioning** -- All REST endpoints prefixed with `/api/v1/` (was `/api/`). Updated 4 router prefixes, Docker healthcheck, all frontend fetch URLs, and all test assertions. WebSocket `/ws` endpoint unchanged.
3. **Error Response Standardization** -- Added global exception handlers for `HTTPException`, `RequestValidationError`, and `Exception`. All error responses now use the same envelope format as success responses: `{"success": false, "data": null, "error": "..."}`.
### Phase 2: Medium Items (new deps)
4. **Alembic Database Migrations** -- Replaced inline DDL in `setup_app_tables()` with versioned Alembic migrations. Initial migration `001_initial_schema.py` captures all 4 tables + ALTER TABLE migration. `setup_app_tables()` preserved for tests. Production uses `run_alembic_migrations()`.
5. **Structured Logging** -- Replaced stdlib `logging.getLogger()` with `structlog.get_logger()` across 10 files. Added `logging_config.py` with console (dev) and JSON (production) modes. Configurable via `LOG_FORMAT` env var.
### Phase 3: Test Coverage
7. **Integration Tests (+30)** -- Created 5 new test files: analytics API, replay API, OpenAPI API, error responses, session/interrupt lifecycle. Uses httpx.AsyncClient with ASGITransport for full API layer testing.
8. **Frontend Tests (+57)** -- Created 12 new test files covering all components (ChatInput, ChatMessages, InterruptPrompt, ErrorBanner, NavBar, MetricCard, ReplayTimeline, AgentAction, Layout), pages (ChatPage, ReviewPage), and hooks (useWebSocket).
## Code Structure
### New files created
- `backend/app/logging_config.py` -- structlog configuration
- `backend/alembic.ini` -- Alembic config
- `backend/alembic/env.py` -- Migration environment
- `backend/alembic/versions/001_initial_schema.py` -- Initial migration
- `backend/tests/unit/test_interrupt_cleanup.py` (3 tests)
- `backend/tests/unit/test_error_responses.py` (6 tests)
- `backend/tests/unit/test_logging_config.py` (2 tests)
- `backend/tests/integration/test_analytics_api.py` (6 tests)
- `backend/tests/integration/test_replay_api.py` (6 tests)
- `backend/tests/integration/test_openapi_api.py` (5 tests)
- `backend/tests/integration/test_error_responses.py` (5 tests)
- `backend/tests/integration/test_session_interrupt_lifecycle.py` (8 tests)
- 12 frontend test files (57 tests total)
### Modified files
- `backend/app/main.py` -- cleanup task, exception handlers, alembic, structlog
- `backend/app/db.py` -- added run_alembic_migrations()
- `backend/app/config.py` -- added log_format setting
- `backend/pyproject.toml` -- added alembic, structlog deps
- 4 router files -- `/api/v1/` prefix
- 10 files -- structlog migration
- `docker-compose.yml` -- healthcheck URL
- `frontend/src/api.ts` -- `/api/v1/` URLs
- All existing test files -- API path updates + error envelope assertions
## Test Coverage
- Backend: 557 tests (was 516), 89.75% coverage
- Unit: ~490 tests
- Integration: ~60 tests
- E2E: ~7 tests
- Frontend: 80 tests (was 23), 16 test files (was 4)
## Deviations from Plan
- Redis rate limiting deferred (single-worker sufficient for now)
- ConversationTracker verified correct by design (pool per-method), skipped
- Coverage dropped slightly from 90.26% to 89.75% due to new alembic/logging modules with partial test coverage (still well above 80% threshold)
## Known Issues / Tech Debt
- Rate limiting remains process-global (needs Redis for multi-worker)
- Alembic migrations not tested against real PostgreSQL in CI (would need running DB)
- Frontend test coverage could be deeper (e.g., WebSocket reconnect edge cases)