refactor: engineering improvements -- API versioning, structured logging, Alembic, error standardization, test coverage
- API versioning: all REST endpoints prefixed with /api/v1/ - Structured logging: replaced stdlib logging with structlog (console/JSON modes) - Alembic migrations: versioned DB schema with initial migration - Error standardization: global exception handlers for consistent envelope format - Interrupt cleanup: asyncio background task for expired interrupt removal - Integration tests: +30 tests (analytics, replay, openapi, error, session APIs) - Frontend tests: +57 tests (all components, pages, useWebSocket hook) - Backend: 557 tests, 89.75% coverage | Frontend: 80 tests, 16 test files
This commit is contained in:
@@ -99,7 +99,12 @@ smart-support/
|
||||
├── backend/
|
||||
│ ├── app/
|
||||
│ │ ├── main.py # FastAPI + WebSocket 入口
|
||||
│ │ ├── graph.py # LangGraph Supervisor 配置
|
||||
│ │ ├── graph.py # LangGraph Supervisor 构建
|
||||
│ │ ├── graph_context.py # GraphContext: 图 + 分类器 + 注册表的类型化封装
|
||||
│ │ ├── ws_handler.py # WebSocket 消息分发 + 速率限制
|
||||
│ │ ├── ws_context.py # WebSocketContext: WS 依赖包
|
||||
│ │ ├── auth.py # API Key 认证中间件
|
||||
│ │ ├── api_utils.py # 共享 API 响应工具 (envelope)
|
||||
│ │ ├── agents/ # Agent 定义 + 工具绑定
|
||||
│ │ ├── registry.py # YAML Agent 注册表加载器
|
||||
│ │ ├── openapi/ # OpenAPI 解析 + MCP 服务器生成
|
||||
@@ -139,7 +144,11 @@ smart-support/
|
||||
| 模块 | 职责 |
|
||||
|------|------|
|
||||
| main.py | 应用入口, WebSocket 端点, 静态文件服务 |
|
||||
| WebSocket Handler | 双向通信: 接收用户消息, 流式返回 token, 处理 interrupt 响应 |
|
||||
| auth.py | API Key 认证: 管理端点通过 `X-API-Key` header, WebSocket 通过 `?token=` query param |
|
||||
| ws_handler.py | 双向通信: 接收用户消息, 流式返回 token, 处理 interrupt 响应 |
|
||||
| graph_context.py | 类型化封装: 将编译后的图与分类器、注册表绑定, 替代猴子补丁 |
|
||||
| ws_context.py | 依赖包: 将 WebSocket 处理所需的 9 个依赖打包为单一不可变对象 |
|
||||
| api_utils.py | 共享响应格式: 统一的 `envelope()` 函数 |
|
||||
|
||||
### 2.3 Agent 编排层 (LangGraph)
|
||||
|
||||
@@ -427,6 +436,19 @@ CREATE INDEX idx_interrupts_ttl ON interrupts(ttl_expires_at)
|
||||
WHERE status = 'pending';
|
||||
```
|
||||
|
||||
#### sessions (自定义 - 会话状态持久化)
|
||||
|
||||
```sql
|
||||
-- 用于多 worker 部署的 PostgreSQL 会话状态管理
|
||||
-- PgSessionManager 使用此表替代内存中的 dict
|
||||
CREATE TABLE sessions (
|
||||
thread_id TEXT PRIMARY KEY,
|
||||
last_activity TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
has_pending_interrupt BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
#### analytics_events (自定义 - 分析事件流)
|
||||
|
||||
```sql
|
||||
|
||||
@@ -54,11 +54,19 @@ Set these in production (never commit secrets):
|
||||
| `ANTHROPIC_API_KEY` | Yes* | LLM provider API key |
|
||||
| `LLM_PROVIDER` | Yes | `anthropic`, `openai`, or `google` |
|
||||
| `LLM_MODEL` | Yes | Model name for your provider |
|
||||
| `ADMIN_API_KEY` | Recommended | API key for admin endpoints (analytics, replay, openapi, WS). Leave empty to disable auth (dev mode only) |
|
||||
| `WEBHOOK_URL` | No | Escalation notification endpoint |
|
||||
| `SESSION_TTL_MINUTES` | No | Session timeout (default: 30) |
|
||||
|
||||
*Or `OPENAI_API_KEY` / `GOOGLE_API_KEY` depending on `LLM_PROVIDER`.
|
||||
|
||||
### Authentication
|
||||
|
||||
When `ADMIN_API_KEY` is set, all admin REST endpoints require the `X-API-Key` header,
|
||||
and WebSocket connections require a `?token=<key>` query parameter.
|
||||
|
||||
When unset or empty, authentication is disabled (suitable for local development only).
|
||||
|
||||
### HTTPS
|
||||
|
||||
For production, place a reverse proxy (nginx, Caddy, or a load balancer) in
|
||||
@@ -87,10 +95,12 @@ cat backup.sql | docker compose exec -T postgres psql -U smart_support smart_sup
|
||||
|
||||
### Scaling
|
||||
|
||||
The backend is stateless (session state is in PostgreSQL via LangGraph's
|
||||
PostgresSaver). You can run multiple backend replicas behind a load balancer.
|
||||
The backend supports multi-worker deployments. LangGraph session state is
|
||||
persisted in PostgreSQL via PostgresSaver. For full horizontal scaling, use
|
||||
`PgSessionManager` and `PgInterruptManager` (instead of the default in-memory
|
||||
managers) to share session and interrupt state across workers.
|
||||
|
||||
The WebSocket connections are session-specific. Use sticky sessions or a shared
|
||||
WebSocket connections are session-specific. Use sticky sessions or a shared
|
||||
session backend if load balancing WebSockets across multiple instances.
|
||||
|
||||
## Manual / Development Setup
|
||||
@@ -139,7 +149,7 @@ GET /api/health
|
||||
|
||||
Response:
|
||||
```json
|
||||
{"status": "ok", "version": "0.5.0"}
|
||||
{"status": "ok", "version": "0.6.0"}
|
||||
```
|
||||
|
||||
### WebSocket health
|
||||
|
||||
@@ -86,7 +86,21 @@ Content-Type: application/json
|
||||
POST /api/openapi/jobs/{job_id}/approve
|
||||
```
|
||||
|
||||
No request body. Changes the job status to `approved`.
|
||||
No request body. Generates tool code for each classified endpoint and produces
|
||||
an agent YAML configuration. Response includes `generated_tools_count`.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"job_id": "abc123",
|
||||
"status": "approved",
|
||||
"spec_url": "https://api.example.com/openapi.yaml",
|
||||
"total_endpoints": 5,
|
||||
"classified_count": 5,
|
||||
"error_message": null,
|
||||
"generated_tools_count": 5
|
||||
}
|
||||
```
|
||||
|
||||
## Access Type Classification
|
||||
|
||||
|
||||
76
docs/phases/eng-improvements-dev-log.md
Normal file
76
docs/phases/eng-improvements-dev-log.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Engineering Improvements -- Development Log
|
||||
|
||||
> Status: COMPLETED
|
||||
> Branch: `eng/engineering-improvements`
|
||||
> Date started: 2026-04-06
|
||||
> Date completed: 2026-04-06
|
||||
|
||||
## What Was Built
|
||||
|
||||
### Phase 1: Quick Wins (no new deps)
|
||||
|
||||
1. **Interrupt Cleanup Background Task** -- Added asyncio background task in lifespan that calls `interrupt_manager.cleanup_expired()` every 60 seconds. Prevents unbounded memory growth from expired interrupts.
|
||||
|
||||
2. **API Versioning** -- All REST endpoints prefixed with `/api/v1/` (was `/api/`). Updated 4 router prefixes, Docker healthcheck, all frontend fetch URLs, and all test assertions. WebSocket `/ws` endpoint unchanged.
|
||||
|
||||
3. **Error Response Standardization** -- Added global exception handlers for `HTTPException`, `RequestValidationError`, and `Exception`. All error responses now use the same envelope format as success responses: `{"success": false, "data": null, "error": "..."}`.
|
||||
|
||||
### Phase 2: Medium Items (new deps)
|
||||
|
||||
4. **Alembic Database Migrations** -- Replaced inline DDL in `setup_app_tables()` with versioned Alembic migrations. Initial migration `001_initial_schema.py` captures all 4 tables + ALTER TABLE migration. `setup_app_tables()` preserved for tests. Production uses `run_alembic_migrations()`.
|
||||
|
||||
5. **Structured Logging** -- Replaced stdlib `logging.getLogger()` with `structlog.get_logger()` across 10 files. Added `logging_config.py` with console (dev) and JSON (production) modes. Configurable via `LOG_FORMAT` env var.
|
||||
|
||||
### Phase 3: Test Coverage
|
||||
|
||||
7. **Integration Tests (+30)** -- Created 5 new test files: analytics API, replay API, OpenAPI API, error responses, session/interrupt lifecycle. Uses httpx.AsyncClient with ASGITransport for full API layer testing.
|
||||
|
||||
8. **Frontend Tests (+57)** -- Created 12 new test files covering all components (ChatInput, ChatMessages, InterruptPrompt, ErrorBanner, NavBar, MetricCard, ReplayTimeline, AgentAction, Layout), pages (ChatPage, ReviewPage), and hooks (useWebSocket).
|
||||
|
||||
## Code Structure
|
||||
|
||||
### New files created
|
||||
- `backend/app/logging_config.py` -- structlog configuration
|
||||
- `backend/alembic.ini` -- Alembic config
|
||||
- `backend/alembic/env.py` -- Migration environment
|
||||
- `backend/alembic/versions/001_initial_schema.py` -- Initial migration
|
||||
- `backend/tests/unit/test_interrupt_cleanup.py` (3 tests)
|
||||
- `backend/tests/unit/test_error_responses.py` (6 tests)
|
||||
- `backend/tests/unit/test_logging_config.py` (2 tests)
|
||||
- `backend/tests/integration/test_analytics_api.py` (6 tests)
|
||||
- `backend/tests/integration/test_replay_api.py` (6 tests)
|
||||
- `backend/tests/integration/test_openapi_api.py` (5 tests)
|
||||
- `backend/tests/integration/test_error_responses.py` (5 tests)
|
||||
- `backend/tests/integration/test_session_interrupt_lifecycle.py` (8 tests)
|
||||
- 12 frontend test files (57 tests total)
|
||||
|
||||
### Modified files
|
||||
- `backend/app/main.py` -- cleanup task, exception handlers, alembic, structlog
|
||||
- `backend/app/db.py` -- added run_alembic_migrations()
|
||||
- `backend/app/config.py` -- added log_format setting
|
||||
- `backend/pyproject.toml` -- added alembic, structlog deps
|
||||
- 4 router files -- `/api/v1/` prefix
|
||||
- 10 files -- structlog migration
|
||||
- `docker-compose.yml` -- healthcheck URL
|
||||
- `frontend/src/api.ts` -- `/api/v1/` URLs
|
||||
- All existing test files -- API path updates + error envelope assertions
|
||||
|
||||
## Test Coverage
|
||||
|
||||
- Backend: 557 tests (was 516), 89.75% coverage
|
||||
- Unit: ~490 tests
|
||||
- Integration: ~60 tests
|
||||
- E2E: ~7 tests
|
||||
- Frontend: 80 tests (was 23), 16 test files (was 4)
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
- Redis rate limiting deferred (single-worker sufficient for now)
|
||||
- ConversationTracker verified correct by design (pool per-method), skipped
|
||||
- Coverage dropped slightly from 90.26% to 89.75% due to new alembic/logging modules with partial test coverage (still well above 80% threshold)
|
||||
|
||||
## Known Issues / Tech Debt
|
||||
|
||||
- Rate limiting remains process-global (needs Redis for multi-worker)
|
||||
- Alembic migrations not tested against real PostgreSQL in CI (would need running DB)
|
||||
- Frontend test coverage could be deeper (e.g., WebSocket reconnect edge cases)
|
||||
Reference in New Issue
Block a user