47 KiB
Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy
Document Type: ADR (Architecture Decision Record) Date: 2025-11-03 Status: Accepted Decision Makers: Architecture Team, Product Manager, Technical Leads Project: ColaFlow - M1 Sprint 2 (Enterprise Multi-Tenant Upgrade)
Document Purpose
This Architecture Decision Record (ADR) documents the key architectural decisions made for ColaFlow's transition from a single-tenant to an enterprise-ready multi-tenant SaaS platform. It follows the ADR format to capture context, options considered, chosen solutions, and consequences.
Table of Contents
- ADR-001: Tenant Identification Strategy
- ADR-002: Data Isolation Strategy
- ADR-003: SSO Library Selection
- ADR-004: MCP Token Format
- ADR-005: Frontend State Management
- ADR-006: Token Storage Strategy
- Summary of Decisions
ADR-001: Tenant Identification Strategy
Status
Accepted - 2025-11-03
Context
ColaFlow is transitioning to a multi-tenant architecture where multiple companies (tenants) will share the same application instance. We need a reliable, performant, and secure method to identify which tenant a user or API request belongs to.
Requirements:
- Must work across web, mobile, and API clients
- Must be stateless (no session storage required)
- Must be secure (prevent tenant spoofing)
- Must be performant (no database lookup per request)
- Must support both human users and AI agents (MCP tokens)
- Must work with subdomain-based URLs (e.g.,
acme.colaflow.com)
Decision Drivers
- Performance: System must handle 10,000+ requests/second without database lookups
- Security: Tenant ID cannot be tampered with by malicious users
- Scalability: Solution must work for mobile apps, APIs, and web simultaneously
- Developer Experience: Easy to implement and maintain across all layers
- User Experience: Friendly tenant selection (via subdomain)
Options Considered
Option 1: JWT Claims (Primary) + Subdomain (Secondary)
Approach:
- Store
tenant_idandtenant_slugin JWT access token claims - Resolve tenant from subdomain on login/registration
- Inject tenant context from JWT claims into all API requests
- No database lookup required after authentication
Pros:
- Stateless: No session storage or database lookup per request
- Secure: JWT signature prevents tampering
- Cross-platform: Works for web, mobile, API, MCP tokens
- Fast: O(1) lookup from JWT claims
- Tenant context available in middleware layer
Cons:
- JWT cannot be updated until refresh (stale tenant info for up to 60 minutes)
- Requires careful token expiration management
- Subdomain only used for initial tenant resolution (login page)
Example JWT Payload:
{
"sub": "user-id-123",
"email": "john@acme.com",
"tenant_id": "tenant-uuid-456",
"tenant_slug": "acme",
"tenant_plan": "Enterprise",
"auth_provider": "AzureAD",
"role": "User",
"exp": 1730678400,
"iat": 1730674800
}
Option 2: Session-Based Tenant Storage
Approach:
- Store tenant ID in server-side session (Redis)
- Lookup tenant on every request via session ID
- Subdomain used for tenant resolution on login
Pros:
- Can update tenant info without re-login
- Works well for web applications
- Session can store additional context
Cons:
- Not stateless: Requires Redis/session storage infrastructure
- Database/Redis lookup on every request (performance hit)
- Difficult to scale horizontally (session affinity required)
- Doesn't work well for mobile apps or API-only clients
- MCP tokens would still need separate mechanism
Option 3: Subdomain-Only Identification
Approach:
- Parse subdomain from HTTP Host header on every request
- Lookup tenant by slug in database
- No JWT claims for tenant
Pros:
- Simple conceptual model
- User-friendly (URL shows tenant)
- Easy to test locally
Cons:
- Database lookup on every request (performance bottleneck)
- Doesn't work for API clients (no subdomain in API calls)
- Doesn't work for mobile apps
- Vulnerable to DNS spoofing
- MCP tokens cannot carry subdomain context
Option 4: Tenant ID in URL Path
Approach:
- Include tenant ID in every API route:
/api/tenants/{tenantId}/projects - Frontend passes tenant ID explicitly
Pros:
- Explicit tenant context in every request
- Easy to debug
- Works across all client types
Cons:
- Poor user experience (ugly URLs)
- Easy to make mistakes (wrong tenant ID)
- Difficult to enforce (requires middleware validation)
- Security risk (users could try other tenant IDs)
- Requires frontend to manage tenant ID everywhere
Decision
Chosen Option: Option 1 - JWT Claims (Primary) + Subdomain (Secondary)
Rationale:
- Performance: No database lookup per request; O(1) from JWT claims
- Security: JWT signature prevents tampering; middleware validates on every request
- Scalability: Works for web, mobile, API, and MCP tokens uniformly
- Stateless: No session storage required; easy to scale horizontally
- Developer Experience: TenantContext injected automatically via middleware
Implementation Strategy:
- Login Flow: User visits
acme.colaflow.com/login→ Tenant resolved from subdomain → JWT containstenant_idandtenant_slug - API Requests: JWT extracted from Authorization header →
tenant_idinjected into TenantContext → EF Core Global Query Filter applies automatic filtering - MCP Tokens: Opaque tokens stored with
tenant_id→ Middleware validates token → Tenant context injected (same as JWT)
Consequences
Positive:
- Fast authentication and authorization
- No session storage infrastructure required
- Uniform tenant resolution across all client types
- Easy to test and debug (tenant visible in JWT payload)
- Supports multi-tenant mobile apps
Negative:
- Tenant changes require re-login (or wait for token refresh)
- JWT size increases slightly (+50 bytes for tenant claims)
- Middleware must validate JWT on every request (minor CPU cost)
Neutral:
- Subdomain is only used for initial tenant selection (login page)
- Tenant switching requires logout and login to different subdomain
Mitigation Strategies:
- Keep JWT expiration short (60 minutes) to allow tenant updates on refresh
- Implement automatic token refresh to minimize user disruption
- Cache JWT validation results per request to avoid redundant checks
Validation
Acceptance Criteria:
- JWT contains
tenant_id,tenant_slug, andtenant_planclaims - Middleware extracts tenant from JWT and injects into TenantContext
- All database queries automatically filter by tenant via Global Query Filter
- Cross-tenant access attempts return 403 Forbidden
- Performance: <5ms overhead for JWT validation per request
Testing:
- Unit tests: TenantContext injection
- Integration tests: Cross-tenant isolation
- Performance tests: 10,000 req/s with JWT validation
- Security tests: Attempt to access other tenant's data (should fail)
References
- Architecture Doc:
docs/architecture/multi-tenancy-architecture.md - JWT Implementation:
docs/architecture/jwt-authentication-architecture.md - MCP Token Format:
docs/architecture/mcp-authentication-architecture.md
ADR-002: Data Isolation Strategy
Status
Accepted - 2025-11-03
Context
In a multi-tenant system, data isolation is critical to ensure that one tenant cannot access another tenant's data. We need to choose an isolation strategy that balances security, performance, cost, and maintainability.
Requirements:
- Strong data isolation (no cross-tenant leaks)
- Good query performance (<50ms for typical queries)
- Cost-effective (avoid database proliferation)
- Easy to maintain and backup
- Scalable to 10,000+ tenants
- Support for per-tenant data export (GDPR compliance)
Decision Drivers
- Security: Absolute data isolation between tenants
- Cost: Minimize infrastructure costs (PostgreSQL instances, storage)
- Performance: Fast queries with proper indexing
- Scalability: Support thousands of tenants on shared infrastructure
- Maintainability: Easy schema migrations, backups, monitoring
Options Considered
Option 1: Shared Database + tenant_id Column + Global Query Filter
Approach:
- All tenants share one PostgreSQL database
- Every table has a
tenant_idcolumn (NOT NULL) - EF Core Global Query Filter automatically adds
.Where(e => e.TenantId == currentTenantId)to all queries - Composite indexes:
(tenant_id, other_columns)
Schema Example:
CREATE TABLE projects (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
name VARCHAR(200) NOT NULL,
key VARCHAR(20) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT uq_projects_tenant_key UNIQUE (tenant_id, key)
);
CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
CREATE INDEX idx_projects_tenant_key ON projects(tenant_id, key);
EF Core Configuration:
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<Project>().HasQueryFilter(
p => p.TenantId == _tenantContext.CurrentTenantId
);
}
Pros:
- Cost-effective: One database for all tenants
- Easy to maintain: Single schema, one backup process
- Good performance with proper indexing (composite indexes)
- Easy to add new tenants (just insert into
tenantstable) - Per-tenant data export is SQL query:
SELECT * FROM projects WHERE tenant_id = 'xxx' - Scales to 10,000+ tenants on one database
- Automatic filtering via Global Query Filter (developer-friendly)
Cons:
- Risk of data leak if Global Query Filter is bypassed (
.IgnoreQueryFilters()) - All tenants affected by database downtime
- Cannot isolate noisy neighbors (one tenant's heavy queries affect others)
- Database size grows with all tenants (monitoring required)
Cost Estimate: 1 database instance (~$100-200/month for medium workload)
Option 2: Database-per-Tenant
Approach:
- Each tenant gets a dedicated PostgreSQL database
- Connection string stored in
tenantstable - Middleware switches database context per request
Schema Example:
-- Shared management database
CREATE TABLE tenants (
id UUID PRIMARY KEY,
slug VARCHAR(50) UNIQUE NOT NULL,
connection_string TEXT NOT NULL -- Encrypted
);
-- Tenant-specific database (one per tenant)
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_beta;
Pros:
- Strong isolation: One tenant's database cannot access another
- Tenant-specific customization (different schema versions)
- Easy to back up per tenant
- Noisy neighbors don't affect each other
- Easy to migrate tenant to different database server
Cons:
- Expensive: N databases for N tenants (~$10-20/month per tenant minimum)
- Complex maintenance: Schema migrations across 1000s of databases
- Connection pool exhaustion (need one pool per tenant)
- Difficult to implement cross-tenant features (analytics, admin tools)
- Onboarding delay (new database provisioning takes time)
Cost Estimate: 1000 tenants × $15/month = $15,000/month (vs $200 for shared)
Option 3: Schema-per-Tenant (PostgreSQL Schemas)
Approach:
- One database with multiple PostgreSQL schemas
- Each tenant gets a schema:
tenant_acme.projects,tenant_beta.projects - Middleware switches search_path per request:
SET search_path = tenant_acme;
Pros:
- Better isolation than shared database
- Lower cost than database-per-tenant
- All tenants in one PostgreSQL instance (easier backups)
- Can support ~1000 schemas per database
Cons:
- PostgreSQL schema limit (~1000 schemas per database)
- Schema creation overhead for new tenants
- Complex schema migrations (run migration on each schema)
- Search_path switching per request (performance overhead)
- Difficult to enforce (easy to forget to set search_path)
Cost Estimate: Same as shared database, but limited scalability
Option 4: Separate Infrastructure per Tenant (Fully Isolated)
Approach:
- Each tenant gets dedicated Kubernetes namespace, database, Redis, etc.
- Complete infrastructure isolation
Pros:
- Maximum isolation and security
- Per-tenant scaling and customization
- Enterprise customers often require this
Cons:
- Extremely expensive (hundreds of dollars per tenant)
- Complex to manage (orchestration required)
- Overkill for most tenants
- Long onboarding time
Cost Estimate: 1000 tenants × $500/month = $500,000/month (prohibitive)
Decision
Chosen Option: Option 1 - Shared Database + tenant_id Column + Global Query Filter
Rationale:
- Cost-Effective: $200/month vs $15,000/month for database-per-tenant
- Scalable: PostgreSQL handles 10,000+ tenants with proper indexing
- Maintainable: One schema, one backup process, one monitoring dashboard
- Developer-Friendly: EF Core Global Query Filter ensures automatic filtering
- Performance: Composite indexes provide excellent query performance
- Proven Pattern: Used by GitHub, Slack, Heroku, and many successful SaaS products
Implementation Strategy:
- Add
tenant_idcolumn to all business tables - Create composite indexes:
(tenant_id, primary_key),(tenant_id, foreign_key) - Configure EF Core Global Query Filter in
OnModelCreating - Create TenantContext service to inject current tenant
- Add database-level constraints:
CHECK (tenant_id IS NOT NULL) - Update unique constraints to be tenant-scoped:
UNIQUE (tenant_id, email)
Migration Path:
- Create
tenantstable - Create default tenant for existing data
- Add
tenant_idcolumns (nullable initially) - Migrate existing data to default tenant
- Set
tenant_idas NOT NULL - Add indexes and constraints
Consequences
Positive:
- Low infrastructure cost (1 database vs thousands)
- Easy to maintain and monitor
- Fast schema migrations (one database)
- Automatic tenant filtering (developer safety)
- Good query performance with indexes
- Per-tenant data export is straightforward SQL
Negative:
- Risk of data leak if developer bypasses Global Query Filter
- All tenants share database resources (monitoring required)
- Cannot isolate noisy neighbors at database level
- Database backup contains all tenants (larger backup size)
Neutral:
- Tenant onboarding is instant (no new database needed)
- Cross-tenant analytics require explicit filtering
- Database size monitoring required as tenant count grows
Mitigation Strategies:
- Data Leak Prevention:
- Code review requirement for any
.IgnoreQueryFilters()usage - Integration tests verify cross-tenant isolation
- Automated security testing (attempt cross-tenant access)
- Code review requirement for any
- Performance Monitoring:
- Alert on slow queries (>100ms)
- Index usage monitoring (pg_stat_user_indexes)
- Per-tenant query cost tracking
- Noisy Neighbor Protection:
- Query timeout limits (5 seconds max)
- Rate limiting per tenant
- Connection pool limits
- Option to migrate large tenant to dedicated database later
Upgrade Path: If a tenant grows too large or requires dedicated resources, we can migrate them to a separate database while keeping the shared model for other tenants.
Validation
Acceptance Criteria:
- All queries automatically filter by tenant
- Cross-tenant access attempts fail with 403 Forbidden
- Query performance <50ms for typical workloads (with 10,000 records per tenant)
- Integration tests verify tenant isolation
- Data export per tenant completes in <1 minute
Testing:
- Unit tests: Global Query Filter applied to all entities
- Integration tests: Create data in Tenant A, verify Tenant B cannot access
- Performance tests: Query time with 1 million total records (100 tenants × 10,000 records)
- Load tests: 10,000 concurrent requests across 100 tenants
References
- Architecture Doc:
docs/architecture/multi-tenancy-architecture.md - Migration Strategy:
docs/architecture/migration-strategy.md - Performance Benchmarks:
docs/architecture/performance-benchmarks.md(TBD)
ADR-003: SSO Library Selection
Status
Accepted - 2025-11-03
Context
Enterprise customers require Single Sign-On (SSO) to integrate ColaFlow with their corporate identity providers (Azure AD, Google Workspace, Okta, etc.). We need to choose an SSO library/approach that balances functionality, cost, implementation speed, and maintainability.
Requirements:
- Support major identity providers: Azure AD, Google, Okta
- Support OIDC (OpenID Connect) protocol
- Support SAML 2.0 for generic enterprise IdPs
- User auto-provisioning (create user on first SSO login)
- Email domain restrictions (only allow @acme.com)
- Configurable per tenant (each tenant has own SSO config)
- Production-ready security standards
Decision Drivers
- Time-to-Market: Implement SSO in <1 week (M1 timeline constraint)
- Cost: Minimize licensing fees
- Coverage: Support 90% of enterprise SSO requirements
- Flexibility: Can upgrade later if complex requirements emerge
- Security: Follow OWASP and OIDC/SAML best practices
Options Considered
Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)
Approach:
- Use built-in
Microsoft.AspNetCore.Authentication.OpenIdConnectfor OIDC - Use
Sustainsys.Saml2library for SAML 2.0 - Custom implementation for multi-tenant SSO configuration
- Store SSO config in
tenantstable (JSONB column)
Pros:
- Free: No licensing costs
- Fast: Can implement OIDC in 2-3 days, SAML in 3-4 days
- Built-in to .NET 9: Mature, well-documented
- Flexible: Full control over implementation
- Covers 80-90% of enterprise SSO needs
Cons:
- Manual implementation: Need to handle user provisioning, domain restrictions
- Limited advanced features: No federation, no protocol switching
- SAML is more complex to implement
- Need to maintain our own SSO configuration UI
Implementation Complexity: Medium Cost: $0/month Coverage: OIDC (Azure, Google, Okta) + SAML 2.0 (80% of market)
Code Example:
services.AddAuthentication()
.AddOpenIdConnect("AzureAD", options =>
{
options.Authority = tenant.SsoConfig.AuthorityUrl;
options.ClientId = tenant.SsoConfig.ClientId;
options.ClientSecret = tenant.SsoConfig.ClientSecret;
options.ResponseType = "code";
options.SaveTokens = true;
options.Events = new OpenIdConnectEvents
{
OnTokenValidated = async context =>
{
await AutoProvisionUserAsync(context);
}
};
});
Option 2: Auth0
Approach:
- Use Auth0 as SSO broker
- Auth0 handles all identity providers
- Configure Auth0 via their dashboard
- Pay per monthly active user (MAU)
Pros:
- Fast setup: Implement in 1-2 days
- Comprehensive: Supports all identity providers out-of-the-box
- User management: Built-in user directory
- Advanced features: MFA, passwordless, anomaly detection
- Dashboard for SSO configuration
Cons:
- Expensive: $240/month (Professional) + $0.05/MAU (500 users = $25/month extra)
- Vendor lock-in: Difficult to migrate away
- Less control: Auth0 controls auth flow
- Overkill for MVP: Many features we don't need yet
Implementation Complexity: Low Cost: $3,000-5,000/year (for 100 tenants with 5,000 total users) Coverage: 100% (all protocols, all providers)
Option 3: Okta (Workforce Identity Cloud)
Approach:
- Use Okta as SSO broker
- Similar to Auth0 but more enterprise-focused
- Per-user pricing
Pros:
- Enterprise-grade: Trusted by Fortune 500
- Complete features: SSO, MFA, provisioning, directory
- Excellent support and documentation
Cons:
- Very expensive: $2/user/month minimum (100 users = $200/month)
- Enterprise sales process (slow, complex)
- Overkill for startup/SMB customers
- Vendor lock-in
Implementation Complexity: Low Cost: $5,000-10,000/year (for 100 tenants) Coverage: 100%
Option 4: IdentityServer4 / Duende IdentityServer
Approach:
- Use IdentityServer as self-hosted identity provider
- Implement Federation support (connect to external IdPs)
- Open-source (IdentityServer4) or licensed (Duende)
Pros:
- Self-hosted: Full control
- Comprehensive: OIDC, OAuth 2.0, SAML via plugins
- Flexible: Can customize extensively
- No per-user fees
Cons:
- Complex: Steep learning curve (2-3 weeks to implement)
- Maintenance burden: Need to maintain IdentityServer instance
- Duende licensing: $1,500/year for production use
- Overkill for MVP: We don't need an identity provider, just SSO
Implementation Complexity: High Cost: $1,500/year (Duende license) Coverage: 100%
Decision
Chosen Option: Option 1 - ASP.NET Core Native OIDC/SAML (M1-M2)
Rationale:
- Cost: $0/month vs $3,000-5,000/year for Auth0/Okta
- Speed: Can implement in <1 week (M1 timeline)
- Control: Full flexibility to customize
- Coverage: Supports 80% of enterprise SSO requirements (OIDC + SAML)
- Upgrade Path: Can migrate to Auth0/Okta later if complex requirements emerge
Decision: Start with native ASP.NET Core for M1-M2. Re-evaluate at M3 if we need:
- Complex federation (multiple IdPs per tenant)
- Advanced MFA flows
- More than 5 different SSO protocols
- Dedicated identity management features
Implementation Strategy:
- M1 (Week 1): OIDC implementation (Azure AD, Google, Okta)
- M2 (Week 2): SAML 2.0 implementation (generic enterprise IdPs)
- M2 (Week 3): User auto-provisioning and domain restrictions
- M2 (Week 4): SSO configuration UI for tenants
Consequences
Positive:
- Zero licensing costs for M1-M2
- Complete control over implementation
- Can customize for our specific needs
- Fast implementation (< 1 week)
- Covers 80% of enterprise SSO requirements
- Learning opportunity for team
Negative:
- Manual implementation required (more code to maintain)
- Limited to OIDC + SAML 2.0 (no exotic protocols)
- Need to build SSO configuration UI ourselves
- More testing required (vs using Auth0)
Neutral:
- Can migrate to Auth0/Okta later if needed
- SSO config stored in database (our control)
- Integration tests required for each IdP
Mitigation Strategies:
- Quality: Comprehensive testing with real IdPs (Azure AD, Google)
- Documentation: Detailed guides for each supported provider
- Security: Follow OIDC/SAML security best practices
- Upgrade Path: Design SSO config to be provider-agnostic (easy migration)
Validation
Acceptance Criteria:
- OIDC login works with Azure AD, Google, Okta
- SAML 2.0 login works with generic IdP
- Users auto-provisioned on first login
- Email domain restrictions enforced
- SSO configuration UI functional for admins
- Error handling for common SSO failures
Testing:
- Unit tests: OIDC token validation, SAML assertion parsing
- Integration tests: Full SSO flow with real IdPs (test tenants)
- Security tests: CSRF protection, replay attack prevention
- Usability tests: Admin can configure SSO without support
References
- Architecture Doc:
docs/architecture/sso-integration-architecture.md - Implementation Guide:
docs/implementation/sso-implementation.md(TBD) - Security Checklist:
docs/security/sso-security-checklist.md(TBD)
ADR-004: MCP Token Format
Status
Accepted - 2025-11-03
Context
ColaFlow will expose an MCP (Model Context Protocol) server that allows AI agents (Claude, ChatGPT) to access project data, create tasks, and generate reports. We need a secure, revocable authentication mechanism for AI agents.
Requirements:
- Secure: Cannot be forged or tampered with
- Revocable: Admin can revoke token instantly
- Fine-Grained Permissions: Control read/write access per resource
- Audit Trail: Log all API operations performed with token
- Tenant-Scoped: Token only works for one tenant
- Long-Lived: Valid for days/weeks (not short-lived like JWT)
Decision Drivers
- Security: Token cannot be guessed or brute-forced
- Revocability: Instant revocation (no JWT blacklist complexity)
- Permissions: Resource-level + operation-level granularity
- Auditability: Complete log of all token operations
- Usability: Easy to copy/paste, recognizable format
Options Considered
Option 1: Opaque Tokens (mcp_<tenant_slug>_<random_32>)
Format: mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
Approach:
- Token is a random string (cryptographically secure)
- Prefix:
mcp_(identifies as MCP token) - Tenant slug:
acme(for easy identification) - Random part: 32 hex characters (128 bits of entropy)
- Store token hash (SHA256) in database
- Store permissions in database alongside token
Token Storage:
CREATE TABLE mcp_tokens (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
user_id UUID NULL,
name VARCHAR(100) NOT NULL,
token_hash VARCHAR(255) NOT NULL UNIQUE, -- SHA256 of token
permissions JSONB NOT NULL, -- {"projects": ["read", "search"], ...}
status INT NOT NULL, -- Active/Revoked/Expired
created_at TIMESTAMP NOT NULL,
expires_at TIMESTAMP NULL,
last_used_at TIMESTAMP NULL
);
Validation Flow:
- Receive token:
mcp_acme_xxx... - Hash token with SHA256
- Lookup in database by token_hash
- Check status (Active/Revoked/Expired)
- Check expiration date
- Load permissions from JSONB column
- Inject tenant context and permissions into request
Pros:
- Revocable: Update
status = Revokedin database, takes effect immediately - Secure: SHA256 hashed, never stored plain-text
- Flexible Permissions: Can update permissions without regenerating token
- Auditable: Every token use logged in database
- Tenant-Scoped: Token hash includes tenant context
- Long-Lived: Can be valid for months/years
- Easy to Identify: Prefix + tenant slug clearly identify token type
Cons:
- Database lookup required on every request (performance overhead)
- Larger tokens (50+ characters) vs API keys (32 characters)
- Need to manage token lifecycle (expiration, revocation)
Performance: ~5ms per token validation (including database lookup)
Option 2: JWT Tokens for MCP
Format: Long JWT string (200+ characters)
Approach:
- Generate JWT with
tenant_id,user_id,permissionsclaims - Sign with secret key
- No database lookup required (stateless)
- Validate signature on every request
Pros:
- Stateless: No database lookup required
- Fast validation: O(1) signature check
- Self-contained: All info in token
Cons:
- Cannot Revoke: Once issued, JWT is valid until expiration (unless using blacklist)
- Blacklist Required: Need Redis/database to store revoked JWTs (adds complexity)
- Permissions Fixed: Cannot update permissions without regenerating token
- Larger Tokens: 200-500 characters (difficult to copy/paste)
- Expiration Required: Must set short expiration for revocation to work
Revocation Problem:
User generates JWT token → Shares with AI agent → Admin wants to revoke
→ JWT is still valid for 30 days → Need to blacklist JWT ID
→ Now need Redis to store blacklist → Not truly stateless anymore
Option 3: API Keys (UUID Format)
Format: 550e8400-e29b-41d4-a716-446655440000
Approach:
- Generate random UUID
- Store in database with permissions
- Simple validation: lookup by UUID
Pros:
- Simple implementation
- Standard format (UUID)
- Database lookup
Cons:
- No tenant context in token (need to lookup tenant)
- No token type identifier (could be confused with user IDs)
- No visual indication of purpose
- Less secure (UUIDs have less entropy than 256-bit random strings)
Option 4: GitHub-Style Personal Access Tokens
Format: ghp_ABcdEF123456789012345678901234567890
Approach:
- Prefix identifies token type
- Random alphanumeric string
- Store hash in database
Pros:
- Industry standard (used by GitHub, GitLab)
- Easy to identify by prefix
- Secure
Cons:
- No tenant context in token itself
- Shorter random part (less entropy than our Option 1)
Decision
Chosen Option: Option 1 - Opaque Tokens (mcp_<tenant_slug>_<random_32>)
Rationale:
- Revocability: Instant revocation without blacklist complexity
- Flexibility: Permissions stored server-side, can update without new token
- Security: 128 bits of entropy + SHA256 hashing
- Usability: Tenant slug in token helps users identify which tenant it's for
- Auditability: Complete audit trail in database
Token Format:
mcp_<tenant_slug>_<random_32_hex_chars>
Example:
mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
mcp_techcorp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6
Components:
mcp_: Identifies as MCP token (easy to filter in logs)acme: Tenant slug (helps user identify which tenant)7f3d8a9c...: 32 hex characters (128 bits entropy = 2^128 combinations)
Generation:
public string GenerateToken(string tenantSlug)
{
var randomBytes = new byte[16]; // 128 bits
using var rng = RandomNumberGenerator.Create();
rng.GetBytes(randomBytes);
var randomHex = Convert.ToHexString(randomBytes).ToLowerInvariant();
return $"mcp_{tenantSlug}_{randomHex}";
}
Storage:
public async Task<McpToken> CreateTokenAsync(CreateMcpTokenCommand command)
{
var token = _tokenGenerator.GenerateToken(tenant.Slug);
var tokenHash = _tokenGenerator.HashToken(token); // SHA256
var mcpToken = new McpToken
{
TokenHash = tokenHash, // Never store plain-text
Permissions = command.Permissions,
ExpiresAt = command.ExpiresAt
};
await _repository.AddAsync(mcpToken);
return token; // Return plain-text ONLY ONCE
}
Consequences
Positive:
- Instant revocation (update database status)
- Fine-grained permissions (stored server-side)
- Complete audit trail
- Tenant-scoped (slug in token)
- Secure (128-bit entropy + SHA256)
- User-friendly (tenant slug helps identification)
Negative:
- Database lookup required per request (~5ms overhead)
- Longer tokens (50 characters vs 32 for API keys)
- Need to manage token lifecycle (expiration, cleanup)
Neutral:
- Performance overhead acceptable for MCP use case (not high-frequency)
- Token length acceptable for copy/paste workflow
Mitigation Strategies:
- Performance: Cache token validation results (5-minute TTL)
- Token Length: Provide copy button and download option in UI
- Lifecycle Management: Automated cleanup job for expired tokens
Validation
Acceptance Criteria:
- Token generation is cryptographically secure (CSPRNG)
- Token hash stored (SHA256), never plain-text
- Token validation <10ms (including database lookup)
- Revocation takes effect immediately
- Permissions enforced on every API call
- Audit log created for every token use
Testing:
- Unit tests: Token generation format, hashing, validation
- Integration tests: Token authentication flow, permission enforcement
- Security tests: Brute-force resistance, revocation effectiveness
- Performance tests: 1,000 req/s with token validation
References
- Architecture Doc:
docs/architecture/mcp-authentication-architecture.md - Token Management UI:
docs/design/multi-tenant-ux-flows.md#mcp-token-management-flow
ADR-005: Frontend State Management
Status
Accepted - 2025-11-03
Context
ColaFlow frontend (Next.js 16 + React 19) needs a state management solution for authentication, user preferences, and server data. We need to choose libraries that are TypeScript-first, performant, and maintainable.
Requirements:
- Type-safe: Full TypeScript support
- Performant: Minimal re-renders
- Developer-friendly: Low boilerplate
- Server state caching: Avoid redundant API calls
- Optimistic updates: Immediate UI feedback
- Auth state persistence: Survive page refresh
Decision Drivers
- TypeScript Support: First-class TypeScript integration
- Performance: Minimal bundle size, fast renders
- DX (Developer Experience): Easy to learn, low boilerplate
- Ecosystem: Good documentation, active community
- Server State: Built-in caching and invalidation
Options Considered
Option 1: Zustand (Client State) + TanStack Query v5 (Server State)
Approach:
- Zustand: Lightweight state manager for auth, UI state
- TanStack Query: Server state caching, mutations, automatic refetching
Zustand Example:
// stores/useAuthStore.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';
interface AuthState {
user: User | null;
tenant: Tenant | null;
accessToken: string | null;
login: (token: string, user: User, tenant: Tenant) => void;
logout: () => void;
}
export const useAuthStore = create<AuthState>()(
persist(
(set) => ({
user: null,
tenant: null,
accessToken: null,
login: (token, user, tenant) => set({ accessToken: token, user, tenant }),
logout: () => set({ accessToken: null, user: null, tenant: null }),
}),
{ name: 'auth-storage' }
)
);
TanStack Query Example:
// hooks/useMcpTokens.ts
import { useQuery } from '@tanstack/react-query';
import { mcpService } from '@/services/mcp.service';
export function useMcpTokens() {
return useQuery({
queryKey: ['mcp-tokens'],
queryFn: () => mcpService.listTokens(),
staleTime: 1000 * 60 * 5, // 5 minutes
});
}
Pros:
- Minimal Bundle Size: Zustand (3KB) + TanStack Query (15KB) = 18KB total
- TypeScript-First: Excellent type inference
- Low Boilerplate: No actions, reducers, or complex setup
- Performance: Zustand avoids unnecessary re-renders
- Caching: TanStack Query caches API responses automatically
- DevTools: Excellent debugging tools for both libraries
- Separation of Concerns: Client state in Zustand, server state in TanStack Query
Cons:
- Two libraries to learn (vs one all-in-one solution)
- Need to decide what goes in Zustand vs TanStack Query
Learning Curve: Low (Zustand is simpler than Redux, TanStack Query has great docs)
Option 2: Redux Toolkit + RTK Query
Approach:
- Redux Toolkit for all state
- RTK Query for API data fetching
Pros:
- All-in-one solution
- Mature ecosystem
- Excellent DevTools
Cons:
- More Boilerplate: Actions, slices, reducers
- Larger Bundle: Redux (10KB) + RTK Query (20KB) = 30KB
- Steeper Learning Curve: More concepts to learn
- Overkill for MVP: We don't need Redux's complexity yet
Option 3: React Context + SWR
Approach:
- React Context for auth state
- SWR for server data
Pros:
- Minimal dependencies (SWR only)
- Simple concept (React Context is built-in)
Cons:
- Performance Issues: React Context causes re-renders on every update
- Boilerplate: Need to create context providers manually
- SWR vs TanStack Query: SWR is less feature-rich
Option 4: Jotai + TanStack Query
Approach:
- Jotai for atomic state management
- TanStack Query for server state
Pros:
- Atomic state model (like Recoil)
- Good TypeScript support
Cons:
- Less mature than Zustand
- Smaller community
- Atomic model can be overkill for simple auth state
Decision
Chosen Option: Option 1 - Zustand (Client State) + TanStack Query v5 (Server State)
Rationale:
- Bundle Size: 18KB total (vs 30KB for Redux Toolkit)
- Performance: Zustand selector-based re-renders, TanStack Query caching
- TypeScript: First-class support in both libraries
- Learning Curve: Simple APIs, great documentation
- Clear Separation: Auth/UI in Zustand, API data in TanStack Query
Usage Guidelines:
Zustand - Use For:
- Authentication state (user, tenant, accessToken)
- UI state (sidebar open/closed, theme)
- User preferences (language, timezone)
TanStack Query - Use For:
- API data (projects, issues, tokens)
- Mutations (create, update, delete)
- Caching and invalidation
Example Architecture:
// Zustand (auth)
const { user, tenant, logout } = useAuthStore();
// TanStack Query (server data)
const { data: projects, isLoading } = useQuery({
queryKey: ['projects'],
queryFn: () => projectService.getAll()
});
// Mutation
const createProject = useMutation({
mutationFn: (data) => projectService.create(data),
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['projects'] });
}
});
Consequences
Positive:
- Lightweight and fast
- Easy to learn and use
- Great TypeScript experience
- Excellent caching and performance
- Clear separation of concerns
Negative:
- Two libraries to learn (instead of one)
- Need to decide where state lives (Zustand vs TanStack Query)
Neutral:
- Both libraries have excellent DevTools
- Both are actively maintained
Mitigation Strategies:
- Documentation: Create team guide for "What goes where"
- Code Reviews: Ensure consistent usage patterns
- Linting: Custom ESLint rules if needed
Validation
Acceptance Criteria:
- Auth state persists across page refresh
- API data cached appropriately (no redundant calls)
- Optimistic updates work (immediate UI feedback)
- TypeScript errors caught at compile time
- DevTools show state clearly
Performance Targets:
- Initial page load: <1.5s
- State updates: <16ms (60fps)
- Cache hit rate: >80%
References
- Zustand Docs: https://docs.pmnd.rs/zustand
- TanStack Query Docs: https://tanstack.com/query
- Implementation:
docs/frontend/state-management-guide.md
ADR-006: Token Storage Strategy
Status
Accepted - 2025-11-03
Context
We need to securely store JWT access tokens and refresh tokens in the frontend. The storage mechanism must balance security, usability, and functionality.
Requirements:
- Secure: Protect against XSS and CSRF attacks
- Persistent: Survive page refresh
- Auto-refresh: Seamlessly refresh tokens before expiration
- Logout: Clear tokens on logout
- Cross-tab sync: Logout in one tab logs out all tabs
Decision Drivers
- Security: XSS protection (primary threat)
- CSRF Protection: For refresh tokens
- Usability: Seamless token refresh
- Persistence: User stays logged in across sessions
- Performance: Fast token access
Options Considered
Option 1: Access Token in Memory + Refresh Token in httpOnly Cookie
Approach:
- Access Token: Stored in Zustand state (memory only, not persisted)
- Refresh Token: Stored in httpOnly cookie (server-side managed)
- Flow:
- User logs in → Receive access + refresh tokens
- Access token stored in Zustand (memory)
- Refresh token stored in httpOnly cookie by backend
- Access token used for API calls (Authorization header)
- On 401 error → Call
/api/auth/refresh(refresh token sent automatically via cookie) - Receive new access token → Update Zustand state
Cookie Configuration (Backend):
Response.Cookies.Append("refreshToken", refreshToken, new CookieOptions
{
HttpOnly = true, // Cannot be accessed by JavaScript
Secure = true, // HTTPS only
SameSite = SameSiteMode.Strict, // CSRF protection
MaxAge = TimeSpan.FromDays(7)
});
Pros:
- XSS Protection (Access Token): Cannot be stolen via XSS (not in localStorage/cookies)
- CSRF Protection (Refresh Token): httpOnly + SameSite=Strict
- Short-Lived Access Token: Even if leaked, expires in 60 minutes
- Automatic Refresh: Cookie sent automatically on refresh endpoint
- No Manual Cookie Management: Backend sets/clears cookies
Cons:
- Access token lost on page refresh (need to call refresh immediately)
- Requires cookie support (some corporate proxies block cookies)
Security Score: 9/10 (Best practice)
Option 2: Both Tokens in localStorage
Approach:
- Store both access and refresh tokens in localStorage
- Read on page load
Pros:
- Simple implementation
- Tokens persist across page refresh
- No cookie management
Cons:
- Vulnerable to XSS: If attacker injects script, can steal both tokens
- No CSRF Protection: Tokens accessible to any script
- Not Recommended: Violates OWASP security guidelines
Security Score: 3/10 (Not secure)
Option 3: Both Tokens in httpOnly Cookies
Approach:
- Store both tokens in httpOnly cookies
- Backend sends cookies on every API response
Pros:
- XSS protection for both tokens
- Automatic token management
Cons:
- CSRF Vulnerability: Cookies sent automatically with every request
- Need CSRF Tokens: Additional complexity
- Cookie Size Limit: JWTs can be large (4KB cookie limit)
- Double-Submit Cookie Pattern Required: More complexity
Security Score: 6/10 (CSRF risk)
Option 4: Session-Based Authentication (No JWT)
Approach:
- Traditional session cookies
- Session stored server-side (Redis)
Pros:
- Simple
- Secure (session ID only)
Cons:
- Not stateless (requires Redis/database for sessions)
- Horizontal scaling complexity
- Not suitable for mobile apps
- Against our JWT strategy
Security Score: 7/10 (Secure but not stateless)
Decision
Chosen Option: Option 1 - Access Token in Memory + Refresh Token in httpOnly Cookie
Rationale:
- Best Security: Access token protected from XSS, refresh token protected from CSRF
- Industry Standard: Used by Auth0, Okta, and major SaaS apps
- Balances Security and UX: Short-lived access token, auto-refresh
- Stateless: No session storage required
- Mobile-Friendly: Can adapt for mobile (store refresh token securely)
Implementation:
// stores/useAuthStore.ts
export const useAuthStore = create<AuthState>((set) => ({
user: null,
accessToken: null, // Stored in memory ONLY
login: (token, user) => set({ accessToken: token, user }),
logout: () => set({ accessToken: null, user: null })
}));
// No persist middleware for accessToken!
// lib/api-client.ts
apiClient.interceptors.response.use(
(response) => response,
async (error) => {
if (error.response?.status === 401 && !error.config._retry) {
error.config._retry = true;
// Call refresh endpoint (refresh token sent via cookie automatically)
const { data } = await axios.post('/api/auth/refresh');
// Update access token in memory
useAuthStore.getState().updateToken(data.accessToken);
// Retry original request
error.config.headers.Authorization = `Bearer ${data.accessToken}`;
return apiClient(error.config);
}
return Promise.reject(error);
}
);
Token Refresh Strategy:
- Automatic: Intercept 401 errors, call refresh endpoint
- Preemptive (Optional): Refresh 5 minutes before expiration
- One-at-a-Time: Only one refresh call in flight (queue other requests)
Consequences
Positive:
- Maximum security (XSS + CSRF protected)
- Seamless user experience (auto-refresh)
- Stateless authentication
- Mobile-friendly (adapt for secure storage)
- Industry best practice
Negative:
- Access token lost on page refresh (need immediate refresh call)
- Requires cookie support (fails in some corporate environments)
- More complex implementation than localStorage
Neutral:
- Short-lived access token means more refresh calls (acceptable trade-off)
Mitigation Strategies:
- Page Load: Call refresh endpoint on app load if no access token in memory
- Cookie Fallback: If cookies blocked, fall back to re-login
- Error Handling: Clear UX if authentication fails (session expired)
Validation
Acceptance Criteria:
- Access token not visible in localStorage/sessionStorage/cookies (developer tools)
- Refresh token in httpOnly cookie with SameSite=Strict
- 401 errors trigger automatic token refresh
- Logout clears all tokens (memory + cookies)
- Cross-tab logout works (listen to storage events)
Security Tests:
- XSS attack simulation (cannot steal access token)
- CSRF attack simulation (refresh endpoint protected)
- Token expiration handled gracefully
- Logout clears all authentication state
References
- OWASP: https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
- Auth0 Best Practices: https://auth0.com/docs/secure/tokens/refresh-tokens/refresh-token-rotation
- Implementation:
docs/frontend/api-integration-guide.md
Summary of Decisions
| Decision | Chosen Solution | Rationale |
|---|---|---|
| ADR-001: Tenant Identification | JWT Claims + Subdomain | Stateless, cross-platform, performant |
| ADR-002: Data Isolation | Shared DB + tenant_id + Global Query Filter | Cost-effective, scalable, maintainable |
| ADR-003: SSO Library | ASP.NET Core Native (OIDC + SAML) | Free, fast, covers 80% of needs |
| ADR-004: MCP Token Format | Opaque Tokens (mcp_<slug>_<random>) |
Revocable, flexible, secure, auditable |
| ADR-005: Frontend State | Zustand + TanStack Query | Lightweight, TypeScript-first, performant |
| ADR-006: Token Storage | Access in Memory + Refresh in httpOnly Cookie | XSS + CSRF protected, industry standard |
Impact Assessment
Security Impact
- Overall Security Posture: Excellent (9/10)
- XSS Protection: Enforced (tokens in memory + httpOnly cookies)
- CSRF Protection: Enforced (SameSite=Strict cookies)
- Data Isolation: Enforced (Global Query Filter + composite indexes)
- Audit Trail: Complete (MCP tokens logged, SSO events tracked)
Performance Impact
- API Latency: +5ms (JWT validation + tenant filtering)
- Database Load: Minimal (composite indexes, Global Query Filter)
- Frontend Bundle Size: +18KB (Zustand + TanStack Query)
- Token Refresh: Transparent to user (<100ms)
Cost Impact
- Infrastructure: $200/month (1 database vs $15,000 for DB-per-tenant)
- Licensing: $0/month (native .NET libraries vs $3,000-5,000 for Auth0)
- Maintenance: Low (one schema, automated migrations)
- Total Savings: ~$18,000/year compared to Auth0 + DB-per-tenant
Development Impact
- Implementation Time: 10 days (vs 6 weeks for IdentityServer + DB-per-tenant)
- Learning Curve: Low (native libraries, clear architecture)
- Maintenance Burden: Low (well-documented, industry patterns)
- Testing Complexity: Medium (need tenant isolation tests)
Risks and Mitigation
| Risk | Mitigation |
|---|---|
| Data leak via Global Query Filter bypass | Code review for .IgnoreQueryFilters(), integration tests |
| SSO misconfiguration | Test connection UI, detailed error messages, documentation |
| MCP token brute-force | 128-bit entropy, rate limiting, IP whitelisting |
| Performance degradation | Composite indexes, query monitoring, slow query alerts |
| Frontend XSS attack | CSP headers, input sanitization, React auto-escaping |
Future Enhancements
Decisions are not permanent. We will revisit these at milestone reviews:
| Milestone | Potential Changes |
|---|---|
| M3 | Re-evaluate SSO (Auth0 if complex federation needed) |
| M4 | Re-evaluate data isolation (DB-per-tenant for enterprise customers) |
| M5 | Re-evaluate frontend state (Redux if complex state emerges) |
| M6 | Re-evaluate MCP tokens (consider JWT if performance critical) |
Document Status: Approved Next Review: M3 Architecture Review (2025-12-15) Approval Signatures:
- Architecture Team: [Approved]
- Product Manager: [Approved]
- Security Team: [Pending Review]
- Engineering Lead: [Approved]
End of Architecture Decision Record