kai/ColaFlow

Fork 0

Files

Yaojia Wang 1f66b25f30

Code Coverage / Generate Coverage Report (push) Has been cancelled

Details

Tests / Run Tests (9.0.x) (push) Has been cancelled

Details

Tests / Docker Build Test (push) Has been cancelled

Details

Tests / Test Summary (push) Has been cancelled

Details

In progress

2025-11-03 14:00:24 +01:00

47 KiB

Raw Blame History

Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy

Document Type: ADR (Architecture Decision Record) Date: 2025-11-03 Status: Accepted Decision Makers: Architecture Team, Product Manager, Technical Leads Project: ColaFlow - M1 Sprint 2 (Enterprise Multi-Tenant Upgrade)

Document Purpose

This Architecture Decision Record (ADR) documents the key architectural decisions made for ColaFlow's transition from a single-tenant to an enterprise-ready multi-tenant SaaS platform. It follows the ADR format to capture context, options considered, chosen solutions, and consequences.

ADR-001: Tenant Identification Strategy
ADR-002: Data Isolation Strategy
ADR-003: SSO Library Selection
ADR-004: MCP Token Format
ADR-005: Frontend State Management
ADR-006: Token Storage Strategy
Summary of Decisions

ADR-001: Tenant Identification Strategy

Status

Accepted - 2025-11-03

Context

ColaFlow is transitioning to a multi-tenant architecture where multiple companies (tenants) will share the same application instance. We need a reliable, performant, and secure method to identify which tenant a user or API request belongs to.

Requirements:

Must work across web, mobile, and API clients
Must be stateless (no session storage required)
Must be secure (prevent tenant spoofing)
Must be performant (no database lookup per request)
Must support both human users and AI agents (MCP tokens)
Must work with subdomain-based URLs (e.g., acme.colaflow.com)

Decision Drivers

Performance: System must handle 10,000+ requests/second without database lookups
Security: Tenant ID cannot be tampered with by malicious users
Scalability: Solution must work for mobile apps, APIs, and web simultaneously
Developer Experience: Easy to implement and maintain across all layers
User Experience: Friendly tenant selection (via subdomain)

Options Considered

Option 1: JWT Claims (Primary) + Subdomain (Secondary)

Approach:

Store tenant_id and tenant_slug in JWT access token claims
Resolve tenant from subdomain on login/registration
Inject tenant context from JWT claims into all API requests
No database lookup required after authentication

Pros:

Stateless: No session storage or database lookup per request
Secure: JWT signature prevents tampering
Cross-platform: Works for web, mobile, API, MCP tokens
Fast: O(1) lookup from JWT claims
Tenant context available in middleware layer

Cons:

JWT cannot be updated until refresh (stale tenant info for up to 60 minutes)
Requires careful token expiration management
Subdomain only used for initial tenant resolution (login page)

Example JWT Payload:

{
  "sub": "user-id-123",
  "email": "john@acme.com",
  "tenant_id": "tenant-uuid-456",
  "tenant_slug": "acme",
  "tenant_plan": "Enterprise",
  "auth_provider": "AzureAD",
  "role": "User",
  "exp": 1730678400,
  "iat": 1730674800
}

Option 2: Session-Based Tenant Storage

Approach:

Store tenant ID in server-side session (Redis)
Lookup tenant on every request via session ID
Subdomain used for tenant resolution on login

Pros:

Can update tenant info without re-login
Works well for web applications
Session can store additional context

Cons:

Not stateless: Requires Redis/session storage infrastructure
Database/Redis lookup on every request (performance hit)
Difficult to scale horizontally (session affinity required)
Doesn't work well for mobile apps or API-only clients
MCP tokens would still need separate mechanism

Option 3: Subdomain-Only Identification

Approach:

Parse subdomain from HTTP Host header on every request
Lookup tenant by slug in database
No JWT claims for tenant

Pros:

Simple conceptual model
User-friendly (URL shows tenant)
Easy to test locally

Cons:

Database lookup on every request (performance bottleneck)
Doesn't work for API clients (no subdomain in API calls)
Doesn't work for mobile apps
Vulnerable to DNS spoofing
MCP tokens cannot carry subdomain context

Option 4: Tenant ID in URL Path

Approach:

Include tenant ID in every API route: /api/tenants/{tenantId}/projects
Frontend passes tenant ID explicitly

Pros:

Explicit tenant context in every request
Easy to debug
Works across all client types

Cons:

Poor user experience (ugly URLs)
Easy to make mistakes (wrong tenant ID)
Difficult to enforce (requires middleware validation)
Security risk (users could try other tenant IDs)
Requires frontend to manage tenant ID everywhere

Decision

Chosen Option: Option 1 - JWT Claims (Primary) + Subdomain (Secondary)

Rationale:

Performance: No database lookup per request; O(1) from JWT claims
Security: JWT signature prevents tampering; middleware validates on every request
Scalability: Works for web, mobile, API, and MCP tokens uniformly
Stateless: No session storage required; easy to scale horizontally
Developer Experience: TenantContext injected automatically via middleware

Implementation Strategy:

Login Flow: User visits acme.colaflow.com/login → Tenant resolved from subdomain → JWT contains tenant_id and tenant_slug
API Requests: JWT extracted from Authorization header → tenant_id injected into TenantContext → EF Core Global Query Filter applies automatic filtering
MCP Tokens: Opaque tokens stored with tenant_id → Middleware validates token → Tenant context injected (same as JWT)

Consequences

Positive:

Fast authentication and authorization
No session storage infrastructure required
Uniform tenant resolution across all client types
Easy to test and debug (tenant visible in JWT payload)
Supports multi-tenant mobile apps

Negative:

Tenant changes require re-login (or wait for token refresh)
JWT size increases slightly (+50 bytes for tenant claims)
Middleware must validate JWT on every request (minor CPU cost)

Neutral:

Subdomain is only used for initial tenant selection (login page)
Tenant switching requires logout and login to different subdomain

Mitigation Strategies:

Keep JWT expiration short (60 minutes) to allow tenant updates on refresh
Implement automatic token refresh to minimize user disruption
Cache JWT validation results per request to avoid redundant checks

Validation

Acceptance Criteria:

JWT contains tenant_id, tenant_slug, and tenant_plan claims
Middleware extracts tenant from JWT and injects into TenantContext
All database queries automatically filter by tenant via Global Query Filter
Cross-tenant access attempts return 403 Forbidden
Performance: <5ms overhead for JWT validation per request

Testing:

Unit tests: TenantContext injection
Integration tests: Cross-tenant isolation
Performance tests: 10,000 req/s with JWT validation
Security tests: Attempt to access other tenant's data (should fail)

References

Architecture Doc: docs/architecture/multi-tenancy-architecture.md
JWT Implementation: docs/architecture/jwt-authentication-architecture.md
MCP Token Format: docs/architecture/mcp-authentication-architecture.md

ADR-002: Data Isolation Strategy

Status

Accepted - 2025-11-03

Context

In a multi-tenant system, data isolation is critical to ensure that one tenant cannot access another tenant's data. We need to choose an isolation strategy that balances security, performance, cost, and maintainability.

Requirements:

Strong data isolation (no cross-tenant leaks)
Good query performance (<50ms for typical queries)
Cost-effective (avoid database proliferation)
Easy to maintain and backup
Scalable to 10,000+ tenants
Support for per-tenant data export (GDPR compliance)

Decision Drivers

Security: Absolute data isolation between tenants
Cost: Minimize infrastructure costs (PostgreSQL instances, storage)
Performance: Fast queries with proper indexing
Scalability: Support thousands of tenants on shared infrastructure
Maintainability: Easy schema migrations, backups, monitoring

Options Considered

Option 1: Shared Database + tenant_id Column + Global Query Filter

Approach:

All tenants share one PostgreSQL database
Every table has a tenant_id column (NOT NULL)
EF Core Global Query Filter automatically adds .Where(e => e.TenantId == currentTenantId) to all queries
Composite indexes: (tenant_id, other_columns)

Schema Example:

CREATE TABLE projects (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    name VARCHAR(200) NOT NULL,
    key VARCHAR(20) NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    CONSTRAINT uq_projects_tenant_key UNIQUE (tenant_id, key)
);

CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
CREATE INDEX idx_projects_tenant_key ON projects(tenant_id, key);

EF Core Configuration:

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Project>().HasQueryFilter(
        p => p.TenantId == _tenantContext.CurrentTenantId
    );
}

Pros:

Cost-effective: One database for all tenants
Easy to maintain: Single schema, one backup process
Good performance with proper indexing (composite indexes)
Easy to add new tenants (just insert into tenants table)
Per-tenant data export is SQL query: SELECT * FROM projects WHERE tenant_id = 'xxx'
Scales to 10,000+ tenants on one database
Automatic filtering via Global Query Filter (developer-friendly)

Cons:

Risk of data leak if Global Query Filter is bypassed (.IgnoreQueryFilters())
All tenants affected by database downtime
Cannot isolate noisy neighbors (one tenant's heavy queries affect others)
Database size grows with all tenants (monitoring required)

Cost Estimate: 1 database instance (~$100-200/month for medium workload)

Option 2: Database-per-Tenant

Approach:

Each tenant gets a dedicated PostgreSQL database
Connection string stored in tenants table
Middleware switches database context per request

Schema Example:

-- Shared management database
CREATE TABLE tenants (
    id UUID PRIMARY KEY,
    slug VARCHAR(50) UNIQUE NOT NULL,
    connection_string TEXT NOT NULL -- Encrypted
);

-- Tenant-specific database (one per tenant)
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_beta;

Pros:

Strong isolation: One tenant's database cannot access another
Tenant-specific customization (different schema versions)
Easy to back up per tenant
Noisy neighbors don't affect each other
Easy to migrate tenant to different database server

Cons:

Expensive: N databases for N tenants (~$10-20/month per tenant minimum)
Complex maintenance: Schema migrations across 1000s of databases
Connection pool exhaustion (need one pool per tenant)
Difficult to implement cross-tenant features (analytics, admin tools)
Onboarding delay (new database provisioning takes time)

Cost Estimate: 1000 tenants × $15/month = $15,000/month (vs $200 for shared)

Option 3: Schema-per-Tenant (PostgreSQL Schemas)

Approach:

One database with multiple PostgreSQL schemas
Each tenant gets a schema: tenant_acme.projects, tenant_beta.projects
Middleware switches search_path per request: SET search_path = tenant_acme;

Pros:

Better isolation than shared database
Lower cost than database-per-tenant
All tenants in one PostgreSQL instance (easier backups)
Can support ~1000 schemas per database

Cons:

PostgreSQL schema limit (~1000 schemas per database)
Schema creation overhead for new tenants
Complex schema migrations (run migration on each schema)
Search_path switching per request (performance overhead)
Difficult to enforce (easy to forget to set search_path)

Cost Estimate: Same as shared database, but limited scalability

Option 4: Separate Infrastructure per Tenant (Fully Isolated)

Approach:

Each tenant gets dedicated Kubernetes namespace, database, Redis, etc.
Complete infrastructure isolation

Pros:

Maximum isolation and security
Per-tenant scaling and customization
Enterprise customers often require this

Cons:

Extremely expensive (hundreds of dollars per tenant)
Complex to manage (orchestration required)
Overkill for most tenants
Long onboarding time

Cost Estimate: 1000 tenants × $500/month = $500,000/month (prohibitive)

Decision

Chosen Option: Option 1 - Shared Database + tenant_id Column + Global Query Filter

Rationale:

Cost-Effective: $200/month vs $15,000/month for database-per-tenant
Scalable: PostgreSQL handles 10,000+ tenants with proper indexing
Maintainable: One schema, one backup process, one monitoring dashboard
Developer-Friendly: EF Core Global Query Filter ensures automatic filtering
Performance: Composite indexes provide excellent query performance
Proven Pattern: Used by GitHub, Slack, Heroku, and many successful SaaS products

Implementation Strategy:

Add tenant_id column to all business tables
Create composite indexes: (tenant_id, primary_key), (tenant_id, foreign_key)
Configure EF Core Global Query Filter in OnModelCreating
Create TenantContext service to inject current tenant
Add database-level constraints: CHECK (tenant_id IS NOT NULL)
Update unique constraints to be tenant-scoped: UNIQUE (tenant_id, email)

Migration Path:

Create tenants table
Create default tenant for existing data
Add tenant_id columns (nullable initially)
Migrate existing data to default tenant
Set tenant_id as NOT NULL
Add indexes and constraints

Consequences

Positive:

Low infrastructure cost (1 database vs thousands)
Easy to maintain and monitor
Fast schema migrations (one database)
Automatic tenant filtering (developer safety)
Good query performance with indexes
Per-tenant data export is straightforward SQL

Negative:

Risk of data leak if developer bypasses Global Query Filter
All tenants share database resources (monitoring required)
Cannot isolate noisy neighbors at database level
Database backup contains all tenants (larger backup size)

Neutral:

Tenant onboarding is instant (no new database needed)
Cross-tenant analytics require explicit filtering
Database size monitoring required as tenant count grows

Mitigation Strategies:

Data Leak Prevention:
- Code review requirement for any .IgnoreQueryFilters() usage
- Integration tests verify cross-tenant isolation
- Automated security testing (attempt cross-tenant access)
Performance Monitoring:
- Alert on slow queries (>100ms)
- Index usage monitoring (pg_stat_user_indexes)
- Per-tenant query cost tracking
Noisy Neighbor Protection:
- Query timeout limits (5 seconds max)
- Rate limiting per tenant
- Connection pool limits
- Option to migrate large tenant to dedicated database later

Upgrade Path: If a tenant grows too large or requires dedicated resources, we can migrate them to a separate database while keeping the shared model for other tenants.

Validation

Acceptance Criteria:

All queries automatically filter by tenant
Cross-tenant access attempts fail with 403 Forbidden
Query performance <50ms for typical workloads (with 10,000 records per tenant)
Integration tests verify tenant isolation
Data export per tenant completes in <1 minute

Testing:

Unit tests: Global Query Filter applied to all entities
Integration tests: Create data in Tenant A, verify Tenant B cannot access
Performance tests: Query time with 1 million total records (100 tenants × 10,000 records)
Load tests: 10,000 concurrent requests across 100 tenants

References

Architecture Doc: docs/architecture/multi-tenancy-architecture.md
Migration Strategy: docs/architecture/migration-strategy.md
Performance Benchmarks: docs/architecture/performance-benchmarks.md (TBD)

ADR-003: SSO Library Selection

Status

Accepted - 2025-11-03

Context

Enterprise customers require Single Sign-On (SSO) to integrate ColaFlow with their corporate identity providers (Azure AD, Google Workspace, Okta, etc.). We need to choose an SSO library/approach that balances functionality, cost, implementation speed, and maintainability.

Requirements:

Support major identity providers: Azure AD, Google, Okta
Support OIDC (OpenID Connect) protocol
Support SAML 2.0 for generic enterprise IdPs
User auto-provisioning (create user on first SSO login)
Email domain restrictions (only allow @acme.com)
Configurable per tenant (each tenant has own SSO config)
Production-ready security standards

Decision Drivers

Time-to-Market: Implement SSO in <1 week (M1 timeline constraint)
Cost: Minimize licensing fees
Coverage: Support 90% of enterprise SSO requirements
Flexibility: Can upgrade later if complex requirements emerge
Security: Follow OWASP and OIDC/SAML best practices

Options Considered

Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)

Approach:

Use built-in Microsoft.AspNetCore.Authentication.OpenIdConnect for OIDC
Use Sustainsys.Saml2 library for SAML 2.0
Custom implementation for multi-tenant SSO configuration
Store SSO config in tenants table (JSONB column)

Pros:

Free: No licensing costs
Fast: Can implement OIDC in 2-3 days, SAML in 3-4 days
Built-in to .NET 9: Mature, well-documented
Flexible: Full control over implementation
Covers 80-90% of enterprise SSO needs

Cons:

Manual implementation: Need to handle user provisioning, domain restrictions
Limited advanced features: No federation, no protocol switching
SAML is more complex to implement
Need to maintain our own SSO configuration UI

Implementation Complexity: Medium Cost: $0/month Coverage: OIDC (Azure, Google, Okta) + SAML 2.0 (80% of market)

Code Example:

services.AddAuthentication()
    .AddOpenIdConnect("AzureAD", options =>
    {
        options.Authority = tenant.SsoConfig.AuthorityUrl;
        options.ClientId = tenant.SsoConfig.ClientId;
        options.ClientSecret = tenant.SsoConfig.ClientSecret;
        options.ResponseType = "code";
        options.SaveTokens = true;
        options.Events = new OpenIdConnectEvents
        {
            OnTokenValidated = async context =>
            {
                await AutoProvisionUserAsync(context);
            }
        };
    });

Option 2: Auth0

Approach:

Use Auth0 as SSO broker
Auth0 handles all identity providers
Configure Auth0 via their dashboard
Pay per monthly active user (MAU)

Pros:

Fast setup: Implement in 1-2 days
Comprehensive: Supports all identity providers out-of-the-box
User management: Built-in user directory
Advanced features: MFA, passwordless, anomaly detection
Dashboard for SSO configuration

Cons:

Expensive: $240/month (Professional) + $0.05/MAU (500 users = $25/month extra)
Vendor lock-in: Difficult to migrate away
Less control: Auth0 controls auth flow
Overkill for MVP: Many features we don't need yet

Implementation Complexity: Low Cost: $3,000-5,000/year (for 100 tenants with 5,000 total users) Coverage: 100% (all protocols, all providers)

Option 3: Okta (Workforce Identity Cloud)

Approach:

Use Okta as SSO broker
Similar to Auth0 but more enterprise-focused
Per-user pricing

Pros:

Enterprise-grade: Trusted by Fortune 500
Complete features: SSO, MFA, provisioning, directory
Excellent support and documentation

Cons:

Very expensive: $2/user/month minimum (100 users = $200/month)
Enterprise sales process (slow, complex)
Overkill for startup/SMB customers
Vendor lock-in

Implementation Complexity: Low Cost: $5,000-10,000/year (for 100 tenants) Coverage: 100%

Option 4: IdentityServer4 / Duende IdentityServer

Approach:

Use IdentityServer as self-hosted identity provider
Implement Federation support (connect to external IdPs)
Open-source (IdentityServer4) or licensed (Duende)

Pros:

Self-hosted: Full control
Comprehensive: OIDC, OAuth 2.0, SAML via plugins
Flexible: Can customize extensively
No per-user fees

Cons:

Complex: Steep learning curve (2-3 weeks to implement)
Maintenance burden: Need to maintain IdentityServer instance
Duende licensing: $1,500/year for production use
Overkill for MVP: We don't need an identity provider, just SSO

Implementation Complexity: High Cost: $1,500/year (Duende license) Coverage: 100%

Decision

Chosen Option: Option 1 - ASP.NET Core Native OIDC/SAML (M1-M2)

Rationale:

Cost: $0/month vs $3,000-5,000/year for Auth0/Okta
Speed: Can implement in <1 week (M1 timeline)
Control: Full flexibility to customize
Coverage: Supports 80% of enterprise SSO requirements (OIDC + SAML)
Upgrade Path: Can migrate to Auth0/Okta later if complex requirements emerge

Decision: Start with native ASP.NET Core for M1-M2. Re-evaluate at M3 if we need:

Complex federation (multiple IdPs per tenant)
Advanced MFA flows
More than 5 different SSO protocols
Dedicated identity management features

Implementation Strategy:

M1 (Week 1): OIDC implementation (Azure AD, Google, Okta)
M2 (Week 2): SAML 2.0 implementation (generic enterprise IdPs)
M2 (Week 3): User auto-provisioning and domain restrictions
M2 (Week 4): SSO configuration UI for tenants

Consequences

Positive:

Zero licensing costs for M1-M2
Complete control over implementation
Can customize for our specific needs
Fast implementation (< 1 week)
Covers 80% of enterprise SSO requirements
Learning opportunity for team

Negative:

Manual implementation required (more code to maintain)
Limited to OIDC + SAML 2.0 (no exotic protocols)
Need to build SSO configuration UI ourselves
More testing required (vs using Auth0)

Neutral:

Can migrate to Auth0/Okta later if needed
SSO config stored in database (our control)
Integration tests required for each IdP

Mitigation Strategies:

Quality: Comprehensive testing with real IdPs (Azure AD, Google)
Documentation: Detailed guides for each supported provider
Security: Follow OIDC/SAML security best practices
Upgrade Path: Design SSO config to be provider-agnostic (easy migration)

Validation

Acceptance Criteria:

OIDC login works with Azure AD, Google, Okta
SAML 2.0 login works with generic IdP
Users auto-provisioned on first login
Email domain restrictions enforced
SSO configuration UI functional for admins
Error handling for common SSO failures

Testing:

Unit tests: OIDC token validation, SAML assertion parsing
Integration tests: Full SSO flow with real IdPs (test tenants)
Security tests: CSRF protection, replay attack prevention
Usability tests: Admin can configure SSO without support

References

Architecture Doc: docs/architecture/sso-integration-architecture.md
Implementation Guide: docs/implementation/sso-implementation.md (TBD)
Security Checklist: docs/security/sso-security-checklist.md (TBD)

ADR-004: MCP Token Format

Status

Accepted - 2025-11-03

Context

ColaFlow will expose an MCP (Model Context Protocol) server that allows AI agents (Claude, ChatGPT) to access project data, create tasks, and generate reports. We need a secure, revocable authentication mechanism for AI agents.

Requirements:

Secure: Cannot be forged or tampered with
Revocable: Admin can revoke token instantly
Fine-Grained Permissions: Control read/write access per resource
Audit Trail: Log all API operations performed with token
Tenant-Scoped: Token only works for one tenant
Long-Lived: Valid for days/weeks (not short-lived like JWT)

Decision Drivers

Security: Token cannot be guessed or brute-forced
Revocability: Instant revocation (no JWT blacklist complexity)
Permissions: Resource-level + operation-level granularity
Auditability: Complete log of all token operations
Usability: Easy to copy/paste, recognizable format

Options Considered

Option 1: Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)

Format: mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d

Approach:

Token is a random string (cryptographically secure)
Prefix: mcp_ (identifies as MCP token)
Tenant slug: acme (for easy identification)
Random part: 32 hex characters (128 bits of entropy)
Store token hash (SHA256) in database
Store permissions in database alongside token

Token Storage:

CREATE TABLE mcp_tokens (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    user_id UUID NULL,
    name VARCHAR(100) NOT NULL,
    token_hash VARCHAR(255) NOT NULL UNIQUE, -- SHA256 of token
    permissions JSONB NOT NULL, -- {"projects": ["read", "search"], ...}
    status INT NOT NULL, -- Active/Revoked/Expired
    created_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NULL,
    last_used_at TIMESTAMP NULL
);

Validation Flow:

Receive token: mcp_acme_xxx...
Hash token with SHA256
Lookup in database by token_hash
Check status (Active/Revoked/Expired)
Check expiration date
Load permissions from JSONB column
Inject tenant context and permissions into request

Pros:

Revocable: Update status = Revoked in database, takes effect immediately
Secure: SHA256 hashed, never stored plain-text
Flexible Permissions: Can update permissions without regenerating token
Auditable: Every token use logged in database
Tenant-Scoped: Token hash includes tenant context
Long-Lived: Can be valid for months/years
Easy to Identify: Prefix + tenant slug clearly identify token type

Cons:

Database lookup required on every request (performance overhead)
Larger tokens (50+ characters) vs API keys (32 characters)
Need to manage token lifecycle (expiration, revocation)

Performance: ~5ms per token validation (including database lookup)

Option 2: JWT Tokens for MCP

Format: Long JWT string (200+ characters)

Approach:

Generate JWT with tenant_id, user_id, permissions claims
Sign with secret key
No database lookup required (stateless)
Validate signature on every request

Pros:

Stateless: No database lookup required
Fast validation: O(1) signature check
Self-contained: All info in token

Cons:

Cannot Revoke: Once issued, JWT is valid until expiration (unless using blacklist)
Blacklist Required: Need Redis/database to store revoked JWTs (adds complexity)
Permissions Fixed: Cannot update permissions without regenerating token
Larger Tokens: 200-500 characters (difficult to copy/paste)
Expiration Required: Must set short expiration for revocation to work

Revocation Problem:

User generates JWT token → Shares with AI agent → Admin wants to revoke
→ JWT is still valid for 30 days → Need to blacklist JWT ID
→ Now need Redis to store blacklist → Not truly stateless anymore

Option 3: API Keys (UUID Format)

Format: 550e8400-e29b-41d4-a716-446655440000

Approach:

Generate random UUID
Store in database with permissions
Simple validation: lookup by UUID

Pros:

Simple implementation
Standard format (UUID)
Database lookup

Cons:

No tenant context in token (need to lookup tenant)
No token type identifier (could be confused with user IDs)
No visual indication of purpose
Less secure (UUIDs have less entropy than 256-bit random strings)

Option 4: GitHub-Style Personal Access Tokens

Format: ghp_ABcdEF123456789012345678901234567890

Approach:

Prefix identifies token type
Random alphanumeric string
Store hash in database

Pros:

Industry standard (used by GitHub, GitLab)
Easy to identify by prefix
Secure

Cons:

No tenant context in token itself
Shorter random part (less entropy than our Option 1)

Decision

Chosen Option: Option 1 - Opaque Tokens (mcp_<tenant_slug>_<random_32>)

Rationale:

Revocability: Instant revocation without blacklist complexity
Flexibility: Permissions stored server-side, can update without new token
Security: 128 bits of entropy + SHA256 hashing
Usability: Tenant slug in token helps users identify which tenant it's for
Auditability: Complete audit trail in database

Token Format:

mcp_<tenant_slug>_<random_32_hex_chars>

Example:

mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
mcp_techcorp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6

Components:

mcp_: Identifies as MCP token (easy to filter in logs)
acme: Tenant slug (helps user identify which tenant)
7f3d8a9c...: 32 hex characters (128 bits entropy = 2^128 combinations)

Generation:

public string GenerateToken(string tenantSlug)
{
    var randomBytes = new byte[16]; // 128 bits
    using var rng = RandomNumberGenerator.Create();
    rng.GetBytes(randomBytes);
    var randomHex = Convert.ToHexString(randomBytes).ToLowerInvariant();
    return $"mcp_{tenantSlug}_{randomHex}";
}

Storage:

public async Task<McpToken> CreateTokenAsync(CreateMcpTokenCommand command)
{
    var token = _tokenGenerator.GenerateToken(tenant.Slug);
    var tokenHash = _tokenGenerator.HashToken(token); // SHA256

    var mcpToken = new McpToken
    {
        TokenHash = tokenHash, // Never store plain-text
        Permissions = command.Permissions,
        ExpiresAt = command.ExpiresAt
    };

    await _repository.AddAsync(mcpToken);
    return token; // Return plain-text ONLY ONCE
}

Consequences

Positive:

Instant revocation (update database status)
Fine-grained permissions (stored server-side)
Complete audit trail
Tenant-scoped (slug in token)
Secure (128-bit entropy + SHA256)
User-friendly (tenant slug helps identification)

Negative:

Database lookup required per request (~5ms overhead)
Longer tokens (50 characters vs 32 for API keys)
Need to manage token lifecycle (expiration, cleanup)

Neutral:

Performance overhead acceptable for MCP use case (not high-frequency)
Token length acceptable for copy/paste workflow

Mitigation Strategies:

Performance: Cache token validation results (5-minute TTL)
Token Length: Provide copy button and download option in UI
Lifecycle Management: Automated cleanup job for expired tokens

Validation

Acceptance Criteria:

Token generation is cryptographically secure (CSPRNG)
Token hash stored (SHA256), never plain-text
Token validation <10ms (including database lookup)
Revocation takes effect immediately
Permissions enforced on every API call
Audit log created for every token use

Testing:

Unit tests: Token generation format, hashing, validation
Integration tests: Token authentication flow, permission enforcement
Security tests: Brute-force resistance, revocation effectiveness
Performance tests: 1,000 req/s with token validation

References

Architecture Doc: docs/architecture/mcp-authentication-architecture.md
Token Management UI: docs/design/multi-tenant-ux-flows.md#mcp-token-management-flow

ADR-005: Frontend State Management

Status

Accepted - 2025-11-03

Context

ColaFlow frontend (Next.js 16 + React 19) needs a state management solution for authentication, user preferences, and server data. We need to choose libraries that are TypeScript-first, performant, and maintainable.

Requirements:

Type-safe: Full TypeScript support
Performant: Minimal re-renders
Developer-friendly: Low boilerplate
Server state caching: Avoid redundant API calls
Optimistic updates: Immediate UI feedback
Auth state persistence: Survive page refresh

Decision Drivers

TypeScript Support: First-class TypeScript integration
Performance: Minimal bundle size, fast renders
DX (Developer Experience): Easy to learn, low boilerplate
Ecosystem: Good documentation, active community
Server State: Built-in caching and invalidation

Options Considered

Option 1: Zustand (Client State) + TanStack Query v5 (Server State)

Approach:

Zustand: Lightweight state manager for auth, UI state
TanStack Query: Server state caching, mutations, automatic refetching

Zustand Example:

// stores/useAuthStore.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';

interface AuthState {
  user: User | null;
  tenant: Tenant | null;
  accessToken: string | null;
  login: (token: string, user: User, tenant: Tenant) => void;
  logout: () => void;
}

export const useAuthStore = create<AuthState>()(
  persist(
    (set) => ({
      user: null,
      tenant: null,
      accessToken: null,
      login: (token, user, tenant) => set({ accessToken: token, user, tenant }),
      logout: () => set({ accessToken: null, user: null, tenant: null }),
    }),
    { name: 'auth-storage' }
  )
);

TanStack Query Example:

// hooks/useMcpTokens.ts
import { useQuery } from '@tanstack/react-query';
import { mcpService } from '@/services/mcp.service';

export function useMcpTokens() {
  return useQuery({
    queryKey: ['mcp-tokens'],
    queryFn: () => mcpService.listTokens(),
    staleTime: 1000 * 60 * 5, // 5 minutes
  });
}

Pros:

Minimal Bundle Size: Zustand (3KB) + TanStack Query (15KB) = 18KB total
TypeScript-First: Excellent type inference
Low Boilerplate: No actions, reducers, or complex setup
Performance: Zustand avoids unnecessary re-renders
Caching: TanStack Query caches API responses automatically
DevTools: Excellent debugging tools for both libraries
Separation of Concerns: Client state in Zustand, server state in TanStack Query

Cons:

Two libraries to learn (vs one all-in-one solution)
Need to decide what goes in Zustand vs TanStack Query

Learning Curve: Low (Zustand is simpler than Redux, TanStack Query has great docs)

Option 2: Redux Toolkit + RTK Query

Approach:

Redux Toolkit for all state
RTK Query for API data fetching

Pros:

All-in-one solution
Mature ecosystem
Excellent DevTools

Cons:

More Boilerplate: Actions, slices, reducers
Larger Bundle: Redux (10KB) + RTK Query (20KB) = 30KB
Steeper Learning Curve: More concepts to learn
Overkill for MVP: We don't need Redux's complexity yet

Option 3: React Context + SWR

Approach:

React Context for auth state
SWR for server data

Pros:

Minimal dependencies (SWR only)
Simple concept (React Context is built-in)

Cons:

Performance Issues: React Context causes re-renders on every update
Boilerplate: Need to create context providers manually
SWR vs TanStack Query: SWR is less feature-rich

Option 4: Jotai + TanStack Query

Approach:

Jotai for atomic state management
TanStack Query for server state

Pros:

Atomic state model (like Recoil)
Good TypeScript support

Cons:

Less mature than Zustand
Smaller community
Atomic model can be overkill for simple auth state

Decision

Chosen Option: Option 1 - Zustand (Client State) + TanStack Query v5 (Server State)

Rationale:

Bundle Size: 18KB total (vs 30KB for Redux Toolkit)
Performance: Zustand selector-based re-renders, TanStack Query caching
TypeScript: First-class support in both libraries
Learning Curve: Simple APIs, great documentation
Clear Separation: Auth/UI in Zustand, API data in TanStack Query

Usage Guidelines:

Zustand - Use For:

Authentication state (user, tenant, accessToken)
UI state (sidebar open/closed, theme)
User preferences (language, timezone)

TanStack Query - Use For:

API data (projects, issues, tokens)
Mutations (create, update, delete)
Caching and invalidation

Example Architecture:

// Zustand (auth)
const { user, tenant, logout } = useAuthStore();

// TanStack Query (server data)
const { data: projects, isLoading } = useQuery({
  queryKey: ['projects'],
  queryFn: () => projectService.getAll()
});

// Mutation
const createProject = useMutation({
  mutationFn: (data) => projectService.create(data),
  onSuccess: () => {
    queryClient.invalidateQueries({ queryKey: ['projects'] });
  }
});

Consequences

Positive:

Lightweight and fast
Easy to learn and use
Great TypeScript experience
Excellent caching and performance
Clear separation of concerns

Negative:

Two libraries to learn (instead of one)
Need to decide where state lives (Zustand vs TanStack Query)

Neutral:

Both libraries have excellent DevTools
Both are actively maintained

Mitigation Strategies:

Documentation: Create team guide for "What goes where"
Code Reviews: Ensure consistent usage patterns
Linting: Custom ESLint rules if needed

Validation

Acceptance Criteria:

Auth state persists across page refresh
API data cached appropriately (no redundant calls)
Optimistic updates work (immediate UI feedback)
TypeScript errors caught at compile time
DevTools show state clearly

Performance Targets:

Initial page load: <1.5s
State updates: <16ms (60fps)
Cache hit rate: >80%

References

Zustand Docs: https://docs.pmnd.rs/zustand
TanStack Query Docs: https://tanstack.com/query
Implementation: docs/frontend/state-management-guide.md

ADR-006: Token Storage Strategy

Status

Accepted - 2025-11-03

Context

We need to securely store JWT access tokens and refresh tokens in the frontend. The storage mechanism must balance security, usability, and functionality.

Requirements:

Secure: Protect against XSS and CSRF attacks
Persistent: Survive page refresh
Auto-refresh: Seamlessly refresh tokens before expiration
Logout: Clear tokens on logout
Cross-tab sync: Logout in one tab logs out all tabs

Decision Drivers

Security: XSS protection (primary threat)
CSRF Protection: For refresh tokens
Usability: Seamless token refresh
Persistence: User stays logged in across sessions
Performance: Fast token access

Options Considered

Approach:

Access Token: Stored in Zustand state (memory only, not persisted)
Refresh Token: Stored in httpOnly cookie (server-side managed)
Flow:
1. User logs in → Receive access + refresh tokens
2. Access token stored in Zustand (memory)
3. Refresh token stored in httpOnly cookie by backend
4. Access token used for API calls (Authorization header)
5. On 401 error → Call /api/auth/refresh (refresh token sent automatically via cookie)
6. Receive new access token → Update Zustand state

Cookie Configuration (Backend):

Response.Cookies.Append("refreshToken", refreshToken, new CookieOptions
{
    HttpOnly = true, // Cannot be accessed by JavaScript
    Secure = true,   // HTTPS only
    SameSite = SameSiteMode.Strict, // CSRF protection
    MaxAge = TimeSpan.FromDays(7)
});

Pros:

XSS Protection (Access Token): Cannot be stolen via XSS (not in localStorage/cookies)
CSRF Protection (Refresh Token): httpOnly + SameSite=Strict
Short-Lived Access Token: Even if leaked, expires in 60 minutes
Automatic Refresh: Cookie sent automatically on refresh endpoint
No Manual Cookie Management: Backend sets/clears cookies

Cons:

Access token lost on page refresh (need to call refresh immediately)
Requires cookie support (some corporate proxies block cookies)

Security Score: 9/10 (Best practice)

Option 2: Both Tokens in localStorage

Approach:

Store both access and refresh tokens in localStorage
Read on page load

Pros:

Simple implementation
Tokens persist across page refresh
No cookie management

Cons:

Vulnerable to XSS: If attacker injects script, can steal both tokens
No CSRF Protection: Tokens accessible to any script
Not Recommended: Violates OWASP security guidelines

Security Score: 3/10 (Not secure)

Option 3: Both Tokens in httpOnly Cookies

Approach:

Store both tokens in httpOnly cookies
Backend sends cookies on every API response

Pros:

XSS protection for both tokens
Automatic token management

Cons:

CSRF Vulnerability: Cookies sent automatically with every request
Need CSRF Tokens: Additional complexity
Cookie Size Limit: JWTs can be large (4KB cookie limit)
Double-Submit Cookie Pattern Required: More complexity

Security Score: 6/10 (CSRF risk)

Option 4: Session-Based Authentication (No JWT)

Approach:

Traditional session cookies
Session stored server-side (Redis)

Pros:

Simple
Secure (session ID only)

Cons:

Not stateless (requires Redis/database for sessions)
Horizontal scaling complexity
Not suitable for mobile apps
Against our JWT strategy

Security Score: 7/10 (Secure but not stateless)

Decision

Chosen Option: Option 1 - Access Token in Memory + Refresh Token in httpOnly Cookie

Rationale:

Best Security: Access token protected from XSS, refresh token protected from CSRF
Industry Standard: Used by Auth0, Okta, and major SaaS apps
Balances Security and UX: Short-lived access token, auto-refresh
Stateless: No session storage required
Mobile-Friendly: Can adapt for mobile (store refresh token securely)

Implementation:

// stores/useAuthStore.ts
export const useAuthStore = create<AuthState>((set) => ({
  user: null,
  accessToken: null, // Stored in memory ONLY
  login: (token, user) => set({ accessToken: token, user }),
  logout: () => set({ accessToken: null, user: null })
}));

// No persist middleware for accessToken!

// lib/api-client.ts
apiClient.interceptors.response.use(
  (response) => response,
  async (error) => {
    if (error.response?.status === 401 && !error.config._retry) {
      error.config._retry = true;

      // Call refresh endpoint (refresh token sent via cookie automatically)
      const { data } = await axios.post('/api/auth/refresh');

      // Update access token in memory
      useAuthStore.getState().updateToken(data.accessToken);

      // Retry original request
      error.config.headers.Authorization = `Bearer ${data.accessToken}`;
      return apiClient(error.config);
    }

    return Promise.reject(error);
  }
);

Token Refresh Strategy:

Automatic: Intercept 401 errors, call refresh endpoint
Preemptive (Optional): Refresh 5 minutes before expiration
One-at-a-Time: Only one refresh call in flight (queue other requests)

Consequences

Positive:

Maximum security (XSS + CSRF protected)
Seamless user experience (auto-refresh)
Stateless authentication
Mobile-friendly (adapt for secure storage)
Industry best practice

Negative:

Access token lost on page refresh (need immediate refresh call)
Requires cookie support (fails in some corporate environments)
More complex implementation than localStorage

Neutral:

Short-lived access token means more refresh calls (acceptable trade-off)

Mitigation Strategies:

Page Load: Call refresh endpoint on app load if no access token in memory
Cookie Fallback: If cookies blocked, fall back to re-login
Error Handling: Clear UX if authentication fails (session expired)

Validation

Acceptance Criteria:

Access token not visible in localStorage/sessionStorage/cookies (developer tools)
Refresh token in httpOnly cookie with SameSite=Strict
401 errors trigger automatic token refresh
Logout clears all tokens (memory + cookies)
Cross-tab logout works (listen to storage events)

Security Tests:

XSS attack simulation (cannot steal access token)
CSRF attack simulation (refresh endpoint protected)
Token expiration handled gracefully
Logout clears all authentication state

References

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
Auth0 Best Practices: https://auth0.com/docs/secure/tokens/refresh-tokens/refresh-token-rotation
Implementation: docs/frontend/api-integration-guide.md

Summary of Decisions

Decision	Chosen Solution	Rationale
ADR-001: Tenant Identification	JWT Claims + Subdomain	Stateless, cross-platform, performant
ADR-002: Data Isolation	Shared DB + tenant_id + Global Query Filter	Cost-effective, scalable, maintainable
ADR-003: SSO Library	ASP.NET Core Native (OIDC + SAML)	Free, fast, covers 80% of needs
ADR-004: MCP Token Format	Opaque Tokens (`mcp_<slug>_<random>`)	Revocable, flexible, secure, auditable
ADR-005: Frontend State	Zustand + TanStack Query	Lightweight, TypeScript-first, performant
ADR-006: Token Storage	Access in Memory + Refresh in httpOnly Cookie	XSS + CSRF protected, industry standard

Impact Assessment

Security Impact

Overall Security Posture: Excellent (9/10)
XSS Protection: Enforced (tokens in memory + httpOnly cookies)
CSRF Protection: Enforced (SameSite=Strict cookies)
Data Isolation: Enforced (Global Query Filter + composite indexes)
Audit Trail: Complete (MCP tokens logged, SSO events tracked)

Performance Impact

API Latency: +5ms (JWT validation + tenant filtering)
Database Load: Minimal (composite indexes, Global Query Filter)
Frontend Bundle Size: +18KB (Zustand + TanStack Query)
Token Refresh: Transparent to user (<100ms)

Cost Impact

Infrastructure: $200/month (1 database vs $15,000 for DB-per-tenant)
Licensing: $0/month (native .NET libraries vs $3,000-5,000 for Auth0)
Maintenance: Low (one schema, automated migrations)
Total Savings: ~$18,000/year compared to Auth0 + DB-per-tenant

Development Impact

Implementation Time: 10 days (vs 6 weeks for IdentityServer + DB-per-tenant)
Learning Curve: Low (native libraries, clear architecture)
Maintenance Burden: Low (well-documented, industry patterns)
Testing Complexity: Medium (need tenant isolation tests)

Risks and Mitigation

Risk	Mitigation
Data leak via Global Query Filter bypass	Code review for `.IgnoreQueryFilters()`, integration tests
SSO misconfiguration	Test connection UI, detailed error messages, documentation
MCP token brute-force	128-bit entropy, rate limiting, IP whitelisting
Performance degradation	Composite indexes, query monitoring, slow query alerts
Frontend XSS attack	CSP headers, input sanitization, React auto-escaping

Future Enhancements

Decisions are not permanent. We will revisit these at milestone reviews:

Milestone	Potential Changes
M3	Re-evaluate SSO (Auth0 if complex federation needed)
M4	Re-evaluate data isolation (DB-per-tenant for enterprise customers)
M5	Re-evaluate frontend state (Redux if complex state emerges)
M6	Re-evaluate MCP tokens (consider JWT if performance critical)

Document Status: Approved Next Review: M3 Architecture Review (2025-12-15) Approval Signatures:

Architecture Team: [Approved]
Product Manager: [Approved]
Security Team: [Pending Review]
Engineering Lead: [Approved]

End of Architecture Decision Record

47 KiB Raw Blame History Unescape Escape

Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy

Document Purpose

Table of Contents

ADR-001: Tenant Identification Strategy

Status

Context

Decision Drivers

Options Considered

Option 1: JWT Claims (Primary) + Subdomain (Secondary)

Option 2: Session-Based Tenant Storage

Option 3: Subdomain-Only Identification

Option 4: Tenant ID in URL Path

Decision

Consequences

Validation

References

ADR-002: Data Isolation Strategy

Status

Context

Decision Drivers

Options Considered

Option 1: Shared Database + tenant_id Column + Global Query Filter

Option 2: Database-per-Tenant

Option 3: Schema-per-Tenant (PostgreSQL Schemas)

Option 4: Separate Infrastructure per Tenant (Fully Isolated)

Decision

Consequences

Validation

References

ADR-003: SSO Library Selection

Status

Context

Decision Drivers

Options Considered

Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)

Option 2: Auth0

Option 3: Okta (Workforce Identity Cloud)

Option 4: IdentityServer4 / Duende IdentityServer

Decision

Consequences

Validation

References

ADR-004: MCP Token Format

Status

Context

Decision Drivers

Options Considered

Option 1: Opaque Tokens (mcp_<tenant_slug>_<random_32>)

Option 2: JWT Tokens for MCP

Option 3: API Keys (UUID Format)

Option 4: GitHub-Style Personal Access Tokens

Decision

Consequences

Validation

References

ADR-005: Frontend State Management

Status

Context

Decision Drivers

Options Considered

Option 1: Zustand (Client State) + TanStack Query v5 (Server State)

Option 2: Redux Toolkit + RTK Query

Option 3: React Context + SWR

Option 4: Jotai + TanStack Query

Decision

Consequences

Validation

References

ADR-006: Token Storage Strategy

Status

Context

Decision Drivers

Options Considered

Option 1: Access Token in Memory + Refresh Token in httpOnly Cookie

Option 2: Both Tokens in localStorage

Option 3: Both Tokens in httpOnly Cookies

Option 4: Session-Based Authentication (No JWT)

Decision

Consequences

47 KiB

Raw Blame History

Option 1: Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)