Files
ColaFlow/reports/2025-11-03-Architecture-Decision-Record.md
Yaojia Wang 1f66b25f30
Some checks failed
Code Coverage / Generate Coverage Report (push) Has been cancelled
Tests / Run Tests (9.0.x) (push) Has been cancelled
Tests / Docker Build Test (push) Has been cancelled
Tests / Test Summary (push) Has been cancelled
In progress
2025-11-03 14:00:24 +01:00

47 KiB
Raw Blame History

Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy

Document Type: ADR (Architecture Decision Record) Date: 2025-11-03 Status: Accepted Decision Makers: Architecture Team, Product Manager, Technical Leads Project: ColaFlow - M1 Sprint 2 (Enterprise Multi-Tenant Upgrade)


Document Purpose

This Architecture Decision Record (ADR) documents the key architectural decisions made for ColaFlow's transition from a single-tenant to an enterprise-ready multi-tenant SaaS platform. It follows the ADR format to capture context, options considered, chosen solutions, and consequences.


Table of Contents

  1. ADR-001: Tenant Identification Strategy
  2. ADR-002: Data Isolation Strategy
  3. ADR-003: SSO Library Selection
  4. ADR-004: MCP Token Format
  5. ADR-005: Frontend State Management
  6. ADR-006: Token Storage Strategy
  7. Summary of Decisions

ADR-001: Tenant Identification Strategy

Status

Accepted - 2025-11-03

Context

ColaFlow is transitioning to a multi-tenant architecture where multiple companies (tenants) will share the same application instance. We need a reliable, performant, and secure method to identify which tenant a user or API request belongs to.

Requirements:

  • Must work across web, mobile, and API clients
  • Must be stateless (no session storage required)
  • Must be secure (prevent tenant spoofing)
  • Must be performant (no database lookup per request)
  • Must support both human users and AI agents (MCP tokens)
  • Must work with subdomain-based URLs (e.g., acme.colaflow.com)

Decision Drivers

  1. Performance: System must handle 10,000+ requests/second without database lookups
  2. Security: Tenant ID cannot be tampered with by malicious users
  3. Scalability: Solution must work for mobile apps, APIs, and web simultaneously
  4. Developer Experience: Easy to implement and maintain across all layers
  5. User Experience: Friendly tenant selection (via subdomain)

Options Considered

Option 1: JWT Claims (Primary) + Subdomain (Secondary)

Approach:

  • Store tenant_id and tenant_slug in JWT access token claims
  • Resolve tenant from subdomain on login/registration
  • Inject tenant context from JWT claims into all API requests
  • No database lookup required after authentication

Pros:

  • Stateless: No session storage or database lookup per request
  • Secure: JWT signature prevents tampering
  • Cross-platform: Works for web, mobile, API, MCP tokens
  • Fast: O(1) lookup from JWT claims
  • Tenant context available in middleware layer

Cons:

  • JWT cannot be updated until refresh (stale tenant info for up to 60 minutes)
  • Requires careful token expiration management
  • Subdomain only used for initial tenant resolution (login page)

Example JWT Payload:

{
  "sub": "user-id-123",
  "email": "john@acme.com",
  "tenant_id": "tenant-uuid-456",
  "tenant_slug": "acme",
  "tenant_plan": "Enterprise",
  "auth_provider": "AzureAD",
  "role": "User",
  "exp": 1730678400,
  "iat": 1730674800
}

Option 2: Session-Based Tenant Storage

Approach:

  • Store tenant ID in server-side session (Redis)
  • Lookup tenant on every request via session ID
  • Subdomain used for tenant resolution on login

Pros:

  • Can update tenant info without re-login
  • Works well for web applications
  • Session can store additional context

Cons:

  • Not stateless: Requires Redis/session storage infrastructure
  • Database/Redis lookup on every request (performance hit)
  • Difficult to scale horizontally (session affinity required)
  • Doesn't work well for mobile apps or API-only clients
  • MCP tokens would still need separate mechanism

Option 3: Subdomain-Only Identification

Approach:

  • Parse subdomain from HTTP Host header on every request
  • Lookup tenant by slug in database
  • No JWT claims for tenant

Pros:

  • Simple conceptual model
  • User-friendly (URL shows tenant)
  • Easy to test locally

Cons:

  • Database lookup on every request (performance bottleneck)
  • Doesn't work for API clients (no subdomain in API calls)
  • Doesn't work for mobile apps
  • Vulnerable to DNS spoofing
  • MCP tokens cannot carry subdomain context

Option 4: Tenant ID in URL Path

Approach:

  • Include tenant ID in every API route: /api/tenants/{tenantId}/projects
  • Frontend passes tenant ID explicitly

Pros:

  • Explicit tenant context in every request
  • Easy to debug
  • Works across all client types

Cons:

  • Poor user experience (ugly URLs)
  • Easy to make mistakes (wrong tenant ID)
  • Difficult to enforce (requires middleware validation)
  • Security risk (users could try other tenant IDs)
  • Requires frontend to manage tenant ID everywhere

Decision

Chosen Option: Option 1 - JWT Claims (Primary) + Subdomain (Secondary)

Rationale:

  1. Performance: No database lookup per request; O(1) from JWT claims
  2. Security: JWT signature prevents tampering; middleware validates on every request
  3. Scalability: Works for web, mobile, API, and MCP tokens uniformly
  4. Stateless: No session storage required; easy to scale horizontally
  5. Developer Experience: TenantContext injected automatically via middleware

Implementation Strategy:

  • Login Flow: User visits acme.colaflow.com/login → Tenant resolved from subdomain → JWT contains tenant_id and tenant_slug
  • API Requests: JWT extracted from Authorization header → tenant_id injected into TenantContext → EF Core Global Query Filter applies automatic filtering
  • MCP Tokens: Opaque tokens stored with tenant_id → Middleware validates token → Tenant context injected (same as JWT)

Consequences

Positive:

  • Fast authentication and authorization
  • No session storage infrastructure required
  • Uniform tenant resolution across all client types
  • Easy to test and debug (tenant visible in JWT payload)
  • Supports multi-tenant mobile apps

Negative:

  • Tenant changes require re-login (or wait for token refresh)
  • JWT size increases slightly (+50 bytes for tenant claims)
  • Middleware must validate JWT on every request (minor CPU cost)

Neutral:

  • Subdomain is only used for initial tenant selection (login page)
  • Tenant switching requires logout and login to different subdomain

Mitigation Strategies:

  • Keep JWT expiration short (60 minutes) to allow tenant updates on refresh
  • Implement automatic token refresh to minimize user disruption
  • Cache JWT validation results per request to avoid redundant checks

Validation

Acceptance Criteria:

  • JWT contains tenant_id, tenant_slug, and tenant_plan claims
  • Middleware extracts tenant from JWT and injects into TenantContext
  • All database queries automatically filter by tenant via Global Query Filter
  • Cross-tenant access attempts return 403 Forbidden
  • Performance: <5ms overhead for JWT validation per request

Testing:

  • Unit tests: TenantContext injection
  • Integration tests: Cross-tenant isolation
  • Performance tests: 10,000 req/s with JWT validation
  • Security tests: Attempt to access other tenant's data (should fail)

References

  • Architecture Doc: docs/architecture/multi-tenancy-architecture.md
  • JWT Implementation: docs/architecture/jwt-authentication-architecture.md
  • MCP Token Format: docs/architecture/mcp-authentication-architecture.md

ADR-002: Data Isolation Strategy

Status

Accepted - 2025-11-03

Context

In a multi-tenant system, data isolation is critical to ensure that one tenant cannot access another tenant's data. We need to choose an isolation strategy that balances security, performance, cost, and maintainability.

Requirements:

  • Strong data isolation (no cross-tenant leaks)
  • Good query performance (<50ms for typical queries)
  • Cost-effective (avoid database proliferation)
  • Easy to maintain and backup
  • Scalable to 10,000+ tenants
  • Support for per-tenant data export (GDPR compliance)

Decision Drivers

  1. Security: Absolute data isolation between tenants
  2. Cost: Minimize infrastructure costs (PostgreSQL instances, storage)
  3. Performance: Fast queries with proper indexing
  4. Scalability: Support thousands of tenants on shared infrastructure
  5. Maintainability: Easy schema migrations, backups, monitoring

Options Considered

Option 1: Shared Database + tenant_id Column + Global Query Filter

Approach:

  • All tenants share one PostgreSQL database
  • Every table has a tenant_id column (NOT NULL)
  • EF Core Global Query Filter automatically adds .Where(e => e.TenantId == currentTenantId) to all queries
  • Composite indexes: (tenant_id, other_columns)

Schema Example:

CREATE TABLE projects (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    name VARCHAR(200) NOT NULL,
    key VARCHAR(20) NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    CONSTRAINT uq_projects_tenant_key UNIQUE (tenant_id, key)
);

CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
CREATE INDEX idx_projects_tenant_key ON projects(tenant_id, key);

EF Core Configuration:

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Project>().HasQueryFilter(
        p => p.TenantId == _tenantContext.CurrentTenantId
    );
}

Pros:

  • Cost-effective: One database for all tenants
  • Easy to maintain: Single schema, one backup process
  • Good performance with proper indexing (composite indexes)
  • Easy to add new tenants (just insert into tenants table)
  • Per-tenant data export is SQL query: SELECT * FROM projects WHERE tenant_id = 'xxx'
  • Scales to 10,000+ tenants on one database
  • Automatic filtering via Global Query Filter (developer-friendly)

Cons:

  • Risk of data leak if Global Query Filter is bypassed (.IgnoreQueryFilters())
  • All tenants affected by database downtime
  • Cannot isolate noisy neighbors (one tenant's heavy queries affect others)
  • Database size grows with all tenants (monitoring required)

Cost Estimate: 1 database instance (~$100-200/month for medium workload)

Option 2: Database-per-Tenant

Approach:

  • Each tenant gets a dedicated PostgreSQL database
  • Connection string stored in tenants table
  • Middleware switches database context per request

Schema Example:

-- Shared management database
CREATE TABLE tenants (
    id UUID PRIMARY KEY,
    slug VARCHAR(50) UNIQUE NOT NULL,
    connection_string TEXT NOT NULL -- Encrypted
);

-- Tenant-specific database (one per tenant)
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_beta;

Pros:

  • Strong isolation: One tenant's database cannot access another
  • Tenant-specific customization (different schema versions)
  • Easy to back up per tenant
  • Noisy neighbors don't affect each other
  • Easy to migrate tenant to different database server

Cons:

  • Expensive: N databases for N tenants (~$10-20/month per tenant minimum)
  • Complex maintenance: Schema migrations across 1000s of databases
  • Connection pool exhaustion (need one pool per tenant)
  • Difficult to implement cross-tenant features (analytics, admin tools)
  • Onboarding delay (new database provisioning takes time)

Cost Estimate: 1000 tenants × $15/month = $15,000/month (vs $200 for shared)

Option 3: Schema-per-Tenant (PostgreSQL Schemas)

Approach:

  • One database with multiple PostgreSQL schemas
  • Each tenant gets a schema: tenant_acme.projects, tenant_beta.projects
  • Middleware switches search_path per request: SET search_path = tenant_acme;

Pros:

  • Better isolation than shared database
  • Lower cost than database-per-tenant
  • All tenants in one PostgreSQL instance (easier backups)
  • Can support ~1000 schemas per database

Cons:

  • PostgreSQL schema limit (~1000 schemas per database)
  • Schema creation overhead for new tenants
  • Complex schema migrations (run migration on each schema)
  • Search_path switching per request (performance overhead)
  • Difficult to enforce (easy to forget to set search_path)

Cost Estimate: Same as shared database, but limited scalability

Option 4: Separate Infrastructure per Tenant (Fully Isolated)

Approach:

  • Each tenant gets dedicated Kubernetes namespace, database, Redis, etc.
  • Complete infrastructure isolation

Pros:

  • Maximum isolation and security
  • Per-tenant scaling and customization
  • Enterprise customers often require this

Cons:

  • Extremely expensive (hundreds of dollars per tenant)
  • Complex to manage (orchestration required)
  • Overkill for most tenants
  • Long onboarding time

Cost Estimate: 1000 tenants × $500/month = $500,000/month (prohibitive)

Decision

Chosen Option: Option 1 - Shared Database + tenant_id Column + Global Query Filter

Rationale:

  1. Cost-Effective: $200/month vs $15,000/month for database-per-tenant
  2. Scalable: PostgreSQL handles 10,000+ tenants with proper indexing
  3. Maintainable: One schema, one backup process, one monitoring dashboard
  4. Developer-Friendly: EF Core Global Query Filter ensures automatic filtering
  5. Performance: Composite indexes provide excellent query performance
  6. Proven Pattern: Used by GitHub, Slack, Heroku, and many successful SaaS products

Implementation Strategy:

  • Add tenant_id column to all business tables
  • Create composite indexes: (tenant_id, primary_key), (tenant_id, foreign_key)
  • Configure EF Core Global Query Filter in OnModelCreating
  • Create TenantContext service to inject current tenant
  • Add database-level constraints: CHECK (tenant_id IS NOT NULL)
  • Update unique constraints to be tenant-scoped: UNIQUE (tenant_id, email)

Migration Path:

  • Create tenants table
  • Create default tenant for existing data
  • Add tenant_id columns (nullable initially)
  • Migrate existing data to default tenant
  • Set tenant_id as NOT NULL
  • Add indexes and constraints

Consequences

Positive:

  • Low infrastructure cost (1 database vs thousands)
  • Easy to maintain and monitor
  • Fast schema migrations (one database)
  • Automatic tenant filtering (developer safety)
  • Good query performance with indexes
  • Per-tenant data export is straightforward SQL

Negative:

  • Risk of data leak if developer bypasses Global Query Filter
  • All tenants share database resources (monitoring required)
  • Cannot isolate noisy neighbors at database level
  • Database backup contains all tenants (larger backup size)

Neutral:

  • Tenant onboarding is instant (no new database needed)
  • Cross-tenant analytics require explicit filtering
  • Database size monitoring required as tenant count grows

Mitigation Strategies:

  • Data Leak Prevention:
    • Code review requirement for any .IgnoreQueryFilters() usage
    • Integration tests verify cross-tenant isolation
    • Automated security testing (attempt cross-tenant access)
  • Performance Monitoring:
    • Alert on slow queries (>100ms)
    • Index usage monitoring (pg_stat_user_indexes)
    • Per-tenant query cost tracking
  • Noisy Neighbor Protection:
    • Query timeout limits (5 seconds max)
    • Rate limiting per tenant
    • Connection pool limits
    • Option to migrate large tenant to dedicated database later

Upgrade Path: If a tenant grows too large or requires dedicated resources, we can migrate them to a separate database while keeping the shared model for other tenants.

Validation

Acceptance Criteria:

  • All queries automatically filter by tenant
  • Cross-tenant access attempts fail with 403 Forbidden
  • Query performance <50ms for typical workloads (with 10,000 records per tenant)
  • Integration tests verify tenant isolation
  • Data export per tenant completes in <1 minute

Testing:

  • Unit tests: Global Query Filter applied to all entities
  • Integration tests: Create data in Tenant A, verify Tenant B cannot access
  • Performance tests: Query time with 1 million total records (100 tenants × 10,000 records)
  • Load tests: 10,000 concurrent requests across 100 tenants

References

  • Architecture Doc: docs/architecture/multi-tenancy-architecture.md
  • Migration Strategy: docs/architecture/migration-strategy.md
  • Performance Benchmarks: docs/architecture/performance-benchmarks.md (TBD)

ADR-003: SSO Library Selection

Status

Accepted - 2025-11-03

Context

Enterprise customers require Single Sign-On (SSO) to integrate ColaFlow with their corporate identity providers (Azure AD, Google Workspace, Okta, etc.). We need to choose an SSO library/approach that balances functionality, cost, implementation speed, and maintainability.

Requirements:

  • Support major identity providers: Azure AD, Google, Okta
  • Support OIDC (OpenID Connect) protocol
  • Support SAML 2.0 for generic enterprise IdPs
  • User auto-provisioning (create user on first SSO login)
  • Email domain restrictions (only allow @acme.com)
  • Configurable per tenant (each tenant has own SSO config)
  • Production-ready security standards

Decision Drivers

  1. Time-to-Market: Implement SSO in <1 week (M1 timeline constraint)
  2. Cost: Minimize licensing fees
  3. Coverage: Support 90% of enterprise SSO requirements
  4. Flexibility: Can upgrade later if complex requirements emerge
  5. Security: Follow OWASP and OIDC/SAML best practices

Options Considered

Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)

Approach:

  • Use built-in Microsoft.AspNetCore.Authentication.OpenIdConnect for OIDC
  • Use Sustainsys.Saml2 library for SAML 2.0
  • Custom implementation for multi-tenant SSO configuration
  • Store SSO config in tenants table (JSONB column)

Pros:

  • Free: No licensing costs
  • Fast: Can implement OIDC in 2-3 days, SAML in 3-4 days
  • Built-in to .NET 9: Mature, well-documented
  • Flexible: Full control over implementation
  • Covers 80-90% of enterprise SSO needs

Cons:

  • Manual implementation: Need to handle user provisioning, domain restrictions
  • Limited advanced features: No federation, no protocol switching
  • SAML is more complex to implement
  • Need to maintain our own SSO configuration UI

Implementation Complexity: Medium Cost: $0/month Coverage: OIDC (Azure, Google, Okta) + SAML 2.0 (80% of market)

Code Example:

services.AddAuthentication()
    .AddOpenIdConnect("AzureAD", options =>
    {
        options.Authority = tenant.SsoConfig.AuthorityUrl;
        options.ClientId = tenant.SsoConfig.ClientId;
        options.ClientSecret = tenant.SsoConfig.ClientSecret;
        options.ResponseType = "code";
        options.SaveTokens = true;
        options.Events = new OpenIdConnectEvents
        {
            OnTokenValidated = async context =>
            {
                await AutoProvisionUserAsync(context);
            }
        };
    });

Option 2: Auth0

Approach:

  • Use Auth0 as SSO broker
  • Auth0 handles all identity providers
  • Configure Auth0 via their dashboard
  • Pay per monthly active user (MAU)

Pros:

  • Fast setup: Implement in 1-2 days
  • Comprehensive: Supports all identity providers out-of-the-box
  • User management: Built-in user directory
  • Advanced features: MFA, passwordless, anomaly detection
  • Dashboard for SSO configuration

Cons:

  • Expensive: $240/month (Professional) + $0.05/MAU (500 users = $25/month extra)
  • Vendor lock-in: Difficult to migrate away
  • Less control: Auth0 controls auth flow
  • Overkill for MVP: Many features we don't need yet

Implementation Complexity: Low Cost: $3,000-5,000/year (for 100 tenants with 5,000 total users) Coverage: 100% (all protocols, all providers)

Option 3: Okta (Workforce Identity Cloud)

Approach:

  • Use Okta as SSO broker
  • Similar to Auth0 but more enterprise-focused
  • Per-user pricing

Pros:

  • Enterprise-grade: Trusted by Fortune 500
  • Complete features: SSO, MFA, provisioning, directory
  • Excellent support and documentation

Cons:

  • Very expensive: $2/user/month minimum (100 users = $200/month)
  • Enterprise sales process (slow, complex)
  • Overkill for startup/SMB customers
  • Vendor lock-in

Implementation Complexity: Low Cost: $5,000-10,000/year (for 100 tenants) Coverage: 100%

Option 4: IdentityServer4 / Duende IdentityServer

Approach:

  • Use IdentityServer as self-hosted identity provider
  • Implement Federation support (connect to external IdPs)
  • Open-source (IdentityServer4) or licensed (Duende)

Pros:

  • Self-hosted: Full control
  • Comprehensive: OIDC, OAuth 2.0, SAML via plugins
  • Flexible: Can customize extensively
  • No per-user fees

Cons:

  • Complex: Steep learning curve (2-3 weeks to implement)
  • Maintenance burden: Need to maintain IdentityServer instance
  • Duende licensing: $1,500/year for production use
  • Overkill for MVP: We don't need an identity provider, just SSO

Implementation Complexity: High Cost: $1,500/year (Duende license) Coverage: 100%

Decision

Chosen Option: Option 1 - ASP.NET Core Native OIDC/SAML (M1-M2)

Rationale:

  1. Cost: $0/month vs $3,000-5,000/year for Auth0/Okta
  2. Speed: Can implement in <1 week (M1 timeline)
  3. Control: Full flexibility to customize
  4. Coverage: Supports 80% of enterprise SSO requirements (OIDC + SAML)
  5. Upgrade Path: Can migrate to Auth0/Okta later if complex requirements emerge

Decision: Start with native ASP.NET Core for M1-M2. Re-evaluate at M3 if we need:

  • Complex federation (multiple IdPs per tenant)
  • Advanced MFA flows
  • More than 5 different SSO protocols
  • Dedicated identity management features

Implementation Strategy:

  • M1 (Week 1): OIDC implementation (Azure AD, Google, Okta)
  • M2 (Week 2): SAML 2.0 implementation (generic enterprise IdPs)
  • M2 (Week 3): User auto-provisioning and domain restrictions
  • M2 (Week 4): SSO configuration UI for tenants

Consequences

Positive:

  • Zero licensing costs for M1-M2
  • Complete control over implementation
  • Can customize for our specific needs
  • Fast implementation (< 1 week)
  • Covers 80% of enterprise SSO requirements
  • Learning opportunity for team

Negative:

  • Manual implementation required (more code to maintain)
  • Limited to OIDC + SAML 2.0 (no exotic protocols)
  • Need to build SSO configuration UI ourselves
  • More testing required (vs using Auth0)

Neutral:

  • Can migrate to Auth0/Okta later if needed
  • SSO config stored in database (our control)
  • Integration tests required for each IdP

Mitigation Strategies:

  • Quality: Comprehensive testing with real IdPs (Azure AD, Google)
  • Documentation: Detailed guides for each supported provider
  • Security: Follow OIDC/SAML security best practices
  • Upgrade Path: Design SSO config to be provider-agnostic (easy migration)

Validation

Acceptance Criteria:

  • OIDC login works with Azure AD, Google, Okta
  • SAML 2.0 login works with generic IdP
  • Users auto-provisioned on first login
  • Email domain restrictions enforced
  • SSO configuration UI functional for admins
  • Error handling for common SSO failures

Testing:

  • Unit tests: OIDC token validation, SAML assertion parsing
  • Integration tests: Full SSO flow with real IdPs (test tenants)
  • Security tests: CSRF protection, replay attack prevention
  • Usability tests: Admin can configure SSO without support

References

  • Architecture Doc: docs/architecture/sso-integration-architecture.md
  • Implementation Guide: docs/implementation/sso-implementation.md (TBD)
  • Security Checklist: docs/security/sso-security-checklist.md (TBD)

ADR-004: MCP Token Format

Status

Accepted - 2025-11-03

Context

ColaFlow will expose an MCP (Model Context Protocol) server that allows AI agents (Claude, ChatGPT) to access project data, create tasks, and generate reports. We need a secure, revocable authentication mechanism for AI agents.

Requirements:

  • Secure: Cannot be forged or tampered with
  • Revocable: Admin can revoke token instantly
  • Fine-Grained Permissions: Control read/write access per resource
  • Audit Trail: Log all API operations performed with token
  • Tenant-Scoped: Token only works for one tenant
  • Long-Lived: Valid for days/weeks (not short-lived like JWT)

Decision Drivers

  1. Security: Token cannot be guessed or brute-forced
  2. Revocability: Instant revocation (no JWT blacklist complexity)
  3. Permissions: Resource-level + operation-level granularity
  4. Auditability: Complete log of all token operations
  5. Usability: Easy to copy/paste, recognizable format

Options Considered

Option 1: Opaque Tokens (mcp_<tenant_slug>_<random_32>)

Format: mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d

Approach:

  • Token is a random string (cryptographically secure)
  • Prefix: mcp_ (identifies as MCP token)
  • Tenant slug: acme (for easy identification)
  • Random part: 32 hex characters (128 bits of entropy)
  • Store token hash (SHA256) in database
  • Store permissions in database alongside token

Token Storage:

CREATE TABLE mcp_tokens (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    user_id UUID NULL,
    name VARCHAR(100) NOT NULL,
    token_hash VARCHAR(255) NOT NULL UNIQUE, -- SHA256 of token
    permissions JSONB NOT NULL, -- {"projects": ["read", "search"], ...}
    status INT NOT NULL, -- Active/Revoked/Expired
    created_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NULL,
    last_used_at TIMESTAMP NULL
);

Validation Flow:

  1. Receive token: mcp_acme_xxx...
  2. Hash token with SHA256
  3. Lookup in database by token_hash
  4. Check status (Active/Revoked/Expired)
  5. Check expiration date
  6. Load permissions from JSONB column
  7. Inject tenant context and permissions into request

Pros:

  • Revocable: Update status = Revoked in database, takes effect immediately
  • Secure: SHA256 hashed, never stored plain-text
  • Flexible Permissions: Can update permissions without regenerating token
  • Auditable: Every token use logged in database
  • Tenant-Scoped: Token hash includes tenant context
  • Long-Lived: Can be valid for months/years
  • Easy to Identify: Prefix + tenant slug clearly identify token type

Cons:

  • Database lookup required on every request (performance overhead)
  • Larger tokens (50+ characters) vs API keys (32 characters)
  • Need to manage token lifecycle (expiration, revocation)

Performance: ~5ms per token validation (including database lookup)

Option 2: JWT Tokens for MCP

Format: Long JWT string (200+ characters)

Approach:

  • Generate JWT with tenant_id, user_id, permissions claims
  • Sign with secret key
  • No database lookup required (stateless)
  • Validate signature on every request

Pros:

  • Stateless: No database lookup required
  • Fast validation: O(1) signature check
  • Self-contained: All info in token

Cons:

  • Cannot Revoke: Once issued, JWT is valid until expiration (unless using blacklist)
  • Blacklist Required: Need Redis/database to store revoked JWTs (adds complexity)
  • Permissions Fixed: Cannot update permissions without regenerating token
  • Larger Tokens: 200-500 characters (difficult to copy/paste)
  • Expiration Required: Must set short expiration for revocation to work

Revocation Problem:

User generates JWT token → Shares with AI agent → Admin wants to revoke
→ JWT is still valid for 30 days → Need to blacklist JWT ID
→ Now need Redis to store blacklist → Not truly stateless anymore

Option 3: API Keys (UUID Format)

Format: 550e8400-e29b-41d4-a716-446655440000

Approach:

  • Generate random UUID
  • Store in database with permissions
  • Simple validation: lookup by UUID

Pros:

  • Simple implementation
  • Standard format (UUID)
  • Database lookup

Cons:

  • No tenant context in token (need to lookup tenant)
  • No token type identifier (could be confused with user IDs)
  • No visual indication of purpose
  • Less secure (UUIDs have less entropy than 256-bit random strings)

Option 4: GitHub-Style Personal Access Tokens

Format: ghp_ABcdEF123456789012345678901234567890

Approach:

  • Prefix identifies token type
  • Random alphanumeric string
  • Store hash in database

Pros:

  • Industry standard (used by GitHub, GitLab)
  • Easy to identify by prefix
  • Secure

Cons:

  • No tenant context in token itself
  • Shorter random part (less entropy than our Option 1)

Decision

Chosen Option: Option 1 - Opaque Tokens (mcp_<tenant_slug>_<random_32>)

Rationale:

  1. Revocability: Instant revocation without blacklist complexity
  2. Flexibility: Permissions stored server-side, can update without new token
  3. Security: 128 bits of entropy + SHA256 hashing
  4. Usability: Tenant slug in token helps users identify which tenant it's for
  5. Auditability: Complete audit trail in database

Token Format:

mcp_<tenant_slug>_<random_32_hex_chars>

Example:

mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
mcp_techcorp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6

Components:

  • mcp_: Identifies as MCP token (easy to filter in logs)
  • acme: Tenant slug (helps user identify which tenant)
  • 7f3d8a9c...: 32 hex characters (128 bits entropy = 2^128 combinations)

Generation:

public string GenerateToken(string tenantSlug)
{
    var randomBytes = new byte[16]; // 128 bits
    using var rng = RandomNumberGenerator.Create();
    rng.GetBytes(randomBytes);
    var randomHex = Convert.ToHexString(randomBytes).ToLowerInvariant();
    return $"mcp_{tenantSlug}_{randomHex}";
}

Storage:

public async Task<McpToken> CreateTokenAsync(CreateMcpTokenCommand command)
{
    var token = _tokenGenerator.GenerateToken(tenant.Slug);
    var tokenHash = _tokenGenerator.HashToken(token); // SHA256

    var mcpToken = new McpToken
    {
        TokenHash = tokenHash, // Never store plain-text
        Permissions = command.Permissions,
        ExpiresAt = command.ExpiresAt
    };

    await _repository.AddAsync(mcpToken);
    return token; // Return plain-text ONLY ONCE
}

Consequences

Positive:

  • Instant revocation (update database status)
  • Fine-grained permissions (stored server-side)
  • Complete audit trail
  • Tenant-scoped (slug in token)
  • Secure (128-bit entropy + SHA256)
  • User-friendly (tenant slug helps identification)

Negative:

  • Database lookup required per request (~5ms overhead)
  • Longer tokens (50 characters vs 32 for API keys)
  • Need to manage token lifecycle (expiration, cleanup)

Neutral:

  • Performance overhead acceptable for MCP use case (not high-frequency)
  • Token length acceptable for copy/paste workflow

Mitigation Strategies:

  • Performance: Cache token validation results (5-minute TTL)
  • Token Length: Provide copy button and download option in UI
  • Lifecycle Management: Automated cleanup job for expired tokens

Validation

Acceptance Criteria:

  • Token generation is cryptographically secure (CSPRNG)
  • Token hash stored (SHA256), never plain-text
  • Token validation <10ms (including database lookup)
  • Revocation takes effect immediately
  • Permissions enforced on every API call
  • Audit log created for every token use

Testing:

  • Unit tests: Token generation format, hashing, validation
  • Integration tests: Token authentication flow, permission enforcement
  • Security tests: Brute-force resistance, revocation effectiveness
  • Performance tests: 1,000 req/s with token validation

References

  • Architecture Doc: docs/architecture/mcp-authentication-architecture.md
  • Token Management UI: docs/design/multi-tenant-ux-flows.md#mcp-token-management-flow

ADR-005: Frontend State Management

Status

Accepted - 2025-11-03

Context

ColaFlow frontend (Next.js 16 + React 19) needs a state management solution for authentication, user preferences, and server data. We need to choose libraries that are TypeScript-first, performant, and maintainable.

Requirements:

  • Type-safe: Full TypeScript support
  • Performant: Minimal re-renders
  • Developer-friendly: Low boilerplate
  • Server state caching: Avoid redundant API calls
  • Optimistic updates: Immediate UI feedback
  • Auth state persistence: Survive page refresh

Decision Drivers

  1. TypeScript Support: First-class TypeScript integration
  2. Performance: Minimal bundle size, fast renders
  3. DX (Developer Experience): Easy to learn, low boilerplate
  4. Ecosystem: Good documentation, active community
  5. Server State: Built-in caching and invalidation

Options Considered

Option 1: Zustand (Client State) + TanStack Query v5 (Server State)

Approach:

  • Zustand: Lightweight state manager for auth, UI state
  • TanStack Query: Server state caching, mutations, automatic refetching

Zustand Example:

// stores/useAuthStore.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';

interface AuthState {
  user: User | null;
  tenant: Tenant | null;
  accessToken: string | null;
  login: (token: string, user: User, tenant: Tenant) => void;
  logout: () => void;
}

export const useAuthStore = create<AuthState>()(
  persist(
    (set) => ({
      user: null,
      tenant: null,
      accessToken: null,
      login: (token, user, tenant) => set({ accessToken: token, user, tenant }),
      logout: () => set({ accessToken: null, user: null, tenant: null }),
    }),
    { name: 'auth-storage' }
  )
);

TanStack Query Example:

// hooks/useMcpTokens.ts
import { useQuery } from '@tanstack/react-query';
import { mcpService } from '@/services/mcp.service';

export function useMcpTokens() {
  return useQuery({
    queryKey: ['mcp-tokens'],
    queryFn: () => mcpService.listTokens(),
    staleTime: 1000 * 60 * 5, // 5 minutes
  });
}

Pros:

  • Minimal Bundle Size: Zustand (3KB) + TanStack Query (15KB) = 18KB total
  • TypeScript-First: Excellent type inference
  • Low Boilerplate: No actions, reducers, or complex setup
  • Performance: Zustand avoids unnecessary re-renders
  • Caching: TanStack Query caches API responses automatically
  • DevTools: Excellent debugging tools for both libraries
  • Separation of Concerns: Client state in Zustand, server state in TanStack Query

Cons:

  • Two libraries to learn (vs one all-in-one solution)
  • Need to decide what goes in Zustand vs TanStack Query

Learning Curve: Low (Zustand is simpler than Redux, TanStack Query has great docs)

Option 2: Redux Toolkit + RTK Query

Approach:

  • Redux Toolkit for all state
  • RTK Query for API data fetching

Pros:

  • All-in-one solution
  • Mature ecosystem
  • Excellent DevTools

Cons:

  • More Boilerplate: Actions, slices, reducers
  • Larger Bundle: Redux (10KB) + RTK Query (20KB) = 30KB
  • Steeper Learning Curve: More concepts to learn
  • Overkill for MVP: We don't need Redux's complexity yet

Option 3: React Context + SWR

Approach:

  • React Context for auth state
  • SWR for server data

Pros:

  • Minimal dependencies (SWR only)
  • Simple concept (React Context is built-in)

Cons:

  • Performance Issues: React Context causes re-renders on every update
  • Boilerplate: Need to create context providers manually
  • SWR vs TanStack Query: SWR is less feature-rich

Option 4: Jotai + TanStack Query

Approach:

  • Jotai for atomic state management
  • TanStack Query for server state

Pros:

  • Atomic state model (like Recoil)
  • Good TypeScript support

Cons:

  • Less mature than Zustand
  • Smaller community
  • Atomic model can be overkill for simple auth state

Decision

Chosen Option: Option 1 - Zustand (Client State) + TanStack Query v5 (Server State)

Rationale:

  1. Bundle Size: 18KB total (vs 30KB for Redux Toolkit)
  2. Performance: Zustand selector-based re-renders, TanStack Query caching
  3. TypeScript: First-class support in both libraries
  4. Learning Curve: Simple APIs, great documentation
  5. Clear Separation: Auth/UI in Zustand, API data in TanStack Query

Usage Guidelines:

Zustand - Use For:

  • Authentication state (user, tenant, accessToken)
  • UI state (sidebar open/closed, theme)
  • User preferences (language, timezone)

TanStack Query - Use For:

  • API data (projects, issues, tokens)
  • Mutations (create, update, delete)
  • Caching and invalidation

Example Architecture:

// Zustand (auth)
const { user, tenant, logout } = useAuthStore();

// TanStack Query (server data)
const { data: projects, isLoading } = useQuery({
  queryKey: ['projects'],
  queryFn: () => projectService.getAll()
});

// Mutation
const createProject = useMutation({
  mutationFn: (data) => projectService.create(data),
  onSuccess: () => {
    queryClient.invalidateQueries({ queryKey: ['projects'] });
  }
});

Consequences

Positive:

  • Lightweight and fast
  • Easy to learn and use
  • Great TypeScript experience
  • Excellent caching and performance
  • Clear separation of concerns

Negative:

  • Two libraries to learn (instead of one)
  • Need to decide where state lives (Zustand vs TanStack Query)

Neutral:

  • Both libraries have excellent DevTools
  • Both are actively maintained

Mitigation Strategies:

  • Documentation: Create team guide for "What goes where"
  • Code Reviews: Ensure consistent usage patterns
  • Linting: Custom ESLint rules if needed

Validation

Acceptance Criteria:

  • Auth state persists across page refresh
  • API data cached appropriately (no redundant calls)
  • Optimistic updates work (immediate UI feedback)
  • TypeScript errors caught at compile time
  • DevTools show state clearly

Performance Targets:

  • Initial page load: <1.5s
  • State updates: <16ms (60fps)
  • Cache hit rate: >80%

References


ADR-006: Token Storage Strategy

Status

Accepted - 2025-11-03

Context

We need to securely store JWT access tokens and refresh tokens in the frontend. The storage mechanism must balance security, usability, and functionality.

Requirements:

  • Secure: Protect against XSS and CSRF attacks
  • Persistent: Survive page refresh
  • Auto-refresh: Seamlessly refresh tokens before expiration
  • Logout: Clear tokens on logout
  • Cross-tab sync: Logout in one tab logs out all tabs

Decision Drivers

  1. Security: XSS protection (primary threat)
  2. CSRF Protection: For refresh tokens
  3. Usability: Seamless token refresh
  4. Persistence: User stays logged in across sessions
  5. Performance: Fast token access

Options Considered

Approach:

  • Access Token: Stored in Zustand state (memory only, not persisted)
  • Refresh Token: Stored in httpOnly cookie (server-side managed)
  • Flow:
    1. User logs in → Receive access + refresh tokens
    2. Access token stored in Zustand (memory)
    3. Refresh token stored in httpOnly cookie by backend
    4. Access token used for API calls (Authorization header)
    5. On 401 error → Call /api/auth/refresh (refresh token sent automatically via cookie)
    6. Receive new access token → Update Zustand state

Cookie Configuration (Backend):

Response.Cookies.Append("refreshToken", refreshToken, new CookieOptions
{
    HttpOnly = true, // Cannot be accessed by JavaScript
    Secure = true,   // HTTPS only
    SameSite = SameSiteMode.Strict, // CSRF protection
    MaxAge = TimeSpan.FromDays(7)
});

Pros:

  • XSS Protection (Access Token): Cannot be stolen via XSS (not in localStorage/cookies)
  • CSRF Protection (Refresh Token): httpOnly + SameSite=Strict
  • Short-Lived Access Token: Even if leaked, expires in 60 minutes
  • Automatic Refresh: Cookie sent automatically on refresh endpoint
  • No Manual Cookie Management: Backend sets/clears cookies

Cons:

  • Access token lost on page refresh (need to call refresh immediately)
  • Requires cookie support (some corporate proxies block cookies)

Security Score: 9/10 (Best practice)

Option 2: Both Tokens in localStorage

Approach:

  • Store both access and refresh tokens in localStorage
  • Read on page load

Pros:

  • Simple implementation
  • Tokens persist across page refresh
  • No cookie management

Cons:

  • Vulnerable to XSS: If attacker injects script, can steal both tokens
  • No CSRF Protection: Tokens accessible to any script
  • Not Recommended: Violates OWASP security guidelines

Security Score: 3/10 (Not secure)

Option 3: Both Tokens in httpOnly Cookies

Approach:

  • Store both tokens in httpOnly cookies
  • Backend sends cookies on every API response

Pros:

  • XSS protection for both tokens
  • Automatic token management

Cons:

  • CSRF Vulnerability: Cookies sent automatically with every request
  • Need CSRF Tokens: Additional complexity
  • Cookie Size Limit: JWTs can be large (4KB cookie limit)
  • Double-Submit Cookie Pattern Required: More complexity

Security Score: 6/10 (CSRF risk)

Option 4: Session-Based Authentication (No JWT)

Approach:

  • Traditional session cookies
  • Session stored server-side (Redis)

Pros:

  • Simple
  • Secure (session ID only)

Cons:

  • Not stateless (requires Redis/database for sessions)
  • Horizontal scaling complexity
  • Not suitable for mobile apps
  • Against our JWT strategy

Security Score: 7/10 (Secure but not stateless)

Decision

Chosen Option: Option 1 - Access Token in Memory + Refresh Token in httpOnly Cookie

Rationale:

  1. Best Security: Access token protected from XSS, refresh token protected from CSRF
  2. Industry Standard: Used by Auth0, Okta, and major SaaS apps
  3. Balances Security and UX: Short-lived access token, auto-refresh
  4. Stateless: No session storage required
  5. Mobile-Friendly: Can adapt for mobile (store refresh token securely)

Implementation:

// stores/useAuthStore.ts
export const useAuthStore = create<AuthState>((set) => ({
  user: null,
  accessToken: null, // Stored in memory ONLY
  login: (token, user) => set({ accessToken: token, user }),
  logout: () => set({ accessToken: null, user: null })
}));

// No persist middleware for accessToken!
// lib/api-client.ts
apiClient.interceptors.response.use(
  (response) => response,
  async (error) => {
    if (error.response?.status === 401 && !error.config._retry) {
      error.config._retry = true;

      // Call refresh endpoint (refresh token sent via cookie automatically)
      const { data } = await axios.post('/api/auth/refresh');

      // Update access token in memory
      useAuthStore.getState().updateToken(data.accessToken);

      // Retry original request
      error.config.headers.Authorization = `Bearer ${data.accessToken}`;
      return apiClient(error.config);
    }

    return Promise.reject(error);
  }
);

Token Refresh Strategy:

  • Automatic: Intercept 401 errors, call refresh endpoint
  • Preemptive (Optional): Refresh 5 minutes before expiration
  • One-at-a-Time: Only one refresh call in flight (queue other requests)

Consequences

Positive:

  • Maximum security (XSS + CSRF protected)
  • Seamless user experience (auto-refresh)
  • Stateless authentication
  • Mobile-friendly (adapt for secure storage)
  • Industry best practice

Negative:

  • Access token lost on page refresh (need immediate refresh call)
  • Requires cookie support (fails in some corporate environments)
  • More complex implementation than localStorage

Neutral:

  • Short-lived access token means more refresh calls (acceptable trade-off)

Mitigation Strategies:

  • Page Load: Call refresh endpoint on app load if no access token in memory
  • Cookie Fallback: If cookies blocked, fall back to re-login
  • Error Handling: Clear UX if authentication fails (session expired)

Validation

Acceptance Criteria:

  • Access token not visible in localStorage/sessionStorage/cookies (developer tools)
  • Refresh token in httpOnly cookie with SameSite=Strict
  • 401 errors trigger automatic token refresh
  • Logout clears all tokens (memory + cookies)
  • Cross-tab logout works (listen to storage events)

Security Tests:

  • XSS attack simulation (cannot steal access token)
  • CSRF attack simulation (refresh endpoint protected)
  • Token expiration handled gracefully
  • Logout clears all authentication state

References


Summary of Decisions

Decision Chosen Solution Rationale
ADR-001: Tenant Identification JWT Claims + Subdomain Stateless, cross-platform, performant
ADR-002: Data Isolation Shared DB + tenant_id + Global Query Filter Cost-effective, scalable, maintainable
ADR-003: SSO Library ASP.NET Core Native (OIDC + SAML) Free, fast, covers 80% of needs
ADR-004: MCP Token Format Opaque Tokens (mcp_<slug>_<random>) Revocable, flexible, secure, auditable
ADR-005: Frontend State Zustand + TanStack Query Lightweight, TypeScript-first, performant
ADR-006: Token Storage Access in Memory + Refresh in httpOnly Cookie XSS + CSRF protected, industry standard

Impact Assessment

Security Impact

  • Overall Security Posture: Excellent (9/10)
  • XSS Protection: Enforced (tokens in memory + httpOnly cookies)
  • CSRF Protection: Enforced (SameSite=Strict cookies)
  • Data Isolation: Enforced (Global Query Filter + composite indexes)
  • Audit Trail: Complete (MCP tokens logged, SSO events tracked)

Performance Impact

  • API Latency: +5ms (JWT validation + tenant filtering)
  • Database Load: Minimal (composite indexes, Global Query Filter)
  • Frontend Bundle Size: +18KB (Zustand + TanStack Query)
  • Token Refresh: Transparent to user (<100ms)

Cost Impact

  • Infrastructure: $200/month (1 database vs $15,000 for DB-per-tenant)
  • Licensing: $0/month (native .NET libraries vs $3,000-5,000 for Auth0)
  • Maintenance: Low (one schema, automated migrations)
  • Total Savings: ~$18,000/year compared to Auth0 + DB-per-tenant

Development Impact

  • Implementation Time: 10 days (vs 6 weeks for IdentityServer + DB-per-tenant)
  • Learning Curve: Low (native libraries, clear architecture)
  • Maintenance Burden: Low (well-documented, industry patterns)
  • Testing Complexity: Medium (need tenant isolation tests)

Risks and Mitigation

Risk Mitigation
Data leak via Global Query Filter bypass Code review for .IgnoreQueryFilters(), integration tests
SSO misconfiguration Test connection UI, detailed error messages, documentation
MCP token brute-force 128-bit entropy, rate limiting, IP whitelisting
Performance degradation Composite indexes, query monitoring, slow query alerts
Frontend XSS attack CSP headers, input sanitization, React auto-escaping

Future Enhancements

Decisions are not permanent. We will revisit these at milestone reviews:

Milestone Potential Changes
M3 Re-evaluate SSO (Auth0 if complex federation needed)
M4 Re-evaluate data isolation (DB-per-tenant for enterprise customers)
M5 Re-evaluate frontend state (Redux if complex state emerges)
M6 Re-evaluate MCP tokens (consider JWT if performance critical)

Document Status: Approved Next Review: M3 Architecture Review (2025-12-15) Approval Signatures:

  • Architecture Team: [Approved]
  • Product Manager: [Approved]
  • Security Team: [Pending Review]
  • Engineering Lead: [Approved]

End of Architecture Decision Record