ColaFlow/reports/2025-11-03-Architecture-Decision-Record.md

# Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy

**Document Type:** ADR (Architecture Decision Record)
**Date:** 2025-11-03
**Status:** Accepted
**Decision Makers:** Architecture Team, Product Manager, Technical Leads
**Project:** ColaFlow - M1 Sprint 2 (Enterprise Multi-Tenant Upgrade)

---

## Document Purpose

This Architecture Decision Record (ADR) documents the key architectural decisions made for ColaFlow's transition from a single-tenant to an enterprise-ready multi-tenant SaaS platform. It follows the ADR format to capture context, options considered, chosen solutions, and consequences.

---

## Table of Contents

1. [ADR-001: Tenant Identification Strategy](#adr-001-tenant-identification-strategy)
2. [ADR-002: Data Isolation Strategy](#adr-002-data-isolation-strategy)
3. [ADR-003: SSO Library Selection](#adr-003-sso-library-selection)
4. [ADR-004: MCP Token Format](#adr-004-mcp-token-format)
5. [ADR-005: Frontend State Management](#adr-005-frontend-state-management)
6. [ADR-006: Token Storage Strategy](#adr-006-token-storage-strategy)
7. [Summary of Decisions](#summary-of-decisions)

---

## ADR-001: Tenant Identification Strategy

### Status
**Accepted** - 2025-11-03

### Context

ColaFlow is transitioning to a multi-tenant architecture where multiple companies (tenants) will share the same application instance. We need a reliable, performant, and secure method to identify which tenant a user or API request belongs to.

**Requirements:**
- Must work across web, mobile, and API clients
- Must be stateless (no session storage required)
- Must be secure (prevent tenant spoofing)
- Must be performant (no database lookup per request)
- Must support both human users and AI agents (MCP tokens)
- Must work with subdomain-based URLs (e.g., `acme.colaflow.com`)

### Decision Drivers

1. **Performance:** System must handle 10,000+ requests/second without database lookups
2. **Security:** Tenant ID cannot be tampered with by malicious users
3. **Scalability:** Solution must work for mobile apps, APIs, and web simultaneously
4. **Developer Experience:** Easy to implement and maintain across all layers
5. **User Experience:** Friendly tenant selection (via subdomain)

### Options Considered

#### Option 1: JWT Claims (Primary) + Subdomain (Secondary)

**Approach:**
- Store `tenant_id` and `tenant_slug` in JWT access token claims
- Resolve tenant from subdomain on login/registration
- Inject tenant context from JWT claims into all API requests
- No database lookup required after authentication

**Pros:**
- Stateless: No session storage or database lookup per request
- Secure: JWT signature prevents tampering
- Cross-platform: Works for web, mobile, API, MCP tokens
- Fast: O(1) lookup from JWT claims
- Tenant context available in middleware layer

**Cons:**
- JWT cannot be updated until refresh (stale tenant info for up to 60 minutes)
- Requires careful token expiration management
- Subdomain only used for initial tenant resolution (login page)

**Example JWT Payload:**
```json
{
  "sub": "user-id-123",
  "email": "john@acme.com",
  "tenant_id": "tenant-uuid-456",
  "tenant_slug": "acme",
  "tenant_plan": "Enterprise",
  "auth_provider": "AzureAD",
  "role": "User",
  "exp": 1730678400,
  "iat": 1730674800
}
```

#### Option 2: Session-Based Tenant Storage

**Approach:**
- Store tenant ID in server-side session (Redis)
- Lookup tenant on every request via session ID
- Subdomain used for tenant resolution on login

**Pros:**
- Can update tenant info without re-login
- Works well for web applications
- Session can store additional context

**Cons:**
- Not stateless: Requires Redis/session storage infrastructure
- Database/Redis lookup on every request (performance hit)
- Difficult to scale horizontally (session affinity required)
- Doesn't work well for mobile apps or API-only clients
- MCP tokens would still need separate mechanism

#### Option 3: Subdomain-Only Identification

**Approach:**
- Parse subdomain from HTTP Host header on every request
- Lookup tenant by slug in database
- No JWT claims for tenant

**Pros:**
- Simple conceptual model
- User-friendly (URL shows tenant)
- Easy to test locally

**Cons:**
- Database lookup on every request (performance bottleneck)
- Doesn't work for API clients (no subdomain in API calls)
- Doesn't work for mobile apps
- Vulnerable to DNS spoofing
- MCP tokens cannot carry subdomain context

#### Option 4: Tenant ID in URL Path

**Approach:**
- Include tenant ID in every API route: `/api/tenants/{tenantId}/projects`
- Frontend passes tenant ID explicitly

**Pros:**
- Explicit tenant context in every request
- Easy to debug
- Works across all client types

**Cons:**
- Poor user experience (ugly URLs)
- Easy to make mistakes (wrong tenant ID)
- Difficult to enforce (requires middleware validation)
- Security risk (users could try other tenant IDs)
- Requires frontend to manage tenant ID everywhere

### Decision

**Chosen Option: Option 1 - JWT Claims (Primary) + Subdomain (Secondary)**

**Rationale:**
1. **Performance:** No database lookup per request; O(1) from JWT claims
2. **Security:** JWT signature prevents tampering; middleware validates on every request
3. **Scalability:** Works for web, mobile, API, and MCP tokens uniformly
4. **Stateless:** No session storage required; easy to scale horizontally
5. **Developer Experience:** TenantContext injected automatically via middleware

**Implementation Strategy:**
- **Login Flow:** User visits `acme.colaflow.com/login` → Tenant resolved from subdomain → JWT contains `tenant_id` and `tenant_slug`
- **API Requests:** JWT extracted from Authorization header → `tenant_id` injected into TenantContext → EF Core Global Query Filter applies automatic filtering
- **MCP Tokens:** Opaque tokens stored with `tenant_id` → Middleware validates token → Tenant context injected (same as JWT)

### Consequences

**Positive:**
- Fast authentication and authorization
- No session storage infrastructure required
- Uniform tenant resolution across all client types
- Easy to test and debug (tenant visible in JWT payload)
- Supports multi-tenant mobile apps

**Negative:**
- Tenant changes require re-login (or wait for token refresh)
- JWT size increases slightly (+50 bytes for tenant claims)
- Middleware must validate JWT on every request (minor CPU cost)

**Neutral:**
- Subdomain is only used for initial tenant selection (login page)
- Tenant switching requires logout and login to different subdomain

**Mitigation Strategies:**
- Keep JWT expiration short (60 minutes) to allow tenant updates on refresh
- Implement automatic token refresh to minimize user disruption
- Cache JWT validation results per request to avoid redundant checks

### Validation

**Acceptance Criteria:**
- JWT contains `tenant_id`, `tenant_slug`, and `tenant_plan` claims
- Middleware extracts tenant from JWT and injects into TenantContext
- All database queries automatically filter by tenant via Global Query Filter
- Cross-tenant access attempts return 403 Forbidden
- Performance: <5ms overhead for JWT validation per request

**Testing:**
- Unit tests: TenantContext injection
- Integration tests: Cross-tenant isolation
- Performance tests: 10,000 req/s with JWT validation
- Security tests: Attempt to access other tenant's data (should fail)

### References
- Architecture Doc: `docs/architecture/multi-tenancy-architecture.md`
- JWT Implementation: `docs/architecture/jwt-authentication-architecture.md`
- MCP Token Format: `docs/architecture/mcp-authentication-architecture.md`

---

## ADR-002: Data Isolation Strategy

### Status
**Accepted** - 2025-11-03

### Context

In a multi-tenant system, data isolation is critical to ensure that one tenant cannot access another tenant's data. We need to choose an isolation strategy that balances security, performance, cost, and maintainability.

**Requirements:**
- Strong data isolation (no cross-tenant leaks)
- Good query performance (<50ms for typical queries)
- Cost-effective (avoid database proliferation)
- Easy to maintain and backup
- Scalable to 10,000+ tenants
- Support for per-tenant data export (GDPR compliance)

### Decision Drivers

1. **Security:** Absolute data isolation between tenants
2. **Cost:** Minimize infrastructure costs (PostgreSQL instances, storage)
3. **Performance:** Fast queries with proper indexing
4. **Scalability:** Support thousands of tenants on shared infrastructure
5. **Maintainability:** Easy schema migrations, backups, monitoring

### Options Considered

#### Option 1: Shared Database + tenant_id Column + Global Query Filter

**Approach:**
- All tenants share one PostgreSQL database
- Every table has a `tenant_id` column (NOT NULL)
- EF Core Global Query Filter automatically adds `.Where(e => e.TenantId == currentTenantId)` to all queries
- Composite indexes: `(tenant_id, other_columns)`

**Schema Example:**
```sql
CREATE TABLE projects (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    name VARCHAR(200) NOT NULL,
    key VARCHAR(20) NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    CONSTRAINT uq_projects_tenant_key UNIQUE (tenant_id, key)
);

CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
CREATE INDEX idx_projects_tenant_key ON projects(tenant_id, key);
```

**EF Core Configuration:**
```csharp
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Project>().HasQueryFilter(
        p => p.TenantId == _tenantContext.CurrentTenantId
    );
}
```

**Pros:**
- Cost-effective: One database for all tenants
- Easy to maintain: Single schema, one backup process
- Good performance with proper indexing (composite indexes)
- Easy to add new tenants (just insert into `tenants` table)
- Per-tenant data export is SQL query: `SELECT * FROM projects WHERE tenant_id = 'xxx'`
- Scales to 10,000+ tenants on one database
- Automatic filtering via Global Query Filter (developer-friendly)

**Cons:**
- Risk of data leak if Global Query Filter is bypassed (`.IgnoreQueryFilters()`)
- All tenants affected by database downtime
- Cannot isolate noisy neighbors (one tenant's heavy queries affect others)
- Database size grows with all tenants (monitoring required)

**Cost Estimate:** 1 database instance (~$100-200/month for medium workload)

#### Option 2: Database-per-Tenant

**Approach:**
- Each tenant gets a dedicated PostgreSQL database
- Connection string stored in `tenants` table
- Middleware switches database context per request

**Schema Example:**
```sql
-- Shared management database
CREATE TABLE tenants (
    id UUID PRIMARY KEY,
    slug VARCHAR(50) UNIQUE NOT NULL,
    connection_string TEXT NOT NULL -- Encrypted
);

-- Tenant-specific database (one per tenant)
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_beta;
```

**Pros:**
- Strong isolation: One tenant's database cannot access another
- Tenant-specific customization (different schema versions)
- Easy to back up per tenant
- Noisy neighbors don't affect each other
- Easy to migrate tenant to different database server

**Cons:**
- Expensive: N databases for N tenants (~$10-20/month per tenant minimum)
- Complex maintenance: Schema migrations across 1000s of databases
- Connection pool exhaustion (need one pool per tenant)
- Difficult to implement cross-tenant features (analytics, admin tools)
- Onboarding delay (new database provisioning takes time)

**Cost Estimate:** 1000 tenants × $15/month = $15,000/month (vs $200 for shared)

#### Option 3: Schema-per-Tenant (PostgreSQL Schemas)

**Approach:**
- One database with multiple PostgreSQL schemas
- Each tenant gets a schema: `tenant_acme.projects`, `tenant_beta.projects`
- Middleware switches search_path per request: `SET search_path = tenant_acme;`

**Pros:**
- Better isolation than shared database
- Lower cost than database-per-tenant
- All tenants in one PostgreSQL instance (easier backups)
- Can support ~1000 schemas per database

**Cons:**
- PostgreSQL schema limit (~1000 schemas per database)
- Schema creation overhead for new tenants
- Complex schema migrations (run migration on each schema)
- Search_path switching per request (performance overhead)
- Difficult to enforce (easy to forget to set search_path)

**Cost Estimate:** Same as shared database, but limited scalability

#### Option 4: Separate Infrastructure per Tenant (Fully Isolated)

**Approach:**
- Each tenant gets dedicated Kubernetes namespace, database, Redis, etc.
- Complete infrastructure isolation

**Pros:**
- Maximum isolation and security
- Per-tenant scaling and customization
- Enterprise customers often require this

**Cons:**
- Extremely expensive (hundreds of dollars per tenant)
- Complex to manage (orchestration required)
- Overkill for most tenants
- Long onboarding time

**Cost Estimate:** 1000 tenants × $500/month = $500,000/month (prohibitive)

### Decision

**Chosen Option: Option 1 - Shared Database + tenant_id Column + Global Query Filter**

**Rationale:**
1. **Cost-Effective:** $200/month vs $15,000/month for database-per-tenant
2. **Scalable:** PostgreSQL handles 10,000+ tenants with proper indexing
3. **Maintainable:** One schema, one backup process, one monitoring dashboard
4. **Developer-Friendly:** EF Core Global Query Filter ensures automatic filtering
5. **Performance:** Composite indexes provide excellent query performance
6. **Proven Pattern:** Used by GitHub, Slack, Heroku, and many successful SaaS products

**Implementation Strategy:**
- Add `tenant_id` column to all business tables
- Create composite indexes: `(tenant_id, primary_key)`, `(tenant_id, foreign_key)`
- Configure EF Core Global Query Filter in `OnModelCreating`
- Create TenantContext service to inject current tenant
- Add database-level constraints: `CHECK (tenant_id IS NOT NULL)`
- Update unique constraints to be tenant-scoped: `UNIQUE (tenant_id, email)`

**Migration Path:**
- Create `tenants` table
- Create default tenant for existing data
- Add `tenant_id` columns (nullable initially)
- Migrate existing data to default tenant
- Set `tenant_id` as NOT NULL
- Add indexes and constraints

### Consequences

**Positive:**
- Low infrastructure cost (1 database vs thousands)
- Easy to maintain and monitor
- Fast schema migrations (one database)
- Automatic tenant filtering (developer safety)
- Good query performance with indexes
- Per-tenant data export is straightforward SQL

**Negative:**
- Risk of data leak if developer bypasses Global Query Filter
- All tenants share database resources (monitoring required)
- Cannot isolate noisy neighbors at database level
- Database backup contains all tenants (larger backup size)

**Neutral:**
- Tenant onboarding is instant (no new database needed)
- Cross-tenant analytics require explicit filtering
- Database size monitoring required as tenant count grows

**Mitigation Strategies:**
- **Data Leak Prevention:**
  - Code review requirement for any `.IgnoreQueryFilters()` usage
  - Integration tests verify cross-tenant isolation
  - Automated security testing (attempt cross-tenant access)
- **Performance Monitoring:**
  - Alert on slow queries (>100ms)
  - Index usage monitoring (pg_stat_user_indexes)
  - Per-tenant query cost tracking
- **Noisy Neighbor Protection:**
  - Query timeout limits (5 seconds max)
  - Rate limiting per tenant
  - Connection pool limits
  - Option to migrate large tenant to dedicated database later

**Upgrade Path:**
If a tenant grows too large or requires dedicated resources, we can migrate them to a separate database while keeping the shared model for other tenants.

### Validation

**Acceptance Criteria:**
- All queries automatically filter by tenant
- Cross-tenant access attempts fail with 403 Forbidden
- Query performance <50ms for typical workloads (with 10,000 records per tenant)
- Integration tests verify tenant isolation
- Data export per tenant completes in <1 minute

**Testing:**
- Unit tests: Global Query Filter applied to all entities
- Integration tests: Create data in Tenant A, verify Tenant B cannot access
- Performance tests: Query time with 1 million total records (100 tenants × 10,000 records)
- Load tests: 10,000 concurrent requests across 100 tenants

### References
- Architecture Doc: `docs/architecture/multi-tenancy-architecture.md`
- Migration Strategy: `docs/architecture/migration-strategy.md`
- Performance Benchmarks: `docs/architecture/performance-benchmarks.md` (TBD)

---

## ADR-003: SSO Library Selection

### Status
**Accepted** - 2025-11-03

### Context

Enterprise customers require Single Sign-On (SSO) to integrate ColaFlow with their corporate identity providers (Azure AD, Google Workspace, Okta, etc.). We need to choose an SSO library/approach that balances functionality, cost, implementation speed, and maintainability.

**Requirements:**
- Support major identity providers: Azure AD, Google, Okta
- Support OIDC (OpenID Connect) protocol
- Support SAML 2.0 for generic enterprise IdPs
- User auto-provisioning (create user on first SSO login)
- Email domain restrictions (only allow @acme.com)
- Configurable per tenant (each tenant has own SSO config)
- Production-ready security standards

### Decision Drivers

1. **Time-to-Market:** Implement SSO in <1 week (M1 timeline constraint)
2. **Cost:** Minimize licensing fees
3. **Coverage:** Support 90% of enterprise SSO requirements
4. **Flexibility:** Can upgrade later if complex requirements emerge
5. **Security:** Follow OWASP and OIDC/SAML best practices

### Options Considered

#### Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)

**Approach:**
- Use built-in `Microsoft.AspNetCore.Authentication.OpenIdConnect` for OIDC
- Use `Sustainsys.Saml2` library for SAML 2.0
- Custom implementation for multi-tenant SSO configuration
- Store SSO config in `tenants` table (JSONB column)

**Pros:**
- Free: No licensing costs
- Fast: Can implement OIDC in 2-3 days, SAML in 3-4 days
- Built-in to .NET 9: Mature, well-documented
- Flexible: Full control over implementation
- Covers 80-90% of enterprise SSO needs

**Cons:**
- Manual implementation: Need to handle user provisioning, domain restrictions
- Limited advanced features: No federation, no protocol switching
- SAML is more complex to implement
- Need to maintain our own SSO configuration UI

**Implementation Complexity:** Medium
**Cost:** $0/month
**Coverage:** OIDC (Azure, Google, Okta) + SAML 2.0 (80% of market)

**Code Example:**
```csharp
services.AddAuthentication()
    .AddOpenIdConnect("AzureAD", options =>
    {
        options.Authority = tenant.SsoConfig.AuthorityUrl;
        options.ClientId = tenant.SsoConfig.ClientId;
        options.ClientSecret = tenant.SsoConfig.ClientSecret;
        options.ResponseType = "code";
        options.SaveTokens = true;
        options.Events = new OpenIdConnectEvents
        {
            OnTokenValidated = async context =>
            {
                await AutoProvisionUserAsync(context);
            }
        };
    });
```

#### Option 2: Auth0

**Approach:**
- Use Auth0 as SSO broker
- Auth0 handles all identity providers
- Configure Auth0 via their dashboard
- Pay per monthly active user (MAU)

**Pros:**
- Fast setup: Implement in 1-2 days
- Comprehensive: Supports all identity providers out-of-the-box
- User management: Built-in user directory
- Advanced features: MFA, passwordless, anomaly detection
- Dashboard for SSO configuration

**Cons:**
- Expensive: $240/month (Professional) + $0.05/MAU (500 users = $25/month extra)
- Vendor lock-in: Difficult to migrate away
- Less control: Auth0 controls auth flow
- Overkill for MVP: Many features we don't need yet

**Implementation Complexity:** Low
**Cost:** $3,000-5,000/year (for 100 tenants with 5,000 total users)
**Coverage:** 100% (all protocols, all providers)

#### Option 3: Okta (Workforce Identity Cloud)

**Approach:**
- Use Okta as SSO broker
- Similar to Auth0 but more enterprise-focused
- Per-user pricing

**Pros:**
- Enterprise-grade: Trusted by Fortune 500
- Complete features: SSO, MFA, provisioning, directory
- Excellent support and documentation

**Cons:**
- Very expensive: $2/user/month minimum (100 users = $200/month)
- Enterprise sales process (slow, complex)
- Overkill for startup/SMB customers
- Vendor lock-in

**Implementation Complexity:** Low
**Cost:** $5,000-10,000/year (for 100 tenants)
**Coverage:** 100%

#### Option 4: IdentityServer4 / Duende IdentityServer

**Approach:**
- Use IdentityServer as self-hosted identity provider
- Implement Federation support (connect to external IdPs)
- Open-source (IdentityServer4) or licensed (Duende)

**Pros:**
- Self-hosted: Full control
- Comprehensive: OIDC, OAuth 2.0, SAML via plugins
- Flexible: Can customize extensively
- No per-user fees

**Cons:**
- Complex: Steep learning curve (2-3 weeks to implement)
- Maintenance burden: Need to maintain IdentityServer instance
- Duende licensing: $1,500/year for production use
- Overkill for MVP: We don't need an identity provider, just SSO

**Implementation Complexity:** High
**Cost:** $1,500/year (Duende license)
**Coverage:** 100%

### Decision

**Chosen Option: Option 1 - ASP.NET Core Native OIDC/SAML (M1-M2)**

**Rationale:**
1. **Cost:** $0/month vs $3,000-5,000/year for Auth0/Okta
2. **Speed:** Can implement in <1 week (M1 timeline)
3. **Control:** Full flexibility to customize
4. **Coverage:** Supports 80% of enterprise SSO requirements (OIDC + SAML)
5. **Upgrade Path:** Can migrate to Auth0/Okta later if complex requirements emerge

**Decision:** Start with native ASP.NET Core for M1-M2. Re-evaluate at M3 if we need:
- Complex federation (multiple IdPs per tenant)
- Advanced MFA flows
- More than 5 different SSO protocols
- Dedicated identity management features

**Implementation Strategy:**
- **M1 (Week 1):** OIDC implementation (Azure AD, Google, Okta)
- **M2 (Week 2):** SAML 2.0 implementation (generic enterprise IdPs)
- **M2 (Week 3):** User auto-provisioning and domain restrictions
- **M2 (Week 4):** SSO configuration UI for tenants

### Consequences

**Positive:**
- Zero licensing costs for M1-M2
- Complete control over implementation
- Can customize for our specific needs
- Fast implementation (< 1 week)
- Covers 80% of enterprise SSO requirements
- Learning opportunity for team

**Negative:**
- Manual implementation required (more code to maintain)
- Limited to OIDC + SAML 2.0 (no exotic protocols)
- Need to build SSO configuration UI ourselves
- More testing required (vs using Auth0)

**Neutral:**
- Can migrate to Auth0/Okta later if needed
- SSO config stored in database (our control)
- Integration tests required for each IdP

**Mitigation Strategies:**
- **Quality:** Comprehensive testing with real IdPs (Azure AD, Google)
- **Documentation:** Detailed guides for each supported provider
- **Security:** Follow OIDC/SAML security best practices
- **Upgrade Path:** Design SSO config to be provider-agnostic (easy migration)

### Validation

**Acceptance Criteria:**
- OIDC login works with Azure AD, Google, Okta
- SAML 2.0 login works with generic IdP
- Users auto-provisioned on first login
- Email domain restrictions enforced
- SSO configuration UI functional for admins
- Error handling for common SSO failures

**Testing:**
- Unit tests: OIDC token validation, SAML assertion parsing
- Integration tests: Full SSO flow with real IdPs (test tenants)
- Security tests: CSRF protection, replay attack prevention
- Usability tests: Admin can configure SSO without support

### References
- Architecture Doc: `docs/architecture/sso-integration-architecture.md`
- Implementation Guide: `docs/implementation/sso-implementation.md` (TBD)
- Security Checklist: `docs/security/sso-security-checklist.md` (TBD)

---

## ADR-004: MCP Token Format

### Status
**Accepted** - 2025-11-03

### Context

ColaFlow will expose an MCP (Model Context Protocol) server that allows AI agents (Claude, ChatGPT) to access project data, create tasks, and generate reports. We need a secure, revocable authentication mechanism for AI agents.

**Requirements:**
- Secure: Cannot be forged or tampered with
- Revocable: Admin can revoke token instantly
- Fine-Grained Permissions: Control read/write access per resource
- Audit Trail: Log all API operations performed with token
- Tenant-Scoped: Token only works for one tenant
- Long-Lived: Valid for days/weeks (not short-lived like JWT)

### Decision Drivers

1. **Security:** Token cannot be guessed or brute-forced
2. **Revocability:** Instant revocation (no JWT blacklist complexity)
3. **Permissions:** Resource-level + operation-level granularity
4. **Auditability:** Complete log of all token operations
5. **Usability:** Easy to copy/paste, recognizable format

### Options Considered

#### Option 1: Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)

**Format:** `mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d`

**Approach:**
- Token is a random string (cryptographically secure)
- Prefix: `mcp_` (identifies as MCP token)
- Tenant slug: `acme` (for easy identification)
- Random part: 32 hex characters (128 bits of entropy)
- Store token hash (SHA256) in database
- Store permissions in database alongside token

**Token Storage:**
```sql
CREATE TABLE mcp_tokens (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    user_id UUID NULL,
    name VARCHAR(100) NOT NULL,
    token_hash VARCHAR(255) NOT NULL UNIQUE, -- SHA256 of token
    permissions JSONB NOT NULL, -- {"projects": ["read", "search"], ...}
    status INT NOT NULL, -- Active/Revoked/Expired
    created_at TIMESTAMP NOT NULL,
    expires_at TIMESTAMP NULL,
    last_used_at TIMESTAMP NULL
);
```

**Validation Flow:**
1. Receive token: `mcp_acme_xxx...`
2. Hash token with SHA256
3. Lookup in database by token_hash
4. Check status (Active/Revoked/Expired)
5. Check expiration date
6. Load permissions from JSONB column
7. Inject tenant context and permissions into request

**Pros:**
- **Revocable:** Update `status = Revoked` in database, takes effect immediately
- **Secure:** SHA256 hashed, never stored plain-text
- **Flexible Permissions:** Can update permissions without regenerating token
- **Auditable:** Every token use logged in database
- **Tenant-Scoped:** Token hash includes tenant context
- **Long-Lived:** Can be valid for months/years
- **Easy to Identify:** Prefix + tenant slug clearly identify token type

**Cons:**
- Database lookup required on every request (performance overhead)
- Larger tokens (50+ characters) vs API keys (32 characters)
- Need to manage token lifecycle (expiration, revocation)

**Performance:** ~5ms per token validation (including database lookup)

#### Option 2: JWT Tokens for MCP

**Format:** Long JWT string (200+ characters)

**Approach:**
- Generate JWT with `tenant_id`, `user_id`, `permissions` claims
- Sign with secret key
- No database lookup required (stateless)
- Validate signature on every request

**Pros:**
- Stateless: No database lookup required
- Fast validation: O(1) signature check
- Self-contained: All info in token

**Cons:**
- **Cannot Revoke:** Once issued, JWT is valid until expiration (unless using blacklist)
- **Blacklist Required:** Need Redis/database to store revoked JWTs (adds complexity)
- **Permissions Fixed:** Cannot update permissions without regenerating token
- **Larger Tokens:** 200-500 characters (difficult to copy/paste)
- **Expiration Required:** Must set short expiration for revocation to work

**Revocation Problem:**
```
User generates JWT token → Shares with AI agent → Admin wants to revoke
→ JWT is still valid for 30 days → Need to blacklist JWT ID
→ Now need Redis to store blacklist → Not truly stateless anymore
```

#### Option 3: API Keys (UUID Format)

**Format:** `550e8400-e29b-41d4-a716-446655440000`

**Approach:**
- Generate random UUID
- Store in database with permissions
- Simple validation: lookup by UUID

**Pros:**
- Simple implementation
- Standard format (UUID)
- Database lookup

**Cons:**
- No tenant context in token (need to lookup tenant)
- No token type identifier (could be confused with user IDs)
- No visual indication of purpose
- Less secure (UUIDs have less entropy than 256-bit random strings)

#### Option 4: GitHub-Style Personal Access Tokens

**Format:** `ghp_ABcdEF123456789012345678901234567890`

**Approach:**
- Prefix identifies token type
- Random alphanumeric string
- Store hash in database

**Pros:**
- Industry standard (used by GitHub, GitLab)
- Easy to identify by prefix
- Secure

**Cons:**
- No tenant context in token itself
- Shorter random part (less entropy than our Option 1)

### Decision

**Chosen Option: Option 1 - Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)**

**Rationale:**
1. **Revocability:** Instant revocation without blacklist complexity
2. **Flexibility:** Permissions stored server-side, can update without new token
3. **Security:** 128 bits of entropy + SHA256 hashing
4. **Usability:** Tenant slug in token helps users identify which tenant it's for
5. **Auditability:** Complete audit trail in database

**Token Format:**
```
mcp_<tenant_slug>_<random_32_hex_chars>
```

**Example:**
```
mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
mcp_techcorp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6
```

**Components:**
- `mcp_`: Identifies as MCP token (easy to filter in logs)
- `acme`: Tenant slug (helps user identify which tenant)
- `7f3d8a9c...`: 32 hex characters (128 bits entropy = 2^128 combinations)

**Generation:**
```csharp
public string GenerateToken(string tenantSlug)
{
    var randomBytes = new byte[16]; // 128 bits
    using var rng = RandomNumberGenerator.Create();
    rng.GetBytes(randomBytes);
    var randomHex = Convert.ToHexString(randomBytes).ToLowerInvariant();
    return $"mcp_{tenantSlug}_{randomHex}";
}
```

**Storage:**
```csharp
public async Task<McpToken> CreateTokenAsync(CreateMcpTokenCommand command)
{
    var token = _tokenGenerator.GenerateToken(tenant.Slug);
    var tokenHash = _tokenGenerator.HashToken(token); // SHA256

    var mcpToken = new McpToken
    {
        TokenHash = tokenHash, // Never store plain-text
        Permissions = command.Permissions,
        ExpiresAt = command.ExpiresAt
    };

    await _repository.AddAsync(mcpToken);
    return token; // Return plain-text ONLY ONCE
}
```

### Consequences

**Positive:**
- Instant revocation (update database status)
- Fine-grained permissions (stored server-side)
- Complete audit trail
- Tenant-scoped (slug in token)
- Secure (128-bit entropy + SHA256)
- User-friendly (tenant slug helps identification)

**Negative:**
- Database lookup required per request (~5ms overhead)
- Longer tokens (50 characters vs 32 for API keys)
- Need to manage token lifecycle (expiration, cleanup)

**Neutral:**
- Performance overhead acceptable for MCP use case (not high-frequency)
- Token length acceptable for copy/paste workflow

**Mitigation Strategies:**
- **Performance:** Cache token validation results (5-minute TTL)
- **Token Length:** Provide copy button and download option in UI
- **Lifecycle Management:** Automated cleanup job for expired tokens

### Validation

**Acceptance Criteria:**
- Token generation is cryptographically secure (CSPRNG)
- Token hash stored (SHA256), never plain-text
- Token validation <10ms (including database lookup)
- Revocation takes effect immediately
- Permissions enforced on every API call
- Audit log created for every token use

**Testing:**
- Unit tests: Token generation format, hashing, validation
- Integration tests: Token authentication flow, permission enforcement
- Security tests: Brute-force resistance, revocation effectiveness
- Performance tests: 1,000 req/s with token validation

### References
- Architecture Doc: `docs/architecture/mcp-authentication-architecture.md`
- Token Management UI: `docs/design/multi-tenant-ux-flows.md#mcp-token-management-flow`

---

## ADR-005: Frontend State Management

### Status
**Accepted** - 2025-11-03

### Context

ColaFlow frontend (Next.js 16 + React 19) needs a state management solution for authentication, user preferences, and server data. We need to choose libraries that are TypeScript-first, performant, and maintainable.

**Requirements:**
- Type-safe: Full TypeScript support
- Performant: Minimal re-renders
- Developer-friendly: Low boilerplate
- Server state caching: Avoid redundant API calls
- Optimistic updates: Immediate UI feedback
- Auth state persistence: Survive page refresh

### Decision Drivers

1. **TypeScript Support:** First-class TypeScript integration
2. **Performance:** Minimal bundle size, fast renders
3. **DX (Developer Experience):** Easy to learn, low boilerplate
4. **Ecosystem:** Good documentation, active community
5. **Server State:** Built-in caching and invalidation

### Options Considered

#### Option 1: Zustand (Client State) + TanStack Query v5 (Server State)

**Approach:**
- **Zustand:** Lightweight state manager for auth, UI state
- **TanStack Query:** Server state caching, mutations, automatic refetching

**Zustand Example:**
```typescript
// stores/useAuthStore.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';

interface AuthState {
  user: User | null;
  tenant: Tenant | null;
  accessToken: string | null;
  login: (token: string, user: User, tenant: Tenant) => void;
  logout: () => void;
}

export const useAuthStore = create<AuthState>()(
  persist(
    (set) => ({
      user: null,
      tenant: null,
      accessToken: null,
      login: (token, user, tenant) => set({ accessToken: token, user, tenant }),
      logout: () => set({ accessToken: null, user: null, tenant: null }),
    }),
    { name: 'auth-storage' }
  )
);
```

**TanStack Query Example:**
```typescript
// hooks/useMcpTokens.ts
import { useQuery } from '@tanstack/react-query';
import { mcpService } from '@/services/mcp.service';

export function useMcpTokens() {
  return useQuery({
    queryKey: ['mcp-tokens'],
    queryFn: () => mcpService.listTokens(),
    staleTime: 1000 * 60 * 5, // 5 minutes
  });
}
```

**Pros:**
- **Minimal Bundle Size:** Zustand (3KB) + TanStack Query (15KB) = 18KB total
- **TypeScript-First:** Excellent type inference
- **Low Boilerplate:** No actions, reducers, or complex setup
- **Performance:** Zustand avoids unnecessary re-renders
- **Caching:** TanStack Query caches API responses automatically
- **DevTools:** Excellent debugging tools for both libraries
- **Separation of Concerns:** Client state in Zustand, server state in TanStack Query

**Cons:**
- Two libraries to learn (vs one all-in-one solution)
- Need to decide what goes in Zustand vs TanStack Query

**Learning Curve:** Low (Zustand is simpler than Redux, TanStack Query has great docs)

#### Option 2: Redux Toolkit + RTK Query

**Approach:**
- Redux Toolkit for all state
- RTK Query for API data fetching

**Pros:**
- All-in-one solution
- Mature ecosystem
- Excellent DevTools

**Cons:**
- **More Boilerplate:** Actions, slices, reducers
- **Larger Bundle:** Redux (10KB) + RTK Query (20KB) = 30KB
- **Steeper Learning Curve:** More concepts to learn
- **Overkill for MVP:** We don't need Redux's complexity yet

#### Option 3: React Context + SWR

**Approach:**
- React Context for auth state
- SWR for server data

**Pros:**
- Minimal dependencies (SWR only)
- Simple concept (React Context is built-in)

**Cons:**
- **Performance Issues:** React Context causes re-renders on every update
- **Boilerplate:** Need to create context providers manually
- **SWR vs TanStack Query:** SWR is less feature-rich

#### Option 4: Jotai + TanStack Query

**Approach:**
- Jotai for atomic state management
- TanStack Query for server state

**Pros:**
- Atomic state model (like Recoil)
- Good TypeScript support

**Cons:**
- Less mature than Zustand
- Smaller community
- Atomic model can be overkill for simple auth state

### Decision

**Chosen Option: Option 1 - Zustand (Client State) + TanStack Query v5 (Server State)**

**Rationale:**
1. **Bundle Size:** 18KB total (vs 30KB for Redux Toolkit)
2. **Performance:** Zustand selector-based re-renders, TanStack Query caching
3. **TypeScript:** First-class support in both libraries
4. **Learning Curve:** Simple APIs, great documentation
5. **Clear Separation:** Auth/UI in Zustand, API data in TanStack Query

**Usage Guidelines:**

**Zustand - Use For:**
- Authentication state (user, tenant, accessToken)
- UI state (sidebar open/closed, theme)
- User preferences (language, timezone)

**TanStack Query - Use For:**
- API data (projects, issues, tokens)
- Mutations (create, update, delete)
- Caching and invalidation

**Example Architecture:**
```typescript
// Zustand (auth)
const { user, tenant, logout } = useAuthStore();

// TanStack Query (server data)
const { data: projects, isLoading } = useQuery({
  queryKey: ['projects'],
  queryFn: () => projectService.getAll()
});

// Mutation
const createProject = useMutation({
  mutationFn: (data) => projectService.create(data),
  onSuccess: () => {
    queryClient.invalidateQueries({ queryKey: ['projects'] });
  }
});
```

### Consequences

**Positive:**
- Lightweight and fast
- Easy to learn and use
- Great TypeScript experience
- Excellent caching and performance
- Clear separation of concerns

**Negative:**
- Two libraries to learn (instead of one)
- Need to decide where state lives (Zustand vs TanStack Query)

**Neutral:**
- Both libraries have excellent DevTools
- Both are actively maintained

**Mitigation Strategies:**
- **Documentation:** Create team guide for "What goes where"
- **Code Reviews:** Ensure consistent usage patterns
- **Linting:** Custom ESLint rules if needed

### Validation

**Acceptance Criteria:**
- Auth state persists across page refresh
- API data cached appropriately (no redundant calls)
- Optimistic updates work (immediate UI feedback)
- TypeScript errors caught at compile time
- DevTools show state clearly

**Performance Targets:**
- Initial page load: <1.5s
- State updates: <16ms (60fps)
- Cache hit rate: >80%

### References
- Zustand Docs: https://docs.pmnd.rs/zustand
- TanStack Query Docs: https://tanstack.com/query
- Implementation: `docs/frontend/state-management-guide.md`

---

## ADR-006: Token Storage Strategy

### Status
**Accepted** - 2025-11-03

### Context

We need to securely store JWT access tokens and refresh tokens in the frontend. The storage mechanism must balance security, usability, and functionality.

**Requirements:**
- Secure: Protect against XSS and CSRF attacks
- Persistent: Survive page refresh
- Auto-refresh: Seamlessly refresh tokens before expiration
- Logout: Clear tokens on logout
- Cross-tab sync: Logout in one tab logs out all tabs

### Decision Drivers

1. **Security:** XSS protection (primary threat)
2. **CSRF Protection:** For refresh tokens
3. **Usability:** Seamless token refresh
4. **Persistence:** User stays logged in across sessions
5. **Performance:** Fast token access

### Options Considered

#### Option 1: Access Token in Memory + Refresh Token in httpOnly Cookie

**Approach:**
- **Access Token:** Stored in Zustand state (memory only, not persisted)
- **Refresh Token:** Stored in httpOnly cookie (server-side managed)
- **Flow:**
  1. User logs in → Receive access + refresh tokens
  2. Access token stored in Zustand (memory)
  3. Refresh token stored in httpOnly cookie by backend
  4. Access token used for API calls (Authorization header)
  5. On 401 error → Call `/api/auth/refresh` (refresh token sent automatically via cookie)
  6. Receive new access token → Update Zustand state

**Cookie Configuration (Backend):**
```csharp
Response.Cookies.Append("refreshToken", refreshToken, new CookieOptions
{
    HttpOnly = true, // Cannot be accessed by JavaScript
    Secure = true,   // HTTPS only
    SameSite = SameSiteMode.Strict, // CSRF protection
    MaxAge = TimeSpan.FromDays(7)
});
```

**Pros:**
- **XSS Protection (Access Token):** Cannot be stolen via XSS (not in localStorage/cookies)
- **CSRF Protection (Refresh Token):** httpOnly + SameSite=Strict
- **Short-Lived Access Token:** Even if leaked, expires in 60 minutes
- **Automatic Refresh:** Cookie sent automatically on refresh endpoint
- **No Manual Cookie Management:** Backend sets/clears cookies

**Cons:**
- Access token lost on page refresh (need to call refresh immediately)
- Requires cookie support (some corporate proxies block cookies)

**Security Score:** 9/10 (Best practice)

#### Option 2: Both Tokens in localStorage

**Approach:**
- Store both access and refresh tokens in localStorage
- Read on page load

**Pros:**
- Simple implementation
- Tokens persist across page refresh
- No cookie management

**Cons:**
- **Vulnerable to XSS:** If attacker injects script, can steal both tokens
- **No CSRF Protection:** Tokens accessible to any script
- **Not Recommended:** Violates OWASP security guidelines

**Security Score:** 3/10 (Not secure)

#### Option 3: Both Tokens in httpOnly Cookies

**Approach:**
- Store both tokens in httpOnly cookies
- Backend sends cookies on every API response

**Pros:**
- XSS protection for both tokens
- Automatic token management

**Cons:**
- **CSRF Vulnerability:** Cookies sent automatically with every request
- **Need CSRF Tokens:** Additional complexity
- **Cookie Size Limit:** JWTs can be large (4KB cookie limit)
- **Double-Submit Cookie Pattern Required:** More complexity

**Security Score:** 6/10 (CSRF risk)

#### Option 4: Session-Based Authentication (No JWT)

**Approach:**
- Traditional session cookies
- Session stored server-side (Redis)

**Pros:**
- Simple
- Secure (session ID only)

**Cons:**
- Not stateless (requires Redis/database for sessions)
- Horizontal scaling complexity
- Not suitable for mobile apps
- Against our JWT strategy

**Security Score:** 7/10 (Secure but not stateless)

### Decision

**Chosen Option: Option 1 - Access Token in Memory + Refresh Token in httpOnly Cookie**

**Rationale:**
1. **Best Security:** Access token protected from XSS, refresh token protected from CSRF
2. **Industry Standard:** Used by Auth0, Okta, and major SaaS apps
3. **Balances Security and UX:** Short-lived access token, auto-refresh
4. **Stateless:** No session storage required
5. **Mobile-Friendly:** Can adapt for mobile (store refresh token securely)

**Implementation:**

```typescript
// stores/useAuthStore.ts
export const useAuthStore = create<AuthState>((set) => ({
  user: null,
  accessToken: null, // Stored in memory ONLY
  login: (token, user) => set({ accessToken: token, user }),
  logout: () => set({ accessToken: null, user: null })
}));

// No persist middleware for accessToken!
```

```typescript
// lib/api-client.ts
apiClient.interceptors.response.use(
  (response) => response,
  async (error) => {
    if (error.response?.status === 401 && !error.config._retry) {
      error.config._retry = true;

      // Call refresh endpoint (refresh token sent via cookie automatically)
      const { data } = await axios.post('/api/auth/refresh');

      // Update access token in memory
      useAuthStore.getState().updateToken(data.accessToken);

      // Retry original request
      error.config.headers.Authorization = `Bearer ${data.accessToken}`;
      return apiClient(error.config);
    }

    return Promise.reject(error);
  }
);
```

**Token Refresh Strategy:**
- **Automatic:** Intercept 401 errors, call refresh endpoint
- **Preemptive (Optional):** Refresh 5 minutes before expiration
- **One-at-a-Time:** Only one refresh call in flight (queue other requests)

### Consequences

**Positive:**
- Maximum security (XSS + CSRF protected)
- Seamless user experience (auto-refresh)
- Stateless authentication
- Mobile-friendly (adapt for secure storage)
- Industry best practice

**Negative:**
- Access token lost on page refresh (need immediate refresh call)
- Requires cookie support (fails in some corporate environments)
- More complex implementation than localStorage

**Neutral:**
- Short-lived access token means more refresh calls (acceptable trade-off)

**Mitigation Strategies:**
- **Page Load:** Call refresh endpoint on app load if no access token in memory
- **Cookie Fallback:** If cookies blocked, fall back to re-login
- **Error Handling:** Clear UX if authentication fails (session expired)

### Validation

**Acceptance Criteria:**
- Access token not visible in localStorage/sessionStorage/cookies (developer tools)
- Refresh token in httpOnly cookie with SameSite=Strict
- 401 errors trigger automatic token refresh
- Logout clears all tokens (memory + cookies)
- Cross-tab logout works (listen to storage events)

**Security Tests:**
- XSS attack simulation (cannot steal access token)
- CSRF attack simulation (refresh endpoint protected)
- Token expiration handled gracefully
- Logout clears all authentication state

### References
- OWASP: https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
- Auth0 Best Practices: https://auth0.com/docs/secure/tokens/refresh-tokens/refresh-token-rotation
- Implementation: `docs/frontend/api-integration-guide.md`

---

## Summary of Decisions

| Decision | Chosen Solution | Rationale |
|----------|----------------|-----------|
| **ADR-001: Tenant Identification** | JWT Claims + Subdomain | Stateless, cross-platform, performant |
| **ADR-002: Data Isolation** | Shared DB + tenant_id + Global Query Filter | Cost-effective, scalable, maintainable |
| **ADR-003: SSO Library** | ASP.NET Core Native (OIDC + SAML) | Free, fast, covers 80% of needs |
| **ADR-004: MCP Token Format** | Opaque Tokens (`mcp_<slug>_<random>`) | Revocable, flexible, secure, auditable |
| **ADR-005: Frontend State** | Zustand + TanStack Query | Lightweight, TypeScript-first, performant |
| **ADR-006: Token Storage** | Access in Memory + Refresh in httpOnly Cookie | XSS + CSRF protected, industry standard |

## Impact Assessment

### Security Impact
- **Overall Security Posture:** Excellent (9/10)
- **XSS Protection:** Enforced (tokens in memory + httpOnly cookies)
- **CSRF Protection:** Enforced (SameSite=Strict cookies)
- **Data Isolation:** Enforced (Global Query Filter + composite indexes)
- **Audit Trail:** Complete (MCP tokens logged, SSO events tracked)

### Performance Impact
- **API Latency:** +5ms (JWT validation + tenant filtering)
- **Database Load:** Minimal (composite indexes, Global Query Filter)
- **Frontend Bundle Size:** +18KB (Zustand + TanStack Query)
- **Token Refresh:** Transparent to user (<100ms)

### Cost Impact
- **Infrastructure:** $200/month (1 database vs $15,000 for DB-per-tenant)
- **Licensing:** $0/month (native .NET libraries vs $3,000-5,000 for Auth0)
- **Maintenance:** Low (one schema, automated migrations)
- **Total Savings:** ~$18,000/year compared to Auth0 + DB-per-tenant

### Development Impact
- **Implementation Time:** 10 days (vs 6 weeks for IdentityServer + DB-per-tenant)
- **Learning Curve:** Low (native libraries, clear architecture)
- **Maintenance Burden:** Low (well-documented, industry patterns)
- **Testing Complexity:** Medium (need tenant isolation tests)

## Risks and Mitigation

| Risk | Mitigation |
|------|------------|
| **Data leak via Global Query Filter bypass** | Code review for `.IgnoreQueryFilters()`, integration tests |
| **SSO misconfiguration** | Test connection UI, detailed error messages, documentation |
| **MCP token brute-force** | 128-bit entropy, rate limiting, IP whitelisting |
| **Performance degradation** | Composite indexes, query monitoring, slow query alerts |
| **Frontend XSS attack** | CSP headers, input sanitization, React auto-escaping |

## Future Enhancements

Decisions are not permanent. We will revisit these at milestone reviews:

| Milestone | Potential Changes |
|-----------|-------------------|
| **M3** | Re-evaluate SSO (Auth0 if complex federation needed) |
| **M4** | Re-evaluate data isolation (DB-per-tenant for enterprise customers) |
| **M5** | Re-evaluate frontend state (Redux if complex state emerges) |
| **M6** | Re-evaluate MCP tokens (consider JWT if performance critical) |

---

**Document Status:** Approved
**Next Review:** M3 Architecture Review (2025-12-15)
**Approval Signatures:**
- Architecture Team: [Approved]
- Product Manager: [Approved]
- Security Team: [Pending Review]
- Engineering Lead: [Approved]

---

**End of Architecture Decision Record**