1430 lines
47 KiB
Markdown
1430 lines
47 KiB
Markdown
# Architecture Decision Record - ColaFlow Enterprise Multi-Tenancy
|
||
|
||
**Document Type:** ADR (Architecture Decision Record)
|
||
**Date:** 2025-11-03
|
||
**Status:** Accepted
|
||
**Decision Makers:** Architecture Team, Product Manager, Technical Leads
|
||
**Project:** ColaFlow - M1 Sprint 2 (Enterprise Multi-Tenant Upgrade)
|
||
|
||
---
|
||
|
||
## Document Purpose
|
||
|
||
This Architecture Decision Record (ADR) documents the key architectural decisions made for ColaFlow's transition from a single-tenant to an enterprise-ready multi-tenant SaaS platform. It follows the ADR format to capture context, options considered, chosen solutions, and consequences.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [ADR-001: Tenant Identification Strategy](#adr-001-tenant-identification-strategy)
|
||
2. [ADR-002: Data Isolation Strategy](#adr-002-data-isolation-strategy)
|
||
3. [ADR-003: SSO Library Selection](#adr-003-sso-library-selection)
|
||
4. [ADR-004: MCP Token Format](#adr-004-mcp-token-format)
|
||
5. [ADR-005: Frontend State Management](#adr-005-frontend-state-management)
|
||
6. [ADR-006: Token Storage Strategy](#adr-006-token-storage-strategy)
|
||
7. [Summary of Decisions](#summary-of-decisions)
|
||
|
||
---
|
||
|
||
## ADR-001: Tenant Identification Strategy
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
ColaFlow is transitioning to a multi-tenant architecture where multiple companies (tenants) will share the same application instance. We need a reliable, performant, and secure method to identify which tenant a user or API request belongs to.
|
||
|
||
**Requirements:**
|
||
- Must work across web, mobile, and API clients
|
||
- Must be stateless (no session storage required)
|
||
- Must be secure (prevent tenant spoofing)
|
||
- Must be performant (no database lookup per request)
|
||
- Must support both human users and AI agents (MCP tokens)
|
||
- Must work with subdomain-based URLs (e.g., `acme.colaflow.com`)
|
||
|
||
### Decision Drivers
|
||
|
||
1. **Performance:** System must handle 10,000+ requests/second without database lookups
|
||
2. **Security:** Tenant ID cannot be tampered with by malicious users
|
||
3. **Scalability:** Solution must work for mobile apps, APIs, and web simultaneously
|
||
4. **Developer Experience:** Easy to implement and maintain across all layers
|
||
5. **User Experience:** Friendly tenant selection (via subdomain)
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: JWT Claims (Primary) + Subdomain (Secondary)
|
||
|
||
**Approach:**
|
||
- Store `tenant_id` and `tenant_slug` in JWT access token claims
|
||
- Resolve tenant from subdomain on login/registration
|
||
- Inject tenant context from JWT claims into all API requests
|
||
- No database lookup required after authentication
|
||
|
||
**Pros:**
|
||
- Stateless: No session storage or database lookup per request
|
||
- Secure: JWT signature prevents tampering
|
||
- Cross-platform: Works for web, mobile, API, MCP tokens
|
||
- Fast: O(1) lookup from JWT claims
|
||
- Tenant context available in middleware layer
|
||
|
||
**Cons:**
|
||
- JWT cannot be updated until refresh (stale tenant info for up to 60 minutes)
|
||
- Requires careful token expiration management
|
||
- Subdomain only used for initial tenant resolution (login page)
|
||
|
||
**Example JWT Payload:**
|
||
```json
|
||
{
|
||
"sub": "user-id-123",
|
||
"email": "john@acme.com",
|
||
"tenant_id": "tenant-uuid-456",
|
||
"tenant_slug": "acme",
|
||
"tenant_plan": "Enterprise",
|
||
"auth_provider": "AzureAD",
|
||
"role": "User",
|
||
"exp": 1730678400,
|
||
"iat": 1730674800
|
||
}
|
||
```
|
||
|
||
#### Option 2: Session-Based Tenant Storage
|
||
|
||
**Approach:**
|
||
- Store tenant ID in server-side session (Redis)
|
||
- Lookup tenant on every request via session ID
|
||
- Subdomain used for tenant resolution on login
|
||
|
||
**Pros:**
|
||
- Can update tenant info without re-login
|
||
- Works well for web applications
|
||
- Session can store additional context
|
||
|
||
**Cons:**
|
||
- Not stateless: Requires Redis/session storage infrastructure
|
||
- Database/Redis lookup on every request (performance hit)
|
||
- Difficult to scale horizontally (session affinity required)
|
||
- Doesn't work well for mobile apps or API-only clients
|
||
- MCP tokens would still need separate mechanism
|
||
|
||
#### Option 3: Subdomain-Only Identification
|
||
|
||
**Approach:**
|
||
- Parse subdomain from HTTP Host header on every request
|
||
- Lookup tenant by slug in database
|
||
- No JWT claims for tenant
|
||
|
||
**Pros:**
|
||
- Simple conceptual model
|
||
- User-friendly (URL shows tenant)
|
||
- Easy to test locally
|
||
|
||
**Cons:**
|
||
- Database lookup on every request (performance bottleneck)
|
||
- Doesn't work for API clients (no subdomain in API calls)
|
||
- Doesn't work for mobile apps
|
||
- Vulnerable to DNS spoofing
|
||
- MCP tokens cannot carry subdomain context
|
||
|
||
#### Option 4: Tenant ID in URL Path
|
||
|
||
**Approach:**
|
||
- Include tenant ID in every API route: `/api/tenants/{tenantId}/projects`
|
||
- Frontend passes tenant ID explicitly
|
||
|
||
**Pros:**
|
||
- Explicit tenant context in every request
|
||
- Easy to debug
|
||
- Works across all client types
|
||
|
||
**Cons:**
|
||
- Poor user experience (ugly URLs)
|
||
- Easy to make mistakes (wrong tenant ID)
|
||
- Difficult to enforce (requires middleware validation)
|
||
- Security risk (users could try other tenant IDs)
|
||
- Requires frontend to manage tenant ID everywhere
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - JWT Claims (Primary) + Subdomain (Secondary)**
|
||
|
||
**Rationale:**
|
||
1. **Performance:** No database lookup per request; O(1) from JWT claims
|
||
2. **Security:** JWT signature prevents tampering; middleware validates on every request
|
||
3. **Scalability:** Works for web, mobile, API, and MCP tokens uniformly
|
||
4. **Stateless:** No session storage required; easy to scale horizontally
|
||
5. **Developer Experience:** TenantContext injected automatically via middleware
|
||
|
||
**Implementation Strategy:**
|
||
- **Login Flow:** User visits `acme.colaflow.com/login` → Tenant resolved from subdomain → JWT contains `tenant_id` and `tenant_slug`
|
||
- **API Requests:** JWT extracted from Authorization header → `tenant_id` injected into TenantContext → EF Core Global Query Filter applies automatic filtering
|
||
- **MCP Tokens:** Opaque tokens stored with `tenant_id` → Middleware validates token → Tenant context injected (same as JWT)
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Fast authentication and authorization
|
||
- No session storage infrastructure required
|
||
- Uniform tenant resolution across all client types
|
||
- Easy to test and debug (tenant visible in JWT payload)
|
||
- Supports multi-tenant mobile apps
|
||
|
||
**Negative:**
|
||
- Tenant changes require re-login (or wait for token refresh)
|
||
- JWT size increases slightly (+50 bytes for tenant claims)
|
||
- Middleware must validate JWT on every request (minor CPU cost)
|
||
|
||
**Neutral:**
|
||
- Subdomain is only used for initial tenant selection (login page)
|
||
- Tenant switching requires logout and login to different subdomain
|
||
|
||
**Mitigation Strategies:**
|
||
- Keep JWT expiration short (60 minutes) to allow tenant updates on refresh
|
||
- Implement automatic token refresh to minimize user disruption
|
||
- Cache JWT validation results per request to avoid redundant checks
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- JWT contains `tenant_id`, `tenant_slug`, and `tenant_plan` claims
|
||
- Middleware extracts tenant from JWT and injects into TenantContext
|
||
- All database queries automatically filter by tenant via Global Query Filter
|
||
- Cross-tenant access attempts return 403 Forbidden
|
||
- Performance: <5ms overhead for JWT validation per request
|
||
|
||
**Testing:**
|
||
- Unit tests: TenantContext injection
|
||
- Integration tests: Cross-tenant isolation
|
||
- Performance tests: 10,000 req/s with JWT validation
|
||
- Security tests: Attempt to access other tenant's data (should fail)
|
||
|
||
### References
|
||
- Architecture Doc: `docs/architecture/multi-tenancy-architecture.md`
|
||
- JWT Implementation: `docs/architecture/jwt-authentication-architecture.md`
|
||
- MCP Token Format: `docs/architecture/mcp-authentication-architecture.md`
|
||
|
||
---
|
||
|
||
## ADR-002: Data Isolation Strategy
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
In a multi-tenant system, data isolation is critical to ensure that one tenant cannot access another tenant's data. We need to choose an isolation strategy that balances security, performance, cost, and maintainability.
|
||
|
||
**Requirements:**
|
||
- Strong data isolation (no cross-tenant leaks)
|
||
- Good query performance (<50ms for typical queries)
|
||
- Cost-effective (avoid database proliferation)
|
||
- Easy to maintain and backup
|
||
- Scalable to 10,000+ tenants
|
||
- Support for per-tenant data export (GDPR compliance)
|
||
|
||
### Decision Drivers
|
||
|
||
1. **Security:** Absolute data isolation between tenants
|
||
2. **Cost:** Minimize infrastructure costs (PostgreSQL instances, storage)
|
||
3. **Performance:** Fast queries with proper indexing
|
||
4. **Scalability:** Support thousands of tenants on shared infrastructure
|
||
5. **Maintainability:** Easy schema migrations, backups, monitoring
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: Shared Database + tenant_id Column + Global Query Filter
|
||
|
||
**Approach:**
|
||
- All tenants share one PostgreSQL database
|
||
- Every table has a `tenant_id` column (NOT NULL)
|
||
- EF Core Global Query Filter automatically adds `.Where(e => e.TenantId == currentTenantId)` to all queries
|
||
- Composite indexes: `(tenant_id, other_columns)`
|
||
|
||
**Schema Example:**
|
||
```sql
|
||
CREATE TABLE projects (
|
||
id UUID PRIMARY KEY,
|
||
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
|
||
name VARCHAR(200) NOT NULL,
|
||
key VARCHAR(20) NOT NULL,
|
||
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
|
||
CONSTRAINT uq_projects_tenant_key UNIQUE (tenant_id, key)
|
||
);
|
||
|
||
CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
|
||
CREATE INDEX idx_projects_tenant_key ON projects(tenant_id, key);
|
||
```
|
||
|
||
**EF Core Configuration:**
|
||
```csharp
|
||
protected override void OnModelCreating(ModelBuilder modelBuilder)
|
||
{
|
||
modelBuilder.Entity<Project>().HasQueryFilter(
|
||
p => p.TenantId == _tenantContext.CurrentTenantId
|
||
);
|
||
}
|
||
```
|
||
|
||
**Pros:**
|
||
- Cost-effective: One database for all tenants
|
||
- Easy to maintain: Single schema, one backup process
|
||
- Good performance with proper indexing (composite indexes)
|
||
- Easy to add new tenants (just insert into `tenants` table)
|
||
- Per-tenant data export is SQL query: `SELECT * FROM projects WHERE tenant_id = 'xxx'`
|
||
- Scales to 10,000+ tenants on one database
|
||
- Automatic filtering via Global Query Filter (developer-friendly)
|
||
|
||
**Cons:**
|
||
- Risk of data leak if Global Query Filter is bypassed (`.IgnoreQueryFilters()`)
|
||
- All tenants affected by database downtime
|
||
- Cannot isolate noisy neighbors (one tenant's heavy queries affect others)
|
||
- Database size grows with all tenants (monitoring required)
|
||
|
||
**Cost Estimate:** 1 database instance (~$100-200/month for medium workload)
|
||
|
||
#### Option 2: Database-per-Tenant
|
||
|
||
**Approach:**
|
||
- Each tenant gets a dedicated PostgreSQL database
|
||
- Connection string stored in `tenants` table
|
||
- Middleware switches database context per request
|
||
|
||
**Schema Example:**
|
||
```sql
|
||
-- Shared management database
|
||
CREATE TABLE tenants (
|
||
id UUID PRIMARY KEY,
|
||
slug VARCHAR(50) UNIQUE NOT NULL,
|
||
connection_string TEXT NOT NULL -- Encrypted
|
||
);
|
||
|
||
-- Tenant-specific database (one per tenant)
|
||
CREATE DATABASE tenant_acme;
|
||
CREATE DATABASE tenant_beta;
|
||
```
|
||
|
||
**Pros:**
|
||
- Strong isolation: One tenant's database cannot access another
|
||
- Tenant-specific customization (different schema versions)
|
||
- Easy to back up per tenant
|
||
- Noisy neighbors don't affect each other
|
||
- Easy to migrate tenant to different database server
|
||
|
||
**Cons:**
|
||
- Expensive: N databases for N tenants (~$10-20/month per tenant minimum)
|
||
- Complex maintenance: Schema migrations across 1000s of databases
|
||
- Connection pool exhaustion (need one pool per tenant)
|
||
- Difficult to implement cross-tenant features (analytics, admin tools)
|
||
- Onboarding delay (new database provisioning takes time)
|
||
|
||
**Cost Estimate:** 1000 tenants × $15/month = $15,000/month (vs $200 for shared)
|
||
|
||
#### Option 3: Schema-per-Tenant (PostgreSQL Schemas)
|
||
|
||
**Approach:**
|
||
- One database with multiple PostgreSQL schemas
|
||
- Each tenant gets a schema: `tenant_acme.projects`, `tenant_beta.projects`
|
||
- Middleware switches search_path per request: `SET search_path = tenant_acme;`
|
||
|
||
**Pros:**
|
||
- Better isolation than shared database
|
||
- Lower cost than database-per-tenant
|
||
- All tenants in one PostgreSQL instance (easier backups)
|
||
- Can support ~1000 schemas per database
|
||
|
||
**Cons:**
|
||
- PostgreSQL schema limit (~1000 schemas per database)
|
||
- Schema creation overhead for new tenants
|
||
- Complex schema migrations (run migration on each schema)
|
||
- Search_path switching per request (performance overhead)
|
||
- Difficult to enforce (easy to forget to set search_path)
|
||
|
||
**Cost Estimate:** Same as shared database, but limited scalability
|
||
|
||
#### Option 4: Separate Infrastructure per Tenant (Fully Isolated)
|
||
|
||
**Approach:**
|
||
- Each tenant gets dedicated Kubernetes namespace, database, Redis, etc.
|
||
- Complete infrastructure isolation
|
||
|
||
**Pros:**
|
||
- Maximum isolation and security
|
||
- Per-tenant scaling and customization
|
||
- Enterprise customers often require this
|
||
|
||
**Cons:**
|
||
- Extremely expensive (hundreds of dollars per tenant)
|
||
- Complex to manage (orchestration required)
|
||
- Overkill for most tenants
|
||
- Long onboarding time
|
||
|
||
**Cost Estimate:** 1000 tenants × $500/month = $500,000/month (prohibitive)
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - Shared Database + tenant_id Column + Global Query Filter**
|
||
|
||
**Rationale:**
|
||
1. **Cost-Effective:** $200/month vs $15,000/month for database-per-tenant
|
||
2. **Scalable:** PostgreSQL handles 10,000+ tenants with proper indexing
|
||
3. **Maintainable:** One schema, one backup process, one monitoring dashboard
|
||
4. **Developer-Friendly:** EF Core Global Query Filter ensures automatic filtering
|
||
5. **Performance:** Composite indexes provide excellent query performance
|
||
6. **Proven Pattern:** Used by GitHub, Slack, Heroku, and many successful SaaS products
|
||
|
||
**Implementation Strategy:**
|
||
- Add `tenant_id` column to all business tables
|
||
- Create composite indexes: `(tenant_id, primary_key)`, `(tenant_id, foreign_key)`
|
||
- Configure EF Core Global Query Filter in `OnModelCreating`
|
||
- Create TenantContext service to inject current tenant
|
||
- Add database-level constraints: `CHECK (tenant_id IS NOT NULL)`
|
||
- Update unique constraints to be tenant-scoped: `UNIQUE (tenant_id, email)`
|
||
|
||
**Migration Path:**
|
||
- Create `tenants` table
|
||
- Create default tenant for existing data
|
||
- Add `tenant_id` columns (nullable initially)
|
||
- Migrate existing data to default tenant
|
||
- Set `tenant_id` as NOT NULL
|
||
- Add indexes and constraints
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Low infrastructure cost (1 database vs thousands)
|
||
- Easy to maintain and monitor
|
||
- Fast schema migrations (one database)
|
||
- Automatic tenant filtering (developer safety)
|
||
- Good query performance with indexes
|
||
- Per-tenant data export is straightforward SQL
|
||
|
||
**Negative:**
|
||
- Risk of data leak if developer bypasses Global Query Filter
|
||
- All tenants share database resources (monitoring required)
|
||
- Cannot isolate noisy neighbors at database level
|
||
- Database backup contains all tenants (larger backup size)
|
||
|
||
**Neutral:**
|
||
- Tenant onboarding is instant (no new database needed)
|
||
- Cross-tenant analytics require explicit filtering
|
||
- Database size monitoring required as tenant count grows
|
||
|
||
**Mitigation Strategies:**
|
||
- **Data Leak Prevention:**
|
||
- Code review requirement for any `.IgnoreQueryFilters()` usage
|
||
- Integration tests verify cross-tenant isolation
|
||
- Automated security testing (attempt cross-tenant access)
|
||
- **Performance Monitoring:**
|
||
- Alert on slow queries (>100ms)
|
||
- Index usage monitoring (pg_stat_user_indexes)
|
||
- Per-tenant query cost tracking
|
||
- **Noisy Neighbor Protection:**
|
||
- Query timeout limits (5 seconds max)
|
||
- Rate limiting per tenant
|
||
- Connection pool limits
|
||
- Option to migrate large tenant to dedicated database later
|
||
|
||
**Upgrade Path:**
|
||
If a tenant grows too large or requires dedicated resources, we can migrate them to a separate database while keeping the shared model for other tenants.
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- All queries automatically filter by tenant
|
||
- Cross-tenant access attempts fail with 403 Forbidden
|
||
- Query performance <50ms for typical workloads (with 10,000 records per tenant)
|
||
- Integration tests verify tenant isolation
|
||
- Data export per tenant completes in <1 minute
|
||
|
||
**Testing:**
|
||
- Unit tests: Global Query Filter applied to all entities
|
||
- Integration tests: Create data in Tenant A, verify Tenant B cannot access
|
||
- Performance tests: Query time with 1 million total records (100 tenants × 10,000 records)
|
||
- Load tests: 10,000 concurrent requests across 100 tenants
|
||
|
||
### References
|
||
- Architecture Doc: `docs/architecture/multi-tenancy-architecture.md`
|
||
- Migration Strategy: `docs/architecture/migration-strategy.md`
|
||
- Performance Benchmarks: `docs/architecture/performance-benchmarks.md` (TBD)
|
||
|
||
---
|
||
|
||
## ADR-003: SSO Library Selection
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
Enterprise customers require Single Sign-On (SSO) to integrate ColaFlow with their corporate identity providers (Azure AD, Google Workspace, Okta, etc.). We need to choose an SSO library/approach that balances functionality, cost, implementation speed, and maintainability.
|
||
|
||
**Requirements:**
|
||
- Support major identity providers: Azure AD, Google, Okta
|
||
- Support OIDC (OpenID Connect) protocol
|
||
- Support SAML 2.0 for generic enterprise IdPs
|
||
- User auto-provisioning (create user on first SSO login)
|
||
- Email domain restrictions (only allow @acme.com)
|
||
- Configurable per tenant (each tenant has own SSO config)
|
||
- Production-ready security standards
|
||
|
||
### Decision Drivers
|
||
|
||
1. **Time-to-Market:** Implement SSO in <1 week (M1 timeline constraint)
|
||
2. **Cost:** Minimize licensing fees
|
||
3. **Coverage:** Support 90% of enterprise SSO requirements
|
||
4. **Flexibility:** Can upgrade later if complex requirements emerge
|
||
5. **Security:** Follow OWASP and OIDC/SAML best practices
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: ASP.NET Core Native OIDC/SAML (M1-M2)
|
||
|
||
**Approach:**
|
||
- Use built-in `Microsoft.AspNetCore.Authentication.OpenIdConnect` for OIDC
|
||
- Use `Sustainsys.Saml2` library for SAML 2.0
|
||
- Custom implementation for multi-tenant SSO configuration
|
||
- Store SSO config in `tenants` table (JSONB column)
|
||
|
||
**Pros:**
|
||
- Free: No licensing costs
|
||
- Fast: Can implement OIDC in 2-3 days, SAML in 3-4 days
|
||
- Built-in to .NET 9: Mature, well-documented
|
||
- Flexible: Full control over implementation
|
||
- Covers 80-90% of enterprise SSO needs
|
||
|
||
**Cons:**
|
||
- Manual implementation: Need to handle user provisioning, domain restrictions
|
||
- Limited advanced features: No federation, no protocol switching
|
||
- SAML is more complex to implement
|
||
- Need to maintain our own SSO configuration UI
|
||
|
||
**Implementation Complexity:** Medium
|
||
**Cost:** $0/month
|
||
**Coverage:** OIDC (Azure, Google, Okta) + SAML 2.0 (80% of market)
|
||
|
||
**Code Example:**
|
||
```csharp
|
||
services.AddAuthentication()
|
||
.AddOpenIdConnect("AzureAD", options =>
|
||
{
|
||
options.Authority = tenant.SsoConfig.AuthorityUrl;
|
||
options.ClientId = tenant.SsoConfig.ClientId;
|
||
options.ClientSecret = tenant.SsoConfig.ClientSecret;
|
||
options.ResponseType = "code";
|
||
options.SaveTokens = true;
|
||
options.Events = new OpenIdConnectEvents
|
||
{
|
||
OnTokenValidated = async context =>
|
||
{
|
||
await AutoProvisionUserAsync(context);
|
||
}
|
||
};
|
||
});
|
||
```
|
||
|
||
#### Option 2: Auth0
|
||
|
||
**Approach:**
|
||
- Use Auth0 as SSO broker
|
||
- Auth0 handles all identity providers
|
||
- Configure Auth0 via their dashboard
|
||
- Pay per monthly active user (MAU)
|
||
|
||
**Pros:**
|
||
- Fast setup: Implement in 1-2 days
|
||
- Comprehensive: Supports all identity providers out-of-the-box
|
||
- User management: Built-in user directory
|
||
- Advanced features: MFA, passwordless, anomaly detection
|
||
- Dashboard for SSO configuration
|
||
|
||
**Cons:**
|
||
- Expensive: $240/month (Professional) + $0.05/MAU (500 users = $25/month extra)
|
||
- Vendor lock-in: Difficult to migrate away
|
||
- Less control: Auth0 controls auth flow
|
||
- Overkill for MVP: Many features we don't need yet
|
||
|
||
**Implementation Complexity:** Low
|
||
**Cost:** $3,000-5,000/year (for 100 tenants with 5,000 total users)
|
||
**Coverage:** 100% (all protocols, all providers)
|
||
|
||
#### Option 3: Okta (Workforce Identity Cloud)
|
||
|
||
**Approach:**
|
||
- Use Okta as SSO broker
|
||
- Similar to Auth0 but more enterprise-focused
|
||
- Per-user pricing
|
||
|
||
**Pros:**
|
||
- Enterprise-grade: Trusted by Fortune 500
|
||
- Complete features: SSO, MFA, provisioning, directory
|
||
- Excellent support and documentation
|
||
|
||
**Cons:**
|
||
- Very expensive: $2/user/month minimum (100 users = $200/month)
|
||
- Enterprise sales process (slow, complex)
|
||
- Overkill for startup/SMB customers
|
||
- Vendor lock-in
|
||
|
||
**Implementation Complexity:** Low
|
||
**Cost:** $5,000-10,000/year (for 100 tenants)
|
||
**Coverage:** 100%
|
||
|
||
#### Option 4: IdentityServer4 / Duende IdentityServer
|
||
|
||
**Approach:**
|
||
- Use IdentityServer as self-hosted identity provider
|
||
- Implement Federation support (connect to external IdPs)
|
||
- Open-source (IdentityServer4) or licensed (Duende)
|
||
|
||
**Pros:**
|
||
- Self-hosted: Full control
|
||
- Comprehensive: OIDC, OAuth 2.0, SAML via plugins
|
||
- Flexible: Can customize extensively
|
||
- No per-user fees
|
||
|
||
**Cons:**
|
||
- Complex: Steep learning curve (2-3 weeks to implement)
|
||
- Maintenance burden: Need to maintain IdentityServer instance
|
||
- Duende licensing: $1,500/year for production use
|
||
- Overkill for MVP: We don't need an identity provider, just SSO
|
||
|
||
**Implementation Complexity:** High
|
||
**Cost:** $1,500/year (Duende license)
|
||
**Coverage:** 100%
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - ASP.NET Core Native OIDC/SAML (M1-M2)**
|
||
|
||
**Rationale:**
|
||
1. **Cost:** $0/month vs $3,000-5,000/year for Auth0/Okta
|
||
2. **Speed:** Can implement in <1 week (M1 timeline)
|
||
3. **Control:** Full flexibility to customize
|
||
4. **Coverage:** Supports 80% of enterprise SSO requirements (OIDC + SAML)
|
||
5. **Upgrade Path:** Can migrate to Auth0/Okta later if complex requirements emerge
|
||
|
||
**Decision:** Start with native ASP.NET Core for M1-M2. Re-evaluate at M3 if we need:
|
||
- Complex federation (multiple IdPs per tenant)
|
||
- Advanced MFA flows
|
||
- More than 5 different SSO protocols
|
||
- Dedicated identity management features
|
||
|
||
**Implementation Strategy:**
|
||
- **M1 (Week 1):** OIDC implementation (Azure AD, Google, Okta)
|
||
- **M2 (Week 2):** SAML 2.0 implementation (generic enterprise IdPs)
|
||
- **M2 (Week 3):** User auto-provisioning and domain restrictions
|
||
- **M2 (Week 4):** SSO configuration UI for tenants
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Zero licensing costs for M1-M2
|
||
- Complete control over implementation
|
||
- Can customize for our specific needs
|
||
- Fast implementation (< 1 week)
|
||
- Covers 80% of enterprise SSO requirements
|
||
- Learning opportunity for team
|
||
|
||
**Negative:**
|
||
- Manual implementation required (more code to maintain)
|
||
- Limited to OIDC + SAML 2.0 (no exotic protocols)
|
||
- Need to build SSO configuration UI ourselves
|
||
- More testing required (vs using Auth0)
|
||
|
||
**Neutral:**
|
||
- Can migrate to Auth0/Okta later if needed
|
||
- SSO config stored in database (our control)
|
||
- Integration tests required for each IdP
|
||
|
||
**Mitigation Strategies:**
|
||
- **Quality:** Comprehensive testing with real IdPs (Azure AD, Google)
|
||
- **Documentation:** Detailed guides for each supported provider
|
||
- **Security:** Follow OIDC/SAML security best practices
|
||
- **Upgrade Path:** Design SSO config to be provider-agnostic (easy migration)
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- OIDC login works with Azure AD, Google, Okta
|
||
- SAML 2.0 login works with generic IdP
|
||
- Users auto-provisioned on first login
|
||
- Email domain restrictions enforced
|
||
- SSO configuration UI functional for admins
|
||
- Error handling for common SSO failures
|
||
|
||
**Testing:**
|
||
- Unit tests: OIDC token validation, SAML assertion parsing
|
||
- Integration tests: Full SSO flow with real IdPs (test tenants)
|
||
- Security tests: CSRF protection, replay attack prevention
|
||
- Usability tests: Admin can configure SSO without support
|
||
|
||
### References
|
||
- Architecture Doc: `docs/architecture/sso-integration-architecture.md`
|
||
- Implementation Guide: `docs/implementation/sso-implementation.md` (TBD)
|
||
- Security Checklist: `docs/security/sso-security-checklist.md` (TBD)
|
||
|
||
---
|
||
|
||
## ADR-004: MCP Token Format
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
ColaFlow will expose an MCP (Model Context Protocol) server that allows AI agents (Claude, ChatGPT) to access project data, create tasks, and generate reports. We need a secure, revocable authentication mechanism for AI agents.
|
||
|
||
**Requirements:**
|
||
- Secure: Cannot be forged or tampered with
|
||
- Revocable: Admin can revoke token instantly
|
||
- Fine-Grained Permissions: Control read/write access per resource
|
||
- Audit Trail: Log all API operations performed with token
|
||
- Tenant-Scoped: Token only works for one tenant
|
||
- Long-Lived: Valid for days/weeks (not short-lived like JWT)
|
||
|
||
### Decision Drivers
|
||
|
||
1. **Security:** Token cannot be guessed or brute-forced
|
||
2. **Revocability:** Instant revocation (no JWT blacklist complexity)
|
||
3. **Permissions:** Resource-level + operation-level granularity
|
||
4. **Auditability:** Complete log of all token operations
|
||
5. **Usability:** Easy to copy/paste, recognizable format
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)
|
||
|
||
**Format:** `mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d`
|
||
|
||
**Approach:**
|
||
- Token is a random string (cryptographically secure)
|
||
- Prefix: `mcp_` (identifies as MCP token)
|
||
- Tenant slug: `acme` (for easy identification)
|
||
- Random part: 32 hex characters (128 bits of entropy)
|
||
- Store token hash (SHA256) in database
|
||
- Store permissions in database alongside token
|
||
|
||
**Token Storage:**
|
||
```sql
|
||
CREATE TABLE mcp_tokens (
|
||
id UUID PRIMARY KEY,
|
||
tenant_id UUID NOT NULL,
|
||
user_id UUID NULL,
|
||
name VARCHAR(100) NOT NULL,
|
||
token_hash VARCHAR(255) NOT NULL UNIQUE, -- SHA256 of token
|
||
permissions JSONB NOT NULL, -- {"projects": ["read", "search"], ...}
|
||
status INT NOT NULL, -- Active/Revoked/Expired
|
||
created_at TIMESTAMP NOT NULL,
|
||
expires_at TIMESTAMP NULL,
|
||
last_used_at TIMESTAMP NULL
|
||
);
|
||
```
|
||
|
||
**Validation Flow:**
|
||
1. Receive token: `mcp_acme_xxx...`
|
||
2. Hash token with SHA256
|
||
3. Lookup in database by token_hash
|
||
4. Check status (Active/Revoked/Expired)
|
||
5. Check expiration date
|
||
6. Load permissions from JSONB column
|
||
7. Inject tenant context and permissions into request
|
||
|
||
**Pros:**
|
||
- **Revocable:** Update `status = Revoked` in database, takes effect immediately
|
||
- **Secure:** SHA256 hashed, never stored plain-text
|
||
- **Flexible Permissions:** Can update permissions without regenerating token
|
||
- **Auditable:** Every token use logged in database
|
||
- **Tenant-Scoped:** Token hash includes tenant context
|
||
- **Long-Lived:** Can be valid for months/years
|
||
- **Easy to Identify:** Prefix + tenant slug clearly identify token type
|
||
|
||
**Cons:**
|
||
- Database lookup required on every request (performance overhead)
|
||
- Larger tokens (50+ characters) vs API keys (32 characters)
|
||
- Need to manage token lifecycle (expiration, revocation)
|
||
|
||
**Performance:** ~5ms per token validation (including database lookup)
|
||
|
||
#### Option 2: JWT Tokens for MCP
|
||
|
||
**Format:** Long JWT string (200+ characters)
|
||
|
||
**Approach:**
|
||
- Generate JWT with `tenant_id`, `user_id`, `permissions` claims
|
||
- Sign with secret key
|
||
- No database lookup required (stateless)
|
||
- Validate signature on every request
|
||
|
||
**Pros:**
|
||
- Stateless: No database lookup required
|
||
- Fast validation: O(1) signature check
|
||
- Self-contained: All info in token
|
||
|
||
**Cons:**
|
||
- **Cannot Revoke:** Once issued, JWT is valid until expiration (unless using blacklist)
|
||
- **Blacklist Required:** Need Redis/database to store revoked JWTs (adds complexity)
|
||
- **Permissions Fixed:** Cannot update permissions without regenerating token
|
||
- **Larger Tokens:** 200-500 characters (difficult to copy/paste)
|
||
- **Expiration Required:** Must set short expiration for revocation to work
|
||
|
||
**Revocation Problem:**
|
||
```
|
||
User generates JWT token → Shares with AI agent → Admin wants to revoke
|
||
→ JWT is still valid for 30 days → Need to blacklist JWT ID
|
||
→ Now need Redis to store blacklist → Not truly stateless anymore
|
||
```
|
||
|
||
#### Option 3: API Keys (UUID Format)
|
||
|
||
**Format:** `550e8400-e29b-41d4-a716-446655440000`
|
||
|
||
**Approach:**
|
||
- Generate random UUID
|
||
- Store in database with permissions
|
||
- Simple validation: lookup by UUID
|
||
|
||
**Pros:**
|
||
- Simple implementation
|
||
- Standard format (UUID)
|
||
- Database lookup
|
||
|
||
**Cons:**
|
||
- No tenant context in token (need to lookup tenant)
|
||
- No token type identifier (could be confused with user IDs)
|
||
- No visual indication of purpose
|
||
- Less secure (UUIDs have less entropy than 256-bit random strings)
|
||
|
||
#### Option 4: GitHub-Style Personal Access Tokens
|
||
|
||
**Format:** `ghp_ABcdEF123456789012345678901234567890`
|
||
|
||
**Approach:**
|
||
- Prefix identifies token type
|
||
- Random alphanumeric string
|
||
- Store hash in database
|
||
|
||
**Pros:**
|
||
- Industry standard (used by GitHub, GitLab)
|
||
- Easy to identify by prefix
|
||
- Secure
|
||
|
||
**Cons:**
|
||
- No tenant context in token itself
|
||
- Shorter random part (less entropy than our Option 1)
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - Opaque Tokens (`mcp_<tenant_slug>_<random_32>`)**
|
||
|
||
**Rationale:**
|
||
1. **Revocability:** Instant revocation without blacklist complexity
|
||
2. **Flexibility:** Permissions stored server-side, can update without new token
|
||
3. **Security:** 128 bits of entropy + SHA256 hashing
|
||
4. **Usability:** Tenant slug in token helps users identify which tenant it's for
|
||
5. **Auditability:** Complete audit trail in database
|
||
|
||
**Token Format:**
|
||
```
|
||
mcp_<tenant_slug>_<random_32_hex_chars>
|
||
```
|
||
|
||
**Example:**
|
||
```
|
||
mcp_acme_7f3d8a9c4e1b2f5a6d8c9e0f1a2b3c4d
|
||
mcp_techcorp_a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6
|
||
```
|
||
|
||
**Components:**
|
||
- `mcp_`: Identifies as MCP token (easy to filter in logs)
|
||
- `acme`: Tenant slug (helps user identify which tenant)
|
||
- `7f3d8a9c...`: 32 hex characters (128 bits entropy = 2^128 combinations)
|
||
|
||
**Generation:**
|
||
```csharp
|
||
public string GenerateToken(string tenantSlug)
|
||
{
|
||
var randomBytes = new byte[16]; // 128 bits
|
||
using var rng = RandomNumberGenerator.Create();
|
||
rng.GetBytes(randomBytes);
|
||
var randomHex = Convert.ToHexString(randomBytes).ToLowerInvariant();
|
||
return $"mcp_{tenantSlug}_{randomHex}";
|
||
}
|
||
```
|
||
|
||
**Storage:**
|
||
```csharp
|
||
public async Task<McpToken> CreateTokenAsync(CreateMcpTokenCommand command)
|
||
{
|
||
var token = _tokenGenerator.GenerateToken(tenant.Slug);
|
||
var tokenHash = _tokenGenerator.HashToken(token); // SHA256
|
||
|
||
var mcpToken = new McpToken
|
||
{
|
||
TokenHash = tokenHash, // Never store plain-text
|
||
Permissions = command.Permissions,
|
||
ExpiresAt = command.ExpiresAt
|
||
};
|
||
|
||
await _repository.AddAsync(mcpToken);
|
||
return token; // Return plain-text ONLY ONCE
|
||
}
|
||
```
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Instant revocation (update database status)
|
||
- Fine-grained permissions (stored server-side)
|
||
- Complete audit trail
|
||
- Tenant-scoped (slug in token)
|
||
- Secure (128-bit entropy + SHA256)
|
||
- User-friendly (tenant slug helps identification)
|
||
|
||
**Negative:**
|
||
- Database lookup required per request (~5ms overhead)
|
||
- Longer tokens (50 characters vs 32 for API keys)
|
||
- Need to manage token lifecycle (expiration, cleanup)
|
||
|
||
**Neutral:**
|
||
- Performance overhead acceptable for MCP use case (not high-frequency)
|
||
- Token length acceptable for copy/paste workflow
|
||
|
||
**Mitigation Strategies:**
|
||
- **Performance:** Cache token validation results (5-minute TTL)
|
||
- **Token Length:** Provide copy button and download option in UI
|
||
- **Lifecycle Management:** Automated cleanup job for expired tokens
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- Token generation is cryptographically secure (CSPRNG)
|
||
- Token hash stored (SHA256), never plain-text
|
||
- Token validation <10ms (including database lookup)
|
||
- Revocation takes effect immediately
|
||
- Permissions enforced on every API call
|
||
- Audit log created for every token use
|
||
|
||
**Testing:**
|
||
- Unit tests: Token generation format, hashing, validation
|
||
- Integration tests: Token authentication flow, permission enforcement
|
||
- Security tests: Brute-force resistance, revocation effectiveness
|
||
- Performance tests: 1,000 req/s with token validation
|
||
|
||
### References
|
||
- Architecture Doc: `docs/architecture/mcp-authentication-architecture.md`
|
||
- Token Management UI: `docs/design/multi-tenant-ux-flows.md#mcp-token-management-flow`
|
||
|
||
---
|
||
|
||
## ADR-005: Frontend State Management
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
ColaFlow frontend (Next.js 16 + React 19) needs a state management solution for authentication, user preferences, and server data. We need to choose libraries that are TypeScript-first, performant, and maintainable.
|
||
|
||
**Requirements:**
|
||
- Type-safe: Full TypeScript support
|
||
- Performant: Minimal re-renders
|
||
- Developer-friendly: Low boilerplate
|
||
- Server state caching: Avoid redundant API calls
|
||
- Optimistic updates: Immediate UI feedback
|
||
- Auth state persistence: Survive page refresh
|
||
|
||
### Decision Drivers
|
||
|
||
1. **TypeScript Support:** First-class TypeScript integration
|
||
2. **Performance:** Minimal bundle size, fast renders
|
||
3. **DX (Developer Experience):** Easy to learn, low boilerplate
|
||
4. **Ecosystem:** Good documentation, active community
|
||
5. **Server State:** Built-in caching and invalidation
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: Zustand (Client State) + TanStack Query v5 (Server State)
|
||
|
||
**Approach:**
|
||
- **Zustand:** Lightweight state manager for auth, UI state
|
||
- **TanStack Query:** Server state caching, mutations, automatic refetching
|
||
|
||
**Zustand Example:**
|
||
```typescript
|
||
// stores/useAuthStore.ts
|
||
import { create } from 'zustand';
|
||
import { persist } from 'zustand/middleware';
|
||
|
||
interface AuthState {
|
||
user: User | null;
|
||
tenant: Tenant | null;
|
||
accessToken: string | null;
|
||
login: (token: string, user: User, tenant: Tenant) => void;
|
||
logout: () => void;
|
||
}
|
||
|
||
export const useAuthStore = create<AuthState>()(
|
||
persist(
|
||
(set) => ({
|
||
user: null,
|
||
tenant: null,
|
||
accessToken: null,
|
||
login: (token, user, tenant) => set({ accessToken: token, user, tenant }),
|
||
logout: () => set({ accessToken: null, user: null, tenant: null }),
|
||
}),
|
||
{ name: 'auth-storage' }
|
||
)
|
||
);
|
||
```
|
||
|
||
**TanStack Query Example:**
|
||
```typescript
|
||
// hooks/useMcpTokens.ts
|
||
import { useQuery } from '@tanstack/react-query';
|
||
import { mcpService } from '@/services/mcp.service';
|
||
|
||
export function useMcpTokens() {
|
||
return useQuery({
|
||
queryKey: ['mcp-tokens'],
|
||
queryFn: () => mcpService.listTokens(),
|
||
staleTime: 1000 * 60 * 5, // 5 minutes
|
||
});
|
||
}
|
||
```
|
||
|
||
**Pros:**
|
||
- **Minimal Bundle Size:** Zustand (3KB) + TanStack Query (15KB) = 18KB total
|
||
- **TypeScript-First:** Excellent type inference
|
||
- **Low Boilerplate:** No actions, reducers, or complex setup
|
||
- **Performance:** Zustand avoids unnecessary re-renders
|
||
- **Caching:** TanStack Query caches API responses automatically
|
||
- **DevTools:** Excellent debugging tools for both libraries
|
||
- **Separation of Concerns:** Client state in Zustand, server state in TanStack Query
|
||
|
||
**Cons:**
|
||
- Two libraries to learn (vs one all-in-one solution)
|
||
- Need to decide what goes in Zustand vs TanStack Query
|
||
|
||
**Learning Curve:** Low (Zustand is simpler than Redux, TanStack Query has great docs)
|
||
|
||
#### Option 2: Redux Toolkit + RTK Query
|
||
|
||
**Approach:**
|
||
- Redux Toolkit for all state
|
||
- RTK Query for API data fetching
|
||
|
||
**Pros:**
|
||
- All-in-one solution
|
||
- Mature ecosystem
|
||
- Excellent DevTools
|
||
|
||
**Cons:**
|
||
- **More Boilerplate:** Actions, slices, reducers
|
||
- **Larger Bundle:** Redux (10KB) + RTK Query (20KB) = 30KB
|
||
- **Steeper Learning Curve:** More concepts to learn
|
||
- **Overkill for MVP:** We don't need Redux's complexity yet
|
||
|
||
#### Option 3: React Context + SWR
|
||
|
||
**Approach:**
|
||
- React Context for auth state
|
||
- SWR for server data
|
||
|
||
**Pros:**
|
||
- Minimal dependencies (SWR only)
|
||
- Simple concept (React Context is built-in)
|
||
|
||
**Cons:**
|
||
- **Performance Issues:** React Context causes re-renders on every update
|
||
- **Boilerplate:** Need to create context providers manually
|
||
- **SWR vs TanStack Query:** SWR is less feature-rich
|
||
|
||
#### Option 4: Jotai + TanStack Query
|
||
|
||
**Approach:**
|
||
- Jotai for atomic state management
|
||
- TanStack Query for server state
|
||
|
||
**Pros:**
|
||
- Atomic state model (like Recoil)
|
||
- Good TypeScript support
|
||
|
||
**Cons:**
|
||
- Less mature than Zustand
|
||
- Smaller community
|
||
- Atomic model can be overkill for simple auth state
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - Zustand (Client State) + TanStack Query v5 (Server State)**
|
||
|
||
**Rationale:**
|
||
1. **Bundle Size:** 18KB total (vs 30KB for Redux Toolkit)
|
||
2. **Performance:** Zustand selector-based re-renders, TanStack Query caching
|
||
3. **TypeScript:** First-class support in both libraries
|
||
4. **Learning Curve:** Simple APIs, great documentation
|
||
5. **Clear Separation:** Auth/UI in Zustand, API data in TanStack Query
|
||
|
||
**Usage Guidelines:**
|
||
|
||
**Zustand - Use For:**
|
||
- Authentication state (user, tenant, accessToken)
|
||
- UI state (sidebar open/closed, theme)
|
||
- User preferences (language, timezone)
|
||
|
||
**TanStack Query - Use For:**
|
||
- API data (projects, issues, tokens)
|
||
- Mutations (create, update, delete)
|
||
- Caching and invalidation
|
||
|
||
**Example Architecture:**
|
||
```typescript
|
||
// Zustand (auth)
|
||
const { user, tenant, logout } = useAuthStore();
|
||
|
||
// TanStack Query (server data)
|
||
const { data: projects, isLoading } = useQuery({
|
||
queryKey: ['projects'],
|
||
queryFn: () => projectService.getAll()
|
||
});
|
||
|
||
// Mutation
|
||
const createProject = useMutation({
|
||
mutationFn: (data) => projectService.create(data),
|
||
onSuccess: () => {
|
||
queryClient.invalidateQueries({ queryKey: ['projects'] });
|
||
}
|
||
});
|
||
```
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Lightweight and fast
|
||
- Easy to learn and use
|
||
- Great TypeScript experience
|
||
- Excellent caching and performance
|
||
- Clear separation of concerns
|
||
|
||
**Negative:**
|
||
- Two libraries to learn (instead of one)
|
||
- Need to decide where state lives (Zustand vs TanStack Query)
|
||
|
||
**Neutral:**
|
||
- Both libraries have excellent DevTools
|
||
- Both are actively maintained
|
||
|
||
**Mitigation Strategies:**
|
||
- **Documentation:** Create team guide for "What goes where"
|
||
- **Code Reviews:** Ensure consistent usage patterns
|
||
- **Linting:** Custom ESLint rules if needed
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- Auth state persists across page refresh
|
||
- API data cached appropriately (no redundant calls)
|
||
- Optimistic updates work (immediate UI feedback)
|
||
- TypeScript errors caught at compile time
|
||
- DevTools show state clearly
|
||
|
||
**Performance Targets:**
|
||
- Initial page load: <1.5s
|
||
- State updates: <16ms (60fps)
|
||
- Cache hit rate: >80%
|
||
|
||
### References
|
||
- Zustand Docs: https://docs.pmnd.rs/zustand
|
||
- TanStack Query Docs: https://tanstack.com/query
|
||
- Implementation: `docs/frontend/state-management-guide.md`
|
||
|
||
---
|
||
|
||
## ADR-006: Token Storage Strategy
|
||
|
||
### Status
|
||
**Accepted** - 2025-11-03
|
||
|
||
### Context
|
||
|
||
We need to securely store JWT access tokens and refresh tokens in the frontend. The storage mechanism must balance security, usability, and functionality.
|
||
|
||
**Requirements:**
|
||
- Secure: Protect against XSS and CSRF attacks
|
||
- Persistent: Survive page refresh
|
||
- Auto-refresh: Seamlessly refresh tokens before expiration
|
||
- Logout: Clear tokens on logout
|
||
- Cross-tab sync: Logout in one tab logs out all tabs
|
||
|
||
### Decision Drivers
|
||
|
||
1. **Security:** XSS protection (primary threat)
|
||
2. **CSRF Protection:** For refresh tokens
|
||
3. **Usability:** Seamless token refresh
|
||
4. **Persistence:** User stays logged in across sessions
|
||
5. **Performance:** Fast token access
|
||
|
||
### Options Considered
|
||
|
||
#### Option 1: Access Token in Memory + Refresh Token in httpOnly Cookie
|
||
|
||
**Approach:**
|
||
- **Access Token:** Stored in Zustand state (memory only, not persisted)
|
||
- **Refresh Token:** Stored in httpOnly cookie (server-side managed)
|
||
- **Flow:**
|
||
1. User logs in → Receive access + refresh tokens
|
||
2. Access token stored in Zustand (memory)
|
||
3. Refresh token stored in httpOnly cookie by backend
|
||
4. Access token used for API calls (Authorization header)
|
||
5. On 401 error → Call `/api/auth/refresh` (refresh token sent automatically via cookie)
|
||
6. Receive new access token → Update Zustand state
|
||
|
||
**Cookie Configuration (Backend):**
|
||
```csharp
|
||
Response.Cookies.Append("refreshToken", refreshToken, new CookieOptions
|
||
{
|
||
HttpOnly = true, // Cannot be accessed by JavaScript
|
||
Secure = true, // HTTPS only
|
||
SameSite = SameSiteMode.Strict, // CSRF protection
|
||
MaxAge = TimeSpan.FromDays(7)
|
||
});
|
||
```
|
||
|
||
**Pros:**
|
||
- **XSS Protection (Access Token):** Cannot be stolen via XSS (not in localStorage/cookies)
|
||
- **CSRF Protection (Refresh Token):** httpOnly + SameSite=Strict
|
||
- **Short-Lived Access Token:** Even if leaked, expires in 60 minutes
|
||
- **Automatic Refresh:** Cookie sent automatically on refresh endpoint
|
||
- **No Manual Cookie Management:** Backend sets/clears cookies
|
||
|
||
**Cons:**
|
||
- Access token lost on page refresh (need to call refresh immediately)
|
||
- Requires cookie support (some corporate proxies block cookies)
|
||
|
||
**Security Score:** 9/10 (Best practice)
|
||
|
||
#### Option 2: Both Tokens in localStorage
|
||
|
||
**Approach:**
|
||
- Store both access and refresh tokens in localStorage
|
||
- Read on page load
|
||
|
||
**Pros:**
|
||
- Simple implementation
|
||
- Tokens persist across page refresh
|
||
- No cookie management
|
||
|
||
**Cons:**
|
||
- **Vulnerable to XSS:** If attacker injects script, can steal both tokens
|
||
- **No CSRF Protection:** Tokens accessible to any script
|
||
- **Not Recommended:** Violates OWASP security guidelines
|
||
|
||
**Security Score:** 3/10 (Not secure)
|
||
|
||
#### Option 3: Both Tokens in httpOnly Cookies
|
||
|
||
**Approach:**
|
||
- Store both tokens in httpOnly cookies
|
||
- Backend sends cookies on every API response
|
||
|
||
**Pros:**
|
||
- XSS protection for both tokens
|
||
- Automatic token management
|
||
|
||
**Cons:**
|
||
- **CSRF Vulnerability:** Cookies sent automatically with every request
|
||
- **Need CSRF Tokens:** Additional complexity
|
||
- **Cookie Size Limit:** JWTs can be large (4KB cookie limit)
|
||
- **Double-Submit Cookie Pattern Required:** More complexity
|
||
|
||
**Security Score:** 6/10 (CSRF risk)
|
||
|
||
#### Option 4: Session-Based Authentication (No JWT)
|
||
|
||
**Approach:**
|
||
- Traditional session cookies
|
||
- Session stored server-side (Redis)
|
||
|
||
**Pros:**
|
||
- Simple
|
||
- Secure (session ID only)
|
||
|
||
**Cons:**
|
||
- Not stateless (requires Redis/database for sessions)
|
||
- Horizontal scaling complexity
|
||
- Not suitable for mobile apps
|
||
- Against our JWT strategy
|
||
|
||
**Security Score:** 7/10 (Secure but not stateless)
|
||
|
||
### Decision
|
||
|
||
**Chosen Option: Option 1 - Access Token in Memory + Refresh Token in httpOnly Cookie**
|
||
|
||
**Rationale:**
|
||
1. **Best Security:** Access token protected from XSS, refresh token protected from CSRF
|
||
2. **Industry Standard:** Used by Auth0, Okta, and major SaaS apps
|
||
3. **Balances Security and UX:** Short-lived access token, auto-refresh
|
||
4. **Stateless:** No session storage required
|
||
5. **Mobile-Friendly:** Can adapt for mobile (store refresh token securely)
|
||
|
||
**Implementation:**
|
||
|
||
```typescript
|
||
// stores/useAuthStore.ts
|
||
export const useAuthStore = create<AuthState>((set) => ({
|
||
user: null,
|
||
accessToken: null, // Stored in memory ONLY
|
||
login: (token, user) => set({ accessToken: token, user }),
|
||
logout: () => set({ accessToken: null, user: null })
|
||
}));
|
||
|
||
// No persist middleware for accessToken!
|
||
```
|
||
|
||
```typescript
|
||
// lib/api-client.ts
|
||
apiClient.interceptors.response.use(
|
||
(response) => response,
|
||
async (error) => {
|
||
if (error.response?.status === 401 && !error.config._retry) {
|
||
error.config._retry = true;
|
||
|
||
// Call refresh endpoint (refresh token sent via cookie automatically)
|
||
const { data } = await axios.post('/api/auth/refresh');
|
||
|
||
// Update access token in memory
|
||
useAuthStore.getState().updateToken(data.accessToken);
|
||
|
||
// Retry original request
|
||
error.config.headers.Authorization = `Bearer ${data.accessToken}`;
|
||
return apiClient(error.config);
|
||
}
|
||
|
||
return Promise.reject(error);
|
||
}
|
||
);
|
||
```
|
||
|
||
**Token Refresh Strategy:**
|
||
- **Automatic:** Intercept 401 errors, call refresh endpoint
|
||
- **Preemptive (Optional):** Refresh 5 minutes before expiration
|
||
- **One-at-a-Time:** Only one refresh call in flight (queue other requests)
|
||
|
||
### Consequences
|
||
|
||
**Positive:**
|
||
- Maximum security (XSS + CSRF protected)
|
||
- Seamless user experience (auto-refresh)
|
||
- Stateless authentication
|
||
- Mobile-friendly (adapt for secure storage)
|
||
- Industry best practice
|
||
|
||
**Negative:**
|
||
- Access token lost on page refresh (need immediate refresh call)
|
||
- Requires cookie support (fails in some corporate environments)
|
||
- More complex implementation than localStorage
|
||
|
||
**Neutral:**
|
||
- Short-lived access token means more refresh calls (acceptable trade-off)
|
||
|
||
**Mitigation Strategies:**
|
||
- **Page Load:** Call refresh endpoint on app load if no access token in memory
|
||
- **Cookie Fallback:** If cookies blocked, fall back to re-login
|
||
- **Error Handling:** Clear UX if authentication fails (session expired)
|
||
|
||
### Validation
|
||
|
||
**Acceptance Criteria:**
|
||
- Access token not visible in localStorage/sessionStorage/cookies (developer tools)
|
||
- Refresh token in httpOnly cookie with SameSite=Strict
|
||
- 401 errors trigger automatic token refresh
|
||
- Logout clears all tokens (memory + cookies)
|
||
- Cross-tab logout works (listen to storage events)
|
||
|
||
**Security Tests:**
|
||
- XSS attack simulation (cannot steal access token)
|
||
- CSRF attack simulation (refresh endpoint protected)
|
||
- Token expiration handled gracefully
|
||
- Logout clears all authentication state
|
||
|
||
### References
|
||
- OWASP: https://cheatsheetseries.owasp.org/cheatsheets/JSON_Web_Token_for_Java_Cheat_Sheet.html
|
||
- Auth0 Best Practices: https://auth0.com/docs/secure/tokens/refresh-tokens/refresh-token-rotation
|
||
- Implementation: `docs/frontend/api-integration-guide.md`
|
||
|
||
---
|
||
|
||
## Summary of Decisions
|
||
|
||
| Decision | Chosen Solution | Rationale |
|
||
|----------|----------------|-----------|
|
||
| **ADR-001: Tenant Identification** | JWT Claims + Subdomain | Stateless, cross-platform, performant |
|
||
| **ADR-002: Data Isolation** | Shared DB + tenant_id + Global Query Filter | Cost-effective, scalable, maintainable |
|
||
| **ADR-003: SSO Library** | ASP.NET Core Native (OIDC + SAML) | Free, fast, covers 80% of needs |
|
||
| **ADR-004: MCP Token Format** | Opaque Tokens (`mcp_<slug>_<random>`) | Revocable, flexible, secure, auditable |
|
||
| **ADR-005: Frontend State** | Zustand + TanStack Query | Lightweight, TypeScript-first, performant |
|
||
| **ADR-006: Token Storage** | Access in Memory + Refresh in httpOnly Cookie | XSS + CSRF protected, industry standard |
|
||
|
||
## Impact Assessment
|
||
|
||
### Security Impact
|
||
- **Overall Security Posture:** Excellent (9/10)
|
||
- **XSS Protection:** Enforced (tokens in memory + httpOnly cookies)
|
||
- **CSRF Protection:** Enforced (SameSite=Strict cookies)
|
||
- **Data Isolation:** Enforced (Global Query Filter + composite indexes)
|
||
- **Audit Trail:** Complete (MCP tokens logged, SSO events tracked)
|
||
|
||
### Performance Impact
|
||
- **API Latency:** +5ms (JWT validation + tenant filtering)
|
||
- **Database Load:** Minimal (composite indexes, Global Query Filter)
|
||
- **Frontend Bundle Size:** +18KB (Zustand + TanStack Query)
|
||
- **Token Refresh:** Transparent to user (<100ms)
|
||
|
||
### Cost Impact
|
||
- **Infrastructure:** $200/month (1 database vs $15,000 for DB-per-tenant)
|
||
- **Licensing:** $0/month (native .NET libraries vs $3,000-5,000 for Auth0)
|
||
- **Maintenance:** Low (one schema, automated migrations)
|
||
- **Total Savings:** ~$18,000/year compared to Auth0 + DB-per-tenant
|
||
|
||
### Development Impact
|
||
- **Implementation Time:** 10 days (vs 6 weeks for IdentityServer + DB-per-tenant)
|
||
- **Learning Curve:** Low (native libraries, clear architecture)
|
||
- **Maintenance Burden:** Low (well-documented, industry patterns)
|
||
- **Testing Complexity:** Medium (need tenant isolation tests)
|
||
|
||
## Risks and Mitigation
|
||
|
||
| Risk | Mitigation |
|
||
|------|------------|
|
||
| **Data leak via Global Query Filter bypass** | Code review for `.IgnoreQueryFilters()`, integration tests |
|
||
| **SSO misconfiguration** | Test connection UI, detailed error messages, documentation |
|
||
| **MCP token brute-force** | 128-bit entropy, rate limiting, IP whitelisting |
|
||
| **Performance degradation** | Composite indexes, query monitoring, slow query alerts |
|
||
| **Frontend XSS attack** | CSP headers, input sanitization, React auto-escaping |
|
||
|
||
## Future Enhancements
|
||
|
||
Decisions are not permanent. We will revisit these at milestone reviews:
|
||
|
||
| Milestone | Potential Changes |
|
||
|-----------|-------------------|
|
||
| **M3** | Re-evaluate SSO (Auth0 if complex federation needed) |
|
||
| **M4** | Re-evaluate data isolation (DB-per-tenant for enterprise customers) |
|
||
| **M5** | Re-evaluate frontend state (Redux if complex state emerges) |
|
||
| **M6** | Re-evaluate MCP tokens (consider JWT if performance critical) |
|
||
|
||
---
|
||
|
||
**Document Status:** Approved
|
||
**Next Review:** M3 Architecture Review (2025-12-15)
|
||
**Approval Signatures:**
|
||
- Architecture Team: [Approved]
|
||
- Product Manager: [Approved]
|
||
- Security Team: [Pending Review]
|
||
- Engineering Lead: [Approved]
|
||
|
||
---
|
||
|
||
**End of Architecture Decision Record**
|