--- name: prod-error-triage description: End-to-end production error triage workflow - search logs, diagnose root cause, fix code, create Jira ticket, create branch, commit, and create PR. Use when investigating production errors, log messages, or exceptions. --- # Production Error Triage End-to-end workflow for investigating production errors and shipping fixes. ## When to Use Trigger when the user: - Pastes a log message or error and asks to investigate - Asks "why is X failing in prod" - Wants to trace a production exception ## Defaults - **Jira project_key**: `ALLPOST` - **Jira component**: `BE` - **Azure DevOps org**: `https://dev.azure.com/billodev` - **Azure DevOps project**: `Billo App Platform` ## Workflow Execute these phases in order. Report findings to the user after each phase before proceeding. ### Phase 1: Log Search & Context Gathering 1. **Search for the error** using `mcp__billo-es-logs__search_logs` with the error message or keywords 2. **Expand the time window** if no results (start with `now-1h`, widen to `now-24h`, `now-7d`) 3. **Get surrounding logs** by searching with the same `Correlation-ID` and a narrow time window around the error 4. **Quantify impact** using `count_only: true` to understand if this is isolated or widespread 5. **Check for patterns** - compare error logs with success logs using `sample: true` to find what differs Key questions to answer: - How many errors in the last 24h? - Is it intermittent or constant? - Which application/service is affected? - Is there a Correlation-ID to trace the full request? ### Phase 2: Root Cause Analysis 1. **Read the stack trace** - identify the exact file and line number 2. **Read the source code** at the error location using the file path from the stack trace 3. **Trace upstream** - read the calling code to understand the full flow 4. **Identify the real error** - the logged exception may wrap the actual cause. Look for inner exceptions and upstream error logs with the same Correlation-ID 5. **Compare success vs failure** - if intermittent, determine what condition causes the divergence Present findings to the user: - Error chain (what calls what) - Root cause (the actual bug, not the symptom) - Why it is intermittent (if applicable) - Impact scope ### Phase 3: Code Fix 1. **Implement the minimal fix** addressing the root cause 2. **Consider idempotency** - if the error is caused by retries, add guards to make the operation safe to retry 3. **Consider edge cases** - identify scenarios where the fix might not cover (e.g. partial completion) and flag them to the user 4. **Show the diff** to the user and get confirmation before proceeding #### Multi-Repo Changes If the fix spans multiple repos (e.g. Infrastructure + Payment): 1. Fix the upstream repo first (e.g. shared library) 2. Merge and publish a new NuGet package version 3. Update the downstream repo to reference the new version 4. **Check dependency compatibility before updating**: - `Microsoft.Extensions.*` major version must match the downstream project's TFM (net9.0 = 9.x) - `AWSSDK.*` major version must not conflict with other transitive dependencies (e.g. MongoDB.Driver requires AWSSDK.Core < 4.0) - Run `dotnet restore` to verify before committing ### Phase 4: Jira Ticket Create a ticket using `mcp__billo-es-logs__create_bug_ticket` with: - **project_key**: `ALLPOST` (default, ask user if different) - **component**: `BE` - **priority**: Based on impact (2300+ errors/day = `Highest`) - **summary**: Short, searchable - include error type and affected component - **description**: Uses lightweight formatting that converts to Jira ADF: - Lines ending with `:` become **h3 headings** (e.g. `Problem:`) - Lines starting with `- ` become **bullet lists** - Text wrapped in `**` becomes **bold** - Everything else is a paragraph ``` Problem: DownloadAndSendInvoiceCommandHandler fails with 409 BlobAlreadyExists Impact: - 2300+ errors in the last 24 hours - Affects both regular and **reminder** invoices Root Cause: - AzureStorage.StoreFileAsync calls blobClient.UploadAsync() without overwrite flag - No idempotency check in the handler Fix: Add idempotency guard to check **InvoiceTransaction** status before uploading Files: - Billo.Platform.Payment.Business/Commands/Handlers/DownloadAndSendInvoiceCommandHandler.cs ``` If the API returns 400, likely causes: - Missing required field (e.g. `component`) - Invalid `priority` value - Wrong `project_key` Use `mcp__billo-es-logs__search_tickets` with an existing ticket key to discover required fields. ### Phase 5: Branch & Commit 1. **Create branch** using the naming convention `{prefix}/{TICKET_ID}_{description}`: ``` bug/ALLPOST-4228_fix-invoice-upload-blob-already-exists fix/ALLPOST-4230_crash feature/ALLPOST-4028_login-page feat/ALLPOST-4028_login-page chore/ALLPOST-4031_cleanup ``` Choose the prefix that best matches the work type. Any prefix is valid. 2. **Stage only the changed files** - never `git add .` 3. **Commit** with conventional commit format: ``` fix: {description} ({TICKET_KEY}) {Brief explanation of what and why} ``` 4. **Ask before pushing** - do not push without user confirmation ### Phase 6: Create PR Create PR using Azure DevOps CLI: ```bash az repos pr create \ --org "https://dev.azure.com/billodev" \ --project "Billo App Platform" \ --detect false \ --repository "{REPO_NAME}" \ --source-branch "{BRANCH}" \ --target-branch "develop" \ --title "{type}: {description} ({TICKET_KEY})" \ --description "{summary of changes}" ``` Notes: - `--project` is required, will error without it - `--detect false` avoids auto-detection issues - Return the PR URL to the user when done ## Tools Reference | Phase | Tool | Purpose | |-------|------|---------| | Log search | `mcp__billo-es-logs__search_logs` | Search with query, time range, level, application | | Impact | `mcp__billo-es-logs__search_logs` with `count_only: true` | Count matching errors | | Patterns | `mcp__billo-es-logs__search_logs` with `sample: true` | Random sample from large result sets | | Source code | `Read`, `Glob`, `Grep` | Find and read source files | | Ticket lookup | `mcp__billo-es-logs__search_tickets` | Find existing tickets or discover field requirements | | Ticket create | `mcp__billo-es-logs__create_bug_ticket` | Create Jira bug ticket | | Git | `Bash` | Branch, commit, push | | PR | `az repos pr create` | Create Azure DevOps pull request | ## Tips - Always search logs before reading code - the logs tell you where to look - Use `Correlation-ID` to trace a single request across services - When errors are intermittent, the root cause is often in retry/concurrency behavior, not in the happy path - When updating shared NuGet packages, always verify transitive dependency compatibility with downstream projects before publishing - Flag edge cases to the user rather than silently ignoring them