Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
175 lines
6.8 KiB
Markdown
175 lines
6.8 KiB
Markdown
---
|
|
name: prod-error-triage
|
|
description: End-to-end production error triage workflow - search logs, diagnose root cause, fix code, create Jira ticket, create branch, commit, and create PR. Use when investigating production errors, log messages, or exceptions.
|
|
---
|
|
|
|
# Production Error Triage
|
|
|
|
End-to-end workflow for investigating production errors and shipping fixes.
|
|
|
|
## When to Use
|
|
|
|
Trigger when the user:
|
|
- Pastes a log message or error and asks to investigate
|
|
- Asks "why is X failing in prod"
|
|
- Wants to trace a production exception
|
|
|
|
## Defaults
|
|
|
|
- **Jira project_key**: `ALLPOST`
|
|
- **Jira component**: `BE`
|
|
- **Azure DevOps org**: `https://dev.azure.com/billodev`
|
|
- **Azure DevOps project**: `Billo App Platform`
|
|
|
|
## Workflow
|
|
|
|
Execute these phases in order. Report findings to the user after each phase before proceeding.
|
|
|
|
### Phase 1: Log Search & Context Gathering
|
|
|
|
1. **Search for the error** using `mcp__billo-es-logs__search_logs` with the error message or keywords
|
|
2. **Expand the time window** if no results (start with `now-1h`, widen to `now-24h`, `now-7d`)
|
|
3. **Get surrounding logs** by searching with the same `Correlation-ID` and a narrow time window around the error
|
|
4. **Quantify impact** using `count_only: true` to understand if this is isolated or widespread
|
|
5. **Check for patterns** - compare error logs with success logs using `sample: true` to find what differs
|
|
|
|
Key questions to answer:
|
|
- How many errors in the last 24h?
|
|
- Is it intermittent or constant?
|
|
- Which application/service is affected?
|
|
- Is there a Correlation-ID to trace the full request?
|
|
|
|
### Phase 2: Root Cause Analysis
|
|
|
|
1. **Read the stack trace** - identify the exact file and line number
|
|
2. **Read the source code** at the error location using the file path from the stack trace
|
|
3. **Trace upstream** - read the calling code to understand the full flow
|
|
4. **Identify the real error** - the logged exception may wrap the actual cause. Look for inner exceptions and upstream error logs with the same Correlation-ID
|
|
5. **Compare success vs failure** - if intermittent, determine what condition causes the divergence
|
|
|
|
Present findings to the user:
|
|
- Error chain (what calls what)
|
|
- Root cause (the actual bug, not the symptom)
|
|
- Why it is intermittent (if applicable)
|
|
- Impact scope
|
|
|
|
### Phase 3: Code Fix
|
|
|
|
1. **Implement the minimal fix** addressing the root cause
|
|
2. **Consider idempotency** - if the error is caused by retries, add guards to make the operation safe to retry
|
|
3. **Consider edge cases** - identify scenarios where the fix might not cover (e.g. partial completion) and flag them to the user
|
|
4. **Show the diff** to the user and get confirmation before proceeding
|
|
|
|
#### Multi-Repo Changes
|
|
|
|
If the fix spans multiple repos (e.g. Infrastructure + Payment):
|
|
1. Fix the upstream repo first (e.g. shared library)
|
|
2. Merge and publish a new NuGet package version
|
|
3. Update the downstream repo to reference the new version
|
|
4. **Check dependency compatibility before updating**:
|
|
- `Microsoft.Extensions.*` major version must match the downstream project's TFM (net9.0 = 9.x)
|
|
- `AWSSDK.*` major version must not conflict with other transitive dependencies (e.g. MongoDB.Driver requires AWSSDK.Core < 4.0)
|
|
- Run `dotnet restore` to verify before committing
|
|
|
|
### Phase 4: Jira Ticket
|
|
|
|
Create a ticket using `mcp__billo-es-logs__create_bug_ticket` with:
|
|
|
|
- **project_key**: `ALLPOST` (default, ask user if different)
|
|
- **component**: `BE`
|
|
- **priority**: Based on impact (2300+ errors/day = `Highest`)
|
|
- **summary**: Short, searchable - include error type and affected component
|
|
- **description**: Uses lightweight formatting that converts to Jira ADF:
|
|
- Lines ending with `:` become **h3 headings** (e.g. `Problem:`)
|
|
- Lines starting with `- ` become **bullet lists**
|
|
- Text wrapped in `**` becomes **bold**
|
|
- Everything else is a paragraph
|
|
|
|
```
|
|
Problem:
|
|
DownloadAndSendInvoiceCommandHandler fails with 409 BlobAlreadyExists
|
|
|
|
Impact:
|
|
- 2300+ errors in the last 24 hours
|
|
- Affects both regular and **reminder** invoices
|
|
|
|
Root Cause:
|
|
- AzureStorage.StoreFileAsync calls blobClient.UploadAsync() without overwrite flag
|
|
- No idempotency check in the handler
|
|
|
|
Fix:
|
|
Add idempotency guard to check **InvoiceTransaction** status before uploading
|
|
|
|
Files:
|
|
- Billo.Platform.Payment.Business/Commands/Handlers/DownloadAndSendInvoiceCommandHandler.cs
|
|
```
|
|
|
|
If the API returns 400, likely causes:
|
|
- Missing required field (e.g. `component`)
|
|
- Invalid `priority` value
|
|
- Wrong `project_key`
|
|
|
|
Use `mcp__billo-es-logs__search_tickets` with an existing ticket key to discover required fields.
|
|
|
|
### Phase 5: Branch & Commit
|
|
|
|
1. **Create branch** using the naming convention `{prefix}/{TICKET_ID}_{description}`:
|
|
```
|
|
bug/ALLPOST-4228_fix-invoice-upload-blob-already-exists
|
|
fix/ALLPOST-4230_crash
|
|
feature/ALLPOST-4028_login-page
|
|
feat/ALLPOST-4028_login-page
|
|
chore/ALLPOST-4031_cleanup
|
|
```
|
|
Choose the prefix that best matches the work type. Any prefix is valid.
|
|
2. **Stage only the changed files** - never `git add .`
|
|
3. **Commit** with conventional commit format:
|
|
```
|
|
fix: {description} ({TICKET_KEY})
|
|
|
|
{Brief explanation of what and why}
|
|
```
|
|
4. **Ask before pushing** - do not push without user confirmation
|
|
|
|
### Phase 6: Create PR
|
|
|
|
Create PR using Azure DevOps CLI:
|
|
|
|
```bash
|
|
az repos pr create \
|
|
--org "https://dev.azure.com/billodev" \
|
|
--project "Billo App Platform" \
|
|
--detect false \
|
|
--repository "{REPO_NAME}" \
|
|
--source-branch "{BRANCH}" \
|
|
--target-branch "develop" \
|
|
--title "{type}: {description} ({TICKET_KEY})" \
|
|
--description "{summary of changes}"
|
|
```
|
|
|
|
Notes:
|
|
- `--project` is required, will error without it
|
|
- `--detect false` avoids auto-detection issues
|
|
- Return the PR URL to the user when done
|
|
|
|
## Tools Reference
|
|
|
|
| Phase | Tool | Purpose |
|
|
|-------|------|---------|
|
|
| Log search | `mcp__billo-es-logs__search_logs` | Search with query, time range, level, application |
|
|
| Impact | `mcp__billo-es-logs__search_logs` with `count_only: true` | Count matching errors |
|
|
| Patterns | `mcp__billo-es-logs__search_logs` with `sample: true` | Random sample from large result sets |
|
|
| Source code | `Read`, `Glob`, `Grep` | Find and read source files |
|
|
| Ticket lookup | `mcp__billo-es-logs__search_tickets` | Find existing tickets or discover field requirements |
|
|
| Ticket create | `mcp__billo-es-logs__create_bug_ticket` | Create Jira bug ticket |
|
|
| Git | `Bash` | Branch, commit, push |
|
|
| PR | `az repos pr create` | Create Azure DevOps pull request |
|
|
|
|
## Tips
|
|
|
|
- Always search logs before reading code - the logs tell you where to look
|
|
- Use `Correlation-ID` to trace a single request across services
|
|
- When errors are intermittent, the root cause is often in retry/concurrency behavior, not in the happy path
|
|
- When updating shared NuGet packages, always verify transitive dependency compatibility with downstream projects before publishing
|
|
- Flag edge cases to the user rather than silently ignoring them
|