chore: initial backup of Claude Code configuration
Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
This commit is contained in:
174
skills/prod-error-triage/SKILL.md
Normal file
174
skills/prod-error-triage/SKILL.md
Normal file
@@ -0,0 +1,174 @@
|
||||
---
|
||||
name: prod-error-triage
|
||||
description: End-to-end production error triage workflow - search logs, diagnose root cause, fix code, create Jira ticket, create branch, commit, and create PR. Use when investigating production errors, log messages, or exceptions.
|
||||
---
|
||||
|
||||
# Production Error Triage
|
||||
|
||||
End-to-end workflow for investigating production errors and shipping fixes.
|
||||
|
||||
## When to Use
|
||||
|
||||
Trigger when the user:
|
||||
- Pastes a log message or error and asks to investigate
|
||||
- Asks "why is X failing in prod"
|
||||
- Wants to trace a production exception
|
||||
|
||||
## Defaults
|
||||
|
||||
- **Jira project_key**: `ALLPOST`
|
||||
- **Jira component**: `BE`
|
||||
- **Azure DevOps org**: `https://dev.azure.com/billodev`
|
||||
- **Azure DevOps project**: `Billo App Platform`
|
||||
|
||||
## Workflow
|
||||
|
||||
Execute these phases in order. Report findings to the user after each phase before proceeding.
|
||||
|
||||
### Phase 1: Log Search & Context Gathering
|
||||
|
||||
1. **Search for the error** using `mcp__billo-es-logs__search_logs` with the error message or keywords
|
||||
2. **Expand the time window** if no results (start with `now-1h`, widen to `now-24h`, `now-7d`)
|
||||
3. **Get surrounding logs** by searching with the same `Correlation-ID` and a narrow time window around the error
|
||||
4. **Quantify impact** using `count_only: true` to understand if this is isolated or widespread
|
||||
5. **Check for patterns** - compare error logs with success logs using `sample: true` to find what differs
|
||||
|
||||
Key questions to answer:
|
||||
- How many errors in the last 24h?
|
||||
- Is it intermittent or constant?
|
||||
- Which application/service is affected?
|
||||
- Is there a Correlation-ID to trace the full request?
|
||||
|
||||
### Phase 2: Root Cause Analysis
|
||||
|
||||
1. **Read the stack trace** - identify the exact file and line number
|
||||
2. **Read the source code** at the error location using the file path from the stack trace
|
||||
3. **Trace upstream** - read the calling code to understand the full flow
|
||||
4. **Identify the real error** - the logged exception may wrap the actual cause. Look for inner exceptions and upstream error logs with the same Correlation-ID
|
||||
5. **Compare success vs failure** - if intermittent, determine what condition causes the divergence
|
||||
|
||||
Present findings to the user:
|
||||
- Error chain (what calls what)
|
||||
- Root cause (the actual bug, not the symptom)
|
||||
- Why it is intermittent (if applicable)
|
||||
- Impact scope
|
||||
|
||||
### Phase 3: Code Fix
|
||||
|
||||
1. **Implement the minimal fix** addressing the root cause
|
||||
2. **Consider idempotency** - if the error is caused by retries, add guards to make the operation safe to retry
|
||||
3. **Consider edge cases** - identify scenarios where the fix might not cover (e.g. partial completion) and flag them to the user
|
||||
4. **Show the diff** to the user and get confirmation before proceeding
|
||||
|
||||
#### Multi-Repo Changes
|
||||
|
||||
If the fix spans multiple repos (e.g. Infrastructure + Payment):
|
||||
1. Fix the upstream repo first (e.g. shared library)
|
||||
2. Merge and publish a new NuGet package version
|
||||
3. Update the downstream repo to reference the new version
|
||||
4. **Check dependency compatibility before updating**:
|
||||
- `Microsoft.Extensions.*` major version must match the downstream project's TFM (net9.0 = 9.x)
|
||||
- `AWSSDK.*` major version must not conflict with other transitive dependencies (e.g. MongoDB.Driver requires AWSSDK.Core < 4.0)
|
||||
- Run `dotnet restore` to verify before committing
|
||||
|
||||
### Phase 4: Jira Ticket
|
||||
|
||||
Create a ticket using `mcp__billo-es-logs__create_bug_ticket` with:
|
||||
|
||||
- **project_key**: `ALLPOST` (default, ask user if different)
|
||||
- **component**: `BE`
|
||||
- **priority**: Based on impact (2300+ errors/day = `Highest`)
|
||||
- **summary**: Short, searchable - include error type and affected component
|
||||
- **description**: Uses lightweight formatting that converts to Jira ADF:
|
||||
- Lines ending with `:` become **h3 headings** (e.g. `Problem:`)
|
||||
- Lines starting with `- ` become **bullet lists**
|
||||
- Text wrapped in `**` becomes **bold**
|
||||
- Everything else is a paragraph
|
||||
|
||||
```
|
||||
Problem:
|
||||
DownloadAndSendInvoiceCommandHandler fails with 409 BlobAlreadyExists
|
||||
|
||||
Impact:
|
||||
- 2300+ errors in the last 24 hours
|
||||
- Affects both regular and **reminder** invoices
|
||||
|
||||
Root Cause:
|
||||
- AzureStorage.StoreFileAsync calls blobClient.UploadAsync() without overwrite flag
|
||||
- No idempotency check in the handler
|
||||
|
||||
Fix:
|
||||
Add idempotency guard to check **InvoiceTransaction** status before uploading
|
||||
|
||||
Files:
|
||||
- Billo.Platform.Payment.Business/Commands/Handlers/DownloadAndSendInvoiceCommandHandler.cs
|
||||
```
|
||||
|
||||
If the API returns 400, likely causes:
|
||||
- Missing required field (e.g. `component`)
|
||||
- Invalid `priority` value
|
||||
- Wrong `project_key`
|
||||
|
||||
Use `mcp__billo-es-logs__search_tickets` with an existing ticket key to discover required fields.
|
||||
|
||||
### Phase 5: Branch & Commit
|
||||
|
||||
1. **Create branch** using the naming convention `{prefix}/{TICKET_ID}_{description}`:
|
||||
```
|
||||
bug/ALLPOST-4228_fix-invoice-upload-blob-already-exists
|
||||
fix/ALLPOST-4230_crash
|
||||
feature/ALLPOST-4028_login-page
|
||||
feat/ALLPOST-4028_login-page
|
||||
chore/ALLPOST-4031_cleanup
|
||||
```
|
||||
Choose the prefix that best matches the work type. Any prefix is valid.
|
||||
2. **Stage only the changed files** - never `git add .`
|
||||
3. **Commit** with conventional commit format:
|
||||
```
|
||||
fix: {description} ({TICKET_KEY})
|
||||
|
||||
{Brief explanation of what and why}
|
||||
```
|
||||
4. **Ask before pushing** - do not push without user confirmation
|
||||
|
||||
### Phase 6: Create PR
|
||||
|
||||
Create PR using Azure DevOps CLI:
|
||||
|
||||
```bash
|
||||
az repos pr create \
|
||||
--org "https://dev.azure.com/billodev" \
|
||||
--project "Billo App Platform" \
|
||||
--detect false \
|
||||
--repository "{REPO_NAME}" \
|
||||
--source-branch "{BRANCH}" \
|
||||
--target-branch "develop" \
|
||||
--title "{type}: {description} ({TICKET_KEY})" \
|
||||
--description "{summary of changes}"
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `--project` is required, will error without it
|
||||
- `--detect false` avoids auto-detection issues
|
||||
- Return the PR URL to the user when done
|
||||
|
||||
## Tools Reference
|
||||
|
||||
| Phase | Tool | Purpose |
|
||||
|-------|------|---------|
|
||||
| Log search | `mcp__billo-es-logs__search_logs` | Search with query, time range, level, application |
|
||||
| Impact | `mcp__billo-es-logs__search_logs` with `count_only: true` | Count matching errors |
|
||||
| Patterns | `mcp__billo-es-logs__search_logs` with `sample: true` | Random sample from large result sets |
|
||||
| Source code | `Read`, `Glob`, `Grep` | Find and read source files |
|
||||
| Ticket lookup | `mcp__billo-es-logs__search_tickets` | Find existing tickets or discover field requirements |
|
||||
| Ticket create | `mcp__billo-es-logs__create_bug_ticket` | Create Jira bug ticket |
|
||||
| Git | `Bash` | Branch, commit, push |
|
||||
| PR | `az repos pr create` | Create Azure DevOps pull request |
|
||||
|
||||
## Tips
|
||||
|
||||
- Always search logs before reading code - the logs tell you where to look
|
||||
- Use `Correlation-ID` to trace a single request across services
|
||||
- When errors are intermittent, the root cause is often in retry/concurrency behavior, not in the happy path
|
||||
- When updating shared NuGet packages, always verify transitive dependency compatibility with downstream projects before publishing
|
||||
- Flag edge cases to the user rather than silently ignoring them
|
||||
Reference in New Issue
Block a user