Files
Yaojia Wang 2876cca8fe chore: initial backup of Claude Code configuration
Includes: CLAUDE.md, settings.json, agents, commands, rules, skills,
hooks, contexts, evals, get-shit-done, plugin configs (installed list
and marketplace sources). Excludes credentials, runtime caches,
telemetry, session data, and plugin binary cache.
2026-03-24 22:26:05 +01:00

6.8 KiB

name, description
name description
prod-error-triage End-to-end production error triage workflow - search logs, diagnose root cause, fix code, create Jira ticket, create branch, commit, and create PR. Use when investigating production errors, log messages, or exceptions.

Production Error Triage

End-to-end workflow for investigating production errors and shipping fixes.

When to Use

Trigger when the user:

  • Pastes a log message or error and asks to investigate
  • Asks "why is X failing in prod"
  • Wants to trace a production exception

Defaults

  • Jira project_key: ALLPOST
  • Jira component: BE
  • Azure DevOps org: https://dev.azure.com/billodev
  • Azure DevOps project: Billo App Platform

Workflow

Execute these phases in order. Report findings to the user after each phase before proceeding.

Phase 1: Log Search & Context Gathering

  1. Search for the error using mcp__billo-es-logs__search_logs with the error message or keywords
  2. Expand the time window if no results (start with now-1h, widen to now-24h, now-7d)
  3. Get surrounding logs by searching with the same Correlation-ID and a narrow time window around the error
  4. Quantify impact using count_only: true to understand if this is isolated or widespread
  5. Check for patterns - compare error logs with success logs using sample: true to find what differs

Key questions to answer:

  • How many errors in the last 24h?
  • Is it intermittent or constant?
  • Which application/service is affected?
  • Is there a Correlation-ID to trace the full request?

Phase 2: Root Cause Analysis

  1. Read the stack trace - identify the exact file and line number
  2. Read the source code at the error location using the file path from the stack trace
  3. Trace upstream - read the calling code to understand the full flow
  4. Identify the real error - the logged exception may wrap the actual cause. Look for inner exceptions and upstream error logs with the same Correlation-ID
  5. Compare success vs failure - if intermittent, determine what condition causes the divergence

Present findings to the user:

  • Error chain (what calls what)
  • Root cause (the actual bug, not the symptom)
  • Why it is intermittent (if applicable)
  • Impact scope

Phase 3: Code Fix

  1. Implement the minimal fix addressing the root cause
  2. Consider idempotency - if the error is caused by retries, add guards to make the operation safe to retry
  3. Consider edge cases - identify scenarios where the fix might not cover (e.g. partial completion) and flag them to the user
  4. Show the diff to the user and get confirmation before proceeding

Multi-Repo Changes

If the fix spans multiple repos (e.g. Infrastructure + Payment):

  1. Fix the upstream repo first (e.g. shared library)
  2. Merge and publish a new NuGet package version
  3. Update the downstream repo to reference the new version
  4. Check dependency compatibility before updating:
    • Microsoft.Extensions.* major version must match the downstream project's TFM (net9.0 = 9.x)
    • AWSSDK.* major version must not conflict with other transitive dependencies (e.g. MongoDB.Driver requires AWSSDK.Core < 4.0)
    • Run dotnet restore to verify before committing

Phase 4: Jira Ticket

Create a ticket using mcp__billo-es-logs__create_bug_ticket with:

  • project_key: ALLPOST (default, ask user if different)
  • component: BE
  • priority: Based on impact (2300+ errors/day = Highest)
  • summary: Short, searchable - include error type and affected component
  • description: Uses lightweight formatting that converts to Jira ADF:
    • Lines ending with : become h3 headings (e.g. Problem:)
    • Lines starting with - become bullet lists
    • Text wrapped in ** becomes bold
    • Everything else is a paragraph
Problem:
DownloadAndSendInvoiceCommandHandler fails with 409 BlobAlreadyExists

Impact:
- 2300+ errors in the last 24 hours
- Affects both regular and **reminder** invoices

Root Cause:
- AzureStorage.StoreFileAsync calls blobClient.UploadAsync() without overwrite flag
- No idempotency check in the handler

Fix:
Add idempotency guard to check **InvoiceTransaction** status before uploading

Files:
- Billo.Platform.Payment.Business/Commands/Handlers/DownloadAndSendInvoiceCommandHandler.cs

If the API returns 400, likely causes:

  • Missing required field (e.g. component)
  • Invalid priority value
  • Wrong project_key

Use mcp__billo-es-logs__search_tickets with an existing ticket key to discover required fields.

Phase 5: Branch & Commit

  1. Create branch using the naming convention {prefix}/{TICKET_ID}_{description}:
    bug/ALLPOST-4228_fix-invoice-upload-blob-already-exists
    fix/ALLPOST-4230_crash
    feature/ALLPOST-4028_login-page
    feat/ALLPOST-4028_login-page
    chore/ALLPOST-4031_cleanup
    
    Choose the prefix that best matches the work type. Any prefix is valid.
  2. Stage only the changed files - never git add .
  3. Commit with conventional commit format:
    fix: {description} ({TICKET_KEY})
    
    {Brief explanation of what and why}
    
  4. Ask before pushing - do not push without user confirmation

Phase 6: Create PR

Create PR using Azure DevOps CLI:

az repos pr create \
  --org "https://dev.azure.com/billodev" \
  --project "Billo App Platform" \
  --detect false \
  --repository "{REPO_NAME}" \
  --source-branch "{BRANCH}" \
  --target-branch "develop" \
  --title "{type}: {description} ({TICKET_KEY})" \
  --description "{summary of changes}"

Notes:

  • --project is required, will error without it
  • --detect false avoids auto-detection issues
  • Return the PR URL to the user when done

Tools Reference

Phase Tool Purpose
Log search mcp__billo-es-logs__search_logs Search with query, time range, level, application
Impact mcp__billo-es-logs__search_logs with count_only: true Count matching errors
Patterns mcp__billo-es-logs__search_logs with sample: true Random sample from large result sets
Source code Read, Glob, Grep Find and read source files
Ticket lookup mcp__billo-es-logs__search_tickets Find existing tickets or discover field requirements
Ticket create mcp__billo-es-logs__create_bug_ticket Create Jira bug ticket
Git Bash Branch, commit, push
PR az repos pr create Create Azure DevOps pull request

Tips

  • Always search logs before reading code - the logs tell you where to look
  • Use Correlation-ID to trace a single request across services
  • When errors are intermittent, the root cause is often in retry/concurrency behavior, not in the happy path
  • When updating shared NuGet packages, always verify transitive dependency compatibility with downstream projects before publishing
  • Flag edge cases to the user rather than silently ignoring them