Files

Yaojia Wang 2876cca8fe chore: initial backup of Claude Code configuration

Includes: CLAUDE.md, settings.json, agents, commands, rules, skills,
hooks, contexts, evals, get-shit-done, plugin configs (installed list
and marketplace sources). Excludes credentials, runtime caches,
telemetry, session data, and plugin binary cache.

2026-03-24 22:26:05 +01:00

1.0 KiB

Raw Blame History

Eval Harness

Evaluations as unit tests for AI development.

Structure

evals/
  capability/    # Test new functionality works
  regression/    # Ensure existing features stay intact

Eval File Format

Each eval is a markdown file:

# Eval: [name]
## Task
[Clear, unambiguous task description]

## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Grader
Type: code | model | human
Method: [how to verify]

## Baseline
pass@3 target: >90%

Metrics

pass@k: At least 1 of k attempts succeeds (use when "just needs to work")
pass^k: All k attempts must succeed (use when consistency is essential)

Workflow

Define eval BEFORE writing code
Run eval after implementation
Fix failures before proceeding
Add regression evals for each bug fix

Getting Started

Start with 20-50 real-world test cases from actual failures
Convert user-reported bugs into eval cases
Build balanced sets (test should AND should-not behaviors)
Each trial starts from clean environment
Grade output, not the path taken

1.0 KiB Raw Blame History

Eval Harness

Structure

Eval File Format

Metrics

Workflow

Getting Started

1.0 KiB

Raw Blame History