Files
claude-config/evals/README.md
Yaojia Wang 2876cca8fe chore: initial backup of Claude Code configuration
Includes: CLAUDE.md, settings.json, agents, commands, rules, skills,
hooks, contexts, evals, get-shit-done, plugin configs (installed list
and marketplace sources). Excludes credentials, runtime caches,
telemetry, session data, and plugin binary cache.
2026-03-24 22:26:05 +01:00

1.0 KiB

Eval Harness

Evaluations as unit tests for AI development.

Structure

evals/
  capability/    # Test new functionality works
  regression/    # Ensure existing features stay intact

Eval File Format

Each eval is a markdown file:

# Eval: [name]
## Task
[Clear, unambiguous task description]

## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Grader
Type: code | model | human
Method: [how to verify]

## Baseline
pass@3 target: >90%

Metrics

  • pass@k: At least 1 of k attempts succeeds (use when "just needs to work")
  • pass^k: All k attempts must succeed (use when consistency is essential)

Workflow

  1. Define eval BEFORE writing code
  2. Run eval after implementation
  3. Fix failures before proceeding
  4. Add regression evals for each bug fix

Getting Started

  1. Start with 20-50 real-world test cases from actual failures
  2. Convert user-reported bugs into eval cases
  3. Build balanced sets (test should AND should-not behaviors)
  4. Each trial starts from clean environment
  5. Grade output, not the path taken