Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
1.0 KiB
1.0 KiB
Eval Harness
Evaluations as unit tests for AI development.
Structure
evals/
capability/ # Test new functionality works
regression/ # Ensure existing features stay intact
Eval File Format
Each eval is a markdown file:
# Eval: [name]
## Task
[Clear, unambiguous task description]
## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2
## Grader
Type: code | model | human
Method: [how to verify]
## Baseline
pass@3 target: >90%
Metrics
pass@k: At least 1 of k attempts succeeds (use when "just needs to work")pass^k: All k attempts must succeed (use when consistency is essential)
Workflow
- Define eval BEFORE writing code
- Run eval after implementation
- Fix failures before proceeding
- Add regression evals for each bug fix
Getting Started
- Start with 20-50 real-world test cases from actual failures
- Convert user-reported bugs into eval cases
- Build balanced sets (test should AND should-not behaviors)
- Each trial starts from clean environment
- Grade output, not the path taken