Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
Eval Harness
Evaluations as unit tests for AI development.
Structure
evals/
capability/ # Test new functionality works
regression/ # Ensure existing features stay intact
Eval File Format
Each eval is a markdown file:
# Eval: [name]
## Task
[Clear, unambiguous task description]
## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2
## Grader
Type: code | model | human
Method: [how to verify]
## Baseline
pass@3 target: >90%
Metrics
pass@k: At least 1 of k attempts succeeds (use when "just needs to work")pass^k: All k attempts must succeed (use when consistency is essential)
Workflow
- Define eval BEFORE writing code
- Run eval after implementation
- Fix failures before proceeding
- Add regression evals for each bug fix
Getting Started
- Start with 20-50 real-world test cases from actual failures
- Convert user-reported bugs into eval cases
- Build balanced sets (test should AND should-not behaviors)
- Each trial starts from clean environment
- Grade output, not the path taken