Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
53 lines
1.0 KiB
Markdown
53 lines
1.0 KiB
Markdown
# Eval Harness
|
|
|
|
Evaluations as unit tests for AI development.
|
|
|
|
## Structure
|
|
|
|
```
|
|
evals/
|
|
capability/ # Test new functionality works
|
|
regression/ # Ensure existing features stay intact
|
|
```
|
|
|
|
## Eval File Format
|
|
|
|
Each eval is a markdown file:
|
|
|
|
```markdown
|
|
# Eval: [name]
|
|
## Task
|
|
[Clear, unambiguous task description]
|
|
|
|
## Success Criteria
|
|
- [ ] Criterion 1
|
|
- [ ] Criterion 2
|
|
|
|
## Grader
|
|
Type: code | model | human
|
|
Method: [how to verify]
|
|
|
|
## Baseline
|
|
pass@3 target: >90%
|
|
```
|
|
|
|
## Metrics
|
|
|
|
- `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work")
|
|
- `pass^k`: All k attempts must succeed (use when consistency is essential)
|
|
|
|
## Workflow
|
|
|
|
1. Define eval BEFORE writing code
|
|
2. Run eval after implementation
|
|
3. Fix failures before proceeding
|
|
4. Add regression evals for each bug fix
|
|
|
|
## Getting Started
|
|
|
|
1. Start with 20-50 real-world test cases from actual failures
|
|
2. Convert user-reported bugs into eval cases
|
|
3. Build balanced sets (test should AND should-not behaviors)
|
|
4. Each trial starts from clean environment
|
|
5. Grade output, not the path taken
|