chore: initial backup of Claude Code configuration
Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
This commit is contained in:
52
evals/README.md
Normal file
52
evals/README.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Eval Harness
|
||||
|
||||
Evaluations as unit tests for AI development.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
evals/
|
||||
capability/ # Test new functionality works
|
||||
regression/ # Ensure existing features stay intact
|
||||
```
|
||||
|
||||
## Eval File Format
|
||||
|
||||
Each eval is a markdown file:
|
||||
|
||||
```markdown
|
||||
# Eval: [name]
|
||||
## Task
|
||||
[Clear, unambiguous task description]
|
||||
|
||||
## Success Criteria
|
||||
- [ ] Criterion 1
|
||||
- [ ] Criterion 2
|
||||
|
||||
## Grader
|
||||
Type: code | model | human
|
||||
Method: [how to verify]
|
||||
|
||||
## Baseline
|
||||
pass@3 target: >90%
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
- `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work")
|
||||
- `pass^k`: All k attempts must succeed (use when consistency is essential)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Define eval BEFORE writing code
|
||||
2. Run eval after implementation
|
||||
3. Fix failures before proceeding
|
||||
4. Add regression evals for each bug fix
|
||||
|
||||
## Getting Started
|
||||
|
||||
1. Start with 20-50 real-world test cases from actual failures
|
||||
2. Convert user-reported bugs into eval cases
|
||||
3. Build balanced sets (test should AND should-not behaviors)
|
||||
4. Each trial starts from clean environment
|
||||
5. Grade output, not the path taken
|
||||
Reference in New Issue
Block a user