chore: initial backup of Claude Code configuration

Includes: CLAUDE.md, settings.json, agents, commands, rules, skills,
hooks, contexts, evals, get-shit-done, plugin configs (installed list
and marketplace sources). Excludes credentials, runtime caches,
telemetry, session data, and plugin binary cache.
This commit is contained in:
Yaojia Wang
2026-03-24 22:26:05 +01:00
commit 2876cca8fe
245 changed files with 54437 additions and 0 deletions

52
evals/README.md Normal file
View File

@@ -0,0 +1,52 @@
# Eval Harness
Evaluations as unit tests for AI development.
## Structure
```
evals/
capability/ # Test new functionality works
regression/ # Ensure existing features stay intact
```
## Eval File Format
Each eval is a markdown file:
```markdown
# Eval: [name]
## Task
[Clear, unambiguous task description]
## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2
## Grader
Type: code | model | human
Method: [how to verify]
## Baseline
pass@3 target: >90%
```
## Metrics
- `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work")
- `pass^k`: All k attempts must succeed (use when consistency is essential)
## Workflow
1. Define eval BEFORE writing code
2. Run eval after implementation
3. Fix failures before proceeding
4. Add regression evals for each bug fix
## Getting Started
1. Start with 20-50 real-world test cases from actual failures
2. Convert user-reported bugs into eval cases
3. Build balanced sets (test should AND should-not behaviors)
4. Each trial starts from clean environment
5. Grade output, not the path taken