chore: initial backup of Claude Code configuration

Includes: CLAUDE.md, settings.json, agents, commands, rules, skills, hooks, contexts, evals, get-shit-done, plugin configs (installed list and marketplace sources). Excludes credentials, runtime caches, telemetry, session data, and plugin binary cache.
2026-03-24 22:26:05 +01:00
commit 2876cca8fe
245 changed files with 54437 additions and 0 deletions
--- a/evals/README.md
+++ b/evals/README.md
@@ -0,0 +1,52 @@
+# Eval Harness
+
+Evaluations as unit tests for AI development.
+
+## Structure
+
+```
+evals/
+  capability/    # Test new functionality works
+  regression/    # Ensure existing features stay intact
+```
+
+## Eval File Format
+
+Each eval is a markdown file:
+
+```markdown
+# Eval: [name]
+## Task
+[Clear, unambiguous task description]
+
+## Success Criteria
+- [ ] Criterion 1
+- [ ] Criterion 2
+
+## Grader
+Type: code | model | human
+Method: [how to verify]
+
+## Baseline
+pass@3 target: >90%
+```
+
+## Metrics
+
+- `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work")
+- `pass^k`: All k attempts must succeed (use when consistency is essential)
+
+## Workflow
+
+1. Define eval BEFORE writing code
+2. Run eval after implementation
+3. Fix failures before proceeding
+4. Add regression evals for each bug fix
+
+## Getting Started
+
+1. Start with 20-50 real-world test cases from actual failures
+2. Convert user-reported bugs into eval cases
+3. Build balanced sets (test should AND should-not behaviors)
+4. Each trial starts from clean environment
+5. Grade output, not the path taken