Add claude config
This commit is contained in:
221
.claude/skills/eval-harness/SKILL.md
Normal file
221
.claude/skills/eval-harness/SKILL.md
Normal file
@@ -0,0 +1,221 @@
|
||||
# Eval Harness Skill
|
||||
|
||||
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
|
||||
|
||||
## Philosophy
|
||||
|
||||
Eval-Driven Development treats evals as the "unit tests of AI development":
|
||||
- Define expected behavior BEFORE implementation
|
||||
- Run evals continuously during development
|
||||
- Track regressions with each change
|
||||
- Use pass@k metrics for reliability measurement
|
||||
|
||||
## Eval Types
|
||||
|
||||
### Capability Evals
|
||||
Test if Claude can do something it couldn't before:
|
||||
```markdown
|
||||
[CAPABILITY EVAL: feature-name]
|
||||
Task: Description of what Claude should accomplish
|
||||
Success Criteria:
|
||||
- [ ] Criterion 1
|
||||
- [ ] Criterion 2
|
||||
- [ ] Criterion 3
|
||||
Expected Output: Description of expected result
|
||||
```
|
||||
|
||||
### Regression Evals
|
||||
Ensure changes don't break existing functionality:
|
||||
```markdown
|
||||
[REGRESSION EVAL: feature-name]
|
||||
Baseline: SHA or checkpoint name
|
||||
Tests:
|
||||
- existing-test-1: PASS/FAIL
|
||||
- existing-test-2: PASS/FAIL
|
||||
- existing-test-3: PASS/FAIL
|
||||
Result: X/Y passed (previously Y/Y)
|
||||
```
|
||||
|
||||
## Grader Types
|
||||
|
||||
### 1. Code-Based Grader
|
||||
Deterministic checks using code:
|
||||
```bash
|
||||
# Check if file contains expected pattern
|
||||
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
|
||||
|
||||
# Check if tests pass
|
||||
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
|
||||
|
||||
# Check if build succeeds
|
||||
npm run build && echo "PASS" || echo "FAIL"
|
||||
```
|
||||
|
||||
### 2. Model-Based Grader
|
||||
Use Claude to evaluate open-ended outputs:
|
||||
```markdown
|
||||
[MODEL GRADER PROMPT]
|
||||
Evaluate the following code change:
|
||||
1. Does it solve the stated problem?
|
||||
2. Is it well-structured?
|
||||
3. Are edge cases handled?
|
||||
4. Is error handling appropriate?
|
||||
|
||||
Score: 1-5 (1=poor, 5=excellent)
|
||||
Reasoning: [explanation]
|
||||
```
|
||||
|
||||
### 3. Human Grader
|
||||
Flag for manual review:
|
||||
```markdown
|
||||
[HUMAN REVIEW REQUIRED]
|
||||
Change: Description of what changed
|
||||
Reason: Why human review is needed
|
||||
Risk Level: LOW/MEDIUM/HIGH
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
### pass@k
|
||||
"At least one success in k attempts"
|
||||
- pass@1: First attempt success rate
|
||||
- pass@3: Success within 3 attempts
|
||||
- Typical target: pass@3 > 90%
|
||||
|
||||
### pass^k
|
||||
"All k trials succeed"
|
||||
- Higher bar for reliability
|
||||
- pass^3: 3 consecutive successes
|
||||
- Use for critical paths
|
||||
|
||||
## Eval Workflow
|
||||
|
||||
### 1. Define (Before Coding)
|
||||
```markdown
|
||||
## EVAL DEFINITION: feature-xyz
|
||||
|
||||
### Capability Evals
|
||||
1. Can create new user account
|
||||
2. Can validate email format
|
||||
3. Can hash password securely
|
||||
|
||||
### Regression Evals
|
||||
1. Existing login still works
|
||||
2. Session management unchanged
|
||||
3. Logout flow intact
|
||||
|
||||
### Success Metrics
|
||||
- pass@3 > 90% for capability evals
|
||||
- pass^3 = 100% for regression evals
|
||||
```
|
||||
|
||||
### 2. Implement
|
||||
Write code to pass the defined evals.
|
||||
|
||||
### 3. Evaluate
|
||||
```bash
|
||||
# Run capability evals
|
||||
[Run each capability eval, record PASS/FAIL]
|
||||
|
||||
# Run regression evals
|
||||
npm test -- --testPathPattern="existing"
|
||||
|
||||
# Generate report
|
||||
```
|
||||
|
||||
### 4. Report
|
||||
```markdown
|
||||
EVAL REPORT: feature-xyz
|
||||
========================
|
||||
|
||||
Capability Evals:
|
||||
create-user: PASS (pass@1)
|
||||
validate-email: PASS (pass@2)
|
||||
hash-password: PASS (pass@1)
|
||||
Overall: 3/3 passed
|
||||
|
||||
Regression Evals:
|
||||
login-flow: PASS
|
||||
session-mgmt: PASS
|
||||
logout-flow: PASS
|
||||
Overall: 3/3 passed
|
||||
|
||||
Metrics:
|
||||
pass@1: 67% (2/3)
|
||||
pass@3: 100% (3/3)
|
||||
|
||||
Status: READY FOR REVIEW
|
||||
```
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### Pre-Implementation
|
||||
```
|
||||
/eval define feature-name
|
||||
```
|
||||
Creates eval definition file at `.claude/evals/feature-name.md`
|
||||
|
||||
### During Implementation
|
||||
```
|
||||
/eval check feature-name
|
||||
```
|
||||
Runs current evals and reports status
|
||||
|
||||
### Post-Implementation
|
||||
```
|
||||
/eval report feature-name
|
||||
```
|
||||
Generates full eval report
|
||||
|
||||
## Eval Storage
|
||||
|
||||
Store evals in project:
|
||||
```
|
||||
.claude/
|
||||
evals/
|
||||
feature-xyz.md # Eval definition
|
||||
feature-xyz.log # Eval run history
|
||||
baseline.json # Regression baselines
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Define evals BEFORE coding** - Forces clear thinking about success criteria
|
||||
2. **Run evals frequently** - Catch regressions early
|
||||
3. **Track pass@k over time** - Monitor reliability trends
|
||||
4. **Use code graders when possible** - Deterministic > probabilistic
|
||||
5. **Human review for security** - Never fully automate security checks
|
||||
6. **Keep evals fast** - Slow evals don't get run
|
||||
7. **Version evals with code** - Evals are first-class artifacts
|
||||
|
||||
## Example: Adding Authentication
|
||||
|
||||
```markdown
|
||||
## EVAL: add-authentication
|
||||
|
||||
### Phase 1: Define (10 min)
|
||||
Capability Evals:
|
||||
- [ ] User can register with email/password
|
||||
- [ ] User can login with valid credentials
|
||||
- [ ] Invalid credentials rejected with proper error
|
||||
- [ ] Sessions persist across page reloads
|
||||
- [ ] Logout clears session
|
||||
|
||||
Regression Evals:
|
||||
- [ ] Public routes still accessible
|
||||
- [ ] API responses unchanged
|
||||
- [ ] Database schema compatible
|
||||
|
||||
### Phase 2: Implement (varies)
|
||||
[Write code]
|
||||
|
||||
### Phase 3: Evaluate
|
||||
Run: /eval check add-authentication
|
||||
|
||||
### Phase 4: Report
|
||||
EVAL REPORT: add-authentication
|
||||
==============================
|
||||
Capability: 5/5 passed (pass@3: 100%)
|
||||
Regression: 3/3 passed (pass^3: 100%)
|
||||
Status: SHIP IT
|
||||
```
|
||||
Reference in New Issue
Block a user