# Eval Harness

Evaluations as unit tests for AI development.

## Structure

```
evals/
  capability/    # Test new functionality works
  regression/    # Ensure existing features stay intact
```

## Eval File Format

Each eval is a markdown file:

```markdown
# Eval: [name]
## Task
[Clear, unambiguous task description]

## Success Criteria
- [ ] Criterion 1
- [ ] Criterion 2

## Grader
Type: code | model | human
Method: [how to verify]

## Baseline
pass@3 target: >90%
```

## Metrics

- `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work")
- `pass^k`: All k attempts must succeed (use when consistency is essential)

## Workflow

1. Define eval BEFORE writing code
2. Run eval after implementation
3. Fix failures before proceeding
4. Add regression evals for each bug fix

## Getting Started

1. Start with 20-50 real-world test cases from actual failures
2. Convert user-reported bugs into eval cases
3. Build balanced sets (test should AND should-not behaviors)
4. Each trial starts from clean environment
5. Grade output, not the path taken