# Eval Harness Evaluations as unit tests for AI development. ## Structure ``` evals/ capability/ # Test new functionality works regression/ # Ensure existing features stay intact ``` ## Eval File Format Each eval is a markdown file: ```markdown # Eval: [name] ## Task [Clear, unambiguous task description] ## Success Criteria - [ ] Criterion 1 - [ ] Criterion 2 ## Grader Type: code | model | human Method: [how to verify] ## Baseline pass@3 target: >90% ``` ## Metrics - `pass@k`: At least 1 of k attempts succeeds (use when "just needs to work") - `pass^k`: All k attempts must succeed (use when consistency is essential) ## Workflow 1. Define eval BEFORE writing code 2. Run eval after implementation 3. Fix failures before proceeding 4. Add regression evals for each bug fix ## Getting Started 1. Start with 20-50 real-world test cases from actual failures 2. Convert user-reported bugs into eval cases 3. Build balanced sets (test should AND should-not behaviors) 4. Each trial starts from clean environment 5. Grade output, not the path taken