Skip to content

Commit 0c769f0

Browse files
WIP: Refactor + golden/integration + different models
1 parent 4ea8b69 commit 0c769f0

File tree

11 files changed

+295
-103
lines changed

11 files changed

+295
-103
lines changed

tests/eval/DESIGN.md

Lines changed: 107 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,20 @@ tests/eval/
1313
├── eval_test.go # Main Ginkgo test suite
1414
├── DESIGN.md # This file
1515
├── testdata/
16-
│ ├── missing-optional-doc/
17-
│ │ ├── patch.diff # Git patch to apply
18-
│ │ └── expected.txt # Expected issues (one per line)
19-
│ ├── undocumented-enum/
20-
│ │ ├── patch.diff
21-
│ │ └── expected.txt
22-
│ └── valid-api-change/
23-
│ ├── patch.diff
24-
│ └── expected.txt # Empty file = no issues expected
16+
│ ├── golden/ # Single-issue tests
17+
│ │ ├── missing-optional-doc/
18+
│ │ │ ├── patch.diff
19+
│ │ │ └── expected.txt
20+
│ │ ├── undocumented-enum/
21+
│ │ │ ├── patch.diff
22+
│ │ │ └── expected.txt
23+
│ │ └── valid-api-change/
24+
│ │ ├── patch.diff
25+
│ │ └── expected.txt # Empty file = no issues expected
26+
│ └── integration/ # Multi-issue tests
27+
│ └── new-field-multiple-issues/
28+
│ ├── patch.diff
29+
│ └── expected.txt
2530
```
2631

2732
## Test Case Format
@@ -148,18 +153,105 @@ These are copied into the temp clone so that any local modifications to the revi
148153

149154
---
150155

151-
## Phase 2 (Future Work)
156+
## Phase 2
157+
158+
### Cost Tracking
159+
160+
Use `--output-format json` to capture `total_cost_usd` from each Claude invocation. Accumulate across all calls (review + judge) and print the total in `AfterSuite`.
161+
162+
### Test Structure Reorganization ✅ IMPLEMENTED
163+
164+
Reorganize `testdata/` into two categories:
165+
166+
```
167+
tests/eval/testdata/
168+
├── golden/ # Base truth tests - single isolated issues
169+
│ ├── missing-optional-doc/
170+
│ │ ├── patch.diff # Triggers ONLY missing-optional-doc
171+
│ │ └── expected.txt
172+
│ ├── undocumented-enum/
173+
│ │ ├── patch.diff # Triggers ONLY undocumented-enum
174+
│ │ └── expected.txt
175+
│ ├── missing-featuregate/
176+
│ │ ├── patch.diff # Triggers ONLY missing-featuregate
177+
│ │ └── expected.txt
178+
│ └── valid-api-change/
179+
│ ├── patch.diff # Triggers NO issues
180+
│ └── expected.txt
181+
└── integration/ # Complex scenarios - multiple issues
182+
├── new-field-all-issues/
183+
│ ├── patch.diff # Triggers multiple issues together
184+
│ └── expected.txt
185+
└── partial-documentation/
186+
├── patch.diff
187+
└── expected.txt
188+
```
189+
190+
**Golden tests**: Each patch is carefully crafted to trigger exactly one issue type. These validate that the review command correctly identifies individual issue categories in isolation.
191+
192+
**Integration tests**: Patches that trigger multiple issues, testing the review command's ability to identify combinations of problems in realistic scenarios.
193+
194+
### Model Selection ✅ IMPLEMENTED
195+
196+
Each test tier has a default model, overridable via environment variable:
197+
198+
| Test Type | Default Model | Override Env Var |
199+
|-----------|---------------|------------------|
200+
| Golden tests | Sonnet | `EVAL_GOLDEN_MODEL` |
201+
| Integration tests | Opus | `EVAL_INTEGRATION_MODEL` |
202+
| Judge LLM | Haiku | `EVAL_JUDGE_MODEL` |
203+
204+
The test suite reads these at startup and applies per-tier:
205+
206+
```go
207+
goldenModel := getEnvOrDefault("EVAL_GOLDEN_MODEL", "claude-sonnet-4-5@20250929")
208+
integrationModel := getEnvOrDefault("EVAL_INTEGRATION_MODEL", "claude-opus-4-5@20251101")
209+
judgeModel := getEnvOrDefault("EVAL_JUDGE_MODEL", "claude-haiku-4-5-20251001")
210+
```
211+
212+
Usage:
213+
```bash
214+
# Use defaults
215+
go test ./tests/eval/...
216+
217+
# Override golden tests to use Haiku
218+
EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 go test ./tests/eval/...
219+
220+
# Override all models
221+
EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 \
222+
EVAL_INTEGRATION_MODEL=claude-sonnet-4-20250514 \
223+
go test ./tests/eval/...
224+
```
152225

153226
### Patch Stability
154227

155-
Patches may fail to apply as `origin/master` evolves over time. Need a strategy to handle this (e.g., pinning to a specific commit).
228+
Patches may fail to apply as `origin/master` evolves over time. Strategies:
229+
230+
- Pin to a specific commit SHA in the clone step
231+
- Use `git apply --3way` for better conflict handling
232+
- Periodic patch refresh CI job
156233

157234
### Error Handling
158235

159236
Current design does not address failure scenarios:
160237

161238
- Patch application failures
162-
- Claude CLI timeouts or crashes
163-
- Empty or malformed output from Claude
164-
- Authentication failures
165-
- Resource cleanup on test failures
239+
- Resource cleanup on test failures
240+
241+
Using `--output-format json` also enables better error handling in future phases:
242+
243+
- Claude CLI timeouts or crashes (detect via JSON parse failure or missing fields)
244+
- Empty or malformed output (validate JSON structure)
245+
- Authentication failures (check for error fields in JSON response)
246+
247+
### Performance Optimizations
248+
249+
The API review step is the slowest part of the eval suite. Options to improve:
250+
251+
1. **Skip linting by default** - Update api-review command to skip `make lint` unless explicitly requested. Linting adds significant time.
252+
253+
2. **Cache review outputs** - For development, cache the review output keyed by patch hash. Skip re-running if cached result exists. Clear cache on command changes.
254+
255+
3. **Parallel test execution** - Run golden tests in parallel (requires separate repo clones per test).
256+
257+
4. **Smaller/faster model for development** - Use Haiku for rapid iteration, Sonnet/Opus for CI validation.

0 commit comments

Comments
 (0)