You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -148,18 +153,105 @@ These are copied into the temp clone so that any local modifications to the revi
148
153
149
154
---
150
155
151
-
## Phase 2 (Future Work)
156
+
## Phase 2
157
+
158
+
### Cost Tracking
159
+
160
+
Use `--output-format json` to capture `total_cost_usd` from each Claude invocation. Accumulate across all calls (review + judge) and print the total in `AfterSuite`.
161
+
162
+
### Test Structure Reorganization ✅ IMPLEMENTED
163
+
164
+
Reorganize `testdata/` into two categories:
165
+
166
+
```
167
+
tests/eval/testdata/
168
+
├── golden/ # Base truth tests - single isolated issues
169
+
│ ├── missing-optional-doc/
170
+
│ │ ├── patch.diff # Triggers ONLY missing-optional-doc
171
+
│ │ └── expected.txt
172
+
│ ├── undocumented-enum/
173
+
│ │ ├── patch.diff # Triggers ONLY undocumented-enum
174
+
│ │ └── expected.txt
175
+
│ ├── missing-featuregate/
176
+
│ │ ├── patch.diff # Triggers ONLY missing-featuregate
│ ├── patch.diff # Triggers multiple issues together
184
+
│ └── expected.txt
185
+
└── partial-documentation/
186
+
├── patch.diff
187
+
└── expected.txt
188
+
```
189
+
190
+
**Golden tests**: Each patch is carefully crafted to trigger exactly one issue type. These validate that the review command correctly identifies individual issue categories in isolation.
191
+
192
+
**Integration tests**: Patches that trigger multiple issues, testing the review command's ability to identify combinations of problems in realistic scenarios.
193
+
194
+
### Model Selection ✅ IMPLEMENTED
195
+
196
+
Each test tier has a default model, overridable via environment variable:
197
+
198
+
| Test Type | Default Model | Override Env Var |
199
+
|-----------|---------------|------------------|
200
+
| Golden tests | Sonnet |`EVAL_GOLDEN_MODEL`|
201
+
| Integration tests | Opus |`EVAL_INTEGRATION_MODEL`|
202
+
| Judge LLM | Haiku |`EVAL_JUDGE_MODEL`|
203
+
204
+
The test suite reads these at startup and applies per-tier:
EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 go test ./tests/eval/...
219
+
220
+
# Override all models
221
+
EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 \
222
+
EVAL_INTEGRATION_MODEL=claude-sonnet-4-20250514 \
223
+
go test ./tests/eval/...
224
+
```
152
225
153
226
### Patch Stability
154
227
155
-
Patches may fail to apply as `origin/master` evolves over time. Need a strategy to handle this (e.g., pinning to a specific commit).
228
+
Patches may fail to apply as `origin/master` evolves over time. Strategies:
229
+
230
+
- Pin to a specific commit SHA in the clone step
231
+
- Use `git apply --3way` for better conflict handling
232
+
- Periodic patch refresh CI job
156
233
157
234
### Error Handling
158
235
159
236
Current design does not address failure scenarios:
160
237
161
238
- Patch application failures
162
-
- Claude CLI timeouts or crashes
163
-
- Empty or malformed output from Claude
164
-
- Authentication failures
165
-
- Resource cleanup on test failures
239
+
- Resource cleanup on test failures
240
+
241
+
Using `--output-format json` also enables better error handling in future phases:
242
+
243
+
- Claude CLI timeouts or crashes (detect via JSON parse failure or missing fields)
244
+
- Empty or malformed output (validate JSON structure)
245
+
- Authentication failures (check for error fields in JSON response)
246
+
247
+
### Performance Optimizations
248
+
249
+
The API review step is the slowest part of the eval suite. Options to improve:
250
+
251
+
1.**Skip linting by default** - Update api-review command to skip `make lint` unless explicitly requested. Linting adds significant time.
252
+
253
+
2.**Cache review outputs** - For development, cache the review output keyed by patch hash. Skip re-running if cached result exists. Clear cache on command changes.
254
+
255
+
3.**Parallel test execution** - Run golden tests in parallel (requires separate repo clones per test).
256
+
257
+
4.**Smaller/faster model for development** - Use Haiku for rapid iteration, Sonnet/Opus for CI validation.
0 commit comments