Skip to content

Commit 06bca4c

Browse files
WIP: evals for api review command
1 parent 0e3c378 commit 06bca4c

File tree

8 files changed

+473
-0
lines changed

8 files changed

+473
-0
lines changed

tests/eval/DESIGN.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# API Review Command Eval Test Suite
2+
3+
Design document for a Go/Ginkgo-based evaluation framework to test the `/api-review` Claude command against known API review scenarios.
4+
5+
## Overview
6+
7+
This test suite validates that the `/api-review` Claude command correctly identifies API documentation issues. Each test case consists of a patch file and expected issues. The suite applies patches to a clean clone of the repository, runs the API review command, and verifies the output matches expectations exactly (no missing issues, no hallucinated issues).
8+
9+
## Directory Structure
10+
11+
```
12+
tests/eval/
13+
├── eval_test.go # Main Ginkgo test suite
14+
├── DESIGN.md # This file
15+
├── testdata/
16+
│ ├── missing-optional-doc/
17+
│ │ ├── patch.diff # Git patch to apply
18+
│ │ └── expected.txt # Expected issues (one per line)
19+
│ ├── undocumented-enum/
20+
│ │ ├── patch.diff
21+
│ │ └── expected.txt
22+
│ └── valid-api-change/
23+
│ ├── patch.diff
24+
│ └── expected.txt # Empty file = no issues expected
25+
```
26+
27+
## Test Case Format
28+
29+
### patch.diff
30+
31+
Standard git diff format:
32+
33+
```diff
34+
diff --git a/config/v1/types.go b/config/v1/types.go
35+
--- a/config/v1/types.go
36+
+++ b/config/v1/types.go
37+
@@ -10,0 +11,5 @@
38+
+// MyField does something
39+
+// +optional
40+
+// +kubebuilder:validation:Enum=Foo;Bar
41+
+MyField string `json:"myField"`
42+
```
43+
44+
### expected.txt
45+
46+
One expected issue per line:
47+
48+
```
49+
enum values Foo and Bar not documented in comment
50+
optional field does not explain behavior when omitted
51+
```
52+
53+
Empty file means the API change should pass review with no issues.
54+
55+
**Note**: Order of issues in `expected.txt` does not matter. Comparison uses semantic matching, not exact string matching.
56+
57+
## Test Flow
58+
59+
```
60+
┌─────────────────────────────────────────────────────────────┐
61+
│ 1. Pre-flight: │
62+
│ a. Verify local AGENTS.md and │
63+
│ .claude/commands/api-review.md exist │
64+
│ b. These will be copied to temp dir after clone │
65+
│ (ensures local changes are tested) │
66+
├─────────────────────────────────────────────────────────────┤
67+
│ 2. Setup (once): │
68+
│ a. Shallow clone openshift/api to temp dir │
69+
│ b. Copy local AGENTS.md and .claude/ to temp dir │
70+
├─────────────────────────────────────────────────────────────┤
71+
│ 3. For each test case (sequential): │
72+
│ a. Reset repo to clean state │
73+
│ b. Apply patch.diff │
74+
│ c. Run claude api-review on changed files │
75+
│ d. Run claude to compare output vs expected.txt │
76+
│ e. Parse true/false response, assert │
77+
├─────────────────────────────────────────────────────────────┤
78+
│ 4. Teardown: Remove temp dir │
79+
└─────────────────────────────────────────────────────────────┘
80+
```
81+
82+
## Reset Between Tests
83+
84+
```bash
85+
git reset --hard origin/master && git clean -fd
86+
```
87+
88+
- `git reset --hard origin/master`: Resets all tracked files to match the remote master branch, discarding any local commits and staged/unstaged changes
89+
- `git clean -f`: Force remove untracked files (files not in git)
90+
- `git clean -d`: Also remove untracked directories
91+
92+
After reset, re-copy `AGENTS.md` and `.claude/` from local source to preserve local modifications being tested.
93+
94+
**Remote origin handling**: The shallow clone creates `origin` automatically. Before reset, verify remote exists:
95+
96+
```bash
97+
git remote get-url origin || git remote add origin https://github.com/openshift/api.git
98+
```
99+
100+
## Claude Invocations
101+
102+
### Step 1 - Run API Review
103+
104+
```bash
105+
claude --print -p "/api-review" --allowedTools "Bash,Read,Grep,Glob,Task" <files>
106+
```
107+
108+
### Step 2 - Compare Results
109+
110+
```bash
111+
claude --print -p "Given this API review output:
112+
<output>
113+
114+
Expected issues (one per line):
115+
<contents of expected.txt>
116+
117+
Compare the review output against the expected issues list.
118+
The result is 'true' ONLY if:
119+
1. ALL expected issues are identified in the output
120+
2. NO additional issues are reported beyond what is expected
121+
122+
If any expected issue is missing, reply 'false'.
123+
If any issue is reported that is NOT in the expected list, reply 'false'.
124+
125+
Reply with exactly 'true' or 'false' (no other text)."
126+
```
127+
128+
Parse response, trim whitespace, check for `true` or `false`.
129+
130+
## Pre-flight Check
131+
132+
Before cloning, verify these local files exist in the source repo:
133+
134+
- `AGENTS.md`
135+
- `.claude/commands/api-review.md`
136+
137+
These are copied into the temp clone so that any local modifications to the review command are tested, not the remote versions.
138+
139+
## Configuration
140+
141+
| Setting | Value |
142+
|---------|-------|
143+
| Timeout per Claude call | 5 minutes |
144+
| Execution mode | Sequential |
145+
| Clone depth | 1 (shallow) |
146+
| Clone source | `https://github.com/openshift/api.git` |
147+
| Reset between tests | Verify origin remote exists, `git reset --hard origin/master && git clean -fd`, re-copy local AGENTS.md and .claude/ |
148+
149+
---
150+
151+
## Phase 2 (Future Work)
152+
153+
### Patch Stability
154+
155+
Patches may fail to apply as `origin/master` evolves over time. Need a strategy to handle this (e.g., pinning to a specific commit).
156+
157+
### Error Handling
158+
159+
Current design does not address failure scenarios:
160+
161+
- Patch application failures
162+
- Claude CLI timeouts or crashes
163+
- Empty or malformed output from Claude
164+
- Authentication failures
165+
- Resource cleanup on test failures

0 commit comments

Comments
 (0)