Build: Improve eval batch run script#34720
Conversation
📝 WalkthroughWalkthroughAdds GitHub label truncation for trial publishing and significantly extends the eval batch runner: supports multiple prompt variants and optional per-batch project filtering, updates CLI parsing/validation, changes batch output formatting, and adds tests and README examples for the new behaviors. ChangesLabel Truncation for GitHub PR Labels
Batch Runner Multi-Prompt & Project Filtering
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
scripts/eval/run-batch.ts (2)
748-759: 💤 Low valueMinor: Label says "prompt" even with multiple prompts.
Line 751 uses the singular label
prompt:regardless of how many prompts are being run. For consistency with the matrix summary line (which usesagent(s),effort(s)), consider usingprompt(s):or dynamically choosing the label.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/run-batch.ts` around lines 748 - 759, The summary output currently prints a singular label "prompt:" even when multiple prompts exist; update the string in the array returned where batch summary is built (the expression using batchTimestamp, descriptors, prompts, agents, models, efforts, projects, concurrency, logsDir) to use either a static plural label like "prompt(s):" or dynamically pluralize based on prompts.length (e.g., choose "prompt:" when prompts.length === 1 and "prompt(s):" otherwise); modify the template that currently contains ` prompt: ${prompts.join(', ')}` to the chosen pluralized label so the summary is consistent with the other "(s)" labels.
412-433: 💤 Low valueConsider case-insensitive matching for consistency with prompt resolution.
resolveBatchProjectsuses exact case matching whileresolveBatchPrompts(lines 387-400) uses case-insensitive matching. This inconsistency could confuse users who type--projects MealDropexpecting it to work like--prompts Pattern-Copy-Play.Given this is a developer tool and project names are well-documented, this is a minor UX concern rather than a bug.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/run-batch.ts` around lines 412 - 433, resolveBatchProjects currently does exact-case matching while resolveBatchPrompts is case-insensitive; update resolveBatchProjects to match case-insensitively by comparing lowercased input against a lowercased allowed map built from BATCH_PROJECT_NAMES, reject unknowns using that map, perform deduplication in a case-insensitive way (using lowercased seen set), and return the canonical-cased project names from BATCH_PROJECT_NAMES (not the lowercased strings). Locate function resolveBatchProjects and use BATCH_PROJECT_NAMES to build the lowercase->canonical map for validation, ordering, and final return.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@scripts/eval/run-batch.ts`:
- Around line 748-759: The summary output currently prints a singular label
"prompt:" even when multiple prompts exist; update the string in the array
returned where batch summary is built (the expression using batchTimestamp,
descriptors, prompts, agents, models, efforts, projects, concurrency, logsDir)
to use either a static plural label like "prompt(s):" or dynamically pluralize
based on prompts.length (e.g., choose "prompt:" when prompts.length === 1 and
"prompt(s):" otherwise); modify the template that currently contains ` prompt:
${prompts.join(', ')}` to the chosen pluralized label so the summary is
consistent with the other "(s)" labels.
- Around line 412-433: resolveBatchProjects currently does exact-case matching
while resolveBatchPrompts is case-insensitive; update resolveBatchProjects to
match case-insensitively by comparing lowercased input against a lowercased
allowed map built from BATCH_PROJECT_NAMES, reject unknowns using that map,
perform deduplication in a case-insensitive way (using lowercased seen set), and
return the canonical-cased project names from BATCH_PROJECT_NAMES (not the
lowercased strings). Locate function resolveBatchProjects and use
BATCH_PROJECT_NAMES to build the lowercase->canonical map for validation,
ordering, and final return.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 08d4b309-63dc-4ade-99cb-e8f17d17dce6
📒 Files selected for processing (5)
scripts/eval/README.mdscripts/eval/lib/publish-trial.test.tsscripts/eval/lib/publish-trial.tsscripts/eval/run-batch.test.tsscripts/eval/run-batch.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/eval/run-batch.ts`:
- Around line 740-745: The computation of reps using const reps =
Math.max(...descriptors.map((d) => d.repetition)) can produce -Infinity for an
empty descriptors array; change the logic in the reps calculation (where reps is
declared) to guard empty descriptor batches by returning a sensible default
(e.g., 0) when descriptors.length === 0 or by using Math.max(0,
...descriptors.map(...)) so the header prints a valid repetition count; update
the single declaration of reps in run-batch.ts accordingly.
- Around line 772-774: The median calculation uses only the upper-middle element
for even counts, so update the logic around sortedDurations and median (computed
from projectRuns.map(...)) to handle even-sized arrays: after sorting, compute
length n, if n is odd use sortedDurations[Math.floor(n/2)], otherwise compute
the average of the two middle values (sortedDurations[n/2 - 1] and
sortedDurations[n/2]) and use that as the median so even-sized run sets return
the true median.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4773b121-d7bd-48e3-a10f-0f3316569e8e
📒 Files selected for processing (2)
scripts/eval/run-batch.test.tsscripts/eval/run-batch.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/eval/run-batch.test.ts
| const reps = Math.max(...descriptors.map((d) => d.repetition)); | ||
|
|
||
| return [ | ||
| `Eval batch ${batchTimestamp}`, | ||
| ` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`, | ||
| ` prompt: ${prompts.join(', ')}`, |
There was a problem hiding this comment.
Guard empty descriptor batches in header math.
On Line 740, Math.max(...descriptors.map(...)) yields -Infinity when descriptors is empty, which leads to invalid header output (-Infinity rep(s)).
Suggested patch
- const reps = Math.max(...descriptors.map((d) => d.repetition));
+ const reps =
+ descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const reps = Math.max(...descriptors.map((d) => d.repetition)); | |
| return [ | |
| `Eval batch ${batchTimestamp}`, | |
| ` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`, | |
| ` prompt: ${prompts.join(', ')}`, | |
| const reps = | |
| descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition)); | |
| return [ | |
| `Eval batch ${batchTimestamp}`, | |
| ` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`, | |
| ` prompt: ${prompts.join(', ')}`, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/eval/run-batch.ts` around lines 740 - 745, The computation of reps
using const reps = Math.max(...descriptors.map((d) => d.repetition)) can produce
-Infinity for an empty descriptors array; change the logic in the reps
calculation (where reps is declared) to guard empty descriptor batches by
returning a sensible default (e.g., 0) when descriptors.length === 0 or by using
Math.max(0, ...descriptors.map(...)) so the header prints a valid repetition
count; update the single declaration of reps in run-batch.ts accordingly.
| const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b); | ||
| const median = sortedDurations[Math.floor(sortedDurations.length / 2)]; | ||
| return [ |
There was a problem hiding this comment.
Median is incorrect for even-sized run sets.
On Line 773, the median picks only the upper middle element. For even counts (common with multiple repetitions), this misreports the median duration.
Suggested patch
- const median = sortedDurations[Math.floor(sortedDurations.length / 2)];
+ const mid = Math.floor(sortedDurations.length / 2);
+ const median =
+ sortedDurations.length % 2 === 0
+ ? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2
+ : sortedDurations[mid];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b); | |
| const median = sortedDurations[Math.floor(sortedDurations.length / 2)]; | |
| return [ | |
| const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b); | |
| const mid = Math.floor(sortedDurations.length / 2); | |
| const median = | |
| sortedDurations.length % 2 === 0 | |
| ? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2 | |
| : sortedDurations[mid]; | |
| return [ |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/eval/run-batch.ts` around lines 772 - 774, The median calculation
uses only the upper-middle element for even counts, so update the logic around
sortedDurations and median (computed from projectRuns.map(...)) to handle
even-sized arrays: after sorting, compute length n, if n is odd use
sortedDurations[Math.floor(n/2)], otherwise compute the average of the two
middle values (sortedDurations[n/2 - 1] and sortedDurations[n/2]) and use that
as the median so even-sized run sets return the true median.
Package BenchmarksCommit: The following packages have significant changes to their size or dependencies:
|
| Before | After | Difference | |
|---|---|---|---|
| Dependency count | 536 | 536 | 0 |
| Self size | 651 KB | 651 KB | 0 B |
| Dependency size | 60.98 MB | 61.04 MB | 🚨 +59 KB 🚨 |
| Bundle Size Analyzer | Link | Link |
@storybook/vue3-vite
| Before | After | Difference | |
|---|---|---|---|
| Dependency count | 108 | 108 | 0 |
| Self size | 36 KB | 36 KB | 🚨 +24 B 🚨 |
| Dependency size | 43.74 MB | 43.75 MB | 🚨 +12 KB 🚨 |
| Bundle Size Analyzer | Link | Link |
15575ff
into
sidnioulz/prompt-with-allowed-failure
Closes #
What I did
Fix label generation in evals for long names, add fine grained options in batch runs, improved logs
It's now possible to batch run a combination of projects, prompts and effort levels e.g.
That would result in: 2 prompts × 2 efforts × 3 projects × 3 reps = 36 trials.
Checklist for Contributors
Testing
The changes in this PR are covered in the following automated tests:
Manual testing
Caution
This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!
Documentation
MIGRATION.MD
Checklist for Maintainers
When this PR is ready for testing, make sure to add
ci:normal,ci:mergedorci:dailyGH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found incode/lib/cli-storybook/src/sandbox-templates.tsMake sure this PR contains one of the labels below:
Available labels
bug: Internal changes that fixes incorrect behavior.maintenance: User-facing maintenance tasks.dependencies: Upgrading (sometimes downgrading) dependencies.build: Internal-facing build tooling & test updates. Will not show up in release changelog.cleanup: Minor cleanup style change. Will not show up in release changelog.documentation: Documentation only changes. Will not show up in release changelog.feature request: Introducing a new feature.BREAKING CHANGE: Changes that break compatibility in some way with current major version.other: Changes that don't fit in the above categories.🦋 Canary release
This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the
@storybookjs/coreteam here.core team members can create a canary release here or locally with
gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>Summary by CodeRabbit
Documentation
New Features
Bug Fixes
Tests