Build: Improve eval batch run script by yannbf · Pull Request #34720 · storybookjs/storybook

yannbf · 2026-05-05T15:53:49Z

Closes #

What I did

Fix label generation in evals for long names, add fine grained options in batch runs, improved logs

It's now possible to batch run a combination of projects, prompts and effort levels e.g.

node eval/run-batch.ts \
  --prompts monorepo-optimized-tests-relaxed-limits-no-story-deletion,pattern-copy-play \
  --agents claude --claude-efforts medium,high \
  --projects wikitok,evergreen-ci,mealdrop \
  --repetitions 3

That would result in: 2 prompts × 2 efforts × 3 projects × 3 reps = 36 trials.

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

stories
unit tests
integration tests
end-to-end tests

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

Documentation

Add or update documentation reflecting your changes
If you are deprecating/removing a feature, make sure to update
MIGRATION.MD

Checklist for Maintainers

When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts
Make sure this PR contains one of the labels below:
Available labels
- bug: Internal changes that fixes incorrect behavior.
- maintenance: User-facing maintenance tasks.
- dependencies: Upgrading (sometimes downgrading) dependencies.
- build: Internal-facing build tooling & test updates. Will not show up in release changelog.
- cleanup: Minor cleanup style change. Will not show up in release changelog.
- documentation: Documentation only changes. Will not show up in release changelog.
- feature request: Introducing a new feature.
- BREAKING CHANGE: Changes that break compatibility in some way with current major version.
- other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

Documentation
- Added example commands for eval batch runs with project restrictions, multi-prompt variants, and targeted effort/project matrices
New Features
- Run batches across multiple prompt variants at once
- Restrict batch runs to specific projects
- Improved progress display and enhanced batch reporting with per-project summaries
Bug Fixes
- Truncate GitHub PR labels to 50 characters
Tests
- Added tests covering multi-prompt/project batch behavior and output formatting

…, improved logs

coderabbitai · 2026-05-05T15:57:31Z

📝 Walkthrough

Walkthrough

Adds GitHub label truncation for trial publishing and significantly extends the eval batch runner: supports multiple prompt variants and optional per-batch project filtering, updates CLI parsing/validation, changes batch output formatting, and adds tests and README examples for the new behaviors.

Changes

Label Truncation for GitHub PR Labels

Layer / File(s)	Summary
Constant & Implementation `scripts/eval/lib/publish-trial.ts`	Adds `GITHUB_LABEL_MAX_LENGTH = 50` and `truncateLabel(label)`; `buildTrialLabels` now applies truncation to every generated label.
Tests `scripts/eval/lib/publish-trial.test.ts`	New test asserts `prompt:` label is truncated when prompt name >50 chars and that all labels have length <= 50.

Batch Runner Multi-Prompt & Project Filtering

Layer / File(s)	Summary
Options & Type Definitions `scripts/eval/run-batch.ts`	`RunBatchOptions` gains `prompts?: string[]` and `projects?: (typeof BATCH_PROJECT_NAMES)[number][]`; `prompt` becomes optional.
CLI Parsing & Validation `scripts/eval/run-batch.ts`	Adds `--prompts` and `--projects` CLI options; introduces `parseList` and `parseProjects`; `runBatchArgsSchema` refined to require at least one of `prompt`/`prompts`.
Core Resolution Logic `scripts/eval/run-batch.ts`	Adds `resolveBatchPrompts` (merge/trim/validate/dedupe, case-insensitive) and `resolveBatchProjects` (validate/dedupe); `buildBatchRunDescriptors` iterates resolved prompts, variants, and projects to produce descriptors.
Run Loop & Output Formatting `scripts/eval/run-batch.ts`	Per-run logs now use padded counters and shortened labels; adds `formatDuration`, `formatBatchHeader`, and `formatPerProjectSummary`; final output prints concise completion line plus per-project summary and failures section.
Tests & Documentation `scripts/eval/run-batch.test.ts`, `scripts/eval/README.md`	Tests added/extended for `--projects` parsing/validation, `--prompts` fan-out, descriptor generation rules, and formatting helpers; README examples updated with `--projects`, `--prompts`, and targeted matrices.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

storybookjs/storybook#34297: Directly modifies eval harness code in scripts/eval/ including run-batch and publish-trial utilities.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

scripts/eval/run-batch.ts (2)
748-759: 💤 Low value

Minor: Label says "prompt" even with multiple prompts.

Line 751 uses the singular label prompt: regardless of how many prompts are being run. For consistency with the matrix summary line (which uses agent(s), effort(s)), consider using prompt(s): or dynamically choosing the label.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 748 - 759, The summary output
currently prints a singular label "prompt:" even when multiple prompts exist;
update the string in the array returned where batch summary is built (the
expression using batchTimestamp, descriptors, prompts, agents, models, efforts,
projects, concurrency, logsDir) to use either a static plural label like
"prompt(s):" or dynamically pluralize based on prompts.length (e.g., choose
"prompt:" when prompts.length === 1 and "prompt(s):" otherwise); modify the
template that currently contains `  prompt:      ${prompts.join(', ')}` to the
chosen pluralized label so the summary is consistent with the other "(s)"
labels.
412-433: 💤 Low value

Consider case-insensitive matching for consistency with prompt resolution.

resolveBatchProjects uses exact case matching while resolveBatchPrompts (lines 387-400) uses case-insensitive matching. This inconsistency could confuse users who type --projects MealDrop expecting it to work like --prompts Pattern-Copy-Play.

Given this is a developer tool and project names are well-documented, this is a minor UX concern rather than a bug.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 412 - 433, resolveBatchProjects
currently does exact-case matching while resolveBatchPrompts is
case-insensitive; update resolveBatchProjects to match case-insensitively by
comparing lowercased input against a lowercased allowed map built from
BATCH_PROJECT_NAMES, reject unknowns using that map, perform deduplication in a
case-insensitive way (using lowercased seen set), and return the canonical-cased
project names from BATCH_PROJECT_NAMES (not the lowercased strings). Locate
function resolveBatchProjects and use BATCH_PROJECT_NAMES to build the
lowercase->canonical map for validation, ordering, and final return.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/eval/run-batch.ts`:
- Around line 748-759: The summary output currently prints a singular label
"prompt:" even when multiple prompts exist; update the string in the array
returned where batch summary is built (the expression using batchTimestamp,
descriptors, prompts, agents, models, efforts, projects, concurrency, logsDir)
to use either a static plural label like "prompt(s):" or dynamically pluralize
based on prompts.length (e.g., choose "prompt:" when prompts.length === 1 and
"prompt(s):" otherwise); modify the template that currently contains `  prompt: 
${prompts.join(', ')}` to the chosen pluralized label so the summary is
consistent with the other "(s)" labels.
- Around line 412-433: resolveBatchProjects currently does exact-case matching
while resolveBatchPrompts is case-insensitive; update resolveBatchProjects to
match case-insensitively by comparing lowercased input against a lowercased
allowed map built from BATCH_PROJECT_NAMES, reject unknowns using that map,
perform deduplication in a case-insensitive way (using lowercased seen set), and
return the canonical-cased project names from BATCH_PROJECT_NAMES (not the
lowercased strings). Locate function resolveBatchProjects and use
BATCH_PROJECT_NAMES to build the lowercase->canonical map for validation,
ordering, and final return.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 08d4b309-63dc-4ade-99cb-e8f17d17dce6

📥 Commits

Reviewing files that changed from the base of the PR and between e8cfc75 and d75268d.

📒 Files selected for processing (5)

scripts/eval/README.md
scripts/eval/lib/publish-trial.test.ts
scripts/eval/lib/publish-trial.ts
scripts/eval/run-batch.test.ts
scripts/eval/run-batch.ts

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/eval/run-batch.ts`:
- Around line 740-745: The computation of reps using const reps =
Math.max(...descriptors.map((d) => d.repetition)) can produce -Infinity for an
empty descriptors array; change the logic in the reps calculation (where reps is
declared) to guard empty descriptor batches by returning a sensible default
(e.g., 0) when descriptors.length === 0 or by using Math.max(0,
...descriptors.map(...)) so the header prints a valid repetition count; update
the single declaration of reps in run-batch.ts accordingly.
- Around line 772-774: The median calculation uses only the upper-middle element
for even counts, so update the logic around sortedDurations and median (computed
from projectRuns.map(...)) to handle even-sized arrays: after sorting, compute
length n, if n is odd use sortedDurations[Math.floor(n/2)], otherwise compute
the average of the two middle values (sortedDurations[n/2 - 1] and
sortedDurations[n/2]) and use that as the median so even-sized run sets return
the true median.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4773b121-d7bd-48e3-a10f-0f3316569e8e

📥 Commits

Reviewing files that changed from the base of the PR and between d75268d and 0be16ce.

📒 Files selected for processing (2)

scripts/eval/run-batch.test.ts
scripts/eval/run-batch.ts

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/eval/run-batch.test.ts

coderabbitai · 2026-05-06T09:48:46Z

+  const reps = Math.max(...descriptors.map((d) => d.repetition));
+
+  return [
+    `Eval batch ${batchTimestamp}`,
+    `  runs:        ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,
+    `  prompt:      ${prompts.join(', ')}`,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard empty descriptor batches in header math.

On Line 740, Math.max(...descriptors.map(...)) yields -Infinity when descriptors is empty, which leads to invalid header output (-Infinity rep(s)).

Suggested patch

- const reps = Math.max(...descriptors.map((d) => d.repetition)); + const reps = + descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const reps = Math.max(...descriptors.map((d) => d.repetition));

return [

`Eval batch ${batchTimestamp}`,

` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,

` prompt: ${prompts.join(', ')}`,

const reps =

descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition));

return [

`Eval batch ${batchTimestamp}`,

` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,

` prompt: ${prompts.join(', ')}`,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/run-batch.ts` around lines 740 - 745, The computation of reps using const reps = Math.max(...descriptors.map((d) => d.repetition)) can produce -Infinity for an empty descriptors array; change the logic in the reps calculation (where reps is declared) to guard empty descriptor batches by returning a sensible default (e.g., 0) when descriptors.length === 0 or by using Math.max(0, ...descriptors.map(...)) so the header prints a valid repetition count; update the single declaration of reps in run-batch.ts accordingly.

coderabbitai · 2026-05-06T09:48:46Z

+      const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);
+      const median = sortedDurations[Math.floor(sortedDurations.length / 2)];
+      return [


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Median is incorrect for even-sized run sets.

On Line 773, the median picks only the upper middle element. For even counts (common with multiple repetitions), this misreports the median duration.

Suggested patch

- const median = sortedDurations[Math.floor(sortedDurations.length / 2)]; + const mid = Math.floor(sortedDurations.length / 2); + const median = + sortedDurations.length % 2 === 0 + ? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2 + : sortedDurations[mid];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);

const median = sortedDurations[Math.floor(sortedDurations.length / 2)];

return [

const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);

const mid = Math.floor(sortedDurations.length / 2);

const median =

sortedDurations.length % 2 === 0

? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2

: sortedDurations[mid];

return [

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/run-batch.ts` around lines 772 - 774, The median calculation uses only the upper-middle element for even counts, so update the logic around sortedDurations and median (computed from projectRuns.map(...)) to handle even-sized arrays: after sorting, compute length n, if n is odd use sortedDurations[Math.floor(n/2)], otherwise compute the average of the two middle values (sortedDurations[n/2 - 1] and sortedDurations[n/2]) and use that as the median so even-sized run sets return the true median.

storybook-app-bot · 2026-05-06T09:56:13Z

Package Benchmarks

^{Commit: 0be16ce, ran on 6 May 2026 at 09:56:09 UTC}

The following packages have significant changes to their size or dependencies:

`@storybook/nextjs`

	Before	After	Difference
Dependency count	536	536	0
Self size	651 KB	651 KB	0 B
Dependency size	60.98 MB	61.04 MB	🚨 +59 KB 🚨
Bundle Size Analyzer	Link	Link

`@storybook/vue3-vite`

	Before	After	Difference
Dependency count	108	108	0
Self size	36 KB	36 KB	🚨 +24 B 🚨
Dependency size	43.74 MB	43.75 MB	🚨 +12 KB 🚨
Bundle Size Analyzer	Link	Link

yannbf added 2 commits May 5, 2026 17:41

fix label generation in evals, add fine grained options in batch runs…

1da36b7

…, improved logs

allow to run multiple prompts in batch run

d75268d

yannbf requested a review from Sidnioulz May 5, 2026 15:53

yannbf self-assigned this May 5, 2026

yannbf added build Internal-facing build tooling & test updates ci:normal labels May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Sidnioulz approved these changes May 6, 2026

View reviewed changes

reformat

0be16ce

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Sidnioulz merged commit 15575ff into sidnioulz/prompt-with-allowed-failure May 6, 2026
141 checks passed

Sidnioulz deleted the yann/improved-batch-eval branch May 6, 2026 10:07

coderabbitai Bot mentioned this pull request May 6, 2026

Agentic Setup: Allow failed stories to persist #34717

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build: Improve eval batch run script#34720

Build: Improve eval batch run script#34720
Sidnioulz merged 3 commits into
sidnioulz/prompt-with-allowed-failurefrom
yann/improved-batch-eval

yannbf commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 6, 2026

Uh oh!

coderabbitai Bot May 6, 2026

Uh oh!

storybook-app-bot Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yannbf commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I did

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

Manual testing

Documentation

Checklist for Maintainers

🦋 Canary release

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

storybook-app-bot Bot commented May 6, 2026

Package Benchmarks

@storybook/nextjs

@storybook/vue3-vite

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yannbf commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

`@storybook/nextjs`

`@storybook/vue3-vite`