Skip to content

Build: Improve eval batch run script#34720

Merged
Sidnioulz merged 3 commits into
sidnioulz/prompt-with-allowed-failurefrom
yann/improved-batch-eval
May 6, 2026
Merged

Build: Improve eval batch run script#34720
Sidnioulz merged 3 commits into
sidnioulz/prompt-with-allowed-failurefrom
yann/improved-batch-eval

Conversation

@yannbf
Copy link
Copy Markdown
Member

@yannbf yannbf commented May 5, 2026

Closes #

What I did

Fix label generation in evals for long names, add fine grained options in batch runs, improved logs

It's now possible to batch run a combination of projects, prompts and effort levels e.g.

node eval/run-batch.ts \
  --prompts monorepo-optimized-tests-relaxed-limits-no-story-deletion,pattern-copy-play \
  --agents claude --claude-efforts medium,high \
  --projects wikitok,evergreen-ci,mealdrop \
  --repetitions 3

That would result in: 2 prompts × 2 efforts × 3 projects × 3 reps = 36 trials.

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

  • Documentation

    • Added example commands for eval batch runs with project restrictions, multi-prompt variants, and targeted effort/project matrices
  • New Features

    • Run batches across multiple prompt variants at once
    • Restrict batch runs to specific projects
    • Improved progress display and enhanced batch reporting with per-project summaries
  • Bug Fixes

    • Truncate GitHub PR labels to 50 characters
  • Tests

    • Added tests covering multi-prompt/project batch behavior and output formatting

@yannbf yannbf requested a review from Sidnioulz May 5, 2026 15:53
@yannbf yannbf self-assigned this May 5, 2026
@yannbf yannbf added build Internal-facing build tooling & test updates ci:normal labels May 5, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

Adds GitHub label truncation for trial publishing and significantly extends the eval batch runner: supports multiple prompt variants and optional per-batch project filtering, updates CLI parsing/validation, changes batch output formatting, and adds tests and README examples for the new behaviors.

Changes

Label Truncation for GitHub PR Labels

Layer / File(s) Summary
Constant & Implementation
scripts/eval/lib/publish-trial.ts
Adds GITHUB_LABEL_MAX_LENGTH = 50 and truncateLabel(label); buildTrialLabels now applies truncation to every generated label.
Tests
scripts/eval/lib/publish-trial.test.ts
New test asserts prompt: label is truncated when prompt name >50 chars and that all labels have length <= 50.

Batch Runner Multi-Prompt & Project Filtering

Layer / File(s) Summary
Options & Type Definitions
scripts/eval/run-batch.ts
RunBatchOptions gains prompts?: string[] and projects?: (typeof BATCH_PROJECT_NAMES)[number][]; prompt becomes optional.
CLI Parsing & Validation
scripts/eval/run-batch.ts
Adds --prompts and --projects CLI options; introduces parseList and parseProjects; runBatchArgsSchema refined to require at least one of prompt/prompts.
Core Resolution Logic
scripts/eval/run-batch.ts
Adds resolveBatchPrompts (merge/trim/validate/dedupe, case-insensitive) and resolveBatchProjects (validate/dedupe); buildBatchRunDescriptors iterates resolved prompts, variants, and projects to produce descriptors.
Run Loop & Output Formatting
scripts/eval/run-batch.ts
Per-run logs now use padded counters and shortened labels; adds formatDuration, formatBatchHeader, and formatPerProjectSummary; final output prints concise completion line plus per-project summary and failures section.
Tests & Documentation
scripts/eval/run-batch.test.ts, scripts/eval/README.md
Tests added/extended for --projects parsing/validation, --prompts fan-out, descriptor generation rules, and formatting helpers; README examples updated with --projects, --prompts, and targeted matrices.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
scripts/eval/run-batch.ts (2)

748-759: 💤 Low value

Minor: Label says "prompt" even with multiple prompts.

Line 751 uses the singular label prompt: regardless of how many prompts are being run. For consistency with the matrix summary line (which uses agent(s), effort(s)), consider using prompt(s): or dynamically choosing the label.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 748 - 759, The summary output
currently prints a singular label "prompt:" even when multiple prompts exist;
update the string in the array returned where batch summary is built (the
expression using batchTimestamp, descriptors, prompts, agents, models, efforts,
projects, concurrency, logsDir) to use either a static plural label like
"prompt(s):" or dynamically pluralize based on prompts.length (e.g., choose
"prompt:" when prompts.length === 1 and "prompt(s):" otherwise); modify the
template that currently contains `  prompt:      ${prompts.join(', ')}` to the
chosen pluralized label so the summary is consistent with the other "(s)"
labels.

412-433: 💤 Low value

Consider case-insensitive matching for consistency with prompt resolution.

resolveBatchProjects uses exact case matching while resolveBatchPrompts (lines 387-400) uses case-insensitive matching. This inconsistency could confuse users who type --projects MealDrop expecting it to work like --prompts Pattern-Copy-Play.

Given this is a developer tool and project names are well-documented, this is a minor UX concern rather than a bug.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 412 - 433, resolveBatchProjects
currently does exact-case matching while resolveBatchPrompts is
case-insensitive; update resolveBatchProjects to match case-insensitively by
comparing lowercased input against a lowercased allowed map built from
BATCH_PROJECT_NAMES, reject unknowns using that map, perform deduplication in a
case-insensitive way (using lowercased seen set), and return the canonical-cased
project names from BATCH_PROJECT_NAMES (not the lowercased strings). Locate
function resolveBatchProjects and use BATCH_PROJECT_NAMES to build the
lowercase->canonical map for validation, ordering, and final return.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/eval/run-batch.ts`:
- Around line 748-759: The summary output currently prints a singular label
"prompt:" even when multiple prompts exist; update the string in the array
returned where batch summary is built (the expression using batchTimestamp,
descriptors, prompts, agents, models, efforts, projects, concurrency, logsDir)
to use either a static plural label like "prompt(s):" or dynamically pluralize
based on prompts.length (e.g., choose "prompt:" when prompts.length === 1 and
"prompt(s):" otherwise); modify the template that currently contains `  prompt: 
${prompts.join(', ')}` to the chosen pluralized label so the summary is
consistent with the other "(s)" labels.
- Around line 412-433: resolveBatchProjects currently does exact-case matching
while resolveBatchPrompts is case-insensitive; update resolveBatchProjects to
match case-insensitively by comparing lowercased input against a lowercased
allowed map built from BATCH_PROJECT_NAMES, reject unknowns using that map,
perform deduplication in a case-insensitive way (using lowercased seen set), and
return the canonical-cased project names from BATCH_PROJECT_NAMES (not the
lowercased strings). Locate function resolveBatchProjects and use
BATCH_PROJECT_NAMES to build the lowercase->canonical map for validation,
ordering, and final return.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 08d4b309-63dc-4ade-99cb-e8f17d17dce6

📥 Commits

Reviewing files that changed from the base of the PR and between e8cfc75 and d75268d.

📒 Files selected for processing (5)
  • scripts/eval/README.md
  • scripts/eval/lib/publish-trial.test.ts
  • scripts/eval/lib/publish-trial.ts
  • scripts/eval/run-batch.test.ts
  • scripts/eval/run-batch.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/eval/run-batch.ts`:
- Around line 740-745: The computation of reps using const reps =
Math.max(...descriptors.map((d) => d.repetition)) can produce -Infinity for an
empty descriptors array; change the logic in the reps calculation (where reps is
declared) to guard empty descriptor batches by returning a sensible default
(e.g., 0) when descriptors.length === 0 or by using Math.max(0,
...descriptors.map(...)) so the header prints a valid repetition count; update
the single declaration of reps in run-batch.ts accordingly.
- Around line 772-774: The median calculation uses only the upper-middle element
for even counts, so update the logic around sortedDurations and median (computed
from projectRuns.map(...)) to handle even-sized arrays: after sorting, compute
length n, if n is odd use sortedDurations[Math.floor(n/2)], otherwise compute
the average of the two middle values (sortedDurations[n/2 - 1] and
sortedDurations[n/2]) and use that as the median so even-sized run sets return
the true median.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4773b121-d7bd-48e3-a10f-0f3316569e8e

📥 Commits

Reviewing files that changed from the base of the PR and between d75268d and 0be16ce.

📒 Files selected for processing (2)
  • scripts/eval/run-batch.test.ts
  • scripts/eval/run-batch.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/eval/run-batch.test.ts

Comment thread scripts/eval/run-batch.ts
Comment on lines +740 to +745
const reps = Math.max(...descriptors.map((d) => d.repetition));

return [
`Eval batch ${batchTimestamp}`,
` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,
` prompt: ${prompts.join(', ')}`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard empty descriptor batches in header math.

On Line 740, Math.max(...descriptors.map(...)) yields -Infinity when descriptors is empty, which leads to invalid header output (-Infinity rep(s)).

Suggested patch
-  const reps = Math.max(...descriptors.map((d) => d.repetition));
+  const reps =
+    descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const reps = Math.max(...descriptors.map((d) => d.repetition));
return [
`Eval batch ${batchTimestamp}`,
` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,
` prompt: ${prompts.join(', ')}`,
const reps =
descriptors.length === 0 ? 0 : Math.max(...descriptors.map((d) => d.repetition));
return [
`Eval batch ${batchTimestamp}`,
` runs: ${descriptors.length} (${projects.length} projects × ${agents.length} agent(s) × ${efforts.length} effort(s) × ${reps} rep(s))`,
` prompt: ${prompts.join(', ')}`,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 740 - 745, The computation of reps
using const reps = Math.max(...descriptors.map((d) => d.repetition)) can produce
-Infinity for an empty descriptors array; change the logic in the reps
calculation (where reps is declared) to guard empty descriptor batches by
returning a sensible default (e.g., 0) when descriptors.length === 0 or by using
Math.max(0, ...descriptors.map(...)) so the header prints a valid repetition
count; update the single declaration of reps in run-batch.ts accordingly.

Comment thread scripts/eval/run-batch.ts
Comment on lines +772 to +774
const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);
const median = sortedDurations[Math.floor(sortedDurations.length / 2)];
return [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Median is incorrect for even-sized run sets.

On Line 773, the median picks only the upper middle element. For even counts (common with multiple repetitions), this misreports the median duration.

Suggested patch
-      const median = sortedDurations[Math.floor(sortedDurations.length / 2)];
+      const mid = Math.floor(sortedDurations.length / 2);
+      const median =
+        sortedDurations.length % 2 === 0
+          ? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2
+          : sortedDurations[mid];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);
const median = sortedDurations[Math.floor(sortedDurations.length / 2)];
return [
const sortedDurations = projectRuns.map((r) => r.durationMs).sort((a, b) => a - b);
const mid = Math.floor(sortedDurations.length / 2);
const median =
sortedDurations.length % 2 === 0
? (sortedDurations[mid - 1] + sortedDurations[mid]) / 2
: sortedDurations[mid];
return [
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval/run-batch.ts` around lines 772 - 774, The median calculation
uses only the upper-middle element for even counts, so update the logic around
sortedDurations and median (computed from projectRuns.map(...)) to handle
even-sized arrays: after sorting, compute length n, if n is odd use
sortedDurations[Math.floor(n/2)], otherwise compute the average of the two
middle values (sortedDurations[n/2 - 1] and sortedDurations[n/2]) and use that
as the median so even-sized run sets return the true median.

@storybook-app-bot
Copy link
Copy Markdown

Package Benchmarks

Commit: 0be16ce, ran on 6 May 2026 at 09:56:09 UTC

The following packages have significant changes to their size or dependencies:

@storybook/nextjs

Before After Difference
Dependency count 536 536 0
Self size 651 KB 651 KB 0 B
Dependency size 60.98 MB 61.04 MB 🚨 +59 KB 🚨
Bundle Size Analyzer Link Link

@storybook/vue3-vite

Before After Difference
Dependency count 108 108 0
Self size 36 KB 36 KB 🚨 +24 B 🚨
Dependency size 43.74 MB 43.75 MB 🚨 +12 KB 🚨
Bundle Size Analyzer Link Link

@Sidnioulz Sidnioulz merged commit 15575ff into sidnioulz/prompt-with-allowed-failure May 6, 2026
141 checks passed
@Sidnioulz Sidnioulz deleted the yann/improved-batch-eval branch May 6, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Internal-facing build tooling & test updates ci:normal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants