Eval: Add worktree prep and PR artifacts by kasperpeulen · Pull Request #34421 · storybookjs/storybook

kasperpeulen · 2026-03-31T17:31:45Z

Closes #

What I did

Expanded the eval harness into a fuller end-to-end trial pipeline.

This PR now focuses on the scripts/eval workflow and related artifacts:

prepare eval trials from persistent source clones plus fresh per-trial worktrees
run Claude/Codex setup evals and write schema v4 trial artifacts
grade runs using generated-story preview gain while still recording before/after ghost-story coverage
publish draft eval PRs with structured result data and transcript documents
collect eval PR data into SQLite for later analysis
add batch execution and shared .storybook baseline sync tooling for benchmark repos
use the play-driven pattern-copy-play prompt as the setup-eval prompt path

Note: This PR targets project/sb-agentic-setup (a project branch), not next or main.

Detailed walkthrough

Pipeline architecture

Two entry points (eval.ts for single trials, run-batch.ts for batch) feed the core pipeline: prepare a worktree → run the agent → grade the output → compute a score → publish a draft PR. Supporting modules supply configuration and templates; post-pipeline tools collect data into SQLite or sync baselines to benchmark repos.

Worktree isolation strategy

Instead of cloning fresh every time, the pipeline maintains one persistent source clone per project and spawns lightweight git worktrees for each trial:

// scripts/eval/lib/prepare-trial.ts

export async function prepareTrial(project, trialId, logger) {
  const sourceDir = join(REPOS_DIR, project.name);
  const trialDir = join(TRIALS_DIR, trialId);
  const repoRoot = join(trialDir, 'project');
  const trialBranch = `trial/${trialId}`;

  await ensureSourceClone(project, sourceDir, logger);            // clone once
  const baselineCommit = await syncSourceClone(project, sourceDir, logger); // fetch + reset
  await createTrialWorktree({ sourceDir, trialBranch, repoRoot, baseBranch: project.branch, logger });

  const projectPath = getProjectPath(repoRoot, project.projectDir);
  await installDeps(projectPath, logger, undefined, { stopAt: repoRoot });
  await syncLocalBuildsToTrial(projectPath, logger);              // optional: inject local storybook dist

  return { trialDir, sourceDir, repoRoot, projectPath, resultsDir, baselineCommit, trialBranch };
}

The worktree is created with a single git command:

// scripts/eval/lib/prepare-trial.ts — createTrialWorktree()

await x('git', ['worktree', 'add', '-b', trialBranch, repoRoot, baseBranch], {
  timeout: 120_000,
  nodeOptions: { cwd: sourceDir },
});

Seven benchmark repos cover different tech stacks:

// scripts/eval/lib/projects.ts

export const PROJECTS: Project[] = [
  { name: 'mealdrop',     description: 'Styled components, Redux, React Router' },
  { name: 'edgy',         description: 'Tailwind, HeadlessUI, React Router' },
  { name: 'wikitok',      description: 'Simple project with Tailwind', projectDir: 'frontend' },
  { name: 'baklava',      description: 'Component library with Zustand' },
  { name: 'echarts',      description: 'ECharts React wrapper' },
  { name: 'evergreen-ci', description: 'GraphQL', projectDir: 'packages/lib' },
  { name: 'excalidraw',   description: 'Monorepo with canvas based drawing app', projectDir: 'excalidraw-app' },
];

Trial lifecycle (8 steps)

run-trial.ts orchestrates the full sequence. The critical detail is the double-write of data.json — a provisional version before grading (so MDX imports don't crash the build), then the final version after:

// scripts/eval/lib/run-trial.ts — runTrial()

// Step 6a: Write provisional data so baseline-owned MDX files can resolve during grading
const provisionalData = buildEvalData({ id: trialId, /* ... */ grade: { buildSuccess: false, /* ... */ } });
await writeFile(join(workspace.resultsDir, 'data.json'), JSON.stringify(provisionalData, null, 2));

// Step 6b: Grade the results
const { grade: trialGrade, score } = await grade(workspace, log, baselineGhostStories);

// Step 7: Rewrite with final grade
const reportForCommit = buildEvalData({ ...provisionalData, grade: trialGrade, score, /* ... */ });
await writeFile(join(workspace.resultsDir, 'data.json'), JSON.stringify(reportForCommit, null, 2));

// Step 8: Commit, push, open PR
const publish = await publishTrialBranch({ data: reportForCommit, workspace, logger: log });

The agent execution uses a pluggable driver interface:

// scripts/eval/lib/agents/config.ts

export interface AgentDriver {
  name: AgentId;
  execute(params: AgentExecuteParams): Promise<AgentExecutionResult>;
}

// Supported agents + models
claude: { models: ['sonnet-4.6', 'opus-4.6', 'haiku-4.5'], defaultModel: 'sonnet-4.6' }
codex:  { models: ['gpt-5.4'], defaultModel: 'gpt-5.4' }

Grading: four-dimensional evaluation

Build and TypeScript checks run in parallel:

// scripts/eval/lib/grade.ts

const [build, tsc] = await Promise.all([
  x('npx', ['storybook', 'build', '--quiet'], { throwOnError: false, timeout: 300_000, /* ... */ }),
  x('npx', ['tsc', '--noEmit'], { throwOnError: false, timeout: 120_000, /* ... */ }),
]);

The story-render evaluation runs Vitest on the agent's new story files:

// scripts/eval/lib/story-render.ts — runStoryRenderPass()

const result = await x('npx', [
  'vitest', 'run', '--project=storybook', '--reporter=json',
  `--outputFile=${reportPath}`, ...runnableStoryFiles,
], { throwOnError: false, timeout: 300_000 });

The baseline comparison temporarily swaps preview config back to the original, reruns the same tests, then restores the agent's files:

// scripts/eval/lib/story-render.ts — withBaselinePreviewEnvironment()

export async function withBaselinePreviewEnvironment<T>(opts) {
  const previewFiles = getPreviewEnvironmentFiles(opts.fileChanges);
  const snapshots = await snapshotFiles(opts.repoRoot, previewFiles);

  try {
    await restoreFilesFromCommit(opts.repoRoot, opts.baselineCommit, previewFiles);
    return await opts.fn();  // run Vitest with baseline preview
  } finally {
    await restoreSnapshots(opts.repoRoot, snapshots);  // put agent's files back
  }
}

The final score is a normalized preview gain — how much of the remaining room-for-improvement did the agent capture?

// scripts/eval/lib/grade.ts

const gain = (afterRate - beforeRate) / (1 - beforeRate);
// e.g. baseline 60% → agent 80% → gain = (0.8 - 0.6) / (1 - 0.6) = 0.5

End-to-end data flow

The system forms a cycle:

sync-baselines.ts pushes canonical .storybook configs to 7 benchmark repos
eval.ts / run-batch.ts creates worktrees and runs agent trials
Results are published as draft PRs with labels (eval, project:X, agent:Y, model:Z)
collect-pr-data.ts scrapes those PRs into SQLite for analysis

Publishing commits the trial, pushes the branch, and opens a draft PR:

// scripts/eval/lib/publish-trial.ts

await x('git', ['commit', '--no-verify', '-m', `eval: ${opts.data.id}`], { /* ... */ });
await x('git', ['push', '--set-upstream', 'origin', opts.workspace.trialBranch], { /* ... */ });

const prUrl = (await x('gh', [
  'pr', 'create', '--repo', opts.data.project.githubSlug,
  '--base', opts.data.project.branch, '--head', opts.workspace.trialBranch,
  '--draft', '--title', title, '--body', prBody,
])).stdout.trim();

Batch runner fans out trials with concurrency control and round-robin interleaving:

// scripts/eval/run-batch.ts

export const BATCH_REPETITIONS = 10;
export const BATCH_CONCURRENCY = 8;
export const BATCH_DEFAULT_AGENT_IDS = ['claude'] as const;
export const BATCH_DEFAULT_CLAUDE_EFFORTS = ['max'] as const;

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

stories
unit tests
integration tests
end-to-end tests

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

No manual end-to-end benchmark run was performed.

Validated with:

yarn vitest run scripts/eval/collect-pr-data.test.ts scripts/eval/lib/prepare-trial.test.ts scripts/eval/lib/run-trial.test.ts scripts/eval/lib/publish-trial.test.ts scripts/eval/lib/story-render.test.ts
yarn vitest run scripts/eval/lib/utils.test.ts scripts/eval/run-batch.test.ts scripts/eval/sync-baselines.test.ts
yarn vitest run code/addons/vitest/src/vitest-plugin/viewports.test.ts code/core/src/csf-tools/vitest-plugin/transformer.test.ts

Documentation

Add or update documentation reflecting your changes
If you are deprecating/removing a feature, make sure to update
MIGRATION.MD

Checklist for Maintainers

When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts
Make sure this PR contains one of the labels below:
Available labels
- bug: Internal changes that fixes incorrect behavior.
- maintenance: User-facing maintenance tasks.
- dependencies: Upgrading (sometimes downgrading) dependencies.
- build: Internal-facing build tooling & test updates. Will not show up in release changelog.
- cleanup: Minor cleanup style change. Will not show up in release changelog.
- documentation: Documentation only changes. Will not show up in release changelog.
- feature request: Introducing a new feature.
- BREAKING CHANGE: Changes that break compatibility in some way with current major version.
- other: Changes that don't fit in the above categories.

🦋 Canary release

This pull request has been released as version 0.0.0-pr-34421-sha-8babe7bc. Try it out in a new sandbox by running npx storybook@0.0.0-pr-34421-sha-8babe7bc sandbox or in an existing project with npx storybook@0.0.0-pr-34421-sha-8babe7bc upgrade.

More information


Published version	`0.0.0-pr-34421-sha-8babe7bc`
Triggered by	@kasperpeulen
Repository	storybookjs/storybook
Branch	`kasper/eval-worktrees-chromatic-pr`
Commit	`8babe7bc`
Datetime	Fri Apr 3 08:48:19 UTC 2026 (`1775206099`)
Workflow run	23940393766

To request a new release of this pull request, mention the @storybookjs/core team.

core team members can create a new canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=34421

Summary by CodeRabbit

New Features
- Added end-to-end evaluation tooling: trial runs with story-render checks, publishing eval PRs, batch runner, baseline Storybook syncing, and PR-data collection; outputs now use a unified data.json schema v4.
- New transcript/docs export and viewer templates for published results.
- Default evaluation prompt changed to "pattern-copy-play".
Tests
- Large suite of unit and integration tests covering agents, grading, story-render, publishing, syncing, batching, and utilities.
Documentation
- Added a detailed evaluation prompt and baseline Storybook templates.

nx-cloud · 2026-03-31T17:32:13Z

View your CI Pipeline Execution ↗ for commit 0045954

Command	Status	Duration	Result
`nx run-many -t compile,check,knip,test,lint,fmt...`	✅ Succeeded	24m 11s	View ↗

☁️ Nx Cloud last updated this comment at 2026-04-15 09:44:18 UTC

nx-cloud · 2026-03-31T17:32:13Z

View your CI Pipeline Execution ↗ for commit 940e76d

☁️ Nx Cloud last updated this comment at 2026-03-31 17:32:13 UTC

storybook-app-bot · 2026-03-31T17:44:21Z

Package Benchmarks

^{Commit: 0045954, ran on 15 April 2026 at 09:30:08 UTC}

The following packages have significant changes to their size or dependencies:

`storybook`

	Before	After	Difference
Dependency count	50	50	0
Self size	20.56 MB	20.52 MB	🎉 -37 KB 🎉
Dependency size	16.56 MB	16.56 MB	🎉 -93 B 🎉
Bundle Size Analyzer	Link	Link

`@storybook/cli`

	Before	After	Difference
Dependency count	184	184	0
Self size	819 KB	818 KB	🎉 -1 KB 🎉
Dependency size	68.20 MB	68.17 MB	🎉 -36 KB 🎉
Bundle Size Analyzer	Link	Link

`@storybook/codemod`

	Before	After	Difference
Dependency count	177	177	0
Self size	32 KB	32 KB	🎉 -36 B 🎉
Dependency size	66.72 MB	66.68 MB	🎉 -37 KB 🎉
Bundle Size Analyzer	Link	Link

`create-storybook`

	Before	After	Difference
Dependency count	51	51	0
Self size	1.05 MB	1.05 MB	🚨 +985 B 🚨
Dependency size	37.11 MB	37.08 MB	🎉 -37 KB 🎉
Bundle Size Analyzer	node	node

Co-authored-by: Codex <noreply@openai.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

scripts/eval/lib/publish-trial.test.ts (2)
133-217: Extract a shared trial-data factory to reduce drift across tests.

The schema payload is repeated three times with minor overrides. A createTrialData(overrides) helper would make scenarios clearer and reduce maintenance risk when schema fields change.

Also applies to: 311-373, 430-489
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/publish-trial.test.ts` around lines 133 - 217, Tests repeat
the same trial payload across multiple publishTrialBranch calls; extract a
createTrialData(overrides) factory that returns the full default schema object
and merges in any overrides, then replace direct inline payloads (e.g., the data
argument passed to publishTrialBranch) with data: createTrialData({...}) so each
test only specifies the differing fields; update all other occurrences (the
other publishTrialBranch/test cases referenced) to use the same factory to keep
a single source of truth for the schema.
1-3: Consider mocking FS dependencies in this unit test suite.

This file heavily uses real node:fs side effects for setup/assertions. For unit-test isolation and consistency with repo rules, prefer mocking filesystem access (or keep a dedicated integration-style test and mock the rest).

As per coding guidelines, “Mock external dependencies like file system access and loggers in tests”.

Also applies to: 38-66, 130-131, 302-307, 427-427
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/publish-trial.test.ts` around lines 1 - 3, The test file
currently uses real filesystem functions (existsSync, mkdirSync, mkdtempSync,
readFileSync, rmSync, writeFileSync) which causes side effects; replace these
real fs calls by mocking node:fs (e.g., with jest.mock('node:fs') or an
in-memory fs like memfs) so setup/assertions run against the mock, or
alternatively convert the file to an explicit integration test and isolate it;
update tests that call
mkdtempSync/mkdirSync/readFileSync/writeFileSync/existsSync/rmSync to use the
mock API, ensure any cleanup uses the mock too, and adjust assertions to read
from the mocked filesystem rather than the real disk.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/publish-trial.test.ts`:
- Around line 102-122: Replace the inline vi.doMock usage for 'tinyexec' with a
top-level vi.mock('tinyexec', ... , { spy: true }) and move all per-test
implementations into beforeEach using vi.mocked(...) to get a typed spy;
specifically, create the mock factory that exposes x as vi.fn() (so tests can
inspect calls array and use createExecResult), then in beforeEach set
vi.mocked(x).mockImplementation(...) to return the different createExecResult
scenarios (empty, exit 1, PR URL, default) and reset mocks between tests; update
tests that currently call vi.doMock('tinyexec') to assume the top-level mock and
only configure behavior in beforeEach and use vi.mocked(x) when asserting call
arguments.

---

Nitpick comments:
In `@scripts/eval/lib/publish-trial.test.ts`:
- Around line 133-217: Tests repeat the same trial payload across multiple
publishTrialBranch calls; extract a createTrialData(overrides) factory that
returns the full default schema object and merges in any overrides, then replace
direct inline payloads (e.g., the data argument passed to publishTrialBranch)
with data: createTrialData({...}) so each test only specifies the differing
fields; update all other occurrences (the other publishTrialBranch/test cases
referenced) to use the same factory to keep a single source of truth for the
schema.
- Around line 1-3: The test file currently uses real filesystem functions
(existsSync, mkdirSync, mkdtempSync, readFileSync, rmSync, writeFileSync) which
causes side effects; replace these real fs calls by mocking node:fs (e.g., with
jest.mock('node:fs') or an in-memory fs like memfs) so setup/assertions run
against the mock, or alternatively convert the file to an explicit integration
test and isolate it; update tests that call
mkdtempSync/mkdirSync/readFileSync/writeFileSync/existsSync/rmSync to use the
mock API, ensure any cleanup uses the mock too, and adjust assertions to read
from the mocked filesystem rather than the real disk.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 47f875fe-3040-4c28-b902-2e099dcfd202

📥 Commits

Reviewing files that changed from the base of the PR and between 07cb30b and de01e41.

📒 Files selected for processing (2)

scripts/eval/lib/publish-trial.test.ts
scripts/eval/lib/publish-trial.ts

✅ Files skipped from review due to trivial changes (1)

scripts/eval/lib/publish-trial.ts

Move sync-baselines to repo-local string templates, simplify the CLI, and add excalidraw to the benchmark project list. Co-authored-by: Codex <noreply@openai.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

scripts/eval/sync-baselines.ts (2)

218-226: Consider using dirname() for clarity.

Using dirname(targetPath) is semantically clearer than join(targetPath, '..') for getting the parent directory.

Proposed fix

+import { dirname, join, relative, resolve } from 'node:path';
+
   for (const [name, contents] of sourceFiles) {
     const targetPath = join(targetDir, name);
-    await mkdir(join(targetPath, '..'), { recursive: true });
+    await mkdir(dirname(targetPath), { recursive: true });
     const rewritten = contents.replace(

Note: dirname is already imported at line 3, so this is just a code clarity improvement.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.ts` around lines 218 - 226, Replace the use of
join(targetPath, '..') with dirname(targetPath) for clarity when creating the
parent directory: in the loop over sourceFiles where targetPath is computed,
call mkdir(dirname(targetPath), { recursive: true }) instead of
mkdir(join(targetPath, '..'), { recursive: true }); keep the rest (rewritten
replacement and writeFile(targetPath, rewritten)) unchanged — dirname is already
imported so no new imports are required.

303-305: Redundant regex pattern in normalizeRepoPath.

The second replace pattern /^\.(?=\/)/ will never match because the first pattern /^\.\// already removes any leading ./. The second pattern attempts to match a dot followed by a slash without consuming the slash, but after the first replace, no such pattern can exist.

Proposed fix

 function normalizeRepoPath(value: string) {
-  return value.replace(/^\.\//, '').replace(/^\.(?=\/)/, '');
+  return value.replace(/^\.\//, '');
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.ts` around lines 303 - 305, The normalizeRepoPath
function contains a redundant second replace; simplify it by replacing both
patterns with a single regex that removes a leading "." or "./". Update
normalizeRepoPath to return value.replace(/^\.\/?/, '') so you only strip a
leading dot or dot+slash in one operation (look for the normalizeRepoPath
function to apply this change).

scripts/eval/lib/baseline-template-files.ts (2)

83-84: Remove internal development path reference from template.

The comment references a local filesystem path (~/code/github/mcp/eval/templates/result-docs/) which is an internal development location and not useful to end users or other contributors.

Proposed fix

 const EVAL_SUPPORT_TRANSCRIPT_MDX = template(String.raw`
-{/* Transcript renderer copied directly from ~/code/github/mcp/eval/templates/result-docs/transcript.tsx and transcript.types.ts */}
+{/* Transcript renderer for eval results visualization */}
 import data from '../eval-results/data.json';

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 83 - 84, The
template string assigned to EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment with
an internal dev path; remove or replace that path-only reference and use a
generic, public-facing description instead (e.g., "Transcript renderer copied
from templates/result-docs/transcript.tsx and transcript.types.ts" or simply
"Transcript renderer"). Update the comment inside the template literal for
EVAL_SUPPORT_TRANSCRIPT_MDX to avoid any local filesystem paths while keeping
the pointer to the original module names.

551-570: Consider guarding cleanup against already-removed elements.

The cleanup function calls removeChild unconditionally. If any element was already removed (e.g., by page navigation or Storybook hot reload), this could throw. Using optional checks would make it more robust.

Proposed fix

     return () => {
-      document.head.removeChild(script);
-      document.head.removeChild(style);
-      document.head.removeChild(codeStyle);
+      if (script.parentNode) document.head.removeChild(script);
+      if (style.parentNode) document.head.removeChild(style);
+      if (codeStyle.parentNode) document.head.removeChild(codeStyle);
     };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 551 - 570, The
cleanup function returned from useEffect unconditionally calls
document.head.removeChild(script|style|codeStyle) which can throw if elements
were already removed; update the cleanup in the useEffect to safely remove those
nodes (e.g., check parentNode or use element.remove with optional chaining) for
the three variables script, style, and codeStyle so removal is guarded against
already-removed elements.

scripts/eval/sync-baselines.test.ts (1)

462-464: Prefer Node.js built-in mkdirSync over shell command.

Using Node's fs.mkdirSync with recursive: true is more portable and consistent with the other fs operations in the file. The shell mkdir command may behave differently on Windows CI runners.
Proposed fix
+import { mkdirSync } from 'node:fs';
+
 function mkdirSyncRecursive(path: string) {
-  execFileSync('mkdir', ['-p', path]);
+  mkdirSync(path, { recursive: true });
 }
Note: mkdirSync is already imported at line 2 as part of the destructured imports from node:fs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.test.ts` around lines 462 - 464, The helper
mkdirSyncRecursive uses a shell call; replace its body to call the Node fs API
instead: use the existing mkdirSync function (already imported) with the option
{ recursive: true } inside mkdirSyncRecursive so it creates the directory tree
in a cross-platform, portable way (keep the function name mkdirSyncRecursive).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 687-693: The percentage prop calculation on the Turn component can
divide by zero when totalMessageTokens is 0; update the expression where
percentage is passed (the Turn element rendering User Prompt) to guard against
zero by computing percentage as zero (or a safe fallback) when
totalMessageTokens is falsy, e.g., use a conditional or Math.max fallback around
(promptTokenCount / totalMessageTokens) * 100 so you never pass Infinity to the
percentage prop.

---

Nitpick comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 83-84: The template string assigned to EVAL_SUPPORT_TRANSCRIPT_MDX
contains a comment with an internal dev path; remove or replace that path-only
reference and use a generic, public-facing description instead (e.g.,
"Transcript renderer copied from templates/result-docs/transcript.tsx and
transcript.types.ts" or simply "Transcript renderer"). Update the comment inside
the template literal for EVAL_SUPPORT_TRANSCRIPT_MDX to avoid any local
filesystem paths while keeping the pointer to the original module names.
- Around line 551-570: The cleanup function returned from useEffect
unconditionally calls document.head.removeChild(script|style|codeStyle) which
can throw if elements were already removed; update the cleanup in the useEffect
to safely remove those nodes (e.g., check parentNode or use element.remove with
optional chaining) for the three variables script, style, and codeStyle so
removal is guarded against already-removed elements.

In `@scripts/eval/sync-baselines.test.ts`:
- Around line 462-464: The helper mkdirSyncRecursive uses a shell call; replace
its body to call the Node fs API instead: use the existing mkdirSync function
(already imported) with the option { recursive: true } inside mkdirSyncRecursive
so it creates the directory tree in a cross-platform, portable way (keep the
function name mkdirSyncRecursive).

In `@scripts/eval/sync-baselines.ts`:
- Around line 218-226: Replace the use of join(targetPath, '..') with
dirname(targetPath) for clarity when creating the parent directory: in the loop
over sourceFiles where targetPath is computed, call mkdir(dirname(targetPath), {
recursive: true }) instead of mkdir(join(targetPath, '..'), { recursive: true
}); keep the rest (rewritten replacement and writeFile(targetPath, rewritten))
unchanged — dirname is already imported so no new imports are required.
- Around line 303-305: The normalizeRepoPath function contains a redundant
second replace; simplify it by replacing both patterns with a single regex that
removes a leading "." or "./". Update normalizeRepoPath to return
value.replace(/^\.\/?/, '') so you only strip a leading dot or dot+slash in one
operation (look for the normalizeRepoPath function to apply this change).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 915b8a56-ca00-4fad-88a8-f13e7c2d29bf

📥 Commits

Reviewing files that changed from the base of the PR and between de01e41 and 5685647.

📒 Files selected for processing (6)

scripts/eval/lib/baseline-template-files.ts
scripts/eval/lib/projects.ts
scripts/eval/lib/result-docs.ts
scripts/eval/lib/transcript-types.ts
scripts/eval/sync-baselines.test.ts
scripts/eval/sync-baselines.ts

✅ Files skipped from review due to trivial changes (1)

scripts/eval/lib/transcript-types.ts

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/eval/lib/projects.ts
scripts/eval/lib/result-docs.ts

Keep the string-based baseline template source byte-exact while switching it back to readable backtick literals. Co-authored-by: Codex <noreply@openai.com>

coderabbitai

♻️ Duplicate comments (1)

scripts/eval/lib/baseline-template-files.ts (1)
674-680: ⚠️ Potential issue | 🟡 Minor

Guard against division by zero when calculating percentage.

If totalMessageTokens is zero (e.g., empty messages array or all zero token counts), the percentage calculation will produce Infinity. Other percentage calculations in this file (lines 807, 842-843) already have this guard.
Proposed fix
       <Turn
         type="prompt"
         title="User Prompt"
         tokenCount={\`\${promptTokenCount.toLocaleString()} tokens\`}
-        percentage={(promptTokenCount / totalMessageTokens) * 100}
+        percentage={totalMessageTokens > 0 ? (promptTokenCount / totalMessageTokens) * 100 : 0}
       >
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 674 - 680, The
percentage expression for the Turn component can divide by zero when
totalMessageTokens is 0; update the JSX where percentage is computed (the Turn
element rendering with props tokenCount and percentage) to guard against
division by zero by checking totalMessageTokens before computing (e.g., use a
conditional or fallback to 0) so percentage uses promptTokenCount and
totalMessageTokens safely.

🧹 Nitpick comments (2)

scripts/eval/lib/baseline-template-files.ts (2)

83-85: Remove local development path from comment.

Same issue as the MDX template - this local path (~/code/github/mcp/eval/templates/result-docs/transcript.tsx) isn't meaningful outside the original development environment.

Suggested fix

 const EVAL_SUPPORT_TRANSCRIPT_TSX = `/*
- * Keep the baseline copies exact to ~/code/github/mcp/eval/templates/result-docs/transcript.tsx.
- * This repo-local template keeps only the minimum lint shims required by Storybook's monorepo.
+ * Transcript component for rendering eval results.
+ * This baseline template keeps only the minimum lint shims required by Storybook's monorepo.
  */

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 83 - 85, Update the
comment in baseline-template-files.ts to remove the hard-coded local development
path; replace the string
"~/code/github/mcp/eval/templates/result-docs/transcript.tsx" with a neutral,
repo-relative or generic path (for example
"templates/result-docs/transcript.tsx" or a description like "the repo's
transcript.tsx template") so the comment remains meaningful outside the original
developer's environment.

73-73: Remove local development path from comment.

The comment references a local filesystem path (~/code/github/mcp/eval/templates/result-docs/transcript.tsx) which exposes internal development structure and isn't meaningful in the committed codebase.

Suggested fix

-const EVAL_SUPPORT_TRANSCRIPT_MDX = `{/* Transcript renderer copied directly from ~/code/github/mcp/eval/templates/result-docs/transcript.tsx and transcript.types.ts */}
+const EVAL_SUPPORT_TRANSCRIPT_MDX = `{/* Transcript renderer for eval results */}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` at line 73, The string constant
EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment that exposes a local development
path; update the comment inside EVAL_SUPPORT_TRANSCRIPT_MDX to remove the
filesystem path and replace it with a neutral description (e.g., "Transcript
renderer copied from templates/result-docs/transcript") or delete the path
entirely so no local directory structure is committed, ensuring the comment
still documents origin without leaking local paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 674-680: The percentage expression for the Turn component can
divide by zero when totalMessageTokens is 0; update the JSX where percentage is
computed (the Turn element rendering with props tokenCount and percentage) to
guard against division by zero by checking totalMessageTokens before computing
(e.g., use a conditional or fallback to 0) so percentage uses promptTokenCount
and totalMessageTokens safely.

---

Nitpick comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 83-85: Update the comment in baseline-template-files.ts to remove
the hard-coded local development path; replace the string
"~/code/github/mcp/eval/templates/result-docs/transcript.tsx" with a neutral,
repo-relative or generic path (for example
"templates/result-docs/transcript.tsx" or a description like "the repo's
transcript.tsx template") so the comment remains meaningful outside the original
developer's environment.
- Line 73: The string constant EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment
that exposes a local development path; update the comment inside
EVAL_SUPPORT_TRANSCRIPT_MDX to remove the filesystem path and replace it with a
neutral description (e.g., "Transcript renderer copied from
templates/result-docs/transcript") or delete the path entirely so no local
directory structure is committed, ensuring the comment still documents origin
without leaking local paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca5fb2f4-cf24-4324-902a-b709289a3378

📥 Commits

Reviewing files that changed from the base of the PR and between 5685647 and 3e26b59.

📒 Files selected for processing (2)

scripts/eval/lib/baseline-template-files.ts
scripts/eval/lib/transcript-types.ts

✅ Files skipped from review due to trivial changes (1)

scripts/eval/lib/transcript-types.ts

Avoid top-level const declarations in the synced MDX files so Storybook can index them, while keeping empty data.json handling in the consumers. Co-authored-by: Codex <noreply@openai.com>

Co-authored-by: Codex <noreply@openai.com>

- sync-baselines.test.ts: remove unused resolve import and type BASELINE_STORYBOOK_FILES as Record to allow dynamic key lookup - run-batch.test.ts: include excalidraw in expected project list

…n file check The script checked for .storybook/main.* before git pull and failed on fresh clones. Removed that unnecessary check (syncStorybookDir recreates the directory from scratch) and reuse ensureSourceClone to auto-clone repos that haven't been cloned yet.

Sidnioulz

The PR is big so I can't wrap my head around it without taking more time to review. The only thing I spotted is a potential logic issue in the gain compute function.

…high effort - Add formatScorePercent for CLI, PR body, and eval summary (data.json stays 0-1) - Default run-batch: both agents, opus + gpt-5.4, high effort; update tests - Refresh scripts/eval README

Eval: Add worktree prep and PR artifacts

940e76d

kasperpeulen added build Internal-facing build tooling & test updates ci:normal Run our default set of CI jobs (choose this for most PRs). labels Mar 31, 2026

kasperpeulen added 23 commits April 1, 2026 09:19

Eval: Improve transcript docs rendering

20eeecb

Eval: copy transcript docs from MCP

a45d092

Eval: keep transcript template as TSX

0c5db20

Eval: link transcript template to GitHub source

fbf4ad7

Fix eval formatting drift

6428925

Fix CI type errors

68f7fef

Eval: stop injecting Chromatic workflow into trials

29f5f8c

Eval: use repo-backed Storybook support

fe23d3f

Eval: fix script formatting

b157f85

Vitest: enable retina screenshot captures

c64e42a

Vitest: gate retina screenshots on env

d9e763a

Vitest: use screenshot flag for retina mode

e9a5868

Eval: unify trial data and baseline docs

6dc4d72

Eval: format data pipeline files

dd058e9

Eval: fix screenshot typing and project list

0f73e3a

Eval: write provisional data before grading

88fe49e

Eval: re-add evergreen project

1013961

Eval: handle empty Claude tool results

061acb0

Vitest: refresh config update snapshots

92d8e3f

Eval: unify .storybook baseline syncing

7c3821b

Eval: score preview gain on generated stories

8babe7b

Eval: refine trial prep and preview setup

4248a42

Eval: interleave batch runs across projects

2dda826

Sidnioulz force-pushed the project/sb-agentic-setup branch from cc92a8c to 11bc8f0 Compare April 9, 2026 07:49