Skip to content

Eval: Add worktree prep and PR artifacts#34421

Merged
yannbf merged 53 commits into
project/sb-agentic-setupfrom
kasper/eval-worktrees-chromatic-pr
Apr 15, 2026
Merged

Eval: Add worktree prep and PR artifacts#34421
yannbf merged 53 commits into
project/sb-agentic-setupfrom
kasper/eval-worktrees-chromatic-pr

Conversation

@kasperpeulen
Copy link
Copy Markdown
Member

@kasperpeulen kasperpeulen commented Mar 31, 2026

Closes #

What I did

Expanded the eval harness into a fuller end-to-end trial pipeline.

This PR now focuses on the scripts/eval workflow and related artifacts:

  • prepare eval trials from persistent source clones plus fresh per-trial worktrees
  • run Claude/Codex setup evals and write schema v4 trial artifacts
  • grade runs using generated-story preview gain while still recording before/after ghost-story coverage
  • publish draft eval PRs with structured result data and transcript documents
  • collect eval PR data into SQLite for later analysis
  • add batch execution and shared .storybook baseline sync tooling for benchmark repos
  • use the play-driven pattern-copy-play prompt as the setup-eval prompt path

Note: This PR targets project/sb-agentic-setup (a project branch), not next or main.

Detailed walkthrough

Pipeline architecture

Pipeline Architecture

Two entry points (eval.ts for single trials, run-batch.ts for batch) feed the core pipeline: prepare a worktree → run the agentgrade the output → compute a scorepublish a draft PR. Supporting modules supply configuration and templates; post-pipeline tools collect data into SQLite or sync baselines to benchmark repos.


Worktree isolation strategy

Worktree Strategy

Instead of cloning fresh every time, the pipeline maintains one persistent source clone per project and spawns lightweight git worktrees for each trial:

// scripts/eval/lib/prepare-trial.ts

export async function prepareTrial(project, trialId, logger) {
  const sourceDir = join(REPOS_DIR, project.name);
  const trialDir = join(TRIALS_DIR, trialId);
  const repoRoot = join(trialDir, 'project');
  const trialBranch = `trial/${trialId}`;

  await ensureSourceClone(project, sourceDir, logger);            // clone once
  const baselineCommit = await syncSourceClone(project, sourceDir, logger); // fetch + reset
  await createTrialWorktree({ sourceDir, trialBranch, repoRoot, baseBranch: project.branch, logger });

  const projectPath = getProjectPath(repoRoot, project.projectDir);
  await installDeps(projectPath, logger, undefined, { stopAt: repoRoot });
  await syncLocalBuildsToTrial(projectPath, logger);              // optional: inject local storybook dist

  return { trialDir, sourceDir, repoRoot, projectPath, resultsDir, baselineCommit, trialBranch };
}

The worktree is created with a single git command:

// scripts/eval/lib/prepare-trial.ts — createTrialWorktree()

await x('git', ['worktree', 'add', '-b', trialBranch, repoRoot, baseBranch], {
  timeout: 120_000,
  nodeOptions: { cwd: sourceDir },
});

Seven benchmark repos cover different tech stacks:

// scripts/eval/lib/projects.ts

export const PROJECTS: Project[] = [
  { name: 'mealdrop',     description: 'Styled components, Redux, React Router' },
  { name: 'edgy',         description: 'Tailwind, HeadlessUI, React Router' },
  { name: 'wikitok',      description: 'Simple project with Tailwind', projectDir: 'frontend' },
  { name: 'baklava',      description: 'Component library with Zustand' },
  { name: 'echarts',      description: 'ECharts React wrapper' },
  { name: 'evergreen-ci', description: 'GraphQL', projectDir: 'packages/lib' },
  { name: 'excalidraw',   description: 'Monorepo with canvas based drawing app', projectDir: 'excalidraw-app' },
];

Trial lifecycle (8 steps)

Trial Steps

run-trial.ts orchestrates the full sequence. The critical detail is the double-write of data.json — a provisional version before grading (so MDX imports don't crash the build), then the final version after:

// scripts/eval/lib/run-trial.ts — runTrial()

// Step 6a: Write provisional data so baseline-owned MDX files can resolve during grading
const provisionalData = buildEvalData({ id: trialId, /* ... */ grade: { buildSuccess: false, /* ... */ } });
await writeFile(join(workspace.resultsDir, 'data.json'), JSON.stringify(provisionalData, null, 2));

// Step 6b: Grade the results
const { grade: trialGrade, score } = await grade(workspace, log, baselineGhostStories);

// Step 7: Rewrite with final grade
const reportForCommit = buildEvalData({ ...provisionalData, grade: trialGrade, score, /* ... */ });
await writeFile(join(workspace.resultsDir, 'data.json'), JSON.stringify(reportForCommit, null, 2));

// Step 8: Commit, push, open PR
const publish = await publishTrialBranch({ data: reportForCommit, workspace, logger: log });

The agent execution uses a pluggable driver interface:

// scripts/eval/lib/agents/config.ts

export interface AgentDriver {
  name: AgentId;
  execute(params: AgentExecuteParams): Promise<AgentExecutionResult>;
}

// Supported agents + models
claude: { models: ['sonnet-4.6', 'opus-4.6', 'haiku-4.5'], defaultModel: 'sonnet-4.6' }
codex:  { models: ['gpt-5.4'], defaultModel: 'gpt-5.4' }

Grading: four-dimensional evaluation

Grading Process

Build and TypeScript checks run in parallel:

// scripts/eval/lib/grade.ts

const [build, tsc] = await Promise.all([
  x('npx', ['storybook', 'build', '--quiet'], { throwOnError: false, timeout: 300_000, /* ... */ }),
  x('npx', ['tsc', '--noEmit'], { throwOnError: false, timeout: 120_000, /* ... */ }),
]);

The story-render evaluation runs Vitest on the agent's new story files:

// scripts/eval/lib/story-render.ts — runStoryRenderPass()

const result = await x('npx', [
  'vitest', 'run', '--project=storybook', '--reporter=json',
  `--outputFile=${reportPath}`, ...runnableStoryFiles,
], { throwOnError: false, timeout: 300_000 });

The baseline comparison temporarily swaps preview config back to the original, reruns the same tests, then restores the agent's files:

// scripts/eval/lib/story-render.ts — withBaselinePreviewEnvironment()

export async function withBaselinePreviewEnvironment<T>(opts) {
  const previewFiles = getPreviewEnvironmentFiles(opts.fileChanges);
  const snapshots = await snapshotFiles(opts.repoRoot, previewFiles);

  try {
    await restoreFilesFromCommit(opts.repoRoot, opts.baselineCommit, previewFiles);
    return await opts.fn();  // run Vitest with baseline preview
  } finally {
    await restoreSnapshots(opts.repoRoot, snapshots);  // put agent's files back
  }
}

The final score is a normalized preview gain — how much of the remaining room-for-improvement did the agent capture?

// scripts/eval/lib/grade.ts

const gain = (afterRate - beforeRate) / (1 - beforeRate);
// e.g. baseline 60% → agent 80% → gain = (0.8 - 0.6) / (1 - 0.6) = 0.5

End-to-end data flow

E2E Data Flow

The system forms a cycle:

  1. sync-baselines.ts pushes canonical .storybook configs to 7 benchmark repos
  2. eval.ts / run-batch.ts creates worktrees and runs agent trials
  3. Results are published as draft PRs with labels (eval, project:X, agent:Y, model:Z)
  4. collect-pr-data.ts scrapes those PRs into SQLite for analysis

Publishing commits the trial, pushes the branch, and opens a draft PR:

// scripts/eval/lib/publish-trial.ts

await x('git', ['commit', '--no-verify', '-m', `eval: ${opts.data.id}`], { /* ... */ });
await x('git', ['push', '--set-upstream', 'origin', opts.workspace.trialBranch], { /* ... */ });

const prUrl = (await x('gh', [
  'pr', 'create', '--repo', opts.data.project.githubSlug,
  '--base', opts.data.project.branch, '--head', opts.workspace.trialBranch,
  '--draft', '--title', title, '--body', prBody,
])).stdout.trim();

Batch runner fans out trials with concurrency control and round-robin interleaving:

// scripts/eval/run-batch.ts

export const BATCH_REPETITIONS = 10;
export const BATCH_CONCURRENCY = 8;
export const BATCH_DEFAULT_AGENT_IDS = ['claude'] as const;
export const BATCH_DEFAULT_CLAUDE_EFFORTS = ['max'] as const;

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

No manual end-to-end benchmark run was performed.

Validated with:

  1. yarn vitest run scripts/eval/collect-pr-data.test.ts scripts/eval/lib/prepare-trial.test.ts scripts/eval/lib/run-trial.test.ts scripts/eval/lib/publish-trial.test.ts scripts/eval/lib/story-render.test.ts
  2. yarn vitest run scripts/eval/lib/utils.test.ts scripts/eval/run-batch.test.ts scripts/eval/sync-baselines.test.ts
  3. yarn vitest run code/addons/vitest/src/vitest-plugin/viewports.test.ts code/core/src/csf-tools/vitest-plugin/transformer.test.ts

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This pull request has been released as version 0.0.0-pr-34421-sha-8babe7bc. Try it out in a new sandbox by running npx storybook@0.0.0-pr-34421-sha-8babe7bc sandbox or in an existing project with npx storybook@0.0.0-pr-34421-sha-8babe7bc upgrade.

More information
Published version 0.0.0-pr-34421-sha-8babe7bc
Triggered by @kasperpeulen
Repository storybookjs/storybook
Branch kasper/eval-worktrees-chromatic-pr
Commit 8babe7bc
Datetime Fri Apr 3 08:48:19 UTC 2026 (1775206099)
Workflow run 23940393766

To request a new release of this pull request, mention the @storybookjs/core team.

core team members can create a new canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=34421

Summary by CodeRabbit

  • New Features

    • Added end-to-end evaluation tooling: trial runs with story-render checks, publishing eval PRs, batch runner, baseline Storybook syncing, and PR-data collection; outputs now use a unified data.json schema v4.
    • New transcript/docs export and viewer templates for published results.
    • Default evaluation prompt changed to "pattern-copy-play".
  • Tests

    • Large suite of unit and integration tests covering agents, grading, story-render, publishing, syncing, batching, and utilities.
  • Documentation

    • Added a detailed evaluation prompt and baseline Storybook templates.

@kasperpeulen kasperpeulen added build Internal-facing build tooling & test updates ci:normal Run our default set of CI jobs (choose this for most PRs). labels Mar 31, 2026
@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Mar 31, 2026

View your CI Pipeline Execution ↗ for commit 0045954

Command Status Duration Result
nx run-many -t compile,check,knip,test,lint,fmt... ✅ Succeeded 24m 11s View ↗

☁️ Nx Cloud last updated this comment at 2026-04-15 09:44:18 UTC

@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Mar 31, 2026

View your CI Pipeline Execution ↗ for commit 940e76d


☁️ Nx Cloud last updated this comment at 2026-03-31 17:32:13 UTC

@storybook-app-bot
Copy link
Copy Markdown

storybook-app-bot Bot commented Mar 31, 2026

Package Benchmarks

Commit: 0045954, ran on 15 April 2026 at 09:30:08 UTC

The following packages have significant changes to their size or dependencies:

storybook

Before After Difference
Dependency count 50 50 0
Self size 20.56 MB 20.52 MB 🎉 -37 KB 🎉
Dependency size 16.56 MB 16.56 MB 🎉 -93 B 🎉
Bundle Size Analyzer Link Link

@storybook/cli

Before After Difference
Dependency count 184 184 0
Self size 819 KB 818 KB 🎉 -1 KB 🎉
Dependency size 68.20 MB 68.17 MB 🎉 -36 KB 🎉
Bundle Size Analyzer Link Link

@storybook/codemod

Before After Difference
Dependency count 177 177 0
Self size 32 KB 32 KB 🎉 -36 B 🎉
Dependency size 66.72 MB 66.68 MB 🎉 -37 KB 🎉
Bundle Size Analyzer Link Link

create-storybook

Before After Difference
Dependency count 51 51 0
Self size 1.05 MB 1.05 MB 🚨 +985 B 🚨
Dependency size 37.11 MB 37.08 MB 🎉 -37 KB 🎉
Bundle Size Analyzer node node

@Sidnioulz Sidnioulz force-pushed the project/sb-agentic-setup branch from cc92a8c to 11bc8f0 Compare April 9, 2026 07:49
Comment thread scripts/eval/prompts/pattern-copy-play.md
Comment thread scripts/eval/run-batch.test.ts
Comment thread scripts/eval/run-batch.ts
Comment thread scripts/eval/sync-baselines.test.ts
Comment thread scripts/eval/sync-baselines.ts
Comment thread scripts/package.json
Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
scripts/eval/lib/publish-trial.test.ts (2)

133-217: Extract a shared trial-data factory to reduce drift across tests.

The schema payload is repeated three times with minor overrides. A createTrialData(overrides) helper would make scenarios clearer and reduce maintenance risk when schema fields change.

Also applies to: 311-373, 430-489

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/publish-trial.test.ts` around lines 133 - 217, Tests repeat
the same trial payload across multiple publishTrialBranch calls; extract a
createTrialData(overrides) factory that returns the full default schema object
and merges in any overrides, then replace direct inline payloads (e.g., the data
argument passed to publishTrialBranch) with data: createTrialData({...}) so each
test only specifies the differing fields; update all other occurrences (the
other publishTrialBranch/test cases referenced) to use the same factory to keep
a single source of truth for the schema.

1-3: Consider mocking FS dependencies in this unit test suite.

This file heavily uses real node:fs side effects for setup/assertions. For unit-test isolation and consistency with repo rules, prefer mocking filesystem access (or keep a dedicated integration-style test and mock the rest).

As per coding guidelines, “Mock external dependencies like file system access and loggers in tests”.

Also applies to: 38-66, 130-131, 302-307, 427-427

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/publish-trial.test.ts` around lines 1 - 3, The test file
currently uses real filesystem functions (existsSync, mkdirSync, mkdtempSync,
readFileSync, rmSync, writeFileSync) which causes side effects; replace these
real fs calls by mocking node:fs (e.g., with jest.mock('node:fs') or an
in-memory fs like memfs) so setup/assertions run against the mock, or
alternatively convert the file to an explicit integration test and isolate it;
update tests that call
mkdtempSync/mkdirSync/readFileSync/writeFileSync/existsSync/rmSync to use the
mock API, ensure any cleanup uses the mock too, and adjust assertions to read
from the mocked filesystem rather than the real disk.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/publish-trial.test.ts`:
- Around line 102-122: Replace the inline vi.doMock usage for 'tinyexec' with a
top-level vi.mock('tinyexec', ... , { spy: true }) and move all per-test
implementations into beforeEach using vi.mocked(...) to get a typed spy;
specifically, create the mock factory that exposes x as vi.fn() (so tests can
inspect calls array and use createExecResult), then in beforeEach set
vi.mocked(x).mockImplementation(...) to return the different createExecResult
scenarios (empty, exit 1, PR URL, default) and reset mocks between tests; update
tests that currently call vi.doMock('tinyexec') to assume the top-level mock and
only configure behavior in beforeEach and use vi.mocked(x) when asserting call
arguments.

---

Nitpick comments:
In `@scripts/eval/lib/publish-trial.test.ts`:
- Around line 133-217: Tests repeat the same trial payload across multiple
publishTrialBranch calls; extract a createTrialData(overrides) factory that
returns the full default schema object and merges in any overrides, then replace
direct inline payloads (e.g., the data argument passed to publishTrialBranch)
with data: createTrialData({...}) so each test only specifies the differing
fields; update all other occurrences (the other publishTrialBranch/test cases
referenced) to use the same factory to keep a single source of truth for the
schema.
- Around line 1-3: The test file currently uses real filesystem functions
(existsSync, mkdirSync, mkdtempSync, readFileSync, rmSync, writeFileSync) which
causes side effects; replace these real fs calls by mocking node:fs (e.g., with
jest.mock('node:fs') or an in-memory fs like memfs) so setup/assertions run
against the mock, or alternatively convert the file to an explicit integration
test and isolate it; update tests that call
mkdtempSync/mkdirSync/readFileSync/writeFileSync/existsSync/rmSync to use the
mock API, ensure any cleanup uses the mock too, and adjust assertions to read
from the mocked filesystem rather than the real disk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 47f875fe-3040-4c28-b902-2e099dcfd202

📥 Commits

Reviewing files that changed from the base of the PR and between 07cb30b and de01e41.

📒 Files selected for processing (2)
  • scripts/eval/lib/publish-trial.test.ts
  • scripts/eval/lib/publish-trial.ts
✅ Files skipped from review due to trivial changes (1)
  • scripts/eval/lib/publish-trial.ts

Comment thread scripts/eval/lib/publish-trial.test.ts
Move sync-baselines to repo-local string templates, simplify the CLI, and add excalidraw to the benchmark project list.

Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
scripts/eval/sync-baselines.ts (2)

218-226: Consider using dirname() for clarity.

Using dirname(targetPath) is semantically clearer than join(targetPath, '..') for getting the parent directory.

Proposed fix
+import { dirname, join, relative, resolve } from 'node:path';
+
   for (const [name, contents] of sourceFiles) {
     const targetPath = join(targetDir, name);
-    await mkdir(join(targetPath, '..'), { recursive: true });
+    await mkdir(dirname(targetPath), { recursive: true });
     const rewritten = contents.replace(

Note: dirname is already imported at line 3, so this is just a code clarity improvement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.ts` around lines 218 - 226, Replace the use of
join(targetPath, '..') with dirname(targetPath) for clarity when creating the
parent directory: in the loop over sourceFiles where targetPath is computed,
call mkdir(dirname(targetPath), { recursive: true }) instead of
mkdir(join(targetPath, '..'), { recursive: true }); keep the rest (rewritten
replacement and writeFile(targetPath, rewritten)) unchanged — dirname is already
imported so no new imports are required.

303-305: Redundant regex pattern in normalizeRepoPath.

The second replace pattern /^\.(?=\/)/ will never match because the first pattern /^\.\// already removes any leading ./. The second pattern attempts to match a dot followed by a slash without consuming the slash, but after the first replace, no such pattern can exist.

Proposed fix
 function normalizeRepoPath(value: string) {
-  return value.replace(/^\.\//, '').replace(/^\.(?=\/)/, '');
+  return value.replace(/^\.\//, '');
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.ts` around lines 303 - 305, The normalizeRepoPath
function contains a redundant second replace; simplify it by replacing both
patterns with a single regex that removes a leading "." or "./". Update
normalizeRepoPath to return value.replace(/^\.\/?/, '') so you only strip a
leading dot or dot+slash in one operation (look for the normalizeRepoPath
function to apply this change).
scripts/eval/lib/baseline-template-files.ts (2)

83-84: Remove internal development path reference from template.

The comment references a local filesystem path (~/code/github/mcp/eval/templates/result-docs/) which is an internal development location and not useful to end users or other contributors.

Proposed fix
 const EVAL_SUPPORT_TRANSCRIPT_MDX = template(String.raw`
-{/* Transcript renderer copied directly from ~/code/github/mcp/eval/templates/result-docs/transcript.tsx and transcript.types.ts */}
+{/* Transcript renderer for eval results visualization */}
 import data from '../eval-results/data.json';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 83 - 84, The
template string assigned to EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment with
an internal dev path; remove or replace that path-only reference and use a
generic, public-facing description instead (e.g., "Transcript renderer copied
from templates/result-docs/transcript.tsx and transcript.types.ts" or simply
"Transcript renderer"). Update the comment inside the template literal for
EVAL_SUPPORT_TRANSCRIPT_MDX to avoid any local filesystem paths while keeping
the pointer to the original module names.

551-570: Consider guarding cleanup against already-removed elements.

The cleanup function calls removeChild unconditionally. If any element was already removed (e.g., by page navigation or Storybook hot reload), this could throw. Using optional checks would make it more robust.

Proposed fix
     return () => {
-      document.head.removeChild(script);
-      document.head.removeChild(style);
-      document.head.removeChild(codeStyle);
+      if (script.parentNode) document.head.removeChild(script);
+      if (style.parentNode) document.head.removeChild(style);
+      if (codeStyle.parentNode) document.head.removeChild(codeStyle);
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 551 - 570, The
cleanup function returned from useEffect unconditionally calls
document.head.removeChild(script|style|codeStyle) which can throw if elements
were already removed; update the cleanup in the useEffect to safely remove those
nodes (e.g., check parentNode or use element.remove with optional chaining) for
the three variables script, style, and codeStyle so removal is guarded against
already-removed elements.
scripts/eval/sync-baselines.test.ts (1)

462-464: Prefer Node.js built-in mkdirSync over shell command.

Using Node's fs.mkdirSync with recursive: true is more portable and consistent with the other fs operations in the file. The shell mkdir command may behave differently on Windows CI runners.

Proposed fix
+import { mkdirSync } from 'node:fs';
+
 function mkdirSyncRecursive(path: string) {
-  execFileSync('mkdir', ['-p', path]);
+  mkdirSync(path, { recursive: true });
 }

Note: mkdirSync is already imported at line 2 as part of the destructured imports from node:fs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/sync-baselines.test.ts` around lines 462 - 464, The helper
mkdirSyncRecursive uses a shell call; replace its body to call the Node fs API
instead: use the existing mkdirSync function (already imported) with the option
{ recursive: true } inside mkdirSyncRecursive so it creates the directory tree
in a cross-platform, portable way (keep the function name mkdirSyncRecursive).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 687-693: The percentage prop calculation on the Turn component can
divide by zero when totalMessageTokens is 0; update the expression where
percentage is passed (the Turn element rendering User Prompt) to guard against
zero by computing percentage as zero (or a safe fallback) when
totalMessageTokens is falsy, e.g., use a conditional or Math.max fallback around
(promptTokenCount / totalMessageTokens) * 100 so you never pass Infinity to the
percentage prop.

---

Nitpick comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 83-84: The template string assigned to EVAL_SUPPORT_TRANSCRIPT_MDX
contains a comment with an internal dev path; remove or replace that path-only
reference and use a generic, public-facing description instead (e.g.,
"Transcript renderer copied from templates/result-docs/transcript.tsx and
transcript.types.ts" or simply "Transcript renderer"). Update the comment inside
the template literal for EVAL_SUPPORT_TRANSCRIPT_MDX to avoid any local
filesystem paths while keeping the pointer to the original module names.
- Around line 551-570: The cleanup function returned from useEffect
unconditionally calls document.head.removeChild(script|style|codeStyle) which
can throw if elements were already removed; update the cleanup in the useEffect
to safely remove those nodes (e.g., check parentNode or use element.remove with
optional chaining) for the three variables script, style, and codeStyle so
removal is guarded against already-removed elements.

In `@scripts/eval/sync-baselines.test.ts`:
- Around line 462-464: The helper mkdirSyncRecursive uses a shell call; replace
its body to call the Node fs API instead: use the existing mkdirSync function
(already imported) with the option { recursive: true } inside mkdirSyncRecursive
so it creates the directory tree in a cross-platform, portable way (keep the
function name mkdirSyncRecursive).

In `@scripts/eval/sync-baselines.ts`:
- Around line 218-226: Replace the use of join(targetPath, '..') with
dirname(targetPath) for clarity when creating the parent directory: in the loop
over sourceFiles where targetPath is computed, call mkdir(dirname(targetPath), {
recursive: true }) instead of mkdir(join(targetPath, '..'), { recursive: true
}); keep the rest (rewritten replacement and writeFile(targetPath, rewritten))
unchanged — dirname is already imported so no new imports are required.
- Around line 303-305: The normalizeRepoPath function contains a redundant
second replace; simplify it by replacing both patterns with a single regex that
removes a leading "." or "./". Update normalizeRepoPath to return
value.replace(/^\.\/?/, '') so you only strip a leading dot or dot+slash in one
operation (look for the normalizeRepoPath function to apply this change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 915b8a56-ca00-4fad-88a8-f13e7c2d29bf

📥 Commits

Reviewing files that changed from the base of the PR and between de01e41 and 5685647.

📒 Files selected for processing (6)
  • scripts/eval/lib/baseline-template-files.ts
  • scripts/eval/lib/projects.ts
  • scripts/eval/lib/result-docs.ts
  • scripts/eval/lib/transcript-types.ts
  • scripts/eval/sync-baselines.test.ts
  • scripts/eval/sync-baselines.ts
✅ Files skipped from review due to trivial changes (1)
  • scripts/eval/lib/transcript-types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/eval/lib/projects.ts
  • scripts/eval/lib/result-docs.ts

Comment thread scripts/eval/lib/baseline-template-files.ts
Keep the string-based baseline template source byte-exact while switching it back to readable backtick literals.

Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
scripts/eval/lib/baseline-template-files.ts (1)

674-680: ⚠️ Potential issue | 🟡 Minor

Guard against division by zero when calculating percentage.

If totalMessageTokens is zero (e.g., empty messages array or all zero token counts), the percentage calculation will produce Infinity. Other percentage calculations in this file (lines 807, 842-843) already have this guard.

Proposed fix
       <Turn
         type="prompt"
         title="User Prompt"
         tokenCount={\`\${promptTokenCount.toLocaleString()} tokens\`}
-        percentage={(promptTokenCount / totalMessageTokens) * 100}
+        percentage={totalMessageTokens > 0 ? (promptTokenCount / totalMessageTokens) * 100 : 0}
       >
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 674 - 680, The
percentage expression for the Turn component can divide by zero when
totalMessageTokens is 0; update the JSX where percentage is computed (the Turn
element rendering with props tokenCount and percentage) to guard against
division by zero by checking totalMessageTokens before computing (e.g., use a
conditional or fallback to 0) so percentage uses promptTokenCount and
totalMessageTokens safely.
🧹 Nitpick comments (2)
scripts/eval/lib/baseline-template-files.ts (2)

83-85: Remove local development path from comment.

Same issue as the MDX template - this local path (~/code/github/mcp/eval/templates/result-docs/transcript.tsx) isn't meaningful outside the original development environment.

Suggested fix
 const EVAL_SUPPORT_TRANSCRIPT_TSX = `/*
- * Keep the baseline copies exact to ~/code/github/mcp/eval/templates/result-docs/transcript.tsx.
- * This repo-local template keeps only the minimum lint shims required by Storybook's monorepo.
+ * Transcript component for rendering eval results.
+ * This baseline template keeps only the minimum lint shims required by Storybook's monorepo.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` around lines 83 - 85, Update the
comment in baseline-template-files.ts to remove the hard-coded local development
path; replace the string
"~/code/github/mcp/eval/templates/result-docs/transcript.tsx" with a neutral,
repo-relative or generic path (for example
"templates/result-docs/transcript.tsx" or a description like "the repo's
transcript.tsx template") so the comment remains meaningful outside the original
developer's environment.

73-73: Remove local development path from comment.

The comment references a local filesystem path (~/code/github/mcp/eval/templates/result-docs/transcript.tsx) which exposes internal development structure and isn't meaningful in the committed codebase.

Suggested fix
-const EVAL_SUPPORT_TRANSCRIPT_MDX = `{/* Transcript renderer copied directly from ~/code/github/mcp/eval/templates/result-docs/transcript.tsx and transcript.types.ts */}
+const EVAL_SUPPORT_TRANSCRIPT_MDX = `{/* Transcript renderer for eval results */}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/baseline-template-files.ts` at line 73, The string constant
EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment that exposes a local development
path; update the comment inside EVAL_SUPPORT_TRANSCRIPT_MDX to remove the
filesystem path and replace it with a neutral description (e.g., "Transcript
renderer copied from templates/result-docs/transcript") or delete the path
entirely so no local directory structure is committed, ensuring the comment
still documents origin without leaking local paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 674-680: The percentage expression for the Turn component can
divide by zero when totalMessageTokens is 0; update the JSX where percentage is
computed (the Turn element rendering with props tokenCount and percentage) to
guard against division by zero by checking totalMessageTokens before computing
(e.g., use a conditional or fallback to 0) so percentage uses promptTokenCount
and totalMessageTokens safely.

---

Nitpick comments:
In `@scripts/eval/lib/baseline-template-files.ts`:
- Around line 83-85: Update the comment in baseline-template-files.ts to remove
the hard-coded local development path; replace the string
"~/code/github/mcp/eval/templates/result-docs/transcript.tsx" with a neutral,
repo-relative or generic path (for example
"templates/result-docs/transcript.tsx" or a description like "the repo's
transcript.tsx template") so the comment remains meaningful outside the original
developer's environment.
- Line 73: The string constant EVAL_SUPPORT_TRANSCRIPT_MDX contains a comment
that exposes a local development path; update the comment inside
EVAL_SUPPORT_TRANSCRIPT_MDX to remove the filesystem path and replace it with a
neutral description (e.g., "Transcript renderer copied from
templates/result-docs/transcript") or delete the path entirely so no local
directory structure is committed, ensuring the comment still documents origin
without leaking local paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca5fb2f4-cf24-4324-902a-b709289a3378

📥 Commits

Reviewing files that changed from the base of the PR and between 5685647 and 3e26b59.

📒 Files selected for processing (2)
  • scripts/eval/lib/baseline-template-files.ts
  • scripts/eval/lib/transcript-types.ts
✅ Files skipped from review due to trivial changes (1)
  • scripts/eval/lib/transcript-types.ts

kasperpeulen and others added 3 commits April 15, 2026 00:06
Avoid top-level const declarations in the synced MDX files so Storybook can index them, while keeping empty data.json handling in the consumers.

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
- sync-baselines.test.ts: remove unused resolve import and type
  BASELINE_STORYBOOK_FILES as Record to allow dynamic key lookup
- run-batch.test.ts: include excalidraw in expected project list
@kasperpeulen kasperpeulen force-pushed the kasper/eval-worktrees-chromatic-pr branch from 81de85f to 8929a4c Compare April 15, 2026 06:30
…n file check

The script checked for .storybook/main.* before git pull and failed on
fresh clones. Removed that unnecessary check (syncStorybookDir recreates
the directory from scratch) and reuse ensureSourceClone to auto-clone
repos that haven't been cloned yet.
Copy link
Copy Markdown
Contributor

@Sidnioulz Sidnioulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is big so I can't wrap my head around it without taking more time to review. The only thing I spotted is a potential logic issue in the gain compute function.

Comment thread scripts/eval/lib/baseline-template-files.ts
Comment thread scripts/eval/lib/grade.ts
Comment thread scripts/eval/lib/grade.ts
Comment thread scripts/eval/lib/grade.ts
Comment thread scripts/eval/lib/grade.ts
Comment thread scripts/eval/sync-baselines.ts
Comment thread scripts/build/utils/generate-types.ts Outdated
Comment thread scripts/eval/lib/publish-trial.ts
Comment thread scripts/eval/lib/result-docs.ts
Comment thread scripts/eval/lib/result-docs.ts
Comment thread scripts/eval/lib/run-trial.ts
Comment thread scripts/eval/run-batch.ts
Comment thread scripts/eval/run-batch.ts
Comment thread scripts/eval/run-batch.ts
Comment thread scripts/eval/sync-baselines.ts
Comment thread scripts/package.json
…high effort

- Add formatScorePercent for CLI, PR body, and eval summary (data.json stays 0-1)
- Default run-batch: both agents, opus + gpt-5.4, high effort; update tests
- Refresh scripts/eval README
@yannbf yannbf merged commit 907c256 into project/sb-agentic-setup Apr 15, 2026
124 checks passed
@yannbf yannbf deleted the kasper/eval-worktrees-chromatic-pr branch April 15, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Internal-facing build tooling & test updates ci:normal Run our default set of CI jobs (choose this for most PRs).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants