diff --git a/.agents/skills/review-pr/SKILL.md b/.agents/skills/review-pr/SKILL.md
new file mode 100644
index 000000000000..253cdfb8c3c0
--- /dev/null
+++ b/.agents/skills/review-pr/SKILL.md
@@ -0,0 +1,439 @@
+---
+name: review-pr
+description: "Generate a scrollable single-page PR review. Use when the user says 'review pr', 'review this PR', 'pr review', or wants to review PR changes in a narrative format."
+allowed-tools: Bash, Read, Write, Edit, Agent, Grep, Glob
+---
+
+# PR Review — Scrollable Single-Page
+
+Generate a scrollable single-page HTML document that reviews a PR as a readable narrative.
+
+**Always generate the page immediately.** Never block on cleanup or fix discussions.
+
+## Principles
+
+The purpose of this page is to help the **human reviewer** understand and review a PR quickly. The page is a reading aid — it presents the code clearly so the reviewer can form their own opinion.
+
+### Optimize for the reviewer's time
+
+The reviewer should be able to:
+- **Skim** the page and grasp what the PR does in 30 seconds (big picture section).
+- **Read** any area and understand what that code does without opening their editor.
+- **Zoom in** to full files or diffs when they want to inspect details.
+
+### Two layers per area
+
+Each area has two layers:
+- **Layer 1 (always visible):** A curated walkthrough — prose explanation with cherry-picked code snippets. Only the parts that matter for understanding.
+- **Layer 2 (collapsed):** Full file contents or diffs in `<details>` blocks. The reviewer expands these to zoom in.
+
+### High-level to low-level
+
+Order areas following the **call graph** from entry points down. The reviewer understands the big picture before details. For example: CLI entry point → orchestration → each pipeline step → helpers → utilities → types.
+
+### Within each area: Explain → Contract → Tests → Implementation
+
+Structure each area's walkthrough in this order:
+
+1. **Explanation** — Plain prose first. What does this module do? Why does it exist? How does it fit into the bigger picture? The reviewer should understand the *purpose* before seeing any code.
+2. **Functions & data structures** — Show function signatures and the key types/interfaces they use. This is the contract — what goes in, what comes out. Show full interface bodies inline where they're first relevant. Don't defer to "see types.ts".
+3. **Tests** — Cherry-pick the test cases that make the behavior concrete. Tests are executable documentation — they turn abstract descriptions into specific examples.
+4. **Implementation** — The interesting parts of *how* it works. Skip boilerplate, show the core logic.
+
+Use narrative `<p>` tags between snippets to guide the reviewer through each transition.
+
+### Flag obvious issues, but don't force opinions
+
+If you notice something clearly wrong (bug, missing error handling, naming mismatch), flag it with a smell-box. If something is notably well done, use a note-box. But don't manufacture opinions — if the code is fine, just present it clearly. The reviewer will decide what matters.
+
+### Cover everything
+
+Every changed file appears somewhere — either in a walkthrough snippet or in a collapsed full-file block.
+
+## Step 1 — Gather PR data
+
+```bash
+gh pr view --json number,title,author,headRefName,baseRefName,body,additions,deletions,changedFiles
+gh pr diff --name-only
+gh pr diff
+```
+
+If a PR number or URL is given as an argument, pass it to `gh pr view <arg>` and `gh pr diff <arg>`.
+
+## Step 2 — Read all changed files
+
+Read the full file content of every changed file with the `Read` tool. Also read the full diff. Classify each file as test, implementation, config, or docs.
+
+## Step 3 — Generate the page
+
+For each area, write two layers:
+
+### Layer 1: Readable walkthrough (always visible)
+
+A curated narrative that mixes prose with **short code snippets**. Structure it following the principle order:
+
+1. **Explanation** — plain prose describing what this area does, why it exists, and how it fits.
+2. **Functions & data structures** — key function signatures and the types they use. Show the contract.
+3. **Tests** — cherry-picked test cases that make the behavior concrete with specific examples.
+4. **Implementation** — the core logic. Skip boilerplate, show the interesting parts.
+
+Use narrative `<p>` tags between snippets to guide the reviewer through each transition. Add smell-boxes or note-boxes only when something genuinely stands out.
+
+### Layer 2: Full files (always collapsed)
+
+Below the walkthrough, include every file in the area as a collapsed `<details>` block with the complete file content (or diff for modified files). The reader expands these for reference.
+
+First create the output directory:
+
+```bash
+mkdir -p .pr-review/pr-<number>
+```
+
+Write to `.pr-review/pr-<number>/index.html` (relative to the repo root).
+
+**Verify every file from `gh pr diff --name-only` appears in the page.**
+
+### HTML structure
+
+```
+Sticky topbar (nav links)
+Header (title, author, stats)
+Big picture section
+Area 1
+  Readable walkthrough (Explain → Contract → Tests → Implementation)
+  Full files (collapsed)
+Area 2
+  ...
+Supporting changes
+```
+
+### Complete HTML template
+
+```html
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>PR #{{NUMBER}}: {{TITLE}}</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Lexend:wght@300;400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/styles/github.min.css" media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/styles/github-dark.min.css" media="(prefers-color-scheme: dark)">
+  <style>
+    :root {
+      --bg: #fff; --fg: #1f2328; --muted: #656d76; --border: #d0d7de;
+      --surface: #f6f8fa; --card: #fff; --card-border: #d0d7de;
+      --add-bg: #dafbe1; --add-fg: #116329; --del-bg: #ffebe9; --del-fg: #82071e;
+      --hunk-bg: #ddf4ff; --hunk-fg: #0969da;
+      --note-bg: #ddf4ff; --note-border: #0969da;
+      --smell-bg: #fff8c5; --smell-border: #9a6700;
+      --green: #1a7f37; --blue: #0969da; --amber: #9a6700;
+      --toc-hover: #eaeef2; --shadow: 0 1px 3px rgba(0,0,0,0.06);
+    }
+    @media (prefers-color-scheme: dark) {
+      :root {
+        --bg: #0d1117; --fg: #e6edf3; --muted: #8b949e; --border: #30363d;
+        --surface: #161b22; --card: #161b22; --card-border: #30363d;
+        --add-bg: #1a2e1a; --add-fg: #3fb950; --del-bg: #2e1a1a; --del-fg: #f85149;
+        --hunk-bg: #0d1f2d; --hunk-fg: #58a6ff;
+        --note-bg: #0d1f2d; --note-border: #58a6ff;
+        --smell-bg: #2a1f0a; --smell-border: #d29922;
+        --green: #3fb950; --blue: #58a6ff; --amber: #d29922;
+        --toc-hover: #21262d; --shadow: 0 1px 3px rgba(0,0,0,0.3);
+      }
+    }
+    * { margin: 0; padding: 0; box-sizing: border-box; }
+    html { scroll-behavior: smooth; scroll-padding-top: 56px; -webkit-text-size-adjust: 100%; text-size-adjust: 100%; }
+    body { background: var(--bg); color: var(--fg); font-family: 'Lexend', sans-serif; font-size: 15px; line-height: 1.6; }
+    .page { max-width: 960px; margin: 0 auto; padding: 0 2px 100px; }
+    .topbar { position: sticky; top: 0; z-index: 100; background: var(--bg); border-bottom: 1px solid var(--border); padding: 10px 0; }
+    .topbar-inner { max-width: 960px; margin: 0 auto; padding: 0 4px; display: flex; align-items: center; gap: 10px; overflow-x: auto; }
+    .topbar a { color: var(--muted); text-decoration: none; font-size: 13px; font-weight: 500; white-space: nowrap; padding: 3px 7px; border-radius: 6px; }
+    .topbar a:hover { background: var(--toc-hover); color: var(--fg); }
+    .topbar .pr-tag { color: var(--fg); font-weight: 600; font-size: 14px; }
+    .header { padding: 40px 0 28px; border-bottom: 1px solid var(--border); margin-bottom: 36px; }
+    .header h1 { font-size: 26px; font-weight: 700; margin-bottom: 6px; }
+    .header .meta { color: var(--muted); font-size: 14px; }
+    .header .stats { display: flex; gap: 20px; margin-top: 10px; font-size: 14px; font-weight: 500; }
+    .header .stats .add { color: var(--green); }
+    .header .stats .del { color: var(--del-fg); }
+    .header .stats .files { color: var(--blue); }
+    .section { margin-bottom: 44px; }
+    .section-head { font-size: 21px; font-weight: 700; margin-bottom: 6px; padding-top: 12px; }
+    .section-desc { color: var(--muted); margin-bottom: 20px; font-size: 14px; line-height: 1.6; }
+    .section-desc code { font-family: 'JetBrains Mono', monospace; font-size: 12px; background: var(--surface); padding: 2px 5px; border-radius: 3px; }
+    .file-card { border: 1px solid var(--card-border); border-radius: 8px; margin-bottom: 18px; overflow: hidden; background: var(--card); box-shadow: var(--shadow); }
+    .file-card-header { display: flex; align-items: center; gap: 6px; padding: 7px 10px; background: var(--surface); border-bottom: 1px solid var(--card-border); font-family: 'JetBrains Mono', monospace; font-size: 11px; color: var(--fg); font-weight: 500; flex-wrap: wrap; }
+    .badge { font-size: 10px; font-weight: 600; padding: 2px 7px; border-radius: 10px; text-transform: uppercase; letter-spacing: 0.5px; }
+    .badge-test { background: var(--add-bg); color: var(--add-fg); }
+    .badge-impl { background: var(--hunk-bg); color: var(--hunk-fg); }
+    .badge-config { background: var(--smell-bg); color: var(--amber); }
+    .badge-new { background: var(--add-bg); color: var(--add-fg); }
+    .badge-modified { background: var(--smell-bg); color: var(--amber); }
+    .file-card pre { margin: 0; border-radius: 0; }
+    .file-card pre code, .file-card pre code.hljs { display: block; padding: 10px; overflow-x: auto; font-family: 'JetBrains Mono', monospace !important; font-size: 13px !important; line-height: 1.4; background: var(--surface) !important; }
+    @media (hover: none), (pointer: coarse) {
+      .file-card pre code, .file-card pre code.hljs { font-size: 12px !important; padding: 8px 6px; }
+    }
+    .narrative { padding: 12px 16px; font-size: 13.5px; line-height: 1.6; }
+    .narrative code { font-family: 'JetBrains Mono', monospace; font-size: 12px; background: var(--surface); padding: 1px 5px; border-radius: 3px; }
+    .note-box { background: var(--note-bg); border-left: 3px solid var(--note-border); padding: 10px 14px; margin: 0 16px 12px; border-radius: 4px; font-size: 13px; }
+    .smell-box { background: var(--smell-bg); border-left: 3px solid var(--smell-border); padding: 10px 14px; margin: 0 16px 12px; border-radius: 4px; font-size: 13px; }
+    details > summary { cursor: pointer; padding: 9px 14px; font-family: 'JetBrains Mono', monospace; font-size: 12px; color: var(--muted); background: var(--surface); border-top: 1px solid var(--card-border); user-select: none; }
+    details > summary:hover { color: var(--fg); }
+    details[open] > summary { border-bottom: 1px solid var(--card-border); }
+    .diff-line-add { display: block; background: var(--add-bg); margin: 0 -10px; padding: 0 10px; }
+    .diff-line-del { display: block; background: var(--del-bg); margin: 0 -10px; padding: 0 10px; }
+    .area-divider { border: none; border-top: 2px solid var(--border); margin: 48px 0 40px; }
+    @media (max-width: 768px), (max-height: 500px) {
+      body { font-size: 14px; }
+      .page { padding: 0 2px 60px; }
+      .topbar-inner { padding: 0 2px; gap: 4px; }
+      .topbar a { font-size: 11px; padding: 2px 5px; }
+      .header { padding: 20px 0 16px; margin-bottom: 20px; }
+      .header h1 { font-size: 18px; }
+      .section-head { font-size: 17px; }
+      .section-desc { font-size: 13px; }
+      .section { margin-bottom: 28px; }
+      .area-divider { margin: 32px 0 28px; }
+      .narrative { padding: 8px 10px; font-size: 12.5px; }
+      .note-box, .smell-box { margin: 0 6px 8px; padding: 8px 10px; font-size: 12px; }
+      .file-card-header { padding: 6px 8px; font-size: 10.5px; }
+      details > summary { padding: 6px 8px; font-size: 10.5px; }
+    }
+  </style>
+</head>
+<body>
+
+<div class="topbar"><div class="topbar-inner">
+  <span class="pr-tag">#{{NUMBER}}</span>
+  <a href="#big-picture">Overview</a>
+  <!-- one <a> per area -->
+</div></div>
+
+<div class="page">
+
+<div class="header">
+  <h1>{{TITLE}}</h1>
+  <div class="meta">by {{AUTHOR}} &middot; {{BRANCH}} &rarr; {{BASE}}</div>
+  <div class="stats">
+    <span class="files">{{FILES}} files</span>
+    <span class="add">+{{ADDITIONS}}</span>
+    <span class="del">&minus;{{DELETIONS}}</span>
+  </div>
+</div>
+
+<!-- Big picture -->
+<div class="section" id="big-picture">
+  <h2 class="section-head">What this PR does</h2>
+  <p class="section-desc">{{SUMMARY}}</p>
+</div>
+<hr class="area-divider">
+
+<!-- Repeat per area -->
+<div class="section" id="area-{{id}}">
+  <h2 class="section-head">{{N}}. {{Area Name}}</h2>
+  <p class="section-desc">{{What this area does}}</p>
+
+  <!-- Layer 1: readable walkthrough with curated snippets -->
+  <!-- Layer 2: full files collapsed -->
+</div>
+<hr class="area-divider">
+
+</div>
+
+<script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/highlight.min.js"></script>
+<script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/languages/typescript.min.js"></script>
+<script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/languages/json.min.js"></script>
+<script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.11.1/build/languages/markdown.min.js"></script>
+<script>
+hljs.highlightAll();
+// Post-process: apply line-level diff backgrounds on top of syntax highlighting
+document.querySelectorAll('code[data-diff]').forEach(block => {
+  block.innerHTML = block.innerHTML.split('\n').map(line => {
+    const stripped = line.replace(/<[^>]*>/g, '');
+    if (stripped.startsWith('+')) return '<span class="diff-line-add">' + line + '</span>';
+    if (stripped.startsWith('-')) return '<span class="diff-line-del">' + line + '</span>';
+    return line;
+  }).join('\n');
+});
+</script>
+</body>
+</html>
+```
+
+### Building blocks
+
+**Layer 1 — Readable walkthrough snippet** (curated excerpt with prose):
+
+Start with explanation, then show the contract (function + types), then tests, then implementation. Show full interface bodies inline — not just names:
+
+```html
+<div class="file-card">
+  <div class="narrative">
+    <p><code>processStory</code> is the core rendering pipeline. It takes a story config, prepares the
+    rendering context, mounts the component, and returns a result with status and timing.</p>
+    <p>The function signature and the types it uses:</p>
+  </div>
+  <pre><code class="language-typescript">export async function processStory(config: StoryConfig): Promise&lt;StoryResult&gt;
+
+export interface StoryConfig {
+  id: string;
+  title: string;
+  component: ComponentType;
+  args: Record&lt;string, unknown&gt;;
+  parameters: Parameters;
+}
+
+export interface StoryResult {
+  status: 'success' | 'error';
+  rendered: boolean;
+  duration: number;
+  errors: string[];
+}</code></pre>
+  <div class="narrative">
+    <p>The happy-path test shows the expected flow concretely:</p>
+  </div>
+  <pre><code class="language-typescript">const result = await processStory(baseConfig);
+expect(result.status).toBe('success');
+expect(result.rendered).toBe(true);</code></pre>
+  <div class="narrative">
+    <p>The implementation is a sequential pipeline — rendering depends on preparation:</p>
+  </div>
+  <pre><code class="language-typescript">const context = await prepare(config);
+const canvas = await render(context);
+return summarize(canvas, config);</code></pre>
+</div>
+```
+
+**Layer 2 — Full file (collapsed, for new files):**
+```html
+<div class="file-card">
+  <div class="file-card-header">
+    <span class="badge badge-impl">impl</span>
+    <span class="badge badge-new">new</span>
+    path/to/file.ts
+  </div>
+  <details>
+    <summary>Full file ({{N}} lines)</summary>
+    <pre><code class="language-typescript">{{FULL FILE CONTENT, HTML-ESCAPED}}</code></pre>
+  </details>
+</div>
+```
+
+**Layer 2 — Full file (collapsed, for modified files with diff):**
+
+Use `language-typescript data-diff` — this gives TypeScript syntax highlighting plus line-level add/remove backgrounds via the post-processing script. Lines starting with `+` get green background, `-` get red.
+
+```html
+<div class="file-card">
+  <div class="file-card-header">
+    <span class="badge badge-modified">modified</span>
+    path/to/file.ts
+  </div>
+  <details>
+    <summary>Diff</summary>
+    <pre><code class="language-typescript" data-diff>-old line
++new line</code></pre>
+  </details>
+</div>
+```
+
+**Supporting change — no code needed:**
+```html
+<div class="file-card">
+  <div class="file-card-header">
+    <span class="badge badge-config">config</span>
+    <span class="badge badge-modified">modified</span>
+    yarn.lock
+  </div>
+  <div class="narrative"><p>Lockfile updated for new dependencies.</p></div>
+</div>
+```
+
+**Inline issue:**
+```html
+<div class="smell-box">No unit tests for this file.</div>
+```
+
+**Positive note:**
+```html
+<div class="note-box">These test names read like a specification — good documentation.</div>
+```
+
+
+### Badge reference
+
+| Badge | Class | Use for |
+|-------|-------|---------|
+| `test` | `badge-test` | Test files |
+| `impl` | `badge-impl` | Implementation files |
+| `config` | `badge-config` | Config, docs, prompts, lockfiles |
+| `new` | `badge-new` | New files (combine with test/impl/config) |
+| `modified` | `badge-modified` | Modified files |
+
+### Syntax highlighting
+
+| Class | Use for |
+|-------|---------|
+| `language-typescript` | `.ts`, `.tsx`, `.js`, `.jsx` (new files) |
+| `language-typescript` + `data-diff` attribute | Modified file diffs — gets TS highlighting plus line-level add/remove backgrounds |
+| `language-json` | `.json` files |
+| `language-markdown` | `.md` files |
+
+**Important:** Do NOT use `language-diff` — it only does `+`/`-` coloring without syntax highlighting. Instead use `language-typescript` with the `data-diff` attribute for diffs. The post-processing script handles line backgrounds.
+
+### HTML escaping
+
+All code inside `<code>` blocks must be escaped: `&` → `&amp;`, `<` → `&lt;`, `>` → `&gt;`.
+
+## Step 4 — Serve the page
+
+Kill any existing server, write a static server, start it:
+
+```bash
+lsof -ti:3000 | xargs kill -9 2>/dev/null || true
+```
+
+Write to `.pr-review/pr-<number>/server.mjs`:
+
+```javascript
+import { createServer } from 'node:http';
+import { readFileSync } from 'node:fs';
+import { join, extname } from 'node:path';
+
+const dir = new URL('.', import.meta.url).pathname;
+const port = 3000;
+
+createServer((req, res) => {
+  try {
+    const filePath = join(dir, req.url === '/' ? 'index.html' : req.url);
+    const content = readFileSync(filePath);
+    const ext = extname(filePath);
+    const types = {
+      '.html': 'text/html', '.js': 'text/javascript',
+      '.css': 'text/css', '.json': 'application/json',
+    };
+    res.writeHead(200, { 'Content-Type': types[ext] || 'application/octet-stream' });
+    res.end(content);
+  } catch {
+    res.writeHead(404).end('Not found');
+  }
+}).listen(port, () => {
+  console.log(`\n  PR Review: http://localhost:${port}\n`);
+});
+```
+
+```bash
+node .pr-review/pr-<number>/server.mjs &   # run_in_background: true
+open http://localhost:3000
+```
+
+## Step 5 — Iterate
+
+Tell the user:
+- The page is live at http://localhost:3000
+- They can ask to update specific sections
+- Refresh the browser after updates
diff --git a/.circleci/config.yml b/.circleci/config.yml
index 59dd245f7676..d2798c319911 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -32,7 +32,7 @@ jobs:
   generate-and-run-config:
     executor: 
       name: node/default
-      resource_class: small
+      resource_class: large
     steps:
       - node/install:
           install-yarn: true
diff --git a/.gitignore b/.gitignore
index 43107a4f3e07..ecb034fa9189 100644
--- a/.gitignore
+++ b/.gitignore
@@ -79,4 +79,11 @@ CLAUDE.local.md
 .cursor/mcp.json
 .vscode/mcp.json
 .mcp.json
-.nx/polygraph
\ No newline at end of file
+.nx/polygraph
+
+# Eval system
+scripts/eval/.cache
+scripts/eval/results
+
+# review-pr skill output
+.pr-review
diff --git a/AGENTS.md b/AGENTS.md
index 7c99c9041a9e..a538c6cdb6c0 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -9,10 +9,11 @@ This file is the canonical instruction source for coding agents. Files like `CLA
 Storybook is a large TypeScript monorepo. The git root is the repo root, the main code lives in `code/`, and build tooling lives in `scripts/`. The default branch is `next`.
 
 - **Base branch**: `next` (all PRs should target `next`, not `main`)
-- **Node.js**: `22.21.1` (see `.nvmrc`)
+- **Node.js**: `22.22.1` (see `.nvmrc`) — supports `.ts` natively via type stripping (no loader needed)
 - **Package Manager**: Yarn Berry
 - **Task orchestration**: NX plus the custom `yarn task` runner
 - **CI environment**: Linux and Windows
+- **TS execution**: Migrating from `jiti` to native `node` for running `.ts` files. New scripts should use `node ./path/file.ts` with explicit `.ts` import extensions (enabled by `allowImportingTsExtensions` in tsconfig). Legacy scripts still use `jiti` but should be migrated over time.
 
 ## Repository Structure
 
@@ -234,7 +235,7 @@ When writing tests:
 
 After changing files:
 
-1. Format with `cd code && oxfmt`
+1. Format with `yarn fmt:write` (run from the repo root)
 2. Lint with `yarn --cwd code lint:js:cmd <file-relative-to-code-folder> --fix` or `cd code && yarn lint:js:cmd <file-relative-to-code-folder>`
 3. Run relevant tests before submitting a PR
 
diff --git a/code/core/src/core-server/index.ts b/code/core/src/core-server/index.ts
index f475fa6166ca..b1669cb685c1 100644
--- a/code/core/src/core-server/index.ts
+++ b/code/core/src/core-server/index.ts
@@ -32,3 +32,6 @@ export {
 } from './stores/test-provider';
 
 export { getServerPort } from './utils/server-address';
+
+export { getComponentCandidates } from './utils/ghost-stories/get-candidates';
+export { runGhostStories } from './utils/ghost-stories/run-story-tests';
diff --git a/code/core/src/core-server/server-channel/ghost-stories-channel.ts b/code/core/src/core-server/server-channel/ghost-stories-channel.ts
index 2b334865556e..3076d7b293d9 100644
--- a/code/core/src/core-server/server-channel/ghost-stories-channel.ts
+++ b/code/core/src/core-server/server-channel/ghost-stories-channel.ts
@@ -9,7 +9,7 @@ import {
 import type { CoreConfig, Options } from 'storybook/internal/types';
 
 import { getComponentCandidates } from '../utils/ghost-stories/get-candidates';
-import { runStoryTests } from '../utils/ghost-stories/run-story-tests';
+import { runGhostStories } from '../utils/ghost-stories/run-story-tests';
 
 export function initGhostStoriesChannel(
   channel: Channel,
@@ -91,7 +91,7 @@ export function initGhostStoriesChannel(
 
       // Phase 2: Run tests on those candidates Vitest. The components will be transformed directly to tests
       // If they pass, it means that creating a story file for them would succeed.
-      const testRunResult = await runStoryTests(candidatesResult.candidates);
+      const testRunResult = await runGhostStories(candidatesResult.candidates);
       stats.totalRunDuration = Date.now() - ghostRunStart;
       stats.testRunDuration = testRunResult.duration;
       if (testRunResult.runError) {
diff --git a/code/core/src/core-server/utils/ghost-stories/get-candidates.ts b/code/core/src/core-server/utils/ghost-stories/get-candidates.ts
index 8c7d7a113cb3..661196a3ebea 100644
--- a/code/core/src/core-server/utils/ghost-stories/get-candidates.ts
+++ b/code/core/src/core-server/utils/ghost-stories/get-candidates.ts
@@ -1,12 +1,11 @@
 import { readFile } from 'node:fs/promises';
 
 import { babelParse, traverse } from 'storybook/internal/babel';
-import { logger } from 'storybook/internal/node-logger';
 
 // eslint-disable-next-line depend/ban-dependencies
 import { glob } from 'glob';
 
-import { getComponentComplexity } from './component-analyzer';
+import { getComponentComplexity } from './component-analyzer.ts';
 
 // A valid candidate includes React code and at least one export
 function isValidCandidate(source: string): boolean {
@@ -128,9 +127,12 @@ export async function getCandidatesForStorybook(
 export async function getComponentCandidates({
   sampleSize = 20,
   globPattern = '**/*.{tsx,jsx}',
+  cwd = process.cwd(),
 }: {
   sampleSize?: number;
   globPattern?: string;
+  /** Working directory for glob. Defaults to process.cwd(). */
+  cwd?: string;
 } = {}): Promise<{
   candidates: string[];
   error?: string;
@@ -145,7 +147,7 @@ export async function getComponentCandidates({
 
     // Find files matching the glob pattern
     files = await glob(globPattern, {
-      cwd: process.cwd(),
+      cwd,
       absolute: true,
       ignore: [
         '**/node_modules/**',
diff --git a/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts b/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts
index 8c783abdccbe..e0bd41cc53a6 100644
--- a/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts
+++ b/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts
@@ -1,6 +1,10 @@
-import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors';
-import { categorizeError } from '../../../shared/utils/categorize-render-errors';
-import { type ErrorCategorizationResult, type StoryTestResult, type TestRunSummary } from './types';
+import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors.ts';
+import { categorizeError } from '../../../shared/utils/categorize-render-errors.ts';
+import {
+  type ErrorCategorizationResult,
+  type StoryTestResult,
+  type TestRunSummary,
+} from './types.ts';
 
 /**
  * For a given list of test results:
diff --git a/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts b/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts
index 42ab270ee58a..c934eb385ecd 100644
--- a/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts
+++ b/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts
@@ -5,10 +5,21 @@ import { executeCommand, resolvePathInStorybookCache } from 'storybook/internal/
 
 import { join } from 'pathe';
 
-import { parseVitestResults } from './parse-vitest-report';
-import type { TestRunSummary } from './types';
+import { parseVitestResults } from './parse-vitest-report.ts';
+import type { TestRunSummary } from './types.ts';
 
-export async function runStoryTests(componentFilePaths: string[]): Promise<TestRunSummary> {
+/**
+ * Run ghost stories: execute vitest on component file paths to auto-generate
+ * and test stories that don't exist on disk.
+ *
+ * @param componentFilePaths - Absolute paths to component files to test.
+ * @param options.cwd - Working directory for vitest. Defaults to process.cwd().
+ */
+export async function runGhostStories(
+  componentFilePaths: string[],
+  options?: { cwd?: string }
+): Promise<TestRunSummary> {
+  const cwd = options?.cwd;
   try {
     // Create the cache directory for story discovery tests
     const cacheDir = resolvePathInStorybookCache('ghost-stories-tests');
@@ -34,6 +45,7 @@ export async function runStoryTests(componentFilePaths: string[]): Promise<TestR
           `--outputFile=${outputFile}`,
           ...componentFilePaths,
         ],
+        cwd,
         stdio: 'pipe',
         env: {
           STORYBOOK_COMPONENT_PATHS: componentFilePaths.join(';'),
diff --git a/code/core/src/shared/utils/categorize-render-errors.ts b/code/core/src/shared/utils/categorize-render-errors.ts
index 68e9653139cf..2bf36b1086a3 100644
--- a/code/core/src/shared/utils/categorize-render-errors.ts
+++ b/code/core/src/shared/utils/categorize-render-errors.ts
@@ -3,7 +3,7 @@ import {
   isRouterPackage,
   isStateManagementPackage,
   isStylingPackage,
-} from './ecosystem-identifier';
+} from './ecosystem-identifier.ts';
 
 export const ERROR_CATEGORIES = {
   MISSING_PROVIDER: 'MISSING_PROVIDER',
diff --git a/code/tsconfig.json b/code/tsconfig.json
index a0979540cdc8..870835c74b2a 100644
--- a/code/tsconfig.json
+++ b/code/tsconfig.json
@@ -13,6 +13,8 @@
     "lib": ["dom", "dom.iterable", "esnext"],
     "module": "Preserve",
     "moduleResolution": "bundler",
+    // Required for explicit .ts import extensions — migrating toward native Node TS execution
+    "allowImportingTsExtensions": true,
     "noImplicitAny": true,
     "noUnusedLocals": false,
     "skipLibCheck": true,
diff --git a/scripts/ci/common-jobs.ts b/scripts/ci/common-jobs.ts
index c47d03d26a4e..2b8a1f0c0f23 100644
--- a/scripts/ci/common-jobs.ts
+++ b/scripts/ci/common-jobs.ts
@@ -67,7 +67,7 @@ export const build_linux = defineJob('Build (linux)', (workflowName) => ({
 export const fmt = defineJob('Format check', () => ({
   executor: {
     name: 'sb_node_22_classic',
-    class: 'medium+',
+    class: 'xlarge',
   },
   steps: [
     git.checkout(),
diff --git a/scripts/eval/eval.ts b/scripts/eval/eval.ts
new file mode 100644
index 000000000000..048e5efb75ca
--- /dev/null
+++ b/scripts/eval/eval.ts
@@ -0,0 +1,201 @@
+/**
+ * Eval harness entry point.
+ *
+ * Runs with `node ./eval/eval.ts` (no jiti). Node 22+ supports .ts natively
+ * via type stripping. Import specifiers use explicit .ts extensions.
+ *
+ * Usage:
+ *   node eval/eval.ts -p mealdrop                       # claude defaults
+ *   node eval/eval.ts -p mealdrop -a codex             # codex defaults
+ *   node eval/eval.ts -p mealdrop -m gpt-5.4           # codex (inferred)
+ *   node eval/eval.ts -p mealdrop -a claude -e max     # claude with max effort
+ *   node eval/eval.ts -p mealdrop --manual             # prepare only, print instructions
+ *   node eval/eval.ts --list-projects
+ *   node eval/eval.ts --list-models
+ *   node eval/eval.ts --list-prompts
+ */
+import { writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import { parseArgs } from 'node:util';
+import { z } from 'zod';
+import pc from 'picocolors';
+import {
+  AGENT_IDS,
+  AGENTS,
+  CLAUDE_EFFORTS,
+  CLAUDE_MODELS,
+  CODEX_EFFORTS,
+  CODEX_MODELS,
+  type AgentId,
+  type AgentVariant,
+} from './lib/agents/config.ts';
+import { prepareTrial } from './lib/prepare-trial.ts';
+import { PROJECTS } from './lib/projects.ts';
+import { runTrial, type TrialConfig } from './lib/run-trial.ts';
+import {
+  captureEnvironment,
+  createLogger,
+  formatCost,
+  formatDuration,
+  generateTrialId,
+  listPrompts,
+  loadPrompt,
+} from './lib/utils.ts';
+
+const PROJECT_NAMES = PROJECTS.map((p) => p.name) as [string, ...string[]];
+
+const base = {
+  project: z.enum(PROJECT_NAMES).optional(),
+  prompt: z.string().default('setup'),
+  verbose: z.boolean().default(false),
+  manual: z.boolean().default(false),
+  listProjects: z.boolean().default(false),
+  listModels: z.boolean().default(false),
+  listPrompts: z.boolean().default(false),
+};
+
+const argsSchema = z.discriminatedUnion('agent', [
+  z.object({
+    ...base,
+    agent: z.literal('claude'),
+    model: z.enum(CLAUDE_MODELS).default(AGENTS.claude.defaultModel),
+    effort: z.enum(CLAUDE_EFFORTS).default(AGENTS.claude.defaultEffort),
+  }),
+  z.object({
+    ...base,
+    agent: z.literal('codex'),
+    model: z.enum(CODEX_MODELS).default(AGENTS.codex.defaultModel),
+    effort: z.enum(CODEX_EFFORTS).default(AGENTS.codex.defaultEffort),
+  }),
+]);
+
+const { values } = parseArgs({
+  options: {
+    project: { type: 'string', short: 'p' },
+    agent: { type: 'string', short: 'a' },
+    model: { type: 'string', short: 'm' },
+    effort: { type: 'string', short: 'e' },
+    prompt: { type: 'string' },
+    verbose: { type: 'boolean', short: 'v' },
+    manual: { type: 'boolean' },
+    'list-projects': { type: 'boolean' },
+    'list-models': { type: 'boolean' },
+    'list-prompts': { type: 'boolean' },
+  },
+  args: process.argv.slice(2),
+  strict: true,
+});
+
+// Resolve the discriminator: explicit --agent, inferred from --model, or default to claude.
+const agent = values.agent ?? (values.model ? inferAgent(values.model) : 'claude');
+
+const parsed = argsSchema.safeParse({
+  ...values,
+  agent,
+  listProjects: values['list-projects'],
+  listModels: values['list-models'],
+  listPrompts: values['list-prompts'],
+});
+
+if (!parsed.success) {
+  for (const issue of parsed.error.issues) {
+    console.error(pc.red(`  ${issue.path.join('.')}: ${issue.message}`));
+  }
+  process.exit(1);
+}
+
+const args = parsed.data;
+const logger = createLogger();
+
+if (args.listProjects) {
+  for (const project of PROJECTS) {
+    logger.log(`  ${pc.bold(project.name)} — ${project.description}`);
+  }
+  process.exit(0);
+}
+if (args.listModels) {
+  for (const [name, { models }] of Object.entries(AGENTS)) {
+    logger.log(`\n  ${pc.bold(name)}`);
+    for (const model of models) logger.log(`    ${model}`);
+  }
+  process.exit(0);
+}
+if (args.listPrompts) {
+  for (const name of listPrompts()) logger.log(`  ${pc.bold(name)}`);
+  process.exit(0);
+}
+
+if (!args.project) {
+  logger.log(pc.red(`Specify a project with -p. Available: ${PROJECT_NAMES.join(', ')}`));
+  process.exit(1);
+}
+const project = PROJECTS.find((p) => p.name === args.project)!;
+const variant = toVariant(args);
+
+logger.log(pc.bold(`\nStorybook Setup Eval — ${project.name}`));
+logger.log(
+  `Agent: ${variant.agent} | Model: ${variant.model} | Effort: ${variant.effort} | Prompt: ${args.prompt}\n`
+);
+
+if (args.manual) {
+  const trialId = generateTrialId(project.name, variant.agent, variant.model, args.prompt);
+  const workspace = await prepareTrial(project, trialId, logger);
+  await captureEnvironment(workspace.resultsDir);
+
+  const prompt = loadPrompt(args.prompt);
+  const promptPath = join(workspace.resultsDir, 'prompt.md');
+  await writeFile(promptPath, prompt);
+
+  const cliCommand = buildManualCommand(variant, promptPath);
+
+  logger.log(pc.bold('\n── Manual mode ──'));
+  logger.log(`\n  Trial dir:    ${pc.cyan(workspace.trialDir)}`);
+  logger.log(`  Project dir:  ${pc.cyan(workspace.projectPath)}`);
+  logger.log(`  Prompt file:  ${pc.cyan(promptPath)}`);
+  logger.log(pc.bold('\nRun the agent yourself:\n'));
+  logger.log(`  ${pc.green('cd')} ${workspace.projectPath}`);
+  logger.log(`  ${pc.green(cliCommand)}\n`);
+} else {
+  const result = await runTrial(
+    { project, variant, prompt: args.prompt, verbose: args.verbose } satisfies TrialConfig,
+    logger
+  );
+
+  const ghost = result.grade.ghostStories;
+  const ghostStr = ghost
+    ? `${ghost.passed}/${ghost.total} (${Math.round(ghost.successRate * 100)}%)`
+    : '-';
+
+  logger.log(pc.bold('\nResult'));
+  logger.log(`  Build:   ${result.grade.buildSuccess ? pc.green('PASS') : pc.red('FAIL')}`);
+  logger.log(`  Ghost:   ${ghostStr}`);
+  logger.log(`  TS Err:  ${result.grade.typeCheckErrors}`);
+  logger.log(`  Score:   ${result.score.score}`);
+  logger.log(`  Cost:    ${formatCost(result.execution.cost)}`);
+  logger.log(`  Time:    ${formatDuration(result.execution.duration)}`);
+  logger.log(`  Turns:   ${result.execution.turns}`);
+
+  logger.log('\nDone.');
+}
+
+function inferAgent(model: string): AgentId {
+  for (const id of AGENT_IDS) {
+    if (AGENTS[id].models.some((candidate) => candidate === model)) return id;
+  }
+  throw new Error(`No agent found for model: ${model}`);
+}
+
+function buildManualCommand(variant: AgentVariant, promptPath: string): string {
+  const promptArg = `"$(cat ${promptPath})"`;
+  if (variant.agent === 'claude') {
+    const sdkModel = AGENTS.claude.sdkModelIds[variant.model] ?? variant.model;
+    return `claude --model ${sdkModel} ${promptArg}`;
+  }
+  return `codex --model ${variant.model} --reasoning-effort ${variant.effort} ${promptArg}`;
+}
+
+function toVariant(args: z.infer<typeof argsSchema>): AgentVariant {
+  return args.agent === 'claude'
+    ? { agent: 'claude', model: args.model, effort: args.effort }
+    : { agent: 'codex', model: args.model, effort: args.effort };
+}
diff --git a/scripts/eval/lib/agents/claude-code.ts b/scripts/eval/lib/agents/claude-code.ts
new file mode 100644
index 000000000000..1cd03f43b177
--- /dev/null
+++ b/scripts/eval/lib/agents/claude-code.ts
@@ -0,0 +1,142 @@
+import type { SDKMessage } from '@anthropic-ai/claude-agent-sdk';
+import { query } from '@anthropic-ai/claude-agent-sdk';
+import { writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import { AGENTS, resolveClaudeSdkModel, type AgentDriver, type Execution } from './config.ts';
+import type { Logger } from '../utils.ts';
+
+export const claudeAgent: AgentDriver = {
+  name: 'claude',
+
+  async execute({ prompt, projectPath, variant, resultsDir, logger }): Promise<Execution> {
+    if (variant.agent !== 'claude') {
+      throw new Error(`Claude driver received unsupported variant: ${variant.agent}`);
+    }
+
+    const startTime = Date.now();
+    const settings = AGENTS.claude.execution;
+    const { model } = variant;
+    const effort = variant.effort as 'low' | 'medium' | 'high' | 'max';
+    const sdkModel = resolveClaudeSdkModel(model);
+
+    let cost: number | undefined;
+    let turns = 0;
+    let durationApi: number | undefined;
+    const messages: unknown[] = [];
+
+    try {
+      for await (const message of query({
+        prompt,
+        options: {
+          model: sdkModel,
+          cwd: projectPath,
+          allowedTools: [...settings.allowedTools],
+          maxTurns: settings.maxTurns,
+          effort,
+          debug: settings.debug,
+          systemPrompt: settings.systemPrompt,
+        },
+      })) {
+        logMessage(message, logger);
+        messages.push(message);
+
+        if (message.type === 'result' && message.subtype === 'success') {
+          cost = message.total_cost_usd as number | undefined;
+          turns = (message.num_turns as number) ?? 0;
+          durationApi =
+            typeof message.duration_api_ms === 'number'
+              ? message.duration_api_ms / 1000
+              : undefined;
+        }
+      }
+    } finally {
+      await writeTranscript(resultsDir, messages, logger);
+    }
+
+    const duration = (Date.now() - startTime) / 1000;
+
+    return {
+      cost,
+      duration,
+      durationApi,
+      turns,
+    };
+  },
+};
+
+function logMessage(message: SDKMessage, logger: Logger) {
+  switch (message.type) {
+    case 'assistant': {
+      for (const block of message.message.content) {
+        if (block.type === 'text') {
+          logger.log(`💬 ${block.text}`);
+        } else if (block.type === 'tool_use') {
+          logger.log(`🔧 ${block.name}(${JSON.stringify(block.input).slice(0, 200)})`);
+        }
+      }
+      if (message.error) {
+        logger.logError(`Assistant error: ${message.error}`);
+      }
+      break;
+    }
+    case 'user': {
+      const content = message.message.content;
+      if (!Array.isArray(content)) break;
+      for (const block of content) {
+        if (block.type === 'tool_result') {
+          const text =
+            typeof block.content === 'string'
+              ? block.content.slice(0, 200)
+              : Array.isArray(block.content)
+                ? block.content
+                    .map((b: { type: string; text?: string }) =>
+                      'text' in b ? b.text : `[${b.type}]`
+                    )
+                    .join('')
+                    .slice(0, 200)
+                : '[no content]';
+          logger.log(`📎 tool_result(${block.tool_use_id?.slice(-8)}): ${text}`);
+        }
+      }
+      break;
+    }
+    case 'result':
+      if (message.subtype === 'success') {
+        logger.logSuccess(
+          `Done — ${message.num_turns} turns, $${message.total_cost_usd?.toFixed(4)}`
+        );
+      } else {
+        logger.logError(`Error (${message.subtype}): ${message.errors?.join(', ')}`);
+      }
+      break;
+    case 'system':
+      if (message.subtype === 'init') {
+        logger.log(`🚀 Session started — model: ${message.model}`);
+      } else if (message.subtype === 'api_retry') {
+        logger.log(`🔄 API retry: attempt ${message.attempt}/${message.max_retries}`);
+      } else if (message.subtype === 'status') {
+        logger.log(`📊 status: ${message.status ?? 'unknown'}`);
+      }
+      break;
+    case 'tool_use_summary':
+      logger.log(`📋 ${message.summary.slice(0, 200)}`);
+      break;
+    case 'rate_limit_event':
+      logger.log(
+        `⏳ Rate limited — status: ${message.rate_limit_info?.status}, resets at: ${message.rate_limit_info?.resetsAt}`
+      );
+      break;
+    default:
+      break;
+  }
+}
+
+async function writeTranscript(resultsDir: string, messages: unknown[], logger: Logger) {
+  try {
+    await writeFile(join(resultsDir, 'transcript.json'), JSON.stringify(messages, null, 2));
+  } catch (error) {
+    logger.logError(
+      `Failed to persist transcript: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
diff --git a/scripts/eval/lib/agents/codex.ts b/scripts/eval/lib/agents/codex.ts
new file mode 100644
index 000000000000..09cbdc00ee7b
--- /dev/null
+++ b/scripts/eval/lib/agents/codex.ts
@@ -0,0 +1,105 @@
+import { Codex, type ModelReasoningEffort } from '@openai/codex-sdk';
+import { writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import { AGENTS, estimateCost, type AgentDriver, type Execution } from './config.ts';
+import type { Logger } from '../utils.ts';
+
+export const codexAgent: AgentDriver = {
+  name: 'codex',
+
+  async execute({ prompt, projectPath, variant, resultsDir, logger }): Promise<Execution> {
+    if (variant.agent !== 'codex') {
+      throw new Error(`Codex driver received unsupported variant: ${variant.agent}`);
+    }
+
+    const startTime = Date.now();
+    const settings = AGENTS.codex.execution;
+    const { model, effort } = variant;
+
+    const codex = new Codex();
+    const thread = codex.startThread({
+      model,
+      modelReasoningEffort: effort as ModelReasoningEffort,
+      workingDirectory: projectPath,
+      approvalPolicy: settings.approvalPolicy,
+    });
+
+    const items: unknown[] = [];
+    let totalInput = 0;
+    let totalCached = 0;
+    let totalOutput = 0;
+    let turns = 0;
+
+    try {
+      const { events } = await thread.runStreamed(prompt);
+      for await (const event of events) {
+        switch (event.type) {
+          case 'item.completed': {
+            const item = event.item;
+            items.push(item);
+            switch (item.type) {
+              case 'agent_message':
+                logger.log(`💬 ${item.text.slice(0, 300)}`);
+                break;
+              case 'command_execution':
+                logger.log(`🔧 $ ${item.command} → exit ${item.exit_code ?? '?'}`);
+                if (item.exit_code !== 0 && item.aggregated_output) {
+                  logger.log(`   ${item.aggregated_output.slice(-200)}`);
+                }
+                break;
+              case 'file_change':
+                for (const c of item.changes) logger.log(`📝 ${c.kind} ${c.path}`);
+                break;
+              case 'reasoning':
+                logger.log(`🧠 ${item.text.slice(0, 200)}`);
+                break;
+              case 'error':
+                logger.logError(item.message);
+                break;
+            }
+            break;
+          }
+          case 'turn.completed':
+            totalInput += event.usage.input_tokens;
+            totalCached += event.usage.cached_input_tokens;
+            totalOutput += event.usage.output_tokens;
+            turns++;
+            logger.log(
+              `📊 tokens: ${event.usage.input_tokens}in / ${event.usage.output_tokens}out (${event.usage.cached_input_tokens} cached)`
+            );
+            break;
+          case 'turn.failed':
+            logger.logError(`Turn failed: ${event.error.message}`);
+            break;
+          case 'error':
+            logger.logError(`Error: ${event.message}`);
+            break;
+        }
+      }
+    } finally {
+      await writeTranscript(resultsDir, items, logger);
+    }
+
+    const duration = (Date.now() - startTime) / 1000;
+    const cost = estimateCost('codex', model, {
+      inputTokens: totalInput,
+      cachedInputTokens: totalCached,
+      outputTokens: totalOutput,
+    });
+    logger.logSuccess(
+      `Done — ${turns} turns, ${Math.round(duration)}s, ${totalInput}in/${totalOutput}out tokens${cost != null ? `, $${cost.toFixed(4)}` : ''}`
+    );
+
+    return { cost, duration, turns };
+  },
+};
+
+async function writeTranscript(resultsDir: string, items: unknown[], logger: Logger) {
+  try {
+    await writeFile(join(resultsDir, 'transcript.json'), JSON.stringify(items, null, 2));
+  } catch (error) {
+    logger.logError(
+      `Failed to persist transcript: ${error instanceof Error ? error.message : String(error)}`
+    );
+  }
+}
diff --git a/scripts/eval/lib/agents/config.test.ts b/scripts/eval/lib/agents/config.test.ts
new file mode 100644
index 000000000000..1236689d05cd
--- /dev/null
+++ b/scripts/eval/lib/agents/config.test.ts
@@ -0,0 +1,62 @@
+import { describe, expect, it } from 'vitest';
+
+import { AGENTS, getDefaultVariant } from './config';
+
+describe('AGENTS', () => {
+  it('keeps each agent default inside its supported model and effort lists', () => {
+    for (const config of Object.values(AGENTS)) {
+      expect(config).toMatchObject({
+        defaultModel: expect.any(String),
+        defaultEffort: expect.any(String),
+      });
+      expect(config.models).toContain(config.defaultModel);
+      expect(config.efforts).toContain(config.defaultEffort);
+    }
+  });
+
+  it('keeps Claude models fully remappable to SDK model ids', () => {
+    expect(AGENTS.claude).toMatchObject({
+      defaultModel: 'sonnet-4.6',
+      defaultEffort: 'medium',
+      execution: {
+        maxTurns: 50,
+        allowedTools: ['Read', 'Write', 'Edit', 'Bash', 'Glob', 'Grep'],
+        permissionModel: 'tool-allowlist',
+      },
+      sdkModelIds: Object.fromEntries(
+        AGENTS.claude.models.map((model) => [model, expect.any(String)])
+      ),
+    });
+  });
+
+  it('keeps Codex models fully priceable from token usage', () => {
+    expect(AGENTS.codex).toMatchObject({
+      defaultModel: 'gpt-5.4',
+      defaultEffort: 'medium',
+      execution: {
+        approvalPolicy: 'never',
+        permissionModel: 'approval-policy-never',
+      },
+      pricing: {
+        'gpt-5.4': {
+          input: 2.5,
+          cachedInput: 0.25,
+          output: 15,
+        },
+      },
+    });
+  });
+
+  it('derives default variants from the central agent definitions', () => {
+    expect(getDefaultVariant('claude')).toEqual({
+      agent: 'claude',
+      model: 'sonnet-4.6',
+      effort: 'medium',
+    });
+    expect(getDefaultVariant('codex')).toEqual({
+      agent: 'codex',
+      model: 'gpt-5.4',
+      effort: 'medium',
+    });
+  });
+});
diff --git a/scripts/eval/lib/agents/config.ts b/scripts/eval/lib/agents/config.ts
new file mode 100644
index 000000000000..eb13a52686a9
--- /dev/null
+++ b/scripts/eval/lib/agents/config.ts
@@ -0,0 +1,166 @@
+/**
+ * Agent definitions, model mappings, pricing, and cost estimation.
+ */
+
+import type { Logger } from '../utils.ts';
+
+export const CLAUDE_MODELS = ['sonnet-4.6', 'opus-4.6', 'haiku-4.5'] as const;
+export const CODEX_MODELS = ['gpt-5.4'] as const;
+export const ALL_MODELS = [...CLAUDE_MODELS, ...CODEX_MODELS] as const;
+
+export const CLAUDE_EFFORTS = ['low', 'medium', 'high', 'max'] as const;
+export const CODEX_EFFORTS = ['low', 'medium', 'high', 'xhigh'] as const;
+export const ALL_EFFORTS = ['low', 'medium', 'high', 'max', 'xhigh'] as const;
+
+export const AGENT_IDS = ['claude', 'codex'] as const;
+
+export type ClaudeModel = (typeof CLAUDE_MODELS)[number];
+export type CodexModel = (typeof CODEX_MODELS)[number];
+export type ClaudeEffort = (typeof CLAUDE_EFFORTS)[number];
+export type CodexEffort = (typeof CODEX_EFFORTS)[number];
+
+/** Agent + model + effort — validated as a discriminated union at the CLI boundary. */
+export type AgentVariant =
+  | { agent: 'claude'; model: ClaudeModel; effort: ClaudeEffort }
+  | { agent: 'codex'; model: CodexModel; effort: CodexEffort };
+
+export type AgentId = AgentVariant['agent'];
+
+export interface Execution {
+  cost?: number;
+  duration: number;
+  durationApi?: number;
+  turns: number;
+}
+
+export interface AgentExecuteParams {
+  prompt: string;
+  projectPath: string;
+  variant: AgentVariant;
+  resultsDir: string;
+  logger: Logger;
+}
+
+export interface AgentDriver {
+  name: AgentId;
+  execute(params: AgentExecuteParams): Promise<Execution>;
+}
+
+export interface TokenPricing {
+  input: number;
+  cachedInput: number;
+  output: number;
+}
+
+export interface TokenUsage {
+  inputTokens: number;
+  cachedInputTokens: number;
+  outputTokens: number;
+}
+
+export type ClaudeTool = 'Read' | 'Write' | 'Edit' | 'Bash' | 'Glob' | 'Grep';
+
+export interface ClaudeExecutionConfig {
+  maxTurns: number;
+  /**
+   * Bash is toggled here at the harness level, but individual shell commands still execute through
+   * Claude's Bash tool rather than through a separate command allowlist.
+   */
+  allowedTools: readonly ClaudeTool[];
+  debug: boolean;
+  systemPrompt: { type: 'preset'; preset: 'claude_code' };
+  /** Claude access is controlled through the explicit tool allowlist above. */
+  permissionModel: 'tool-allowlist';
+}
+
+export interface CodexExecutionConfig {
+  /** Codex runs non-interactively so benchmark runs never block on approval prompts. */
+  approvalPolicy: 'never';
+  permissionModel: 'approval-policy-never';
+}
+
+export interface AgentDefinition<TModel extends string, TEffort extends string, TExecution> {
+  models: readonly TModel[];
+  defaultModel: TModel;
+  /** Map friendly model names to SDK-specific model IDs (e.g. "sonnet-4.6" → "claude-sonnet-4-6"). */
+  sdkModelIds: Partial<Record<TModel, string>>;
+  /** Per-million-token pricing for manual cost estimation (agents that don't report cost natively). */
+  pricing: Partial<Record<TModel, TokenPricing>>;
+  efforts: readonly TEffort[];
+  defaultEffort: TEffort;
+  execution: TExecution;
+}
+
+export type ClaudeDefinition = AgentDefinition<ClaudeModel, ClaudeEffort, ClaudeExecutionConfig>;
+export type CodexDefinition = AgentDefinition<CodexModel, CodexEffort, CodexExecutionConfig>;
+
+export interface AgentDefinitions {
+  claude: ClaudeDefinition;
+  codex: CodexDefinition;
+}
+
+export const AGENTS: AgentDefinitions = {
+  claude: {
+    models: CLAUDE_MODELS,
+    defaultModel: 'sonnet-4.6',
+    sdkModelIds: {
+      'sonnet-4.6': 'claude-sonnet-4-6',
+      'opus-4.6': 'claude-opus-4-6',
+      'haiku-4.5': 'claude-haiku-4-5',
+    },
+    pricing: {},
+    efforts: CLAUDE_EFFORTS,
+    defaultEffort: 'medium',
+    execution: {
+      maxTurns: 50,
+      allowedTools: ['Read', 'Write', 'Edit', 'Bash', 'Glob', 'Grep'],
+      debug: true,
+      systemPrompt: { type: 'preset', preset: 'claude_code' },
+      permissionModel: 'tool-allowlist',
+    },
+  },
+  codex: {
+    models: CODEX_MODELS,
+    defaultModel: 'gpt-5.4',
+    sdkModelIds: {},
+    pricing: {
+      'gpt-5.4': { input: 2.5, cachedInput: 0.25, output: 15.0 },
+    },
+    efforts: CODEX_EFFORTS,
+    defaultEffort: 'medium',
+    execution: {
+      approvalPolicy: 'never',
+      permissionModel: 'approval-policy-never',
+    },
+  },
+};
+
+export function getDefaultVariant<T extends AgentId>(
+  agent: T
+): Extract<AgentVariant, { agent: T }> {
+  const definition = AGENTS[agent];
+  return {
+    agent,
+    model: definition.defaultModel,
+    effort: definition.defaultEffort,
+  } as Extract<AgentVariant, { agent: T }>;
+}
+
+export function resolveClaudeSdkModel(model: ClaudeModel): string {
+  return AGENTS.claude.sdkModelIds[model] ?? model;
+}
+
+/** Estimate cost from token usage using the pricing table. */
+export function estimateCost(agent: AgentId, model: string, usage: TokenUsage): number | undefined {
+  const pricing =
+    agent === 'claude'
+      ? AGENTS.claude.pricing[model as ClaudeModel]
+      : AGENTS.codex.pricing[model as CodexModel];
+  if (!pricing) return undefined;
+  const freshInput = usage.inputTokens - usage.cachedInputTokens;
+  return (
+    (freshInput / 1_000_000) * pricing.input +
+    (usage.cachedInputTokens / 1_000_000) * pricing.cachedInput +
+    (usage.outputTokens / 1_000_000) * pricing.output
+  );
+}
diff --git a/scripts/eval/lib/grade.test.ts b/scripts/eval/lib/grade.test.ts
new file mode 100644
index 000000000000..adcf2d85667d
--- /dev/null
+++ b/scripts/eval/lib/grade.test.ts
@@ -0,0 +1,260 @@
+import { describe, expect, it, vi } from 'vitest';
+
+vi.mock('../../../code/core/src/core-server/utils/ghost-stories/get-candidates.ts', () => ({
+  getComponentCandidates: vi.fn(),
+}));
+
+vi.mock('../../../code/core/src/core-server/utils/ghost-stories/run-story-tests.ts', () => ({
+  runGhostStories: vi.fn(),
+}));
+
+import {
+  filterStorybookFiles,
+  computeQualityScore,
+  countTypeCheckErrors,
+  parseChangedFiles,
+} from './grade';
+import type { FileChange } from './grade';
+
+describe('filterStorybookFiles', () => {
+  it('matches files in .storybook/ directory', () => {
+    const files: FileChange[] = [
+      { path: '.storybook/main.ts', gitStatus: 'M' },
+      { path: '.storybook/preview.tsx', gitStatus: 'A' },
+      { path: 'src/App.tsx', gitStatus: 'M' },
+    ];
+    expect(filterStorybookFiles(files)).toMatchObject([
+      { path: '.storybook/main.ts', gitStatus: 'M' },
+      { path: '.storybook/preview.tsx', gitStatus: 'A' },
+    ]);
+  });
+
+  it('matches story files with various extensions', () => {
+    const files: FileChange[] = [
+      { path: 'src/Button.stories.tsx', gitStatus: 'A' },
+      { path: 'src/Header.stories.ts', gitStatus: 'A' },
+      { path: 'src/Page.story.jsx', gitStatus: 'A' },
+      { path: 'src/utils.stories.js', gitStatus: 'A' },
+      { path: 'src/Button.tsx', gitStatus: 'M' },
+      { path: 'src/Button.test.tsx', gitStatus: 'M' },
+    ];
+    expect(filterStorybookFiles(files)).toMatchObject(files.slice(0, 4));
+  });
+
+  it('returns empty for no storybook files', () => {
+    const files: FileChange[] = [
+      { path: 'src/App.tsx', gitStatus: 'M' },
+      { path: 'package.json', gitStatus: 'M' },
+    ];
+    expect(filterStorybookFiles(files)).toHaveLength(0);
+  });
+
+  it('handles empty input', () => {
+    expect(filterStorybookFiles([])).toHaveLength(0);
+  });
+
+  it('matches renamed files using either side of the rename', () => {
+    const files: FileChange[] = [
+      { path: 'src/Button.tsx', previousPath: 'src/Button.stories.tsx', gitStatus: 'R' },
+      { path: '.storybook/preview.tsx', previousPath: 'config/preview.tsx', gitStatus: 'R' },
+      { path: 'src/App.tsx', previousPath: 'src/Main.tsx', gitStatus: 'R' },
+    ];
+
+    expect(filterStorybookFiles(files)).toMatchObject(files.slice(0, 2));
+  });
+});
+
+describe('computeQualityScore', () => {
+  // Weights: 40% ghost, 25% build, 25% typecheck, 10% performance
+
+  it('returns 1.0 when everything passes and agent is fast', () => {
+    const result = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 60,
+    });
+    expect(result.score).toBe(1);
+    expect(result.breakdown).toEqual({ build: 1, typecheck: 1, ghostStories: 1, performance: 1 });
+  });
+
+  it('ghost stories have 40% weight', () => {
+    const result = computeQualityScore({
+      buildSuccess: false,
+      typeCheckErrors: 20,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 600,
+    });
+    expect(result.score).toBe(0.4);
+  });
+
+  it('build has 25% weight', () => {
+    const result = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 20,
+      ghostSuccessRate: 0,
+      durationSeconds: 600,
+    });
+    expect(result.score).toBe(0.25);
+  });
+
+  it('performance has 10% weight', () => {
+    const result = computeQualityScore({
+      buildSuccess: false,
+      typeCheckErrors: 20,
+      ghostSuccessRate: 0,
+      durationSeconds: 60,
+    });
+    expect(result.score).toBe(0.1);
+  });
+
+  it('returns 0 when everything fails', () => {
+    const result = computeQualityScore({
+      buildSuccess: false,
+      typeCheckErrors: 20,
+      ghostSuccessRate: 0,
+      durationSeconds: 600,
+    });
+    expect(result.score).toBe(0);
+  });
+
+  it('scales typecheck score linearly', () => {
+    const result = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 10,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 60,
+    });
+    expect(result.breakdown.typecheck).toBe(0.5);
+  });
+
+  it('clamps typecheck score at 0 for >= 20 errors', () => {
+    const a = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 20,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 60,
+    });
+    const b = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 50,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 60,
+    });
+    expect(a.breakdown.typecheck).toBe(0);
+    expect(b.breakdown.typecheck).toBe(0);
+  });
+
+  it('treats undefined ghost stories as 0', () => {
+    const a = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 0,
+      durationSeconds: 60,
+    });
+    const b = computeQualityScore({ buildSuccess: true, typeCheckErrors: 0, durationSeconds: 60 });
+    expect(a.score).toBe(b.score);
+  });
+
+  it('performance: ≤120s scores 1.0', () => {
+    const a = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 0,
+    });
+    const b = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 120,
+    });
+    expect(a.breakdown.performance).toBe(1);
+    expect(b.breakdown.performance).toBe(1);
+  });
+
+  it('performance: 360s scores 0.5', () => {
+    const r = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 360,
+    });
+    expect(r.breakdown.performance).toBe(0.5);
+  });
+
+  it('performance: ≥600s scores 0', () => {
+    const a = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 600,
+    });
+    const b = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 1000,
+    });
+    expect(a.breakdown.performance).toBe(0);
+    expect(b.breakdown.performance).toBe(0);
+  });
+
+  it('performance: undefined duration scores 0', () => {
+    const r = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+    });
+    expect(r.breakdown.performance).toBe(0);
+  });
+});
+
+describe('countTypeCheckErrors', () => {
+  it('counts zero for clean output', () => {
+    expect(countTypeCheckErrors('')).toBe(0);
+    expect(countTypeCheckErrors('All good\nNo issues')).toBe(0);
+  });
+
+  it('counts TypeScript error codes', () => {
+    const output = [
+      "src/App.tsx(3,1): error TS2304: Cannot find name 'foo'.",
+      "src/App.tsx(5,1): error TS2322: Type 'string' is not assignable.",
+      'Found 2 errors.',
+    ].join('\n');
+    expect(countTypeCheckErrors(output)).toBe(2);
+  });
+
+  it('counts multiple errors on the same line', () => {
+    expect(countTypeCheckErrors('error TS1234 and error TS5678 on same line')).toBe(2);
+  });
+
+  it('does not count non-error TS references', () => {
+    expect(countTypeCheckErrors('TS2304 without error prefix')).toBe(0);
+    expect(countTypeCheckErrors('warning TS1234')).toBe(0);
+  });
+});
+
+describe('parseChangedFiles', () => {
+  it('parses added, modified, deleted, and renamed files', () => {
+    const output =
+      'A\tsrc/new-file.ts\nM\tsrc/existing.ts\nD\tsrc/removed.ts\nR100\told.ts\tnew.ts';
+    expect(parseChangedFiles(output)).toMatchObject([
+      { path: 'src/new-file.ts', gitStatus: 'A' },
+      { path: 'src/existing.ts', gitStatus: 'M' },
+      { path: 'src/removed.ts', gitStatus: 'D' },
+      { path: 'new.ts', previousPath: 'old.ts', gitStatus: 'R' },
+    ]);
+  });
+
+  it('handles empty output', () => {
+    expect(parseChangedFiles('')).toEqual([]);
+    expect(parseChangedFiles('\n')).toEqual([]);
+  });
+
+  it('handles single file', () => {
+    expect(parseChangedFiles('M\tpackage.json')).toEqual([
+      { path: 'package.json', gitStatus: 'M' },
+    ]);
+  });
+});
diff --git a/scripts/eval/lib/grade.ts b/scripts/eval/lib/grade.ts
new file mode 100644
index 000000000000..0bf3259b56da
--- /dev/null
+++ b/scripts/eval/lib/grade.ts
@@ -0,0 +1,289 @@
+import { writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import { x } from 'tinyexec';
+import { getComponentCandidates } from '../../../code/core/src/core-server/utils/ghost-stories/get-candidates.ts';
+import { runGhostStories } from '../../../code/core/src/core-server/utils/ghost-stories/run-story-tests.ts';
+import type { Logger } from './utils.ts';
+import type { TrialWorkspace } from './prepare-trial.ts';
+
+/** Git `--name-status` codes: A=added, M=modified, D=deleted, R=renamed. */
+export type GitDiffStatus = 'A' | 'M' | 'D' | 'R';
+
+export interface FileChange {
+  path: string;
+  gitStatus: GitDiffStatus;
+  /** For renames, the original path before the move. */
+  previousPath?: string;
+}
+
+export interface GhostStoryGrade {
+  candidateCount: number;
+  total: number;
+  passed: number;
+  successRate: number;
+}
+
+export interface ScoreWeights {
+  ghostStories: number;
+  build: number;
+  typecheck: number;
+  performance: number;
+}
+
+export const DEFAULT_SCORE_WEIGHTS: ScoreWeights = {
+  ghostStories: 0.4,
+  build: 0.25,
+  typecheck: 0.25,
+  performance: 0.1,
+};
+
+export interface QualityScore {
+  score: number;
+  breakdown: {
+    build: number;
+    typecheck: number;
+    ghostStories: number;
+    performance: number;
+  };
+}
+
+export interface Grade {
+  buildSuccess: boolean;
+  buildError?: string;
+  typeCheckErrors: number;
+  typeCheckOutput?: string;
+  fileChanges: FileChange[];
+  storybookChanges: FileChange[];
+  ghostStories?: GhostStoryGrade;
+}
+
+/** Maximum TypeScript errors before the typecheck score reaches 0. */
+const MAX_TYPECHECK_ERRORS = 20;
+/** Agent duration (seconds) at or below which performance scores 1.0. */
+const PERFECT_DURATION_S = 120;
+/** Agent duration (seconds) at or above which performance scores 0. */
+const ZERO_SCORE_DURATION_S = 600;
+
+/** Filter file changes to only storybook-related ones. */
+export function filterStorybookFiles(fileChanges: FileChange[]): FileChange[] {
+  const isStorybookPath = (path?: string) =>
+    path != null && (path.includes('.storybook/') || /\.(stories|story)\.[tj]sx?$/.test(path));
+
+  return fileChanges.filter((f) => isStorybookPath(f.path) || isStorybookPath(f.previousPath));
+}
+
+/**
+ * Compute quality score with configurable weights.
+ *
+ * Default weights: 40% ghost stories, 25% build, 25% typecheck, 10% performance.
+ *
+ * Performance is scored on a curve: <=120s -> 1.0, 600s -> 0, linear between.
+ */
+export function computeQualityScore(
+  opts: {
+    buildSuccess: boolean;
+    typeCheckErrors: number;
+    ghostSuccessRate?: number;
+    durationSeconds?: number;
+  },
+  weights: ScoreWeights = DEFAULT_SCORE_WEIGHTS
+): QualityScore {
+  const buildScore = opts.buildSuccess ? 1 : 0;
+  const tcScore = Math.max(0, 1 - opts.typeCheckErrors / MAX_TYPECHECK_ERRORS);
+  const ghostScore = opts.ghostSuccessRate ?? 0;
+  const d = opts.durationSeconds;
+  const perfScore =
+    d == null
+      ? 0
+      : Math.max(
+          0,
+          Math.min(1, 1 - (d - PERFECT_DURATION_S) / (ZERO_SCORE_DURATION_S - PERFECT_DURATION_S))
+        );
+  const score =
+    Math.round(
+      (ghostScore * weights.ghostStories +
+        buildScore * weights.build +
+        tcScore * weights.typecheck +
+        perfScore * weights.performance) *
+        100
+    ) / 100;
+  return {
+    score,
+    breakdown: {
+      build: buildScore,
+      typecheck: Math.round(tcScore * 100) / 100,
+      ghostStories: Math.round(ghostScore * 100) / 100,
+      performance: Math.round(perfScore * 100) / 100,
+    },
+  };
+}
+
+/** Count TypeScript errors from tsc output. */
+export function countTypeCheckErrors(tscOutput: string): number {
+  return (tscOutput.match(/error TS\d+/g) || []).length;
+}
+
+/** Parse git diff --name-status output into FileChange objects. */
+export function parseChangedFiles(gitOutput: string): FileChange[] {
+  return gitOutput
+    .trim()
+    .split('\n')
+    .filter(Boolean)
+    .map((line) => {
+      const [status, ...parts] = line.split('\t');
+      const gitStatus = parseGitDiffStatus(status);
+
+      if (gitStatus === 'R' && parts.length >= 2) {
+        const [previousPath, path] = parts;
+        return { path, previousPath, gitStatus };
+      }
+
+      return { path: parts.join('\t'), gitStatus };
+    });
+}
+
+export async function grade(
+  workspace: TrialWorkspace,
+  logger: Logger,
+  agentDuration?: number
+): Promise<{ grade: Grade; score: QualityScore }> {
+  const { repoRoot, projectPath, resultsDir, baselineCommit } = workspace;
+
+  // Changed files
+  logger.logStep('Collecting agent changes...');
+  const fileChanges = await getChangedFiles(repoRoot, baselineCommit);
+  const storybookChanges = filterStorybookFiles(fileChanges);
+  logger.logSuccess(
+    `${fileChanges.length} files changed (${storybookChanges.length} storybook-related)`
+  );
+
+  // Storybook build + TypeScript check in parallel
+  logger.logStep('Running storybook build + typecheck...');
+  const [build, tsc] = await Promise.all([
+    x('npx', ['storybook', 'build', '--quiet'], {
+      throwOnError: false,
+      timeout: 300_000,
+      nodeOptions: {
+        cwd: projectPath,
+        env: {
+          ...process.env,
+          STORYBOOK_DISABLE_TELEMETRY: '1',
+          NODE_OPTIONS: '--max_old_space_size=4096',
+        },
+      },
+    }),
+    x('npx', ['tsc', '--noEmit'], {
+      throwOnError: false,
+      timeout: 120_000,
+      nodeOptions: { cwd: projectPath },
+    }),
+  ]);
+
+  const buildSuccess = build.exitCode === 0;
+  const buildOutput = build.stdout + '\n' + build.stderr;
+  await writeFile(join(resultsDir, 'build-output.txt'), buildOutput);
+  if (buildSuccess) {
+    logger.logSuccess('Storybook build succeeded');
+  } else {
+    logger.logError(`Storybook build failed (exit ${build.exitCode})`);
+  }
+
+  const tscOutput = tsc.stdout + '\n' + tsc.stderr;
+  await writeFile(join(resultsDir, 'typecheck-output.txt'), tscOutput);
+  const typeCheckErrors = countTypeCheckErrors(tscOutput);
+  if (typeCheckErrors === 0) {
+    logger.logSuccess('No TypeScript errors');
+  } else {
+    logger.logError(`${typeCheckErrors} TypeScript error(s)`);
+  }
+
+  // Ghost stories (only if build passed)
+  const ghostStories = buildSuccess ? await gradeGhostStories(projectPath, logger) : undefined;
+
+  const trialGrade: Grade = {
+    buildSuccess,
+    buildError: buildSuccess ? undefined : truncateEnd(buildOutput, 2000),
+    typeCheckErrors,
+    typeCheckOutput: typeCheckErrors > 0 ? truncateEnd(tscOutput, 2000) : undefined,
+    fileChanges,
+    storybookChanges,
+    ghostStories,
+  };
+
+  const score = computeQualityScore({
+    buildSuccess,
+    typeCheckErrors,
+    ghostSuccessRate: ghostStories?.successRate,
+    durationSeconds: agentDuration,
+  });
+
+  return { grade: trialGrade, score };
+}
+
+async function getChangedFiles(repoRoot: string, baseline: string): Promise<FileChange[]> {
+  // Stage all files so `git diff --cached` picks up new files the agent created.
+  // Safe: this runs on an ephemeral trial copy, not the real repo.
+  await x('git', ['add', '-A'], { nodeOptions: { cwd: repoRoot } });
+  const { stdout } = await x('git', ['diff', '--cached', '--name-status', baseline], {
+    throwOnError: false,
+    nodeOptions: { cwd: repoRoot },
+  });
+  return parseChangedFiles(stdout);
+}
+
+async function gradeGhostStories(
+  projectPath: string,
+  logger: Logger
+): Promise<GhostStoryGrade | undefined> {
+  logger.logStep('Running ghost stories...');
+
+  try {
+    const { candidates } = await getComponentCandidates({ sampleSize: 20, cwd: projectPath });
+    if (candidates.length === 0) {
+      logger.logError('No candidate components found');
+      return undefined;
+    }
+    logger.logStep(`Found ${candidates.length} candidate component(s)`);
+
+    const result = await runGhostStories(candidates, { cwd: projectPath });
+
+    if (result.runError) {
+      logger.logError(`Ghost stories: ${result.runError}`);
+      return undefined;
+    }
+
+    const summary = 'summary' in result ? result.summary : undefined;
+
+    if (summary && summary.total > 0) {
+      const realPassed = summary.passed - summary.passedButEmptyRender;
+      logger.logSuccess(
+        `Ghost stories: ${realPassed}/${summary.total} passed (${Math.round(summary.successRateWithoutEmptyRender * 100)}%)${summary.passedButEmptyRender > 0 ? ` (${summary.passedButEmptyRender} empty renders excluded)` : ''}`
+      );
+    }
+
+    return {
+      candidateCount: candidates.length,
+      total: summary?.total ?? 0,
+      passed: (summary?.passed ?? 0) - (summary?.passedButEmptyRender ?? 0),
+      successRate: summary?.successRateWithoutEmptyRender ?? 0,
+    };
+  } catch (error) {
+    logger.logError(`Ghost stories: ${error instanceof Error ? error.message : String(error)}`);
+    return undefined;
+  }
+}
+
+/** Truncate text to approximately maxChars, snapping to a line boundary. */
+function truncateEnd(text: string, maxChars: number): string {
+  if (text.length <= maxChars) return text;
+  const truncated = text.slice(-maxChars);
+  const firstNewline = truncated.indexOf('\n');
+  return firstNewline >= 0 ? truncated.slice(firstNewline + 1) : truncated;
+}
+
+function parseGitDiffStatus(rawStatus?: string): GitDiffStatus {
+  const firstChar = rawStatus?.charAt(0);
+  return firstChar === 'A' || firstChar === 'M' || firstChar === 'D' || firstChar === 'R'
+    ? firstChar
+    : 'M';
+}
diff --git a/scripts/eval/lib/grading-helpers.test.ts b/scripts/eval/lib/grading-helpers.test.ts
new file mode 100644
index 000000000000..8d883da92f7a
--- /dev/null
+++ b/scripts/eval/lib/grading-helpers.test.ts
@@ -0,0 +1,177 @@
+import { mkdirSync, writeFileSync, rmSync } from 'node:fs';
+import { join } from 'node:path';
+import { tmpdir } from 'node:os';
+
+import { afterEach, beforeEach, describe, expect, it } from 'vitest';
+
+import { getComponentCandidates } from 'storybook/internal/core-server';
+import {
+  computeQualityScore,
+  countTypeCheckErrors,
+  filterStorybookFiles,
+  parseChangedFiles,
+} from './grade';
+/**
+ * Helper-level test: compose grading helpers on a fake project directory.
+ * This exercises candidate discovery, git-output parsing,
+ * and quality-score calculation without pretending to cover the full grade() flow.
+ */
+
+let TMP: string;
+
+beforeEach(() => {
+  TMP = join(tmpdir(), `eval-grading-helpers-${Date.now()}`);
+  mkdirSync(join(TMP, 'src', 'components'), { recursive: true });
+  mkdirSync(join(TMP, '.storybook'), { recursive: true });
+});
+
+afterEach(() => {
+  rmSync(TMP, { recursive: true, force: true });
+});
+
+describe('grading helpers', () => {
+  it('composes helper signals for a well-configured project', async () => {
+    // Set up a realistic project with components and storybook config
+    writeFile(
+      'src/components/Button.tsx',
+      [
+        `import React from 'react';`,
+        `export function Button({ label }: { label: string }) {`,
+        `  return (`,
+        `    <button className="btn">{label}</button>`,
+        `  );`,
+        `}`,
+      ].join('\n')
+    );
+    writeFile(
+      'src/components/Card.tsx',
+      [
+        `import React from 'react';`,
+        `export function Card({ title }: { title: string }) {`,
+        `  return (`,
+        `    <div className="card">{title}</div>`,
+        `  );`,
+        `}`,
+      ].join('\n')
+    );
+    writeFile(
+      '.storybook/preview.tsx',
+      [
+        `import '../src/styles/globals.css';`,
+        `import { ThemeProvider } from '@emotion/react';`,
+      ].join('\n')
+    );
+    writeFile(
+      '.storybook/main.ts',
+      `export default { staticDirs: ['../public'], stories: ['../src/**/*.stories.tsx'] };`
+    );
+
+    // Step 1: Find candidates — both components should be discovered
+    const candidates = await findCandidates(TMP);
+    expect(candidates).toHaveLength(2);
+
+    // Step 2: Simulate git output where the agent added storybook config + one
+    // story per discovered candidate, plus modified package.json
+    const gitLines = [
+      'A\t.storybook/preview.tsx',
+      'A\t.storybook/main.ts',
+      ...candidates.map((c) => `A\t${c.replace(/\.tsx$/, '.stories.tsx')}`),
+      'M\tpackage.json',
+    ];
+    const changedFiles = parseChangedFiles(gitLines.join('\n'));
+    const storybookFiles = filterStorybookFiles(changedFiles);
+
+    // 2 config files + 1 story per candidate = storybook-related
+    expect(storybookFiles).toHaveLength(2 + candidates.length);
+    // Total includes package.json
+    expect(changedFiles).toHaveLength(storybookFiles.length + 1);
+
+    // Step 3: Build passed, no TS errors, 100% ghost stories, fast agent → perfect score
+    const quality = computeQualityScore({
+      buildSuccess: true,
+      typeCheckErrors: 0,
+      ghostSuccessRate: 1.0,
+      durationSeconds: 60,
+    });
+    expect(quality.score).toBe(1);
+  });
+
+  it('composes helper signals for a broken project', async () => {
+    writeFile(
+      'src/components/Widget.tsx',
+      [
+        `import React from 'react';`,
+        `export function Widget() {`,
+        `  return <div>hello</div>;`,
+        `}`,
+      ].join('\n')
+    );
+
+    // Candidates still discoverable even when storybook setup is broken
+    const candidates = await findCandidates(TMP);
+    expect(candidates).toHaveLength(1);
+
+    // Simulate tsc output with errors proportional to candidate count
+    const tscLines = candidates.map(
+      (c, i) => `${c}(${i + 1},1): error TS2304: Cannot find name 'React'.`
+    );
+    tscLines.push('src/App.tsx(10,5): error TS2345: Argument not assignable.');
+    const errorCount = countTypeCheckErrors(tscLines.join('\n'));
+    expect(errorCount).toBe(candidates.length + 1);
+
+    // Build failed, no ghost stories, errors, slow → low quality
+    const quality = computeQualityScore({
+      buildSuccess: false,
+      typeCheckErrors: errorCount,
+      ghostSuccessRate: 0,
+      durationSeconds: 600,
+    });
+    expect(quality.score).toBeLessThan(0.3);
+    expect(quality.breakdown.build).toBe(0);
+  });
+
+  it('keeps helper output stable as candidate count grows', async () => {
+    // Rich project: many simple components
+    for (let i = 0; i < 5; i++) {
+      writeFile(
+        `src/components/Comp${i}.tsx`,
+        [
+          `import React from 'react';`,
+          `export function Comp${i}() {`,
+          `  return <div>Component ${i}</div>;`,
+          `}`,
+        ].join('\n')
+      );
+    }
+    writeFile('.storybook/preview.tsx', `import { MemoryRouter } from 'react-router-dom';`);
+
+    const candidates = await findCandidates(TMP);
+    expect(candidates).toHaveLength(5);
+
+    // Agent wrote one story per candidate — all storybook-related
+    const gitOutput = candidates.map((c) => `A\t${c.replace(/\.tsx$/, '.stories.tsx')}`).join('\n');
+    const storybookFiles = filterStorybookFiles(parseChangedFiles(gitOutput));
+    expect(storybookFiles).toHaveLength(candidates.length);
+
+    // Clean build + 100% ghost stories + fast → perfect
+    expect(
+      computeQualityScore({
+        buildSuccess: true,
+        typeCheckErrors: 0,
+        ghostSuccessRate: 1.0,
+        durationSeconds: 60,
+      }).score
+    ).toBe(1);
+  });
+});
+
+function writeFile(relativePath: string, content: string) {
+  const fullPath = join(TMP, relativePath);
+  mkdirSync(join(fullPath, '..'), { recursive: true });
+  writeFileSync(fullPath, content);
+}
+
+async function findCandidates(cwd: string) {
+  const { candidates } = await getComponentCandidates({ cwd, sampleSize: 20 });
+  return candidates.map((c) => c.replace(cwd + '/', ''));
+}
diff --git a/scripts/eval/lib/package-manager.test.ts b/scripts/eval/lib/package-manager.test.ts
new file mode 100644
index 000000000000..4d958198d3d7
--- /dev/null
+++ b/scripts/eval/lib/package-manager.test.ts
@@ -0,0 +1,71 @@
+import { mkdirSync, rmSync, writeFileSync } from 'node:fs';
+import { dirname, join } from 'node:path';
+import { tmpdir } from 'node:os';
+
+import { afterEach, describe, expect, it } from 'vitest';
+
+import { detectPackageManager, resolveInstallRoot } from './package-manager';
+
+const TEMP_DIRS: string[] = [];
+
+afterEach(() => {
+  for (const dir of TEMP_DIRS.splice(0)) {
+    rmSync(dir, { recursive: true, force: true });
+  }
+});
+
+describe('detectPackageManager', () => {
+  it('recognizes npm from package-lock files', () => {
+    const root = createTempDir('npm-lock');
+    writeFile('package-lock.json', root);
+
+    expect(detectPackageManager(root)).toBe('npm');
+  });
+});
+
+describe('resolveInstallRoot', () => {
+  it('keeps nested standalone apps on their own install root', () => {
+    const repoRoot = createTempDir('nested-bun');
+    const projectDir = join(repoRoot, 'frontend');
+    mkdirSync(projectDir, { recursive: true });
+    writeFile('frontend/bun.lock', repoRoot);
+
+    expect(resolveInstallRoot(projectDir, repoRoot)).toBe(projectDir);
+  });
+
+  it('walks up to the repo workspace root when lockfiles live above projectDir', () => {
+    const repoRoot = createTempDir('pnpm-workspace');
+    const projectDir = join(repoRoot, 'packages', 'lib');
+    mkdirSync(projectDir, { recursive: true });
+    writeFile('pnpm-lock.yaml', repoRoot);
+    writeFile('pnpm-workspace.yaml', repoRoot);
+
+    expect(resolveInstallRoot(projectDir, repoRoot)).toBe(repoRoot);
+  });
+
+  it('does not walk above the cloned repo root', () => {
+    const parent = createTempDir('parent-lock');
+    const repoRoot = join(parent, 'repo');
+    const projectDir = join(repoRoot, 'packages', 'lib');
+    mkdirSync(projectDir, { recursive: true });
+    writeFile('yarn.lock', parent);
+
+    expect(resolveInstallRoot(projectDir, repoRoot)).toBe(projectDir);
+  });
+});
+
+function createTempDir(name: string) {
+  const dir = join(
+    tmpdir(),
+    `storybook-eval-${name}-${Date.now()}-${Math.random().toString(16).slice(2)}`
+  );
+  mkdirSync(dir, { recursive: true });
+  TEMP_DIRS.push(dir);
+  return dir;
+}
+
+function writeFile(relativePath: string, root: string) {
+  const fullPath = join(root, relativePath);
+  mkdirSync(dirname(fullPath), { recursive: true });
+  writeFileSync(fullPath, '');
+}
diff --git a/scripts/eval/lib/package-manager.ts b/scripts/eval/lib/package-manager.ts
new file mode 100644
index 000000000000..ea61a5444e4f
--- /dev/null
+++ b/scripts/eval/lib/package-manager.ts
@@ -0,0 +1,99 @@
+/**
+ * Shared package manager detection and dependency installation.
+ *
+ * Used by trial preparation and any other eval flows that need a
+ * package-manager-aware install step.
+ */
+import { existsSync } from 'node:fs';
+import { dirname, join, resolve } from 'node:path';
+import { x } from 'tinyexec';
+import type { Logger } from './utils.ts';
+
+const PACKAGE_MANAGER_MARKERS = {
+  pnpm: ['pnpm-lock.yaml', 'pnpm-workspace.yaml'],
+  yarn: ['yarn.lock'],
+  bun: ['bun.lockb', 'bun.lock'],
+  npm: ['package-lock.json', 'npm-shrinkwrap.json'],
+} as const;
+
+/** Detect the package manager from lock files in a directory. */
+export function detectPackageManager(dir: string): string {
+  if (PACKAGE_MANAGER_MARKERS.pnpm.some((file) => existsSync(join(dir, file)))) return 'pnpm';
+  if (PACKAGE_MANAGER_MARKERS.yarn.some((file) => existsSync(join(dir, file)))) return 'yarn';
+  if (PACKAGE_MANAGER_MARKERS.bun.some((file) => existsSync(join(dir, file)))) return 'bun';
+  if (PACKAGE_MANAGER_MARKERS.npm.some((file) => existsSync(join(dir, file)))) return 'npm';
+  return 'npm';
+}
+
+/**
+ * Resolve the directory where dependency installation should run.
+ *
+ * For nested projects inside a workspace, the lockfile often lives above `dir`.
+ * We walk upward until we find the closest package-manager marker, stopping at
+ * the cloned repo root so we do not accidentally use markers from outside the trial.
+ */
+export function resolveInstallRoot(dir: string, stopAt?: string): string {
+  const start = resolve(dir);
+  const boundary = stopAt ? resolve(stopAt) : undefined;
+
+  let current = start;
+  while (true) {
+    if (hasAnyMarker(current)) {
+      return current;
+    }
+
+    if (boundary && current === boundary) {
+      return start;
+    }
+
+    const parent = dirname(current);
+    if (parent === current) {
+      return start;
+    }
+
+    current = parent;
+  }
+}
+
+/** Install dependencies using the detected package manager. */
+export async function installDeps(
+  dir: string,
+  logger: Logger,
+  env?: Record<string, string>,
+  options?: { stopAt?: string }
+): Promise<void> {
+  const installRoot = resolveInstallRoot(dir, options?.stopAt);
+  const pm = detectPackageManager(installRoot);
+  const [cmd, args] = getInstallArgs(pm, installRoot);
+  logger.logStep(
+    installRoot === resolve(dir)
+      ? `Installing with ${pm}...`
+      : `Installing with ${pm} from ${installRoot}...`
+  );
+  await x(cmd, args, {
+    timeout: 300_000,
+    nodeOptions: { cwd: installRoot, ...(env && { env: env as NodeJS.ProcessEnv }) },
+  });
+}
+
+function hasAnyMarker(dir: string): boolean {
+  return Object.values(PACKAGE_MANAGER_MARKERS).some((files) =>
+    files.some((file) => existsSync(join(dir, file)))
+  );
+}
+
+function getInstallArgs(pm: string, dir: string): [string, string[]] {
+  switch (pm) {
+    case 'pnpm':
+      return ['pnpm', ['install', '--no-frozen-lockfile']];
+    case 'yarn':
+      return [
+        'yarn',
+        existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'],
+      ];
+    case 'bun':
+      return ['bun', ['install']];
+    default:
+      return ['npm', ['install', '--ignore-scripts']];
+  }
+}
diff --git a/scripts/eval/lib/prepare-trial.test.ts b/scripts/eval/lib/prepare-trial.test.ts
new file mode 100644
index 000000000000..af45783998b5
--- /dev/null
+++ b/scripts/eval/lib/prepare-trial.test.ts
@@ -0,0 +1,48 @@
+import { describe, expect, it } from 'vitest';
+
+import { getCacheRefreshReason, type TrialCacheInfo } from './prepare-trial';
+import type { Project } from './projects';
+
+const project: Project = {
+  name: 'mealdrop',
+  repo: 'https://github.com/example/mealdrop',
+  branch: 'eval-baseline',
+};
+
+const cacheInfo: TrialCacheInfo = {
+  repo: project.repo,
+  branch: project.branch,
+  baselineCommit: '0123456789abcdef',
+};
+
+describe('getCacheRefreshReason', () => {
+  it('keeps the cache when repo, branch, and baseline still match', () => {
+    expect(getCacheRefreshReason(project, cacheInfo, cacheInfo.baselineCommit)).toBeUndefined();
+  });
+
+  it('refreshes when the repo URL changes', () => {
+    expect(
+      getCacheRefreshReason(
+        { ...project, repo: 'https://github.com/example/mealdrop-fork' },
+        cacheInfo,
+        cacheInfo.baselineCommit
+      )
+    ).toContain('repo changed');
+  });
+
+  it('refreshes when the tracked branch changes', () => {
+    expect(
+      getCacheRefreshReason({ ...project, branch: 'next' }, cacheInfo, cacheInfo.baselineCommit)
+    ).toContain('branch changed');
+  });
+
+  it('refreshes when the remote branch head advances', () => {
+    expect(getCacheRefreshReason(project, cacheInfo, 'fedcba9876543210')).toContain(
+      'baseline branch advanced'
+    );
+  });
+
+  it('keeps the cache if the remote branch cannot be verified', () => {
+    expect(getCacheRefreshReason(project, cacheInfo)).toBeUndefined();
+  });
+});
diff --git a/scripts/eval/lib/prepare-trial.ts b/scripts/eval/lib/prepare-trial.ts
new file mode 100644
index 000000000000..a39eedd40f64
--- /dev/null
+++ b/scripts/eval/lib/prepare-trial.ts
@@ -0,0 +1,166 @@
+import { existsSync } from 'node:fs';
+import { cp, mkdir, readFile, rm, writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import type { Logger } from './utils.ts';
+import type { Project } from './projects.ts';
+import { x } from 'tinyexec';
+import { installDeps } from './package-manager.ts';
+import { CACHE_DIR, TRIALS_DIR } from './utils.ts';
+
+const CACHE_INFO_SUFFIX = '.json';
+
+export interface TrialWorkspace {
+  trialDir: string;
+  repoRoot: string;
+  projectPath: string;
+  resultsDir: string;
+  baselineCommit: string;
+}
+
+export interface TrialCacheInfo {
+  repo: string;
+  branch: string;
+  baselineCommit: string;
+}
+
+/**
+ * First run: clone eval-baseline -> install deps -> cache it.
+ * Subsequent runs: copy from cache. Agent starts immediately.
+ */
+export async function prepareTrial(
+  project: Project,
+  trialId: string,
+  logger: Logger
+): Promise<TrialWorkspace> {
+  const cacheDir = join(CACHE_DIR, project.name);
+  const cacheInfoPath = join(CACHE_DIR, `${project.name}${CACHE_INFO_SUFFIX}`);
+  const trialDir = join(TRIALS_DIR, trialId);
+  const repoRoot = join(trialDir, 'project');
+  await mkdir(trialDir, { recursive: true });
+
+  if (await canReuseCache(project, cacheDir, cacheInfoPath, logger)) {
+    logger.logStep('Copying from cache...');
+    await cp(cacheDir, repoRoot, { recursive: true });
+  } else {
+    logger.logStep(`Cloning ${project.repo}#${project.branch}...`);
+    await mkdir(CACHE_DIR, { recursive: true });
+    await x('git', ['clone', '--depth', '1', '--branch', project.branch, project.repo, repoRoot], {
+      timeout: 120_000,
+    });
+    const projectPath = project.projectDir ? join(repoRoot, project.projectDir) : repoRoot;
+    await installDeps(projectPath, logger, undefined, { stopAt: repoRoot });
+    logger.logSuccess('Dependencies installed');
+    logger.logStep('Caching for future runs...');
+    const baselineCommit = await getGitHead(repoRoot);
+    await persistCache(cacheDir, cacheInfoPath, repoRoot, {
+      repo: project.repo,
+      branch: project.branch,
+      baselineCommit,
+    });
+  }
+
+  const baselineCommit = await getGitHead(repoRoot);
+  const projectPath = project.projectDir ? join(repoRoot, project.projectDir) : repoRoot;
+  const resultsDir = join(trialDir, 'results');
+  await mkdir(resultsDir, { recursive: true });
+
+  logger.logSuccess('Trial ready');
+  return { trialDir, repoRoot, projectPath, resultsDir, baselineCommit };
+}
+
+export function getCacheRefreshReason(
+  project: Project,
+  cacheInfo: TrialCacheInfo,
+  remoteHead?: string
+): string | undefined {
+  if (cacheInfo.repo !== project.repo) {
+    return `repo changed (${cacheInfo.repo} → ${project.repo})`;
+  }
+  if (cacheInfo.branch !== project.branch) {
+    return `branch changed (${cacheInfo.branch} → ${project.branch})`;
+  }
+  if (remoteHead && cacheInfo.baselineCommit !== remoteHead) {
+    return `baseline branch advanced (${cacheInfo.baselineCommit.slice(0, 7)} → ${remoteHead.slice(0, 7)})`;
+  }
+  return undefined;
+}
+
+async function canReuseCache(
+  project: Project,
+  cacheDir: string,
+  cacheInfoPath: string,
+  logger: Logger
+): Promise<boolean> {
+  if (!existsSync(join(cacheDir, '.git'))) {
+    return false;
+  }
+
+  const cacheInfo = await readCacheInfo(cacheInfoPath);
+  if (!cacheInfo) {
+    logger.logStep('Refreshing cache (missing or invalid cache metadata)...');
+    await clearCache(cacheDir, cacheInfoPath);
+    return false;
+  }
+
+  const remoteHead = await getRemoteBranchHead(project.repo, project.branch, logger);
+  const refreshReason = getCacheRefreshReason(project, cacheInfo, remoteHead);
+  if (!refreshReason) {
+    return true;
+  }
+
+  logger.logStep(`Refreshing cache (${refreshReason})...`);
+  await clearCache(cacheDir, cacheInfoPath);
+  return false;
+}
+
+async function persistCache(
+  cacheDir: string,
+  cacheInfoPath: string,
+  repoRoot: string,
+  cacheInfo: TrialCacheInfo
+) {
+  await clearCache(cacheDir, cacheInfoPath);
+  await cp(repoRoot, cacheDir, { recursive: true });
+  await writeFile(cacheInfoPath, JSON.stringify(cacheInfo, null, 2));
+}
+
+async function readCacheInfo(cacheInfoPath: string): Promise<TrialCacheInfo | undefined> {
+  if (!existsSync(cacheInfoPath)) {
+    return undefined;
+  }
+
+  try {
+    return JSON.parse(await readFile(cacheInfoPath, 'utf-8')) as TrialCacheInfo;
+  } catch {
+    return undefined;
+  }
+}
+
+async function getGitHead(cwd: string): Promise<string> {
+  return (await x('git', ['rev-parse', 'HEAD'], { nodeOptions: { cwd } })).stdout.trim();
+}
+
+async function getRemoteBranchHead(
+  repo: string,
+  branch: string,
+  logger: Logger
+): Promise<string | undefined> {
+  const result = await x('git', ['ls-remote', repo, `refs/heads/${branch}`], {
+    throwOnError: false,
+    timeout: 120_000,
+  });
+  if (result.exitCode !== 0) {
+    logger.logStep(`Could not verify remote HEAD for ${repo}#${branch}; reusing cache as-is.`);
+    return undefined;
+  }
+
+  const line = result.stdout.trim().split('\n').find(Boolean);
+  return line?.split('\t')[0]?.trim() || undefined;
+}
+
+async function clearCache(cacheDir: string, cacheInfoPath: string) {
+  await Promise.all([
+    rm(cacheDir, { recursive: true, force: true }),
+    rm(cacheInfoPath, { force: true }),
+  ]);
+}
diff --git a/scripts/eval/lib/projects.test.ts b/scripts/eval/lib/projects.test.ts
new file mode 100644
index 000000000000..b80238500f8e
--- /dev/null
+++ b/scripts/eval/lib/projects.test.ts
@@ -0,0 +1,32 @@
+import { describe, expect, it } from 'vitest';
+
+import { PROJECTS } from './projects';
+
+const githubRepoUrl = /^https:\/\/github\.com\/[^/]+\/[^/]+$/;
+
+describe('PROJECTS', () => {
+  it('pins every benchmark project to a pre-initialized eval-baseline repo', () => {
+    expect(PROJECTS.length).toBeGreaterThan(0);
+
+    for (const project of PROJECTS) {
+      expect(project).toMatchObject({
+        branch: 'eval-baseline',
+        repo: expect.stringMatching(githubRepoUrl),
+        description: expect.any(String),
+      });
+    }
+  });
+
+  it('keeps benchmark project metadata unambiguous', () => {
+    const names = PROJECTS.map((p) => p.name);
+    const repos = PROJECTS.map((p) => p.repo);
+
+    expect(new Set(names).size).toBe(names.length);
+    expect(new Set(repos).size).toBe(repos.length);
+
+    for (const project of PROJECTS) {
+      if (!project.projectDir) continue;
+      expect(project.projectDir).toMatch(/^(?!\/)(?!\.\.?(?:\/|$)).+/);
+    }
+  });
+});
diff --git a/scripts/eval/lib/projects.ts b/scripts/eval/lib/projects.ts
new file mode 100644
index 000000000000..0046ed30bac4
--- /dev/null
+++ b/scripts/eval/lib/projects.ts
@@ -0,0 +1,48 @@
+export interface Project {
+  name: string;
+  repo: string;
+  branch: string;
+  projectDir?: string;
+  description?: string;
+}
+
+export const PROJECTS: Project[] = [
+  {
+    name: 'mealdrop',
+    repo: 'https://github.com/kasperpeulen/mealdrop',
+    branch: 'eval-baseline',
+    description: 'Styled components, Redux, React Router',
+  },
+  {
+    name: 'edgy',
+    repo: 'https://github.com/kasperpeulen/edgy',
+    branch: 'eval-baseline',
+    description: 'Tailwind, HeadlessUI, React Router',
+  },
+  {
+    name: 'wikitok',
+    repo: 'https://github.com/kasperpeulen/wikitok',
+    branch: 'eval-baseline',
+    projectDir: 'frontend',
+    description: 'Simple project with Tailwind',
+  },
+  {
+    name: 'baklava',
+    repo: 'https://github.com/kasperpeulen/baklava',
+    branch: 'eval-baseline',
+    description: 'Component library with Zustand',
+  },
+  {
+    name: 'echarts',
+    repo: 'https://github.com/kasperpeulen/echarts-react',
+    branch: 'eval-baseline',
+    description: 'ECharts React wrapper',
+  },
+  {
+    name: 'evergreen-ci',
+    repo: 'https://github.com/kasperpeulen/ui',
+    branch: 'eval-baseline',
+    projectDir: 'packages/lib',
+    description: 'GraphQL',
+  },
+];
diff --git a/scripts/eval/lib/run-trial.test.ts b/scripts/eval/lib/run-trial.test.ts
new file mode 100644
index 000000000000..8b7f79bd07c0
--- /dev/null
+++ b/scripts/eval/lib/run-trial.test.ts
@@ -0,0 +1,233 @@
+import { mkdirSync, readFileSync, rmSync } from 'node:fs';
+import { join } from 'node:path';
+import { tmpdir } from 'node:os';
+
+import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest';
+
+import type { TrialConfig, TrialReport } from './run-trial';
+
+// Mock external dependencies to avoid real git/storybook/vitest calls
+vi.mock('./prepare-trial', () => ({
+  prepareTrial: vi.fn(),
+}));
+vi.mock('./grade', () => ({
+  grade: vi.fn(),
+}));
+vi.mock('./utils', async (importOriginal) => {
+  const actual = await importOriginal<typeof import('./utils')>();
+  return {
+    ...actual,
+    captureEnvironment: vi.fn().mockResolvedValue({
+      nodeVersion: 'v22.21.1',
+      evalBranch: 'test-branch',
+      evalCommit: 'abc123',
+    }),
+  };
+});
+vi.mock('./agents/claude-code', () => ({
+  claudeAgent: { name: 'claude', execute: vi.fn() },
+}));
+vi.mock('./agents/codex', () => ({
+  codexAgent: { name: 'codex', execute: vi.fn() },
+}));
+
+import { claudeAgent } from './agents/claude-code';
+import { grade } from './grade';
+import { prepareTrial } from './prepare-trial';
+import { runTrial } from './run-trial';
+import { captureEnvironment } from './utils';
+
+let TMP: string;
+
+beforeEach(() => {
+  vi.clearAllMocks();
+  TMP = join(tmpdir(), `eval-run-trial-${Date.now()}`);
+  mkdirSync(join(TMP, 'results'), { recursive: true });
+});
+
+afterEach(() => {
+  rmSync(TMP, { recursive: true, force: true });
+});
+
+const baseConfig: TrialConfig = {
+  project: { name: 'test-project', repo: 'https://github.com/test/repo', branch: 'main' },
+  variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' },
+  prompt: 'setup',
+};
+
+describe('runTrial pipeline', () => {
+  it('assembles a complete TrialReport from pipeline steps', async () => {
+    setupMocks();
+
+    const result = await runTrial(baseConfig);
+
+    expect(result).toMatchObject({
+      schemaVersion: 1,
+      project: { name: 'test-project', repo: 'https://github.com/test/repo', branch: 'main' },
+      variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' },
+      prompt: 'setup',
+      baselineCommit: 'deadbeef',
+      execution: {
+        cost: 0.42,
+        duration: 45.2,
+        turns: 12,
+      },
+      grade: {
+        buildSuccess: true,
+      },
+      score: {
+        score: 1,
+      },
+    });
+    expect(result.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/);
+  });
+
+  it('calls pipeline steps with correct arguments', async () => {
+    setupMocks();
+
+    const config: TrialConfig = {
+      ...baseConfig,
+      project: {
+        name: 'mealdrop',
+        repo: 'https://github.com/test/mealdrop',
+        branch: 'eval-baseline',
+      },
+    };
+
+    await runTrial(config);
+
+    expect(vi.mocked(prepareTrial).mock.calls[0][0]).toMatchObject({
+      name: 'mealdrop',
+      repo: 'https://github.com/test/mealdrop',
+      branch: 'eval-baseline',
+    });
+    expect(vi.mocked(prepareTrial).mock.calls[0][2]).toBeDefined();
+
+    expect(vi.mocked(captureEnvironment).mock.calls[0][0]).toBe(join(TMP, 'results'));
+
+    const params = vi.mocked(claudeAgent.execute).mock.calls[0][0];
+    expect(params).toMatchObject({
+      prompt: expect.stringContaining('set up Storybook'),
+      projectPath: TMP,
+      variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' },
+      resultsDir: join(TMP, 'results'),
+    });
+    expect(params.logger).toBeDefined();
+
+    const gradeWorkspace = vi.mocked(grade).mock.calls[0][0];
+    expect(gradeWorkspace).toMatchObject({
+      baselineCommit: 'deadbeef',
+      projectPath: TMP,
+      resultsDir: join(TMP, 'results'),
+    });
+    expect(vi.mocked(grade).mock.calls[0][1]).toBeDefined();
+  });
+
+  it('writes summary.json and prompt.md to results dir', async () => {
+    setupMocks();
+
+    await runTrial(baseConfig);
+
+    const resultsDir = join(TMP, 'results');
+
+    const summary: TrialReport = JSON.parse(
+      readFileSync(join(resultsDir, 'summary.json'), 'utf-8')
+    );
+    expect(summary).toMatchObject({
+      schemaVersion: 1,
+      execution: { cost: 0.42 },
+      grade: { buildSuccess: true },
+    });
+
+    const promptContent = readFileSync(join(resultsDir, 'prompt.md'), 'utf-8');
+    expect(promptContent).toContain('set up Storybook');
+  });
+
+  it('propagates failed build into result', async () => {
+    setupMocks({ buildSuccess: false, typeCheckErrors: 5 });
+
+    await expect(runTrial(baseConfig)).resolves.toMatchObject({
+      grade: { buildSuccess: false, typeCheckErrors: 5 },
+      score: { score: 0.3 },
+    });
+  });
+
+  it('does not call grade before agent finishes', async () => {
+    // Use execution order tracking to verify sequencing
+    const callOrder: string[] = [];
+
+    vi.mocked(prepareTrial).mockImplementation(async () => {
+      callOrder.push('prepare');
+      return {
+        trialDir: TMP,
+        repoRoot: TMP,
+        projectPath: TMP,
+        resultsDir: join(TMP, 'results'),
+        baselineCommit: 'deadbeef',
+      };
+    });
+
+    vi.mocked(claudeAgent.execute).mockImplementation(async () => {
+      callOrder.push('agent');
+      return { cost: 0.1, duration: 10, turns: 3 };
+    });
+
+    vi.mocked(grade).mockImplementation(async () => {
+      callOrder.push('grade');
+      return {
+        grade: {
+          buildSuccess: true,
+          typeCheckErrors: 0,
+          fileChanges: [],
+          storybookChanges: [],
+        },
+        score: { score: 1, breakdown: { build: 1, typecheck: 1, ghostStories: 0, performance: 0 } },
+      };
+    });
+
+    await runTrial(baseConfig);
+
+    expect(callOrder).toEqual(['prepare', 'agent', 'grade']);
+  });
+});
+
+function setupMocks(overrides?: {
+  buildSuccess?: boolean;
+  typeCheckErrors?: number;
+  cost?: number;
+}) {
+  const { buildSuccess = true, typeCheckErrors = 0, cost = 0.42 } = overrides ?? {};
+
+  vi.mocked(prepareTrial).mockResolvedValue({
+    trialDir: TMP,
+    repoRoot: TMP,
+    projectPath: TMP,
+    resultsDir: join(TMP, 'results'),
+    baselineCommit: 'deadbeef',
+  });
+
+  vi.mocked(claudeAgent.execute).mockResolvedValue({
+    cost,
+    duration: 45.2,
+    turns: 12,
+  });
+
+  vi.mocked(grade).mockResolvedValue({
+    grade: {
+      buildSuccess,
+      typeCheckErrors,
+      fileChanges: [
+        { path: '.storybook/preview.tsx', gitStatus: 'A' },
+        { path: 'src/Button.stories.tsx', gitStatus: 'A' },
+      ],
+      storybookChanges: [
+        { path: '.storybook/preview.tsx', gitStatus: 'A' },
+        { path: 'src/Button.stories.tsx', gitStatus: 'A' },
+      ],
+    },
+    score: {
+      score: buildSuccess ? 1 : 0.3,
+      breakdown: { build: buildSuccess ? 1 : 0, typecheck: 1, ghostStories: 0, performance: 0 },
+    },
+  });
+}
diff --git a/scripts/eval/lib/run-trial.ts b/scripts/eval/lib/run-trial.ts
new file mode 100644
index 000000000000..fc8dde20fff8
--- /dev/null
+++ b/scripts/eval/lib/run-trial.ts
@@ -0,0 +1,96 @@
+import { writeFile } from 'node:fs/promises';
+import { join } from 'node:path';
+import type { Logger } from './utils.ts';
+import type { AgentId, AgentDriver, AgentVariant, Execution } from './agents/config.ts';
+import type { Project } from './projects.ts';
+import { grade, type Grade, type QualityScore } from './grade.ts';
+import { claudeAgent } from './agents/claude-code.ts';
+import { codexAgent } from './agents/codex.ts';
+import { prepareTrial } from './prepare-trial.ts';
+import { generateTrialId, loadPrompt, captureEnvironment, createLogger } from './utils.ts';
+
+export interface TrialConfig {
+  /** Which project to evaluate (cloned from its eval-baseline branch). */
+  project: Project;
+  /** Agent, model, and effort level. */
+  variant: AgentVariant;
+  /** Prompt name — maps to `prompts/{name}.md` (e.g. "setup"). */
+  prompt: string;
+  /** Log agent messages to stdout. */
+  verbose?: boolean;
+}
+
+export interface TrialReport {
+  schemaVersion: 1;
+  project: Project;
+  variant: AgentVariant;
+  prompt: string;
+  timestamp: string;
+  baselineCommit: string;
+  execution: Execution;
+  grade: Grade;
+  score: QualityScore;
+}
+
+const drivers: Record<AgentId, AgentDriver> = {
+  claude: claudeAgent,
+  codex: codexAgent,
+};
+
+/**
+ * Run a full eval trial: prepare -> execute agent -> grade -> save.
+ */
+export async function runTrial(config: TrialConfig, logger?: Logger): Promise<TrialReport> {
+  const { project, variant, prompt: promptName } = config;
+  const { agent: agentName, model } = variant;
+  const log = logger ?? createLogger();
+  const trialId = generateTrialId(project.name, agentName, model, promptName || 'setup');
+  const timestamp = new Date().toISOString();
+
+  log.log(`Preparing ${project.name}...`);
+
+  // 1. Prepare the trial
+  const workspace = await prepareTrial(project, trialId, log);
+
+  // 2. Capture environment
+  await captureEnvironment(workspace.resultsDir);
+
+  // 3. Load the prompt
+  const prompt = loadPrompt(promptName);
+  await writeFile(join(workspace.resultsDir, 'prompt.md'), prompt);
+
+  // 4. Execute the agent
+  log.log(`  Running ${agentName} (${model}, effort=${variant.effort})...`);
+  const driver = drivers[agentName];
+  const execution = await driver.execute({
+    prompt,
+    projectPath: workspace.projectPath,
+    variant,
+    resultsDir: workspace.resultsDir,
+    logger: log,
+  });
+  log.logSuccess(
+    `Agent completed (${Math.round(execution.duration)}s, ${execution.cost ? `$${execution.cost.toFixed(2)}` : 'cost N/A'}, ${execution.turns} turns)`
+  );
+
+  // 5. Grade the results (pass agent duration for performance scoring)
+  const { grade: trialGrade, score } = await grade(workspace, log, execution.duration);
+
+  // 6. Assemble final report
+  const report: TrialReport = {
+    schemaVersion: 1,
+    project,
+    variant,
+    timestamp,
+    prompt: promptName || 'setup',
+    baselineCommit: workspace.baselineCommit,
+    execution,
+    grade: trialGrade,
+    score,
+  };
+
+  await writeFile(join(workspace.resultsDir, 'summary.json'), JSON.stringify(report, null, 2));
+  log.logSuccess(`Results saved to ${workspace.resultsDir}`);
+
+  return report;
+}
diff --git a/scripts/eval/lib/utils.test.ts b/scripts/eval/lib/utils.test.ts
new file mode 100644
index 000000000000..7b4ebe4e5024
--- /dev/null
+++ b/scripts/eval/lib/utils.test.ts
@@ -0,0 +1,144 @@
+import { describe, expect, it } from 'vitest';
+
+import {
+  formatDuration,
+  formatCost,
+  generateTrialId,
+  loadPrompt,
+  listPrompts,
+  formatTable,
+} from './utils';
+
+describe('formatDuration', () => {
+  it('formats seconds under a minute', () => {
+    expect(formatDuration(0)).toBe('0s');
+    expect(formatDuration(1)).toBe('1s');
+    expect(formatDuration(45)).toBe('45s');
+  });
+
+  it('rounds fractional seconds', () => {
+    expect(formatDuration(2.7)).toBe('3s');
+    expect(formatDuration(59.4)).toBe('59s');
+  });
+
+  it('formats minutes and seconds', () => {
+    expect(formatDuration(60)).toBe('1m0s');
+    expect(formatDuration(61)).toBe('1m1s');
+    expect(formatDuration(90)).toBe('1m30s');
+    expect(formatDuration(125)).toBe('2m5s');
+    expect(formatDuration(3661)).toBe('61m1s');
+  });
+});
+
+describe('formatCost', () => {
+  it('returns dash for undefined', () => {
+    expect(formatCost(undefined)).toBe('-');
+    expect(formatCost()).toBe('-');
+  });
+
+  it('formats dollar amounts', () => {
+    expect(formatCost(0)).toBe('$0.00');
+    expect(formatCost(1.5)).toBe('$1.50');
+  });
+});
+
+describe('generateTrialId', () => {
+  it('contains project, agent, model, and prompt', () => {
+    const id = generateTrialId('mealdrop', 'claude', 'sonnet-4.6', 'setup');
+    expect(id).toContain('mealdrop');
+    expect(id).toContain('claude');
+    expect(id).toContain('sonnet-4.6');
+    expect(id).toContain('setup');
+  });
+
+  it('starts with an ISO-like timestamp', () => {
+    const id = generateTrialId('proj', 'agent', 'model', 'prompt');
+    expect(id).toMatch(/^\d{4}-\d{2}-\d{2}T\d{2}-\d{2}-\d{2}/);
+  });
+
+  it('generates unique IDs', () => {
+    const a = generateTrialId('p', 'a', 'm', 'pr');
+    const b = generateTrialId('p', 'a', 'm', 'pr');
+    expect(a).not.toBe(b);
+  });
+});
+
+describe('listPrompts', () => {
+  it('lists available prompt names', () => {
+    const prompts = listPrompts();
+    expect(prompts).toContain('setup');
+  });
+
+  it('returns only names without .md extension', () => {
+    for (const name of listPrompts()) {
+      expect(name).not.toContain('.md');
+    }
+  });
+});
+
+describe('loadPrompt', () => {
+  it('loads setup prompt by default', () => {
+    const prompt = loadPrompt();
+    expect(prompt).toContain('Storybook');
+    expect(prompt.length).toBeGreaterThan(0);
+  });
+
+  it('loads setup prompt by name', () => {
+    const prompt = loadPrompt('setup');
+    expect(prompt).toContain('Storybook');
+    expect(prompt).toContain('### Step 1');
+  });
+
+  it('throws for unknown prompt', () => {
+    expect(() => loadPrompt('nonexistent-prompt-xyz')).toThrow('Prompt not found');
+  });
+
+  it('returns trimmed content', () => {
+    const prompt = loadPrompt('setup');
+    expect(prompt).toBe(prompt.trim());
+  });
+});
+
+describe('formatTable', () => {
+  it('formats a simple table with aligned columns', () => {
+    const result = formatTable(
+      ['Name', 'Score'],
+      [
+        ['Alice', '100'],
+        ['Bob', '95'],
+      ]
+    );
+    const lines = result.split('\n');
+    expect(lines).toHaveLength(4); // header + divider + 2 rows
+    expect(lines[0]).toContain('Name');
+    expect(lines[0]).toContain('Score');
+    expect(lines[1]).toMatch(/^-+\+-+$/);
+    expect(lines[2]).toContain('Alice');
+    expect(lines[3]).toContain('Bob');
+  });
+
+  it('auto-sizes columns to fit content', () => {
+    const result = formatTable(['X', 'Y'], [['short', 'a-much-longer-value']]);
+    const lines = result.split('\n');
+    // Header column for Y should be padded to match the data width
+    const headerCols = lines[0].split(' | ');
+    const dataCols = lines[2].split(' | ');
+    expect(headerCols[1].trim().length).toBeLessThanOrEqual(dataCols[1].trim().length);
+  });
+
+  it('handles ANSI escape codes in cells', () => {
+    const green = '\x1b[32mPASS\x1b[39m';
+    const result = formatTable(['Status'], [[green], ['FAIL']]);
+    const lines = result.split('\n');
+    // Both rows should be the same visible width
+    // The ANSI row has extra invisible chars but should still align
+    expect(lines[2]).toContain('PASS');
+    expect(lines[3]).toContain('FAIL');
+  });
+
+  it('handles empty rows', () => {
+    const result = formatTable(['A', 'B'], []);
+    const lines = result.split('\n');
+    expect(lines).toHaveLength(2); // header + divider only
+  });
+});
diff --git a/scripts/eval/lib/utils.ts b/scripts/eval/lib/utils.ts
new file mode 100644
index 000000000000..79d24891f227
--- /dev/null
+++ b/scripts/eval/lib/utils.ts
@@ -0,0 +1,101 @@
+import { readFileSync, existsSync, readdirSync } from 'node:fs';
+import { writeFile } from 'node:fs/promises';
+import { resolve, basename, join } from 'node:path';
+import pc from 'picocolors';
+import { x } from 'tinyexec';
+
+export interface Logger {
+  log: (msg: string) => void;
+  logStep: (msg: string) => void;
+  logSuccess: (msg: string) => void;
+  logError: (msg: string) => void;
+}
+
+export const REPO_ROOT = resolve(import.meta.dirname, '..', '..', '..');
+export const EVAL_ROOT = resolve(REPO_ROOT, '..', 'storybook-eval');
+export const CACHE_DIR = resolve(EVAL_ROOT, '.cache', 'repos');
+export const TRIALS_DIR = resolve(EVAL_ROOT, 'trials');
+export const PROMPTS_DIR = resolve(import.meta.dirname, '..', 'prompts');
+
+export function createLogger(prefix?: string): Logger {
+  const p = prefix ? pc.dim(`[${prefix}]`) + ' ' : '';
+  return {
+    log: (msg: string) => console.log(`${p}${msg}`),
+    logStep: (msg: string) => console.log(`${p}  ${pc.cyan('>')} ${msg}`),
+    logSuccess: (msg: string) => console.log(`${p}  ${pc.green('✓')} ${msg}`),
+    logError: (msg: string) => console.log(`${p}  ${pc.red('✗')} ${msg}`),
+  };
+}
+
+export const formatDuration = (s: number) =>
+  s < 60 ? `${Math.round(s)}s` : `${Math.floor(s / 60)}m${Math.round(s % 60)}s`;
+
+export const formatCost = (cost?: number) => (cost == null ? '-' : `$${cost.toFixed(2)}`);
+
+export function generateTrialId(project: string, agent: string, model: string, prompt: string) {
+  const ts = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
+  return `${ts}-${project}-${agent}-${model}-${prompt}-${crypto.randomUUID().slice(0, 8)}`;
+}
+
+/** Format data as an aligned table with automatic column widths. */
+export function formatTable(headers: string[], rows: string[][]): string {
+  const widths = headers.map((h, i) =>
+    Math.max(h.length, ...rows.map((r) => stripAnsi(r[i] ?? '').length))
+  );
+
+  const pad = (str: string, width: number) => {
+    const visible = stripAnsi(str).length;
+    return str + ' '.repeat(Math.max(0, width - visible));
+  };
+
+  const sep = ' | ';
+  return [
+    headers.map((h, i) => pad(h, widths[i])).join(sep),
+    widths.map((w) => '-'.repeat(w)).join('-+-'),
+    ...rows.map((row) => row.map((cell, i) => pad(cell, widths[i])).join(sep)),
+  ].join('\n');
+}
+
+/** Load a prompt by name from prompts/{name}.md. */
+export function loadPrompt(name = 'setup'): string {
+  const file = resolve(PROMPTS_DIR, `${name}.md`);
+  if (!existsSync(file)) {
+    throw new Error(`Prompt not found: ${file}\nAvailable: ${listPrompts().join(', ')}`);
+  }
+  return readFileSync(file, 'utf-8').trim();
+}
+
+/** List available prompt names. */
+export function listPrompts(): string[] {
+  if (!existsSync(PROMPTS_DIR)) return [];
+  return readdirSync(PROMPTS_DIR)
+    .filter((f) => f.endsWith('.md'))
+    .map((f) => basename(f, '.md'));
+}
+
+export interface EvalEnvironment {
+  nodeVersion: string;
+  /** Git branch of the eval harness (storybook monorepo), not the evaluated project. */
+  evalBranch: string;
+  /** Git commit of the eval harness (storybook monorepo), not the evaluated project. */
+  evalCommit: string;
+}
+
+export async function captureEnvironment(resultsDir: string): Promise<EvalEnvironment> {
+  let evalBranch = 'unknown';
+  let evalCommit = 'unknown';
+  try {
+    evalBranch = (await x('git', ['rev-parse', '--abbrev-ref', 'HEAD'])).stdout.trim();
+    evalCommit = (await x('git', ['rev-parse', 'HEAD'])).stdout.trim();
+  } catch {
+    /* not in a git repo */
+  }
+  const env: EvalEnvironment = { nodeVersion: process.version, evalBranch, evalCommit };
+  await writeFile(join(resultsDir, 'environment.json'), JSON.stringify(env, null, 2));
+  return env;
+}
+
+/** Strip ANSI escape codes for accurate width calculation. */
+function stripAnsi(str: string) {
+  return str.replace(/\x1b\[[0-9;]*m/g, '');
+}
diff --git a/scripts/eval/prompts/setup.md b/scripts/eval/prompts/setup.md
new file mode 100644
index 000000000000..4021e67becde
--- /dev/null
+++ b/scripts/eval/prompts/setup.md
@@ -0,0 +1,196 @@
+Attention: The following instructions must be followed in order to successfully set up Storybook in this project. Do not skip steps or attempt to do them out of order.
+
+Your goal is to make Storybook fully functional in this project by analyzing the codebase,
+configuring the preview with the right decorators, and writing example stories for 9 components.
+
+Work through these steps in order. After each story file, run Vitest to verify it renders.
+If the test fails, read the error, fix the issue, and re-run until it passes before moving on.
+
+### Step 1: Analyze the codebase
+
+Before writing any stories, understand what the components need to render:
+
+- Scan the project for context providers, theme systems, routers, stores, and i18n setups.
+  Look at the app's entry point (e.g. `App.tsx`, `main.tsx`, `layout.tsx`) to see what
+  providers wrap the component tree.
+- Identify global CSS or style imports required for components to look correct.
+- Note any path aliases configured in tsconfig or bundler config.
+- Read `.storybook/main.ts` (or `main.js`) to find the `stories` glob patterns.
+  Your story files must match those patterns to be picked up by Storybook.
+
+### Step 2: Configure `.storybook/preview.ts` with decorators
+
+Add decorators that wrap every story with the providers your components need.
+Without this, most non-trivial components will crash.
+
+If the project uses CSF Factory (look for `definePreview` in `.storybook/preview.ts`):
+```ts
+// .storybook/preview.ts
+import '../src/index.css'; // import global styles
+
+import { definePreview } from 'storybook/preview';
+
+export const config = definePreview({
+  decorators: [
+    (Story) => (
+      <ThemeProvider theme={theme}>
+        <MemoryRouter>
+          <Story />
+        </MemoryRouter>
+      </ThemeProvider>
+    ),
+  ],
+});
+```
+
+Otherwise:
+```ts
+// .storybook/preview.ts
+import '../src/index.css'; // import global styles
+
+const preview = {
+  decorators: [
+    (Story) => (
+      <ThemeProvider theme={theme}>
+        <MemoryRouter>
+          <Story />
+        </MemoryRouter>
+      </ThemeProvider>
+    ),
+  ],
+};
+export default preview;
+```
+
+Common decorators to add:
+- **Theme providers** (e.g. ThemeProvider, MUI ThemeProvider, styled-components, Tailwind)
+- **Router** (e.g. MemoryRouter, BrowserRouter mock)
+- **State stores** (e.g. Redux Provider, Zustand, Jotai)
+- **i18n** (e.g. IntlProvider, I18nextProvider)
+- **Global CSS** — import global stylesheets at the top of preview.ts
+
+### Step 3: Write stories for 9 components
+
+Pick 9 real components from the codebase, 3 of each complexity level.
+Use the title prefix `AI Generated/<Complexity>/<ComponentName>` so they are grouped
+together in the Storybook sidebar.
+
+**Simple (3 components)** — Presentational with few props, no internal state.
+Examples: Button, Badge, Avatar, Icon, Label, Chip.
+Title format: `AI Generated/Simple/<ComponentName>`
+
+**Medium (3 components)** — Multiple visual variants or composed from simpler components.
+Examples: Card, Alert, Input, Select, Tooltip, Tabs.
+Title format: `AI Generated/Medium/<ComponentName>`
+
+**Complex (3 components)** — Internal state, side effects, or deep composition.
+Examples: Modal, DataTable, Form, Dropdown, Accordion, Sidebar.
+Title format: `AI Generated/Complex/<ComponentName>`
+
+For each component, create a `<ComponentName>.stories.ts` file next to the component.
+Each file must have at least 2 story exports covering the component's main states.
+Make sure the file location and naming matches the `stories` patterns in `.storybook/main.ts`.
+
+If the project uses CSF Factory (look for `definePreview` / `config.meta` patterns):
+
+Story format (CSF Factory — this project uses CSF factories):
+```ts
+import { config } from '#.storybook/preview';
+import { Button } from './Button';
+
+const meta = config.meta({
+  title: 'AI Generated/Simple/Button',
+  component: Button,
+});
+
+export const Default = meta.story({
+  args: {
+    label: 'Click me',
+  },
+});
+
+export const Disabled = meta.story({
+  args: {
+    label: 'Disabled',
+    disabled: true,
+  },
+});
+```
+
+Otherwise:
+
+Story format (CSF):
+```ts
+import type { Meta, StoryObj } from '@storybook/react';
+import { Button } from './Button';
+
+const meta = {
+  title: 'AI Generated/Simple/Button',
+  component: Button,
+} satisfies Meta<typeof Button>;
+
+export default meta;
+type Story = StoryObj<typeof meta>;
+
+export const Default: Story = {
+  args: {
+    label: 'Click me',
+  },
+};
+
+export const Disabled: Story = {
+  args: {
+    label: 'Disabled',
+    disabled: true,
+  },
+};
+```
+
+Rules:
+- Every named export is a story. Use `args` to set props.
+- Provide all required props via `args` — check the component's types.
+- If a component needs per-story decorators (beyond the global ones), add them in the meta.
+- Do NOT use `any` types. Use the component's prop types for type safety.
+
+Reference: https://storybook.js.org/docs/latest/writing-stories
+
+### Step 4: Verify each story with Vitest
+
+After writing each story file, immediately verify it:
+
+```bash
+npx vitest --project storybook <path-to-story-file>
+```
+
+**Self-healing loop — repeat for every story file:**
+1. Write/update the story file
+2. Run `npx vitest --project storybook <path-to-story-file>`
+3. If it fails: read the error output carefully
+   - Missing provider → add a decorator in `.storybook/preview.ts` or in the story meta
+   - Missing prop → add the required prop to `args`
+   - Import error → fix the import path
+   - CSS/asset error → add static dirs or import the stylesheet
+4. Fix the issue and go back to step 2
+5. Once the test passes, move to the next component
+
+After all 9 story files pass individually, run the full suite:
+```bash
+npx vitest --project storybook
+```
+
+### Checklist
+
+- [ ] Analyzed codebase for providers, global styles, and path aliases
+- [ ] Read story patterns from `.storybook/main.ts`
+- [ ] Configured `.storybook/preview.ts` with necessary decorators
+- [ ] Simple component 1: story written and passing
+- [ ] Simple component 2: story written and passing
+- [ ] Simple component 3: story written and passing
+- [ ] Medium component 1: story written and passing
+- [ ] Medium component 2: story written and passing
+- [ ] Medium component 3: story written and passing
+- [ ] Complex component 1: story written and passing
+- [ ] Complex component 2: story written and passing
+- [ ] Complex component 3: story written and passing
+- [ ] Full Vitest suite passes: `npx vitest --project storybook`
+- [ ] Run `npx storybook doctor` to check for common issues (version mismatches, duplicated deps, etc.)
diff --git a/scripts/package.json b/scripts/package.json
index 11ca41bd541d..48fbc54c8704 100644
--- a/scripts/package.json
+++ b/scripts/package.json
@@ -9,6 +9,7 @@
     "check": "jiti ./check/check-package.ts",
     "check-package": "jiti ./check-package.ts",
     "docs:codemod": "jiti ./snippets/codemod.ts",
+    "eval": "node ./eval/eval.ts",
     "generate-sandboxes": "jiti ./sandbox/generate.ts",
     "get-report-message": "jiti ./get-report-message.ts",
     "get-sandbox-dir": "jiti ./get-sandbox-dir.ts",
@@ -41,10 +42,12 @@
   },
   "dependencies": {
     "@actions/core": "^1.11.1",
+    "@anthropic-ai/claude-agent-sdk": "^0.2.85",
     "@fal-works/esbuild-plugin-global-externals": "^2.1.2",
     "@google-cloud/bigquery": "^6.2.1",
     "@octokit/graphql": "^5.0.6",
     "@octokit/request": "^8.4.1",
+    "@openai/codex-sdk": "^0.117.0",
     "@polka/parse": "^1.0.0-next.28",
     "@testing-library/dom": "^10.4.0",
     "@testing-library/jest-dom": "^6.9.1",
@@ -73,6 +76,7 @@
     "@vitest/coverage-v8": "^4.1.0",
     "ansi-regex": "^6.0.1",
     "chromatic": "^13.3.4",
+    "citty": "^0.2.1",
     "codecov": "^3.8.1",
     "commander": "^14.0.2",
     "cross-env": "^7.0.3",
diff --git a/scripts/tsconfig.json b/scripts/tsconfig.json
index c8082acb3897..9c5b78519b8b 100644
--- a/scripts/tsconfig.json
+++ b/scripts/tsconfig.json
@@ -1,6 +1,7 @@
 {
   "compileOnSave": false,
   "compilerOptions": {
+    "customConditions": ["code"],
     "baseUrl": ".",
     "noEmit": true,
     "incremental": false,
@@ -11,6 +12,8 @@
     "moduleResolution": "bundler",
     "target": "ESNext",
     "module": "Preserve",
+    // Required for native Node TS execution (node file.ts) — we are migrating from jiti to native node
+    "allowImportingTsExtensions": true,
     "skipLibCheck": true,
     "allowSyntheticDefaultImports": true,
     "esModuleInterop": true,
diff --git a/yarn.lock b/yarn.lock
index cee8c18346d7..aec95012dc21 100644
--- a/yarn.lock
+++ b/yarn.lock
@@ -436,6 +436,44 @@ __metadata:
   languageName: node
   linkType: hard
 
+"@anthropic-ai/claude-agent-sdk@npm:^0.2.85":
+  version: 0.2.85
+  resolution: "@anthropic-ai/claude-agent-sdk@npm:0.2.85"
+  dependencies:
+    "@img/sharp-darwin-arm64": "npm:^0.34.2"
+    "@img/sharp-darwin-x64": "npm:^0.34.2"
+    "@img/sharp-linux-arm": "npm:^0.34.2"
+    "@img/sharp-linux-arm64": "npm:^0.34.2"
+    "@img/sharp-linux-x64": "npm:^0.34.2"
+    "@img/sharp-linuxmusl-arm64": "npm:^0.34.2"
+    "@img/sharp-linuxmusl-x64": "npm:^0.34.2"
+    "@img/sharp-win32-arm64": "npm:^0.34.2"
+    "@img/sharp-win32-x64": "npm:^0.34.2"
+  peerDependencies:
+    zod: ^4.0.0
+  dependenciesMeta:
+    "@img/sharp-darwin-arm64":
+      optional: true
+    "@img/sharp-darwin-x64":
+      optional: true
+    "@img/sharp-linux-arm":
+      optional: true
+    "@img/sharp-linux-arm64":
+      optional: true
+    "@img/sharp-linux-x64":
+      optional: true
+    "@img/sharp-linuxmusl-arm64":
+      optional: true
+    "@img/sharp-linuxmusl-x64":
+      optional: true
+    "@img/sharp-win32-arm64":
+      optional: true
+    "@img/sharp-win32-x64":
+      optional: true
+  checksum: 10c0/5bb31712460b03b264b489c38a2ddcac62ba60aad50da8cd6d3cebdaf46fae84c37473f25b7a4e20a6bda6f2310b4cc9f3574bc3f2e8f73a4a6e6bd0e04bd827
+  languageName: node
+  linkType: hard
+
 "@aw-web-design/x-default-browser@npm:1.4.126":
   version: 1.4.126
   resolution: "@aw-web-design/x-default-browser@npm:1.4.126"
@@ -2972,7 +3010,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-darwin-arm64@npm:0.34.5":
+"@img/sharp-darwin-arm64@npm:0.34.5, @img/sharp-darwin-arm64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-darwin-arm64@npm:0.34.5"
   dependencies:
@@ -2984,7 +3022,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-darwin-x64@npm:0.34.5":
+"@img/sharp-darwin-x64@npm:0.34.5, @img/sharp-darwin-x64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-darwin-x64@npm:0.34.5"
   dependencies:
@@ -3066,7 +3104,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-linux-arm64@npm:0.34.5":
+"@img/sharp-linux-arm64@npm:0.34.5, @img/sharp-linux-arm64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-linux-arm64@npm:0.34.5"
   dependencies:
@@ -3078,7 +3116,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-linux-arm@npm:0.34.5":
+"@img/sharp-linux-arm@npm:0.34.5, @img/sharp-linux-arm@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-linux-arm@npm:0.34.5"
   dependencies:
@@ -3126,7 +3164,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-linux-x64@npm:0.34.5":
+"@img/sharp-linux-x64@npm:0.34.5, @img/sharp-linux-x64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-linux-x64@npm:0.34.5"
   dependencies:
@@ -3138,7 +3176,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-linuxmusl-arm64@npm:0.34.5":
+"@img/sharp-linuxmusl-arm64@npm:0.34.5, @img/sharp-linuxmusl-arm64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-linuxmusl-arm64@npm:0.34.5"
   dependencies:
@@ -3150,7 +3188,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-linuxmusl-x64@npm:0.34.5":
+"@img/sharp-linuxmusl-x64@npm:0.34.5, @img/sharp-linuxmusl-x64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-linuxmusl-x64@npm:0.34.5"
   dependencies:
@@ -3171,7 +3209,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-win32-arm64@npm:0.34.5":
+"@img/sharp-win32-arm64@npm:0.34.5, @img/sharp-win32-arm64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-win32-arm64@npm:0.34.5"
   conditions: os=win32 & cpu=arm64
@@ -3185,7 +3223,7 @@ __metadata:
   languageName: node
   linkType: hard
 
-"@img/sharp-win32-x64@npm:0.34.5":
+"@img/sharp-win32-x64@npm:0.34.5, @img/sharp-win32-x64@npm:^0.34.2":
   version: 0.34.5
   resolution: "@img/sharp-win32-x64@npm:0.34.5"
   conditions: os=win32 & cpu=x64
@@ -4665,6 +4703,86 @@ __metadata:
   languageName: node
   linkType: hard
 
+"@openai/codex-darwin-arm64@npm:@openai/codex@0.117.0-darwin-arm64":
+  version: 0.117.0-darwin-arm64
+  resolution: "@openai/codex@npm:0.117.0-darwin-arm64"
+  conditions: os=darwin & cpu=arm64
+  languageName: node
+  linkType: hard
+
+"@openai/codex-darwin-x64@npm:@openai/codex@0.117.0-darwin-x64":
+  version: 0.117.0-darwin-x64
+  resolution: "@openai/codex@npm:0.117.0-darwin-x64"
+  conditions: os=darwin & cpu=x64
+  languageName: node
+  linkType: hard
+
+"@openai/codex-linux-arm64@npm:@openai/codex@0.117.0-linux-arm64":
+  version: 0.117.0-linux-arm64
+  resolution: "@openai/codex@npm:0.117.0-linux-arm64"
+  conditions: os=linux & cpu=arm64
+  languageName: node
+  linkType: hard
+
+"@openai/codex-linux-x64@npm:@openai/codex@0.117.0-linux-x64":
+  version: 0.117.0-linux-x64
+  resolution: "@openai/codex@npm:0.117.0-linux-x64"
+  conditions: os=linux & cpu=x64
+  languageName: node
+  linkType: hard
+
+"@openai/codex-sdk@npm:^0.117.0":
+  version: 0.117.0
+  resolution: "@openai/codex-sdk@npm:0.117.0"
+  dependencies:
+    "@openai/codex": "npm:0.117.0"
+  checksum: 10c0/96f86890fd45a4030a8e9b6f8466389a015d0ee534b1661b56463a1fd210c6fc3af0ea1f3ce57306a13a9b6ff6197d6409a4d5af7f6d7c90e672009eee15e3fd
+  languageName: node
+  linkType: hard
+
+"@openai/codex-win32-arm64@npm:@openai/codex@0.117.0-win32-arm64":
+  version: 0.117.0-win32-arm64
+  resolution: "@openai/codex@npm:0.117.0-win32-arm64"
+  conditions: os=win32 & cpu=arm64
+  languageName: node
+  linkType: hard
+
+"@openai/codex-win32-x64@npm:@openai/codex@0.117.0-win32-x64":
+  version: 0.117.0-win32-x64
+  resolution: "@openai/codex@npm:0.117.0-win32-x64"
+  conditions: os=win32 & cpu=x64
+  languageName: node
+  linkType: hard
+
+"@openai/codex@npm:0.117.0":
+  version: 0.117.0
+  resolution: "@openai/codex@npm:0.117.0"
+  dependencies:
+    "@openai/codex-darwin-arm64": "npm:@openai/codex@0.117.0-darwin-arm64"
+    "@openai/codex-darwin-x64": "npm:@openai/codex@0.117.0-darwin-x64"
+    "@openai/codex-linux-arm64": "npm:@openai/codex@0.117.0-linux-arm64"
+    "@openai/codex-linux-x64": "npm:@openai/codex@0.117.0-linux-x64"
+    "@openai/codex-win32-arm64": "npm:@openai/codex@0.117.0-win32-arm64"
+    "@openai/codex-win32-x64": "npm:@openai/codex@0.117.0-win32-x64"
+  dependenciesMeta:
+    "@openai/codex-darwin-arm64":
+      optional: true
+    "@openai/codex-darwin-x64":
+      optional: true
+    "@openai/codex-linux-arm64":
+      optional: true
+    "@openai/codex-linux-x64":
+      optional: true
+    "@openai/codex-win32-arm64":
+      optional: true
+    "@openai/codex-win32-x64":
+      optional: true
+  bin:
+    codex: bin/codex.js
+  checksum: 10c0/a5104a396f0f33558c9a402012bf2dd954f5d3465d3b0bb5fe780d265760a3c72b64af4a2d42a0012f661b7e4a274a42c5d4f5582de115613557f480dbec3b5b
+  languageName: node
+  linkType: hard
+
 "@oxc-project/runtime@npm:0.115.0":
   version: 0.115.0
   resolution: "@oxc-project/runtime@npm:0.115.0"
@@ -8717,10 +8835,12 @@ __metadata:
   resolution: "@storybook/scripts@workspace:scripts"
   dependencies:
     "@actions/core": "npm:^1.11.1"
+    "@anthropic-ai/claude-agent-sdk": "npm:^0.2.85"
     "@fal-works/esbuild-plugin-global-externals": "npm:^2.1.2"
     "@google-cloud/bigquery": "npm:^6.2.1"
     "@octokit/graphql": "npm:^5.0.6"
     "@octokit/request": "npm:^8.4.1"
+    "@openai/codex-sdk": "npm:^0.117.0"
     "@polka/parse": "npm:^1.0.0-next.28"
     "@testing-library/dom": "npm:^10.4.0"
     "@testing-library/jest-dom": "npm:^6.9.1"
@@ -8750,6 +8870,7 @@ __metadata:
     "@vitest/coverage-v8": "npm:^4.1.0"
     ansi-regex: "npm:^6.0.1"
     chromatic: "npm:^13.3.4"
+    citty: "npm:^0.2.1"
     codecov: "npm:^3.8.1"
     commander: "npm:^14.0.2"
     cross-env: "npm:^7.0.3"
@@ -13710,6 +13831,13 @@ __metadata:
   languageName: node
   linkType: hard
 
+"citty@npm:^0.2.1":
+  version: 0.2.1
+  resolution: "citty@npm:0.2.1"
+  checksum: 10c0/504ac5aeb076f750bf5f25d40c730083e8ed6112eac2f00dbe341a223c46ad16893ce73dfdb55b2d0da505100b9678968ee0443637c45b21917db48daa5a6977
+  languageName: node
+  linkType: hard
+
 "cjs-module-lexer@npm:^1.2.3":
   version: 1.4.3
   resolution: "cjs-module-lexer@npm:1.4.3"