diff --git a/.agents/skills/review-pr/SKILL.md b/.agents/skills/review-pr/SKILL.md new file mode 100644 index 000000000000..253cdfb8c3c0 --- /dev/null +++ b/.agents/skills/review-pr/SKILL.md @@ -0,0 +1,439 @@ +--- +name: review-pr +description: "Generate a scrollable single-page PR review. Use when the user says 'review pr', 'review this PR', 'pr review', or wants to review PR changes in a narrative format." +allowed-tools: Bash, Read, Write, Edit, Agent, Grep, Glob +--- + +# PR Review — Scrollable Single-Page + +Generate a scrollable single-page HTML document that reviews a PR as a readable narrative. + +**Always generate the page immediately.** Never block on cleanup or fix discussions. + +## Principles + +The purpose of this page is to help the **human reviewer** understand and review a PR quickly. The page is a reading aid — it presents the code clearly so the reviewer can form their own opinion. + +### Optimize for the reviewer's time + +The reviewer should be able to: +- **Skim** the page and grasp what the PR does in 30 seconds (big picture section). +- **Read** any area and understand what that code does without opening their editor. +- **Zoom in** to full files or diffs when they want to inspect details. + +### Two layers per area + +Each area has two layers: +- **Layer 1 (always visible):** A curated walkthrough — prose explanation with cherry-picked code snippets. Only the parts that matter for understanding. +- **Layer 2 (collapsed):** Full file contents or diffs in `
` blocks. The reviewer expands these to zoom in. + +### High-level to low-level + +Order areas following the **call graph** from entry points down. The reviewer understands the big picture before details. For example: CLI entry point → orchestration → each pipeline step → helpers → utilities → types. + +### Within each area: Explain → Contract → Tests → Implementation + +Structure each area's walkthrough in this order: + +1. **Explanation** — Plain prose first. What does this module do? Why does it exist? How does it fit into the bigger picture? The reviewer should understand the *purpose* before seeing any code. +2. **Functions & data structures** — Show function signatures and the key types/interfaces they use. This is the contract — what goes in, what comes out. Show full interface bodies inline where they're first relevant. Don't defer to "see types.ts". +3. **Tests** — Cherry-pick the test cases that make the behavior concrete. Tests are executable documentation — they turn abstract descriptions into specific examples. +4. **Implementation** — The interesting parts of *how* it works. Skip boilerplate, show the core logic. + +Use narrative `

` tags between snippets to guide the reviewer through each transition. + +### Flag obvious issues, but don't force opinions + +If you notice something clearly wrong (bug, missing error handling, naming mismatch), flag it with a smell-box. If something is notably well done, use a note-box. But don't manufacture opinions — if the code is fine, just present it clearly. The reviewer will decide what matters. + +### Cover everything + +Every changed file appears somewhere — either in a walkthrough snippet or in a collapsed full-file block. + +## Step 1 — Gather PR data + +```bash +gh pr view --json number,title,author,headRefName,baseRefName,body,additions,deletions,changedFiles +gh pr diff --name-only +gh pr diff +``` + +If a PR number or URL is given as an argument, pass it to `gh pr view ` and `gh pr diff `. + +## Step 2 — Read all changed files + +Read the full file content of every changed file with the `Read` tool. Also read the full diff. Classify each file as test, implementation, config, or docs. + +## Step 3 — Generate the page + +For each area, write two layers: + +### Layer 1: Readable walkthrough (always visible) + +A curated narrative that mixes prose with **short code snippets**. Structure it following the principle order: + +1. **Explanation** — plain prose describing what this area does, why it exists, and how it fits. +2. **Functions & data structures** — key function signatures and the types they use. Show the contract. +3. **Tests** — cherry-picked test cases that make the behavior concrete with specific examples. +4. **Implementation** — the core logic. Skip boilerplate, show the interesting parts. + +Use narrative `

` tags between snippets to guide the reviewer through each transition. Add smell-boxes or note-boxes only when something genuinely stands out. + +### Layer 2: Full files (always collapsed) + +Below the walkthrough, include every file in the area as a collapsed `

` block with the complete file content (or diff for modified files). The reader expands these for reference. + +First create the output directory: + +```bash +mkdir -p .pr-review/pr- +``` + +Write to `.pr-review/pr-/index.html` (relative to the repo root). + +**Verify every file from `gh pr diff --name-only` appears in the page.** + +### HTML structure + +``` +Sticky topbar (nav links) +Header (title, author, stats) +Big picture section +Area 1 + Readable walkthrough (Explain → Contract → Tests → Implementation) + Full files (collapsed) +Area 2 + ... +Supporting changes +``` + +### Complete HTML template + +```html + + + + + + PR #{{NUMBER}}: {{TITLE}} + + + + + + + + + +
+ #{{NUMBER}} + Overview + +
+ +
+ +
+

{{TITLE}}

+
by {{AUTHOR}} · {{BRANCH}} → {{BASE}}
+
+ {{FILES}} files + +{{ADDITIONS}} + −{{DELETIONS}} +
+
+ + +
+

What this PR does

+

{{SUMMARY}}

+
+
+ + +
+

{{N}}. {{Area Name}}

+

{{What this area does}}

+ + + +
+
+ +
+ + + + + + + + +``` + +### Building blocks + +**Layer 1 — Readable walkthrough snippet** (curated excerpt with prose): + +Start with explanation, then show the contract (function + types), then tests, then implementation. Show full interface bodies inline — not just names: + +```html +
+
+

processStory is the core rendering pipeline. It takes a story config, prepares the + rendering context, mounts the component, and returns a result with status and timing.

+

The function signature and the types it uses:

+
+
export async function processStory(config: StoryConfig): Promise<StoryResult>
+
+export interface StoryConfig {
+  id: string;
+  title: string;
+  component: ComponentType;
+  args: Record<string, unknown>;
+  parameters: Parameters;
+}
+
+export interface StoryResult {
+  status: 'success' | 'error';
+  rendered: boolean;
+  duration: number;
+  errors: string[];
+}
+
+

The happy-path test shows the expected flow concretely:

+
+
const result = await processStory(baseConfig);
+expect(result.status).toBe('success');
+expect(result.rendered).toBe(true);
+
+

The implementation is a sequential pipeline — rendering depends on preparation:

+
+
const context = await prepare(config);
+const canvas = await render(context);
+return summarize(canvas, config);
+
+``` + +**Layer 2 — Full file (collapsed, for new files):** +```html +
+
+ impl + new + path/to/file.ts +
+
+ Full file ({{N}} lines) +
{{FULL FILE CONTENT, HTML-ESCAPED}}
+
+
+``` + +**Layer 2 — Full file (collapsed, for modified files with diff):** + +Use `language-typescript data-diff` — this gives TypeScript syntax highlighting plus line-level add/remove backgrounds via the post-processing script. Lines starting with `+` get green background, `-` get red. + +```html +
+
+ modified + path/to/file.ts +
+
+ Diff +
-old line
++new line
+
+
+``` + +**Supporting change — no code needed:** +```html +
+
+ config + modified + yarn.lock +
+

Lockfile updated for new dependencies.

+
+``` + +**Inline issue:** +```html +
No unit tests for this file.
+``` + +**Positive note:** +```html +
These test names read like a specification — good documentation.
+``` + + +### Badge reference + +| Badge | Class | Use for | +|-------|-------|---------| +| `test` | `badge-test` | Test files | +| `impl` | `badge-impl` | Implementation files | +| `config` | `badge-config` | Config, docs, prompts, lockfiles | +| `new` | `badge-new` | New files (combine with test/impl/config) | +| `modified` | `badge-modified` | Modified files | + +### Syntax highlighting + +| Class | Use for | +|-------|---------| +| `language-typescript` | `.ts`, `.tsx`, `.js`, `.jsx` (new files) | +| `language-typescript` + `data-diff` attribute | Modified file diffs — gets TS highlighting plus line-level add/remove backgrounds | +| `language-json` | `.json` files | +| `language-markdown` | `.md` files | + +**Important:** Do NOT use `language-diff` — it only does `+`/`-` coloring without syntax highlighting. Instead use `language-typescript` with the `data-diff` attribute for diffs. The post-processing script handles line backgrounds. + +### HTML escaping + +All code inside `` blocks must be escaped: `&` → `&`, `<` → `<`, `>` → `>`. + +## Step 4 — Serve the page + +Kill any existing server, write a static server, start it: + +```bash +lsof -ti:3000 | xargs kill -9 2>/dev/null || true +``` + +Write to `.pr-review/pr-/server.mjs`: + +```javascript +import { createServer } from 'node:http'; +import { readFileSync } from 'node:fs'; +import { join, extname } from 'node:path'; + +const dir = new URL('.', import.meta.url).pathname; +const port = 3000; + +createServer((req, res) => { + try { + const filePath = join(dir, req.url === '/' ? 'index.html' : req.url); + const content = readFileSync(filePath); + const ext = extname(filePath); + const types = { + '.html': 'text/html', '.js': 'text/javascript', + '.css': 'text/css', '.json': 'application/json', + }; + res.writeHead(200, { 'Content-Type': types[ext] || 'application/octet-stream' }); + res.end(content); + } catch { + res.writeHead(404).end('Not found'); + } +}).listen(port, () => { + console.log(`\n PR Review: http://localhost:${port}\n`); +}); +``` + +```bash +node .pr-review/pr-/server.mjs & # run_in_background: true +open http://localhost:3000 +``` + +## Step 5 — Iterate + +Tell the user: +- The page is live at http://localhost:3000 +- They can ask to update specific sections +- Refresh the browser after updates diff --git a/.circleci/config.yml b/.circleci/config.yml index 59dd245f7676..d2798c319911 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -32,7 +32,7 @@ jobs: generate-and-run-config: executor: name: node/default - resource_class: small + resource_class: large steps: - node/install: install-yarn: true diff --git a/.gitignore b/.gitignore index 43107a4f3e07..ecb034fa9189 100644 --- a/.gitignore +++ b/.gitignore @@ -79,4 +79,11 @@ CLAUDE.local.md .cursor/mcp.json .vscode/mcp.json .mcp.json -.nx/polygraph \ No newline at end of file +.nx/polygraph + +# Eval system +scripts/eval/.cache +scripts/eval/results + +# review-pr skill output +.pr-review diff --git a/AGENTS.md b/AGENTS.md index 7c99c9041a9e..a538c6cdb6c0 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -9,10 +9,11 @@ This file is the canonical instruction source for coding agents. Files like `CLA Storybook is a large TypeScript monorepo. The git root is the repo root, the main code lives in `code/`, and build tooling lives in `scripts/`. The default branch is `next`. - **Base branch**: `next` (all PRs should target `next`, not `main`) -- **Node.js**: `22.21.1` (see `.nvmrc`) +- **Node.js**: `22.22.1` (see `.nvmrc`) — supports `.ts` natively via type stripping (no loader needed) - **Package Manager**: Yarn Berry - **Task orchestration**: NX plus the custom `yarn task` runner - **CI environment**: Linux and Windows +- **TS execution**: Migrating from `jiti` to native `node` for running `.ts` files. New scripts should use `node ./path/file.ts` with explicit `.ts` import extensions (enabled by `allowImportingTsExtensions` in tsconfig). Legacy scripts still use `jiti` but should be migrated over time. ## Repository Structure @@ -234,7 +235,7 @@ When writing tests: After changing files: -1. Format with `cd code && oxfmt` +1. Format with `yarn fmt:write` (run from the repo root) 2. Lint with `yarn --cwd code lint:js:cmd --fix` or `cd code && yarn lint:js:cmd ` 3. Run relevant tests before submitting a PR diff --git a/code/core/src/core-server/index.ts b/code/core/src/core-server/index.ts index f475fa6166ca..b1669cb685c1 100644 --- a/code/core/src/core-server/index.ts +++ b/code/core/src/core-server/index.ts @@ -32,3 +32,6 @@ export { } from './stores/test-provider'; export { getServerPort } from './utils/server-address'; + +export { getComponentCandidates } from './utils/ghost-stories/get-candidates'; +export { runGhostStories } from './utils/ghost-stories/run-story-tests'; diff --git a/code/core/src/core-server/server-channel/ghost-stories-channel.ts b/code/core/src/core-server/server-channel/ghost-stories-channel.ts index 2b334865556e..3076d7b293d9 100644 --- a/code/core/src/core-server/server-channel/ghost-stories-channel.ts +++ b/code/core/src/core-server/server-channel/ghost-stories-channel.ts @@ -9,7 +9,7 @@ import { import type { CoreConfig, Options } from 'storybook/internal/types'; import { getComponentCandidates } from '../utils/ghost-stories/get-candidates'; -import { runStoryTests } from '../utils/ghost-stories/run-story-tests'; +import { runGhostStories } from '../utils/ghost-stories/run-story-tests'; export function initGhostStoriesChannel( channel: Channel, @@ -91,7 +91,7 @@ export function initGhostStoriesChannel( // Phase 2: Run tests on those candidates Vitest. The components will be transformed directly to tests // If they pass, it means that creating a story file for them would succeed. - const testRunResult = await runStoryTests(candidatesResult.candidates); + const testRunResult = await runGhostStories(candidatesResult.candidates); stats.totalRunDuration = Date.now() - ghostRunStart; stats.testRunDuration = testRunResult.duration; if (testRunResult.runError) { diff --git a/code/core/src/core-server/utils/ghost-stories/get-candidates.ts b/code/core/src/core-server/utils/ghost-stories/get-candidates.ts index 8c7d7a113cb3..661196a3ebea 100644 --- a/code/core/src/core-server/utils/ghost-stories/get-candidates.ts +++ b/code/core/src/core-server/utils/ghost-stories/get-candidates.ts @@ -1,12 +1,11 @@ import { readFile } from 'node:fs/promises'; import { babelParse, traverse } from 'storybook/internal/babel'; -import { logger } from 'storybook/internal/node-logger'; // eslint-disable-next-line depend/ban-dependencies import { glob } from 'glob'; -import { getComponentComplexity } from './component-analyzer'; +import { getComponentComplexity } from './component-analyzer.ts'; // A valid candidate includes React code and at least one export function isValidCandidate(source: string): boolean { @@ -128,9 +127,12 @@ export async function getCandidatesForStorybook( export async function getComponentCandidates({ sampleSize = 20, globPattern = '**/*.{tsx,jsx}', + cwd = process.cwd(), }: { sampleSize?: number; globPattern?: string; + /** Working directory for glob. Defaults to process.cwd(). */ + cwd?: string; } = {}): Promise<{ candidates: string[]; error?: string; @@ -145,7 +147,7 @@ export async function getComponentCandidates({ // Find files matching the glob pattern files = await glob(globPattern, { - cwd: process.cwd(), + cwd, absolute: true, ignore: [ '**/node_modules/**', diff --git a/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts b/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts index 8c783abdccbe..e0bd41cc53a6 100644 --- a/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts +++ b/code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts @@ -1,6 +1,10 @@ -import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors'; -import { categorizeError } from '../../../shared/utils/categorize-render-errors'; -import { type ErrorCategorizationResult, type StoryTestResult, type TestRunSummary } from './types'; +import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors.ts'; +import { categorizeError } from '../../../shared/utils/categorize-render-errors.ts'; +import { + type ErrorCategorizationResult, + type StoryTestResult, + type TestRunSummary, +} from './types.ts'; /** * For a given list of test results: diff --git a/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts b/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts index 42ab270ee58a..c934eb385ecd 100644 --- a/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts +++ b/code/core/src/core-server/utils/ghost-stories/run-story-tests.ts @@ -5,10 +5,21 @@ import { executeCommand, resolvePathInStorybookCache } from 'storybook/internal/ import { join } from 'pathe'; -import { parseVitestResults } from './parse-vitest-report'; -import type { TestRunSummary } from './types'; +import { parseVitestResults } from './parse-vitest-report.ts'; +import type { TestRunSummary } from './types.ts'; -export async function runStoryTests(componentFilePaths: string[]): Promise { +/** + * Run ghost stories: execute vitest on component file paths to auto-generate + * and test stories that don't exist on disk. + * + * @param componentFilePaths - Absolute paths to component files to test. + * @param options.cwd - Working directory for vitest. Defaults to process.cwd(). + */ +export async function runGhostStories( + componentFilePaths: string[], + options?: { cwd?: string } +): Promise { + const cwd = options?.cwd; try { // Create the cache directory for story discovery tests const cacheDir = resolvePathInStorybookCache('ghost-stories-tests'); @@ -34,6 +45,7 @@ export async function runStoryTests(componentFilePaths: string[]): Promise ({ export const fmt = defineJob('Format check', () => ({ executor: { name: 'sb_node_22_classic', - class: 'medium+', + class: 'xlarge', }, steps: [ git.checkout(), diff --git a/scripts/eval/eval.ts b/scripts/eval/eval.ts new file mode 100644 index 000000000000..048e5efb75ca --- /dev/null +++ b/scripts/eval/eval.ts @@ -0,0 +1,201 @@ +/** + * Eval harness entry point. + * + * Runs with `node ./eval/eval.ts` (no jiti). Node 22+ supports .ts natively + * via type stripping. Import specifiers use explicit .ts extensions. + * + * Usage: + * node eval/eval.ts -p mealdrop # claude defaults + * node eval/eval.ts -p mealdrop -a codex # codex defaults + * node eval/eval.ts -p mealdrop -m gpt-5.4 # codex (inferred) + * node eval/eval.ts -p mealdrop -a claude -e max # claude with max effort + * node eval/eval.ts -p mealdrop --manual # prepare only, print instructions + * node eval/eval.ts --list-projects + * node eval/eval.ts --list-models + * node eval/eval.ts --list-prompts + */ +import { writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import { parseArgs } from 'node:util'; +import { z } from 'zod'; +import pc from 'picocolors'; +import { + AGENT_IDS, + AGENTS, + CLAUDE_EFFORTS, + CLAUDE_MODELS, + CODEX_EFFORTS, + CODEX_MODELS, + type AgentId, + type AgentVariant, +} from './lib/agents/config.ts'; +import { prepareTrial } from './lib/prepare-trial.ts'; +import { PROJECTS } from './lib/projects.ts'; +import { runTrial, type TrialConfig } from './lib/run-trial.ts'; +import { + captureEnvironment, + createLogger, + formatCost, + formatDuration, + generateTrialId, + listPrompts, + loadPrompt, +} from './lib/utils.ts'; + +const PROJECT_NAMES = PROJECTS.map((p) => p.name) as [string, ...string[]]; + +const base = { + project: z.enum(PROJECT_NAMES).optional(), + prompt: z.string().default('setup'), + verbose: z.boolean().default(false), + manual: z.boolean().default(false), + listProjects: z.boolean().default(false), + listModels: z.boolean().default(false), + listPrompts: z.boolean().default(false), +}; + +const argsSchema = z.discriminatedUnion('agent', [ + z.object({ + ...base, + agent: z.literal('claude'), + model: z.enum(CLAUDE_MODELS).default(AGENTS.claude.defaultModel), + effort: z.enum(CLAUDE_EFFORTS).default(AGENTS.claude.defaultEffort), + }), + z.object({ + ...base, + agent: z.literal('codex'), + model: z.enum(CODEX_MODELS).default(AGENTS.codex.defaultModel), + effort: z.enum(CODEX_EFFORTS).default(AGENTS.codex.defaultEffort), + }), +]); + +const { values } = parseArgs({ + options: { + project: { type: 'string', short: 'p' }, + agent: { type: 'string', short: 'a' }, + model: { type: 'string', short: 'm' }, + effort: { type: 'string', short: 'e' }, + prompt: { type: 'string' }, + verbose: { type: 'boolean', short: 'v' }, + manual: { type: 'boolean' }, + 'list-projects': { type: 'boolean' }, + 'list-models': { type: 'boolean' }, + 'list-prompts': { type: 'boolean' }, + }, + args: process.argv.slice(2), + strict: true, +}); + +// Resolve the discriminator: explicit --agent, inferred from --model, or default to claude. +const agent = values.agent ?? (values.model ? inferAgent(values.model) : 'claude'); + +const parsed = argsSchema.safeParse({ + ...values, + agent, + listProjects: values['list-projects'], + listModels: values['list-models'], + listPrompts: values['list-prompts'], +}); + +if (!parsed.success) { + for (const issue of parsed.error.issues) { + console.error(pc.red(` ${issue.path.join('.')}: ${issue.message}`)); + } + process.exit(1); +} + +const args = parsed.data; +const logger = createLogger(); + +if (args.listProjects) { + for (const project of PROJECTS) { + logger.log(` ${pc.bold(project.name)} — ${project.description}`); + } + process.exit(0); +} +if (args.listModels) { + for (const [name, { models }] of Object.entries(AGENTS)) { + logger.log(`\n ${pc.bold(name)}`); + for (const model of models) logger.log(` ${model}`); + } + process.exit(0); +} +if (args.listPrompts) { + for (const name of listPrompts()) logger.log(` ${pc.bold(name)}`); + process.exit(0); +} + +if (!args.project) { + logger.log(pc.red(`Specify a project with -p. Available: ${PROJECT_NAMES.join(', ')}`)); + process.exit(1); +} +const project = PROJECTS.find((p) => p.name === args.project)!; +const variant = toVariant(args); + +logger.log(pc.bold(`\nStorybook Setup Eval — ${project.name}`)); +logger.log( + `Agent: ${variant.agent} | Model: ${variant.model} | Effort: ${variant.effort} | Prompt: ${args.prompt}\n` +); + +if (args.manual) { + const trialId = generateTrialId(project.name, variant.agent, variant.model, args.prompt); + const workspace = await prepareTrial(project, trialId, logger); + await captureEnvironment(workspace.resultsDir); + + const prompt = loadPrompt(args.prompt); + const promptPath = join(workspace.resultsDir, 'prompt.md'); + await writeFile(promptPath, prompt); + + const cliCommand = buildManualCommand(variant, promptPath); + + logger.log(pc.bold('\n── Manual mode ──')); + logger.log(`\n Trial dir: ${pc.cyan(workspace.trialDir)}`); + logger.log(` Project dir: ${pc.cyan(workspace.projectPath)}`); + logger.log(` Prompt file: ${pc.cyan(promptPath)}`); + logger.log(pc.bold('\nRun the agent yourself:\n')); + logger.log(` ${pc.green('cd')} ${workspace.projectPath}`); + logger.log(` ${pc.green(cliCommand)}\n`); +} else { + const result = await runTrial( + { project, variant, prompt: args.prompt, verbose: args.verbose } satisfies TrialConfig, + logger + ); + + const ghost = result.grade.ghostStories; + const ghostStr = ghost + ? `${ghost.passed}/${ghost.total} (${Math.round(ghost.successRate * 100)}%)` + : '-'; + + logger.log(pc.bold('\nResult')); + logger.log(` Build: ${result.grade.buildSuccess ? pc.green('PASS') : pc.red('FAIL')}`); + logger.log(` Ghost: ${ghostStr}`); + logger.log(` TS Err: ${result.grade.typeCheckErrors}`); + logger.log(` Score: ${result.score.score}`); + logger.log(` Cost: ${formatCost(result.execution.cost)}`); + logger.log(` Time: ${formatDuration(result.execution.duration)}`); + logger.log(` Turns: ${result.execution.turns}`); + + logger.log('\nDone.'); +} + +function inferAgent(model: string): AgentId { + for (const id of AGENT_IDS) { + if (AGENTS[id].models.some((candidate) => candidate === model)) return id; + } + throw new Error(`No agent found for model: ${model}`); +} + +function buildManualCommand(variant: AgentVariant, promptPath: string): string { + const promptArg = `"$(cat ${promptPath})"`; + if (variant.agent === 'claude') { + const sdkModel = AGENTS.claude.sdkModelIds[variant.model] ?? variant.model; + return `claude --model ${sdkModel} ${promptArg}`; + } + return `codex --model ${variant.model} --reasoning-effort ${variant.effort} ${promptArg}`; +} + +function toVariant(args: z.infer): AgentVariant { + return args.agent === 'claude' + ? { agent: 'claude', model: args.model, effort: args.effort } + : { agent: 'codex', model: args.model, effort: args.effort }; +} diff --git a/scripts/eval/lib/agents/claude-code.ts b/scripts/eval/lib/agents/claude-code.ts new file mode 100644 index 000000000000..1cd03f43b177 --- /dev/null +++ b/scripts/eval/lib/agents/claude-code.ts @@ -0,0 +1,142 @@ +import type { SDKMessage } from '@anthropic-ai/claude-agent-sdk'; +import { query } from '@anthropic-ai/claude-agent-sdk'; +import { writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import { AGENTS, resolveClaudeSdkModel, type AgentDriver, type Execution } from './config.ts'; +import type { Logger } from '../utils.ts'; + +export const claudeAgent: AgentDriver = { + name: 'claude', + + async execute({ prompt, projectPath, variant, resultsDir, logger }): Promise { + if (variant.agent !== 'claude') { + throw new Error(`Claude driver received unsupported variant: ${variant.agent}`); + } + + const startTime = Date.now(); + const settings = AGENTS.claude.execution; + const { model } = variant; + const effort = variant.effort as 'low' | 'medium' | 'high' | 'max'; + const sdkModel = resolveClaudeSdkModel(model); + + let cost: number | undefined; + let turns = 0; + let durationApi: number | undefined; + const messages: unknown[] = []; + + try { + for await (const message of query({ + prompt, + options: { + model: sdkModel, + cwd: projectPath, + allowedTools: [...settings.allowedTools], + maxTurns: settings.maxTurns, + effort, + debug: settings.debug, + systemPrompt: settings.systemPrompt, + }, + })) { + logMessage(message, logger); + messages.push(message); + + if (message.type === 'result' && message.subtype === 'success') { + cost = message.total_cost_usd as number | undefined; + turns = (message.num_turns as number) ?? 0; + durationApi = + typeof message.duration_api_ms === 'number' + ? message.duration_api_ms / 1000 + : undefined; + } + } + } finally { + await writeTranscript(resultsDir, messages, logger); + } + + const duration = (Date.now() - startTime) / 1000; + + return { + cost, + duration, + durationApi, + turns, + }; + }, +}; + +function logMessage(message: SDKMessage, logger: Logger) { + switch (message.type) { + case 'assistant': { + for (const block of message.message.content) { + if (block.type === 'text') { + logger.log(`💬 ${block.text}`); + } else if (block.type === 'tool_use') { + logger.log(`🔧 ${block.name}(${JSON.stringify(block.input).slice(0, 200)})`); + } + } + if (message.error) { + logger.logError(`Assistant error: ${message.error}`); + } + break; + } + case 'user': { + const content = message.message.content; + if (!Array.isArray(content)) break; + for (const block of content) { + if (block.type === 'tool_result') { + const text = + typeof block.content === 'string' + ? block.content.slice(0, 200) + : Array.isArray(block.content) + ? block.content + .map((b: { type: string; text?: string }) => + 'text' in b ? b.text : `[${b.type}]` + ) + .join('') + .slice(0, 200) + : '[no content]'; + logger.log(`📎 tool_result(${block.tool_use_id?.slice(-8)}): ${text}`); + } + } + break; + } + case 'result': + if (message.subtype === 'success') { + logger.logSuccess( + `Done — ${message.num_turns} turns, $${message.total_cost_usd?.toFixed(4)}` + ); + } else { + logger.logError(`Error (${message.subtype}): ${message.errors?.join(', ')}`); + } + break; + case 'system': + if (message.subtype === 'init') { + logger.log(`🚀 Session started — model: ${message.model}`); + } else if (message.subtype === 'api_retry') { + logger.log(`🔄 API retry: attempt ${message.attempt}/${message.max_retries}`); + } else if (message.subtype === 'status') { + logger.log(`📊 status: ${message.status ?? 'unknown'}`); + } + break; + case 'tool_use_summary': + logger.log(`📋 ${message.summary.slice(0, 200)}`); + break; + case 'rate_limit_event': + logger.log( + `⏳ Rate limited — status: ${message.rate_limit_info?.status}, resets at: ${message.rate_limit_info?.resetsAt}` + ); + break; + default: + break; + } +} + +async function writeTranscript(resultsDir: string, messages: unknown[], logger: Logger) { + try { + await writeFile(join(resultsDir, 'transcript.json'), JSON.stringify(messages, null, 2)); + } catch (error) { + logger.logError( + `Failed to persist transcript: ${error instanceof Error ? error.message : String(error)}` + ); + } +} diff --git a/scripts/eval/lib/agents/codex.ts b/scripts/eval/lib/agents/codex.ts new file mode 100644 index 000000000000..09cbdc00ee7b --- /dev/null +++ b/scripts/eval/lib/agents/codex.ts @@ -0,0 +1,105 @@ +import { Codex, type ModelReasoningEffort } from '@openai/codex-sdk'; +import { writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import { AGENTS, estimateCost, type AgentDriver, type Execution } from './config.ts'; +import type { Logger } from '../utils.ts'; + +export const codexAgent: AgentDriver = { + name: 'codex', + + async execute({ prompt, projectPath, variant, resultsDir, logger }): Promise { + if (variant.agent !== 'codex') { + throw new Error(`Codex driver received unsupported variant: ${variant.agent}`); + } + + const startTime = Date.now(); + const settings = AGENTS.codex.execution; + const { model, effort } = variant; + + const codex = new Codex(); + const thread = codex.startThread({ + model, + modelReasoningEffort: effort as ModelReasoningEffort, + workingDirectory: projectPath, + approvalPolicy: settings.approvalPolicy, + }); + + const items: unknown[] = []; + let totalInput = 0; + let totalCached = 0; + let totalOutput = 0; + let turns = 0; + + try { + const { events } = await thread.runStreamed(prompt); + for await (const event of events) { + switch (event.type) { + case 'item.completed': { + const item = event.item; + items.push(item); + switch (item.type) { + case 'agent_message': + logger.log(`💬 ${item.text.slice(0, 300)}`); + break; + case 'command_execution': + logger.log(`🔧 $ ${item.command} → exit ${item.exit_code ?? '?'}`); + if (item.exit_code !== 0 && item.aggregated_output) { + logger.log(` ${item.aggregated_output.slice(-200)}`); + } + break; + case 'file_change': + for (const c of item.changes) logger.log(`📝 ${c.kind} ${c.path}`); + break; + case 'reasoning': + logger.log(`🧠 ${item.text.slice(0, 200)}`); + break; + case 'error': + logger.logError(item.message); + break; + } + break; + } + case 'turn.completed': + totalInput += event.usage.input_tokens; + totalCached += event.usage.cached_input_tokens; + totalOutput += event.usage.output_tokens; + turns++; + logger.log( + `📊 tokens: ${event.usage.input_tokens}in / ${event.usage.output_tokens}out (${event.usage.cached_input_tokens} cached)` + ); + break; + case 'turn.failed': + logger.logError(`Turn failed: ${event.error.message}`); + break; + case 'error': + logger.logError(`Error: ${event.message}`); + break; + } + } + } finally { + await writeTranscript(resultsDir, items, logger); + } + + const duration = (Date.now() - startTime) / 1000; + const cost = estimateCost('codex', model, { + inputTokens: totalInput, + cachedInputTokens: totalCached, + outputTokens: totalOutput, + }); + logger.logSuccess( + `Done — ${turns} turns, ${Math.round(duration)}s, ${totalInput}in/${totalOutput}out tokens${cost != null ? `, $${cost.toFixed(4)}` : ''}` + ); + + return { cost, duration, turns }; + }, +}; + +async function writeTranscript(resultsDir: string, items: unknown[], logger: Logger) { + try { + await writeFile(join(resultsDir, 'transcript.json'), JSON.stringify(items, null, 2)); + } catch (error) { + logger.logError( + `Failed to persist transcript: ${error instanceof Error ? error.message : String(error)}` + ); + } +} diff --git a/scripts/eval/lib/agents/config.test.ts b/scripts/eval/lib/agents/config.test.ts new file mode 100644 index 000000000000..1236689d05cd --- /dev/null +++ b/scripts/eval/lib/agents/config.test.ts @@ -0,0 +1,62 @@ +import { describe, expect, it } from 'vitest'; + +import { AGENTS, getDefaultVariant } from './config'; + +describe('AGENTS', () => { + it('keeps each agent default inside its supported model and effort lists', () => { + for (const config of Object.values(AGENTS)) { + expect(config).toMatchObject({ + defaultModel: expect.any(String), + defaultEffort: expect.any(String), + }); + expect(config.models).toContain(config.defaultModel); + expect(config.efforts).toContain(config.defaultEffort); + } + }); + + it('keeps Claude models fully remappable to SDK model ids', () => { + expect(AGENTS.claude).toMatchObject({ + defaultModel: 'sonnet-4.6', + defaultEffort: 'medium', + execution: { + maxTurns: 50, + allowedTools: ['Read', 'Write', 'Edit', 'Bash', 'Glob', 'Grep'], + permissionModel: 'tool-allowlist', + }, + sdkModelIds: Object.fromEntries( + AGENTS.claude.models.map((model) => [model, expect.any(String)]) + ), + }); + }); + + it('keeps Codex models fully priceable from token usage', () => { + expect(AGENTS.codex).toMatchObject({ + defaultModel: 'gpt-5.4', + defaultEffort: 'medium', + execution: { + approvalPolicy: 'never', + permissionModel: 'approval-policy-never', + }, + pricing: { + 'gpt-5.4': { + input: 2.5, + cachedInput: 0.25, + output: 15, + }, + }, + }); + }); + + it('derives default variants from the central agent definitions', () => { + expect(getDefaultVariant('claude')).toEqual({ + agent: 'claude', + model: 'sonnet-4.6', + effort: 'medium', + }); + expect(getDefaultVariant('codex')).toEqual({ + agent: 'codex', + model: 'gpt-5.4', + effort: 'medium', + }); + }); +}); diff --git a/scripts/eval/lib/agents/config.ts b/scripts/eval/lib/agents/config.ts new file mode 100644 index 000000000000..eb13a52686a9 --- /dev/null +++ b/scripts/eval/lib/agents/config.ts @@ -0,0 +1,166 @@ +/** + * Agent definitions, model mappings, pricing, and cost estimation. + */ + +import type { Logger } from '../utils.ts'; + +export const CLAUDE_MODELS = ['sonnet-4.6', 'opus-4.6', 'haiku-4.5'] as const; +export const CODEX_MODELS = ['gpt-5.4'] as const; +export const ALL_MODELS = [...CLAUDE_MODELS, ...CODEX_MODELS] as const; + +export const CLAUDE_EFFORTS = ['low', 'medium', 'high', 'max'] as const; +export const CODEX_EFFORTS = ['low', 'medium', 'high', 'xhigh'] as const; +export const ALL_EFFORTS = ['low', 'medium', 'high', 'max', 'xhigh'] as const; + +export const AGENT_IDS = ['claude', 'codex'] as const; + +export type ClaudeModel = (typeof CLAUDE_MODELS)[number]; +export type CodexModel = (typeof CODEX_MODELS)[number]; +export type ClaudeEffort = (typeof CLAUDE_EFFORTS)[number]; +export type CodexEffort = (typeof CODEX_EFFORTS)[number]; + +/** Agent + model + effort — validated as a discriminated union at the CLI boundary. */ +export type AgentVariant = + | { agent: 'claude'; model: ClaudeModel; effort: ClaudeEffort } + | { agent: 'codex'; model: CodexModel; effort: CodexEffort }; + +export type AgentId = AgentVariant['agent']; + +export interface Execution { + cost?: number; + duration: number; + durationApi?: number; + turns: number; +} + +export interface AgentExecuteParams { + prompt: string; + projectPath: string; + variant: AgentVariant; + resultsDir: string; + logger: Logger; +} + +export interface AgentDriver { + name: AgentId; + execute(params: AgentExecuteParams): Promise; +} + +export interface TokenPricing { + input: number; + cachedInput: number; + output: number; +} + +export interface TokenUsage { + inputTokens: number; + cachedInputTokens: number; + outputTokens: number; +} + +export type ClaudeTool = 'Read' | 'Write' | 'Edit' | 'Bash' | 'Glob' | 'Grep'; + +export interface ClaudeExecutionConfig { + maxTurns: number; + /** + * Bash is toggled here at the harness level, but individual shell commands still execute through + * Claude's Bash tool rather than through a separate command allowlist. + */ + allowedTools: readonly ClaudeTool[]; + debug: boolean; + systemPrompt: { type: 'preset'; preset: 'claude_code' }; + /** Claude access is controlled through the explicit tool allowlist above. */ + permissionModel: 'tool-allowlist'; +} + +export interface CodexExecutionConfig { + /** Codex runs non-interactively so benchmark runs never block on approval prompts. */ + approvalPolicy: 'never'; + permissionModel: 'approval-policy-never'; +} + +export interface AgentDefinition { + models: readonly TModel[]; + defaultModel: TModel; + /** Map friendly model names to SDK-specific model IDs (e.g. "sonnet-4.6" → "claude-sonnet-4-6"). */ + sdkModelIds: Partial>; + /** Per-million-token pricing for manual cost estimation (agents that don't report cost natively). */ + pricing: Partial>; + efforts: readonly TEffort[]; + defaultEffort: TEffort; + execution: TExecution; +} + +export type ClaudeDefinition = AgentDefinition; +export type CodexDefinition = AgentDefinition; + +export interface AgentDefinitions { + claude: ClaudeDefinition; + codex: CodexDefinition; +} + +export const AGENTS: AgentDefinitions = { + claude: { + models: CLAUDE_MODELS, + defaultModel: 'sonnet-4.6', + sdkModelIds: { + 'sonnet-4.6': 'claude-sonnet-4-6', + 'opus-4.6': 'claude-opus-4-6', + 'haiku-4.5': 'claude-haiku-4-5', + }, + pricing: {}, + efforts: CLAUDE_EFFORTS, + defaultEffort: 'medium', + execution: { + maxTurns: 50, + allowedTools: ['Read', 'Write', 'Edit', 'Bash', 'Glob', 'Grep'], + debug: true, + systemPrompt: { type: 'preset', preset: 'claude_code' }, + permissionModel: 'tool-allowlist', + }, + }, + codex: { + models: CODEX_MODELS, + defaultModel: 'gpt-5.4', + sdkModelIds: {}, + pricing: { + 'gpt-5.4': { input: 2.5, cachedInput: 0.25, output: 15.0 }, + }, + efforts: CODEX_EFFORTS, + defaultEffort: 'medium', + execution: { + approvalPolicy: 'never', + permissionModel: 'approval-policy-never', + }, + }, +}; + +export function getDefaultVariant( + agent: T +): Extract { + const definition = AGENTS[agent]; + return { + agent, + model: definition.defaultModel, + effort: definition.defaultEffort, + } as Extract; +} + +export function resolveClaudeSdkModel(model: ClaudeModel): string { + return AGENTS.claude.sdkModelIds[model] ?? model; +} + +/** Estimate cost from token usage using the pricing table. */ +export function estimateCost(agent: AgentId, model: string, usage: TokenUsage): number | undefined { + const pricing = + agent === 'claude' + ? AGENTS.claude.pricing[model as ClaudeModel] + : AGENTS.codex.pricing[model as CodexModel]; + if (!pricing) return undefined; + const freshInput = usage.inputTokens - usage.cachedInputTokens; + return ( + (freshInput / 1_000_000) * pricing.input + + (usage.cachedInputTokens / 1_000_000) * pricing.cachedInput + + (usage.outputTokens / 1_000_000) * pricing.output + ); +} diff --git a/scripts/eval/lib/grade.test.ts b/scripts/eval/lib/grade.test.ts new file mode 100644 index 000000000000..adcf2d85667d --- /dev/null +++ b/scripts/eval/lib/grade.test.ts @@ -0,0 +1,260 @@ +import { describe, expect, it, vi } from 'vitest'; + +vi.mock('../../../code/core/src/core-server/utils/ghost-stories/get-candidates.ts', () => ({ + getComponentCandidates: vi.fn(), +})); + +vi.mock('../../../code/core/src/core-server/utils/ghost-stories/run-story-tests.ts', () => ({ + runGhostStories: vi.fn(), +})); + +import { + filterStorybookFiles, + computeQualityScore, + countTypeCheckErrors, + parseChangedFiles, +} from './grade'; +import type { FileChange } from './grade'; + +describe('filterStorybookFiles', () => { + it('matches files in .storybook/ directory', () => { + const files: FileChange[] = [ + { path: '.storybook/main.ts', gitStatus: 'M' }, + { path: '.storybook/preview.tsx', gitStatus: 'A' }, + { path: 'src/App.tsx', gitStatus: 'M' }, + ]; + expect(filterStorybookFiles(files)).toMatchObject([ + { path: '.storybook/main.ts', gitStatus: 'M' }, + { path: '.storybook/preview.tsx', gitStatus: 'A' }, + ]); + }); + + it('matches story files with various extensions', () => { + const files: FileChange[] = [ + { path: 'src/Button.stories.tsx', gitStatus: 'A' }, + { path: 'src/Header.stories.ts', gitStatus: 'A' }, + { path: 'src/Page.story.jsx', gitStatus: 'A' }, + { path: 'src/utils.stories.js', gitStatus: 'A' }, + { path: 'src/Button.tsx', gitStatus: 'M' }, + { path: 'src/Button.test.tsx', gitStatus: 'M' }, + ]; + expect(filterStorybookFiles(files)).toMatchObject(files.slice(0, 4)); + }); + + it('returns empty for no storybook files', () => { + const files: FileChange[] = [ + { path: 'src/App.tsx', gitStatus: 'M' }, + { path: 'package.json', gitStatus: 'M' }, + ]; + expect(filterStorybookFiles(files)).toHaveLength(0); + }); + + it('handles empty input', () => { + expect(filterStorybookFiles([])).toHaveLength(0); + }); + + it('matches renamed files using either side of the rename', () => { + const files: FileChange[] = [ + { path: 'src/Button.tsx', previousPath: 'src/Button.stories.tsx', gitStatus: 'R' }, + { path: '.storybook/preview.tsx', previousPath: 'config/preview.tsx', gitStatus: 'R' }, + { path: 'src/App.tsx', previousPath: 'src/Main.tsx', gitStatus: 'R' }, + ]; + + expect(filterStorybookFiles(files)).toMatchObject(files.slice(0, 2)); + }); +}); + +describe('computeQualityScore', () => { + // Weights: 40% ghost, 25% build, 25% typecheck, 10% performance + + it('returns 1.0 when everything passes and agent is fast', () => { + const result = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }); + expect(result.score).toBe(1); + expect(result.breakdown).toEqual({ build: 1, typecheck: 1, ghostStories: 1, performance: 1 }); + }); + + it('ghost stories have 40% weight', () => { + const result = computeQualityScore({ + buildSuccess: false, + typeCheckErrors: 20, + ghostSuccessRate: 1.0, + durationSeconds: 600, + }); + expect(result.score).toBe(0.4); + }); + + it('build has 25% weight', () => { + const result = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 20, + ghostSuccessRate: 0, + durationSeconds: 600, + }); + expect(result.score).toBe(0.25); + }); + + it('performance has 10% weight', () => { + const result = computeQualityScore({ + buildSuccess: false, + typeCheckErrors: 20, + ghostSuccessRate: 0, + durationSeconds: 60, + }); + expect(result.score).toBe(0.1); + }); + + it('returns 0 when everything fails', () => { + const result = computeQualityScore({ + buildSuccess: false, + typeCheckErrors: 20, + ghostSuccessRate: 0, + durationSeconds: 600, + }); + expect(result.score).toBe(0); + }); + + it('scales typecheck score linearly', () => { + const result = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 10, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }); + expect(result.breakdown.typecheck).toBe(0.5); + }); + + it('clamps typecheck score at 0 for >= 20 errors', () => { + const a = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 20, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }); + const b = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 50, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }); + expect(a.breakdown.typecheck).toBe(0); + expect(b.breakdown.typecheck).toBe(0); + }); + + it('treats undefined ghost stories as 0', () => { + const a = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 0, + durationSeconds: 60, + }); + const b = computeQualityScore({ buildSuccess: true, typeCheckErrors: 0, durationSeconds: 60 }); + expect(a.score).toBe(b.score); + }); + + it('performance: ≤120s scores 1.0', () => { + const a = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 0, + }); + const b = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 120, + }); + expect(a.breakdown.performance).toBe(1); + expect(b.breakdown.performance).toBe(1); + }); + + it('performance: 360s scores 0.5', () => { + const r = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 360, + }); + expect(r.breakdown.performance).toBe(0.5); + }); + + it('performance: ≥600s scores 0', () => { + const a = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 600, + }); + const b = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 1000, + }); + expect(a.breakdown.performance).toBe(0); + expect(b.breakdown.performance).toBe(0); + }); + + it('performance: undefined duration scores 0', () => { + const r = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + }); + expect(r.breakdown.performance).toBe(0); + }); +}); + +describe('countTypeCheckErrors', () => { + it('counts zero for clean output', () => { + expect(countTypeCheckErrors('')).toBe(0); + expect(countTypeCheckErrors('All good\nNo issues')).toBe(0); + }); + + it('counts TypeScript error codes', () => { + const output = [ + "src/App.tsx(3,1): error TS2304: Cannot find name 'foo'.", + "src/App.tsx(5,1): error TS2322: Type 'string' is not assignable.", + 'Found 2 errors.', + ].join('\n'); + expect(countTypeCheckErrors(output)).toBe(2); + }); + + it('counts multiple errors on the same line', () => { + expect(countTypeCheckErrors('error TS1234 and error TS5678 on same line')).toBe(2); + }); + + it('does not count non-error TS references', () => { + expect(countTypeCheckErrors('TS2304 without error prefix')).toBe(0); + expect(countTypeCheckErrors('warning TS1234')).toBe(0); + }); +}); + +describe('parseChangedFiles', () => { + it('parses added, modified, deleted, and renamed files', () => { + const output = + 'A\tsrc/new-file.ts\nM\tsrc/existing.ts\nD\tsrc/removed.ts\nR100\told.ts\tnew.ts'; + expect(parseChangedFiles(output)).toMatchObject([ + { path: 'src/new-file.ts', gitStatus: 'A' }, + { path: 'src/existing.ts', gitStatus: 'M' }, + { path: 'src/removed.ts', gitStatus: 'D' }, + { path: 'new.ts', previousPath: 'old.ts', gitStatus: 'R' }, + ]); + }); + + it('handles empty output', () => { + expect(parseChangedFiles('')).toEqual([]); + expect(parseChangedFiles('\n')).toEqual([]); + }); + + it('handles single file', () => { + expect(parseChangedFiles('M\tpackage.json')).toEqual([ + { path: 'package.json', gitStatus: 'M' }, + ]); + }); +}); diff --git a/scripts/eval/lib/grade.ts b/scripts/eval/lib/grade.ts new file mode 100644 index 000000000000..0bf3259b56da --- /dev/null +++ b/scripts/eval/lib/grade.ts @@ -0,0 +1,289 @@ +import { writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import { x } from 'tinyexec'; +import { getComponentCandidates } from '../../../code/core/src/core-server/utils/ghost-stories/get-candidates.ts'; +import { runGhostStories } from '../../../code/core/src/core-server/utils/ghost-stories/run-story-tests.ts'; +import type { Logger } from './utils.ts'; +import type { TrialWorkspace } from './prepare-trial.ts'; + +/** Git `--name-status` codes: A=added, M=modified, D=deleted, R=renamed. */ +export type GitDiffStatus = 'A' | 'M' | 'D' | 'R'; + +export interface FileChange { + path: string; + gitStatus: GitDiffStatus; + /** For renames, the original path before the move. */ + previousPath?: string; +} + +export interface GhostStoryGrade { + candidateCount: number; + total: number; + passed: number; + successRate: number; +} + +export interface ScoreWeights { + ghostStories: number; + build: number; + typecheck: number; + performance: number; +} + +export const DEFAULT_SCORE_WEIGHTS: ScoreWeights = { + ghostStories: 0.4, + build: 0.25, + typecheck: 0.25, + performance: 0.1, +}; + +export interface QualityScore { + score: number; + breakdown: { + build: number; + typecheck: number; + ghostStories: number; + performance: number; + }; +} + +export interface Grade { + buildSuccess: boolean; + buildError?: string; + typeCheckErrors: number; + typeCheckOutput?: string; + fileChanges: FileChange[]; + storybookChanges: FileChange[]; + ghostStories?: GhostStoryGrade; +} + +/** Maximum TypeScript errors before the typecheck score reaches 0. */ +const MAX_TYPECHECK_ERRORS = 20; +/** Agent duration (seconds) at or below which performance scores 1.0. */ +const PERFECT_DURATION_S = 120; +/** Agent duration (seconds) at or above which performance scores 0. */ +const ZERO_SCORE_DURATION_S = 600; + +/** Filter file changes to only storybook-related ones. */ +export function filterStorybookFiles(fileChanges: FileChange[]): FileChange[] { + const isStorybookPath = (path?: string) => + path != null && (path.includes('.storybook/') || /\.(stories|story)\.[tj]sx?$/.test(path)); + + return fileChanges.filter((f) => isStorybookPath(f.path) || isStorybookPath(f.previousPath)); +} + +/** + * Compute quality score with configurable weights. + * + * Default weights: 40% ghost stories, 25% build, 25% typecheck, 10% performance. + * + * Performance is scored on a curve: <=120s -> 1.0, 600s -> 0, linear between. + */ +export function computeQualityScore( + opts: { + buildSuccess: boolean; + typeCheckErrors: number; + ghostSuccessRate?: number; + durationSeconds?: number; + }, + weights: ScoreWeights = DEFAULT_SCORE_WEIGHTS +): QualityScore { + const buildScore = opts.buildSuccess ? 1 : 0; + const tcScore = Math.max(0, 1 - opts.typeCheckErrors / MAX_TYPECHECK_ERRORS); + const ghostScore = opts.ghostSuccessRate ?? 0; + const d = opts.durationSeconds; + const perfScore = + d == null + ? 0 + : Math.max( + 0, + Math.min(1, 1 - (d - PERFECT_DURATION_S) / (ZERO_SCORE_DURATION_S - PERFECT_DURATION_S)) + ); + const score = + Math.round( + (ghostScore * weights.ghostStories + + buildScore * weights.build + + tcScore * weights.typecheck + + perfScore * weights.performance) * + 100 + ) / 100; + return { + score, + breakdown: { + build: buildScore, + typecheck: Math.round(tcScore * 100) / 100, + ghostStories: Math.round(ghostScore * 100) / 100, + performance: Math.round(perfScore * 100) / 100, + }, + }; +} + +/** Count TypeScript errors from tsc output. */ +export function countTypeCheckErrors(tscOutput: string): number { + return (tscOutput.match(/error TS\d+/g) || []).length; +} + +/** Parse git diff --name-status output into FileChange objects. */ +export function parseChangedFiles(gitOutput: string): FileChange[] { + return gitOutput + .trim() + .split('\n') + .filter(Boolean) + .map((line) => { + const [status, ...parts] = line.split('\t'); + const gitStatus = parseGitDiffStatus(status); + + if (gitStatus === 'R' && parts.length >= 2) { + const [previousPath, path] = parts; + return { path, previousPath, gitStatus }; + } + + return { path: parts.join('\t'), gitStatus }; + }); +} + +export async function grade( + workspace: TrialWorkspace, + logger: Logger, + agentDuration?: number +): Promise<{ grade: Grade; score: QualityScore }> { + const { repoRoot, projectPath, resultsDir, baselineCommit } = workspace; + + // Changed files + logger.logStep('Collecting agent changes...'); + const fileChanges = await getChangedFiles(repoRoot, baselineCommit); + const storybookChanges = filterStorybookFiles(fileChanges); + logger.logSuccess( + `${fileChanges.length} files changed (${storybookChanges.length} storybook-related)` + ); + + // Storybook build + TypeScript check in parallel + logger.logStep('Running storybook build + typecheck...'); + const [build, tsc] = await Promise.all([ + x('npx', ['storybook', 'build', '--quiet'], { + throwOnError: false, + timeout: 300_000, + nodeOptions: { + cwd: projectPath, + env: { + ...process.env, + STORYBOOK_DISABLE_TELEMETRY: '1', + NODE_OPTIONS: '--max_old_space_size=4096', + }, + }, + }), + x('npx', ['tsc', '--noEmit'], { + throwOnError: false, + timeout: 120_000, + nodeOptions: { cwd: projectPath }, + }), + ]); + + const buildSuccess = build.exitCode === 0; + const buildOutput = build.stdout + '\n' + build.stderr; + await writeFile(join(resultsDir, 'build-output.txt'), buildOutput); + if (buildSuccess) { + logger.logSuccess('Storybook build succeeded'); + } else { + logger.logError(`Storybook build failed (exit ${build.exitCode})`); + } + + const tscOutput = tsc.stdout + '\n' + tsc.stderr; + await writeFile(join(resultsDir, 'typecheck-output.txt'), tscOutput); + const typeCheckErrors = countTypeCheckErrors(tscOutput); + if (typeCheckErrors === 0) { + logger.logSuccess('No TypeScript errors'); + } else { + logger.logError(`${typeCheckErrors} TypeScript error(s)`); + } + + // Ghost stories (only if build passed) + const ghostStories = buildSuccess ? await gradeGhostStories(projectPath, logger) : undefined; + + const trialGrade: Grade = { + buildSuccess, + buildError: buildSuccess ? undefined : truncateEnd(buildOutput, 2000), + typeCheckErrors, + typeCheckOutput: typeCheckErrors > 0 ? truncateEnd(tscOutput, 2000) : undefined, + fileChanges, + storybookChanges, + ghostStories, + }; + + const score = computeQualityScore({ + buildSuccess, + typeCheckErrors, + ghostSuccessRate: ghostStories?.successRate, + durationSeconds: agentDuration, + }); + + return { grade: trialGrade, score }; +} + +async function getChangedFiles(repoRoot: string, baseline: string): Promise { + // Stage all files so `git diff --cached` picks up new files the agent created. + // Safe: this runs on an ephemeral trial copy, not the real repo. + await x('git', ['add', '-A'], { nodeOptions: { cwd: repoRoot } }); + const { stdout } = await x('git', ['diff', '--cached', '--name-status', baseline], { + throwOnError: false, + nodeOptions: { cwd: repoRoot }, + }); + return parseChangedFiles(stdout); +} + +async function gradeGhostStories( + projectPath: string, + logger: Logger +): Promise { + logger.logStep('Running ghost stories...'); + + try { + const { candidates } = await getComponentCandidates({ sampleSize: 20, cwd: projectPath }); + if (candidates.length === 0) { + logger.logError('No candidate components found'); + return undefined; + } + logger.logStep(`Found ${candidates.length} candidate component(s)`); + + const result = await runGhostStories(candidates, { cwd: projectPath }); + + if (result.runError) { + logger.logError(`Ghost stories: ${result.runError}`); + return undefined; + } + + const summary = 'summary' in result ? result.summary : undefined; + + if (summary && summary.total > 0) { + const realPassed = summary.passed - summary.passedButEmptyRender; + logger.logSuccess( + `Ghost stories: ${realPassed}/${summary.total} passed (${Math.round(summary.successRateWithoutEmptyRender * 100)}%)${summary.passedButEmptyRender > 0 ? ` (${summary.passedButEmptyRender} empty renders excluded)` : ''}` + ); + } + + return { + candidateCount: candidates.length, + total: summary?.total ?? 0, + passed: (summary?.passed ?? 0) - (summary?.passedButEmptyRender ?? 0), + successRate: summary?.successRateWithoutEmptyRender ?? 0, + }; + } catch (error) { + logger.logError(`Ghost stories: ${error instanceof Error ? error.message : String(error)}`); + return undefined; + } +} + +/** Truncate text to approximately maxChars, snapping to a line boundary. */ +function truncateEnd(text: string, maxChars: number): string { + if (text.length <= maxChars) return text; + const truncated = text.slice(-maxChars); + const firstNewline = truncated.indexOf('\n'); + return firstNewline >= 0 ? truncated.slice(firstNewline + 1) : truncated; +} + +function parseGitDiffStatus(rawStatus?: string): GitDiffStatus { + const firstChar = rawStatus?.charAt(0); + return firstChar === 'A' || firstChar === 'M' || firstChar === 'D' || firstChar === 'R' + ? firstChar + : 'M'; +} diff --git a/scripts/eval/lib/grading-helpers.test.ts b/scripts/eval/lib/grading-helpers.test.ts new file mode 100644 index 000000000000..8d883da92f7a --- /dev/null +++ b/scripts/eval/lib/grading-helpers.test.ts @@ -0,0 +1,177 @@ +import { mkdirSync, writeFileSync, rmSync } from 'node:fs'; +import { join } from 'node:path'; +import { tmpdir } from 'node:os'; + +import { afterEach, beforeEach, describe, expect, it } from 'vitest'; + +import { getComponentCandidates } from 'storybook/internal/core-server'; +import { + computeQualityScore, + countTypeCheckErrors, + filterStorybookFiles, + parseChangedFiles, +} from './grade'; +/** + * Helper-level test: compose grading helpers on a fake project directory. + * This exercises candidate discovery, git-output parsing, + * and quality-score calculation without pretending to cover the full grade() flow. + */ + +let TMP: string; + +beforeEach(() => { + TMP = join(tmpdir(), `eval-grading-helpers-${Date.now()}`); + mkdirSync(join(TMP, 'src', 'components'), { recursive: true }); + mkdirSync(join(TMP, '.storybook'), { recursive: true }); +}); + +afterEach(() => { + rmSync(TMP, { recursive: true, force: true }); +}); + +describe('grading helpers', () => { + it('composes helper signals for a well-configured project', async () => { + // Set up a realistic project with components and storybook config + writeFile( + 'src/components/Button.tsx', + [ + `import React from 'react';`, + `export function Button({ label }: { label: string }) {`, + ` return (`, + ` `, + ` );`, + `}`, + ].join('\n') + ); + writeFile( + 'src/components/Card.tsx', + [ + `import React from 'react';`, + `export function Card({ title }: { title: string }) {`, + ` return (`, + `
{title}
`, + ` );`, + `}`, + ].join('\n') + ); + writeFile( + '.storybook/preview.tsx', + [ + `import '../src/styles/globals.css';`, + `import { ThemeProvider } from '@emotion/react';`, + ].join('\n') + ); + writeFile( + '.storybook/main.ts', + `export default { staticDirs: ['../public'], stories: ['../src/**/*.stories.tsx'] };` + ); + + // Step 1: Find candidates — both components should be discovered + const candidates = await findCandidates(TMP); + expect(candidates).toHaveLength(2); + + // Step 2: Simulate git output where the agent added storybook config + one + // story per discovered candidate, plus modified package.json + const gitLines = [ + 'A\t.storybook/preview.tsx', + 'A\t.storybook/main.ts', + ...candidates.map((c) => `A\t${c.replace(/\.tsx$/, '.stories.tsx')}`), + 'M\tpackage.json', + ]; + const changedFiles = parseChangedFiles(gitLines.join('\n')); + const storybookFiles = filterStorybookFiles(changedFiles); + + // 2 config files + 1 story per candidate = storybook-related + expect(storybookFiles).toHaveLength(2 + candidates.length); + // Total includes package.json + expect(changedFiles).toHaveLength(storybookFiles.length + 1); + + // Step 3: Build passed, no TS errors, 100% ghost stories, fast agent → perfect score + const quality = computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }); + expect(quality.score).toBe(1); + }); + + it('composes helper signals for a broken project', async () => { + writeFile( + 'src/components/Widget.tsx', + [ + `import React from 'react';`, + `export function Widget() {`, + ` return
hello
;`, + `}`, + ].join('\n') + ); + + // Candidates still discoverable even when storybook setup is broken + const candidates = await findCandidates(TMP); + expect(candidates).toHaveLength(1); + + // Simulate tsc output with errors proportional to candidate count + const tscLines = candidates.map( + (c, i) => `${c}(${i + 1},1): error TS2304: Cannot find name 'React'.` + ); + tscLines.push('src/App.tsx(10,5): error TS2345: Argument not assignable.'); + const errorCount = countTypeCheckErrors(tscLines.join('\n')); + expect(errorCount).toBe(candidates.length + 1); + + // Build failed, no ghost stories, errors, slow → low quality + const quality = computeQualityScore({ + buildSuccess: false, + typeCheckErrors: errorCount, + ghostSuccessRate: 0, + durationSeconds: 600, + }); + expect(quality.score).toBeLessThan(0.3); + expect(quality.breakdown.build).toBe(0); + }); + + it('keeps helper output stable as candidate count grows', async () => { + // Rich project: many simple components + for (let i = 0; i < 5; i++) { + writeFile( + `src/components/Comp${i}.tsx`, + [ + `import React from 'react';`, + `export function Comp${i}() {`, + ` return
Component ${i}
;`, + `}`, + ].join('\n') + ); + } + writeFile('.storybook/preview.tsx', `import { MemoryRouter } from 'react-router-dom';`); + + const candidates = await findCandidates(TMP); + expect(candidates).toHaveLength(5); + + // Agent wrote one story per candidate — all storybook-related + const gitOutput = candidates.map((c) => `A\t${c.replace(/\.tsx$/, '.stories.tsx')}`).join('\n'); + const storybookFiles = filterStorybookFiles(parseChangedFiles(gitOutput)); + expect(storybookFiles).toHaveLength(candidates.length); + + // Clean build + 100% ghost stories + fast → perfect + expect( + computeQualityScore({ + buildSuccess: true, + typeCheckErrors: 0, + ghostSuccessRate: 1.0, + durationSeconds: 60, + }).score + ).toBe(1); + }); +}); + +function writeFile(relativePath: string, content: string) { + const fullPath = join(TMP, relativePath); + mkdirSync(join(fullPath, '..'), { recursive: true }); + writeFileSync(fullPath, content); +} + +async function findCandidates(cwd: string) { + const { candidates } = await getComponentCandidates({ cwd, sampleSize: 20 }); + return candidates.map((c) => c.replace(cwd + '/', '')); +} diff --git a/scripts/eval/lib/package-manager.test.ts b/scripts/eval/lib/package-manager.test.ts new file mode 100644 index 000000000000..4d958198d3d7 --- /dev/null +++ b/scripts/eval/lib/package-manager.test.ts @@ -0,0 +1,71 @@ +import { mkdirSync, rmSync, writeFileSync } from 'node:fs'; +import { dirname, join } from 'node:path'; +import { tmpdir } from 'node:os'; + +import { afterEach, describe, expect, it } from 'vitest'; + +import { detectPackageManager, resolveInstallRoot } from './package-manager'; + +const TEMP_DIRS: string[] = []; + +afterEach(() => { + for (const dir of TEMP_DIRS.splice(0)) { + rmSync(dir, { recursive: true, force: true }); + } +}); + +describe('detectPackageManager', () => { + it('recognizes npm from package-lock files', () => { + const root = createTempDir('npm-lock'); + writeFile('package-lock.json', root); + + expect(detectPackageManager(root)).toBe('npm'); + }); +}); + +describe('resolveInstallRoot', () => { + it('keeps nested standalone apps on their own install root', () => { + const repoRoot = createTempDir('nested-bun'); + const projectDir = join(repoRoot, 'frontend'); + mkdirSync(projectDir, { recursive: true }); + writeFile('frontend/bun.lock', repoRoot); + + expect(resolveInstallRoot(projectDir, repoRoot)).toBe(projectDir); + }); + + it('walks up to the repo workspace root when lockfiles live above projectDir', () => { + const repoRoot = createTempDir('pnpm-workspace'); + const projectDir = join(repoRoot, 'packages', 'lib'); + mkdirSync(projectDir, { recursive: true }); + writeFile('pnpm-lock.yaml', repoRoot); + writeFile('pnpm-workspace.yaml', repoRoot); + + expect(resolveInstallRoot(projectDir, repoRoot)).toBe(repoRoot); + }); + + it('does not walk above the cloned repo root', () => { + const parent = createTempDir('parent-lock'); + const repoRoot = join(parent, 'repo'); + const projectDir = join(repoRoot, 'packages', 'lib'); + mkdirSync(projectDir, { recursive: true }); + writeFile('yarn.lock', parent); + + expect(resolveInstallRoot(projectDir, repoRoot)).toBe(projectDir); + }); +}); + +function createTempDir(name: string) { + const dir = join( + tmpdir(), + `storybook-eval-${name}-${Date.now()}-${Math.random().toString(16).slice(2)}` + ); + mkdirSync(dir, { recursive: true }); + TEMP_DIRS.push(dir); + return dir; +} + +function writeFile(relativePath: string, root: string) { + const fullPath = join(root, relativePath); + mkdirSync(dirname(fullPath), { recursive: true }); + writeFileSync(fullPath, ''); +} diff --git a/scripts/eval/lib/package-manager.ts b/scripts/eval/lib/package-manager.ts new file mode 100644 index 000000000000..ea61a5444e4f --- /dev/null +++ b/scripts/eval/lib/package-manager.ts @@ -0,0 +1,99 @@ +/** + * Shared package manager detection and dependency installation. + * + * Used by trial preparation and any other eval flows that need a + * package-manager-aware install step. + */ +import { existsSync } from 'node:fs'; +import { dirname, join, resolve } from 'node:path'; +import { x } from 'tinyexec'; +import type { Logger } from './utils.ts'; + +const PACKAGE_MANAGER_MARKERS = { + pnpm: ['pnpm-lock.yaml', 'pnpm-workspace.yaml'], + yarn: ['yarn.lock'], + bun: ['bun.lockb', 'bun.lock'], + npm: ['package-lock.json', 'npm-shrinkwrap.json'], +} as const; + +/** Detect the package manager from lock files in a directory. */ +export function detectPackageManager(dir: string): string { + if (PACKAGE_MANAGER_MARKERS.pnpm.some((file) => existsSync(join(dir, file)))) return 'pnpm'; + if (PACKAGE_MANAGER_MARKERS.yarn.some((file) => existsSync(join(dir, file)))) return 'yarn'; + if (PACKAGE_MANAGER_MARKERS.bun.some((file) => existsSync(join(dir, file)))) return 'bun'; + if (PACKAGE_MANAGER_MARKERS.npm.some((file) => existsSync(join(dir, file)))) return 'npm'; + return 'npm'; +} + +/** + * Resolve the directory where dependency installation should run. + * + * For nested projects inside a workspace, the lockfile often lives above `dir`. + * We walk upward until we find the closest package-manager marker, stopping at + * the cloned repo root so we do not accidentally use markers from outside the trial. + */ +export function resolveInstallRoot(dir: string, stopAt?: string): string { + const start = resolve(dir); + const boundary = stopAt ? resolve(stopAt) : undefined; + + let current = start; + while (true) { + if (hasAnyMarker(current)) { + return current; + } + + if (boundary && current === boundary) { + return start; + } + + const parent = dirname(current); + if (parent === current) { + return start; + } + + current = parent; + } +} + +/** Install dependencies using the detected package manager. */ +export async function installDeps( + dir: string, + logger: Logger, + env?: Record, + options?: { stopAt?: string } +): Promise { + const installRoot = resolveInstallRoot(dir, options?.stopAt); + const pm = detectPackageManager(installRoot); + const [cmd, args] = getInstallArgs(pm, installRoot); + logger.logStep( + installRoot === resolve(dir) + ? `Installing with ${pm}...` + : `Installing with ${pm} from ${installRoot}...` + ); + await x(cmd, args, { + timeout: 300_000, + nodeOptions: { cwd: installRoot, ...(env && { env: env as NodeJS.ProcessEnv }) }, + }); +} + +function hasAnyMarker(dir: string): boolean { + return Object.values(PACKAGE_MANAGER_MARKERS).some((files) => + files.some((file) => existsSync(join(dir, file))) + ); +} + +function getInstallArgs(pm: string, dir: string): [string, string[]] { + switch (pm) { + case 'pnpm': + return ['pnpm', ['install', '--no-frozen-lockfile']]; + case 'yarn': + return [ + 'yarn', + existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'], + ]; + case 'bun': + return ['bun', ['install']]; + default: + return ['npm', ['install', '--ignore-scripts']]; + } +} diff --git a/scripts/eval/lib/prepare-trial.test.ts b/scripts/eval/lib/prepare-trial.test.ts new file mode 100644 index 000000000000..af45783998b5 --- /dev/null +++ b/scripts/eval/lib/prepare-trial.test.ts @@ -0,0 +1,48 @@ +import { describe, expect, it } from 'vitest'; + +import { getCacheRefreshReason, type TrialCacheInfo } from './prepare-trial'; +import type { Project } from './projects'; + +const project: Project = { + name: 'mealdrop', + repo: 'https://github.com/example/mealdrop', + branch: 'eval-baseline', +}; + +const cacheInfo: TrialCacheInfo = { + repo: project.repo, + branch: project.branch, + baselineCommit: '0123456789abcdef', +}; + +describe('getCacheRefreshReason', () => { + it('keeps the cache when repo, branch, and baseline still match', () => { + expect(getCacheRefreshReason(project, cacheInfo, cacheInfo.baselineCommit)).toBeUndefined(); + }); + + it('refreshes when the repo URL changes', () => { + expect( + getCacheRefreshReason( + { ...project, repo: 'https://github.com/example/mealdrop-fork' }, + cacheInfo, + cacheInfo.baselineCommit + ) + ).toContain('repo changed'); + }); + + it('refreshes when the tracked branch changes', () => { + expect( + getCacheRefreshReason({ ...project, branch: 'next' }, cacheInfo, cacheInfo.baselineCommit) + ).toContain('branch changed'); + }); + + it('refreshes when the remote branch head advances', () => { + expect(getCacheRefreshReason(project, cacheInfo, 'fedcba9876543210')).toContain( + 'baseline branch advanced' + ); + }); + + it('keeps the cache if the remote branch cannot be verified', () => { + expect(getCacheRefreshReason(project, cacheInfo)).toBeUndefined(); + }); +}); diff --git a/scripts/eval/lib/prepare-trial.ts b/scripts/eval/lib/prepare-trial.ts new file mode 100644 index 000000000000..a39eedd40f64 --- /dev/null +++ b/scripts/eval/lib/prepare-trial.ts @@ -0,0 +1,166 @@ +import { existsSync } from 'node:fs'; +import { cp, mkdir, readFile, rm, writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import type { Logger } from './utils.ts'; +import type { Project } from './projects.ts'; +import { x } from 'tinyexec'; +import { installDeps } from './package-manager.ts'; +import { CACHE_DIR, TRIALS_DIR } from './utils.ts'; + +const CACHE_INFO_SUFFIX = '.json'; + +export interface TrialWorkspace { + trialDir: string; + repoRoot: string; + projectPath: string; + resultsDir: string; + baselineCommit: string; +} + +export interface TrialCacheInfo { + repo: string; + branch: string; + baselineCommit: string; +} + +/** + * First run: clone eval-baseline -> install deps -> cache it. + * Subsequent runs: copy from cache. Agent starts immediately. + */ +export async function prepareTrial( + project: Project, + trialId: string, + logger: Logger +): Promise { + const cacheDir = join(CACHE_DIR, project.name); + const cacheInfoPath = join(CACHE_DIR, `${project.name}${CACHE_INFO_SUFFIX}`); + const trialDir = join(TRIALS_DIR, trialId); + const repoRoot = join(trialDir, 'project'); + await mkdir(trialDir, { recursive: true }); + + if (await canReuseCache(project, cacheDir, cacheInfoPath, logger)) { + logger.logStep('Copying from cache...'); + await cp(cacheDir, repoRoot, { recursive: true }); + } else { + logger.logStep(`Cloning ${project.repo}#${project.branch}...`); + await mkdir(CACHE_DIR, { recursive: true }); + await x('git', ['clone', '--depth', '1', '--branch', project.branch, project.repo, repoRoot], { + timeout: 120_000, + }); + const projectPath = project.projectDir ? join(repoRoot, project.projectDir) : repoRoot; + await installDeps(projectPath, logger, undefined, { stopAt: repoRoot }); + logger.logSuccess('Dependencies installed'); + logger.logStep('Caching for future runs...'); + const baselineCommit = await getGitHead(repoRoot); + await persistCache(cacheDir, cacheInfoPath, repoRoot, { + repo: project.repo, + branch: project.branch, + baselineCommit, + }); + } + + const baselineCommit = await getGitHead(repoRoot); + const projectPath = project.projectDir ? join(repoRoot, project.projectDir) : repoRoot; + const resultsDir = join(trialDir, 'results'); + await mkdir(resultsDir, { recursive: true }); + + logger.logSuccess('Trial ready'); + return { trialDir, repoRoot, projectPath, resultsDir, baselineCommit }; +} + +export function getCacheRefreshReason( + project: Project, + cacheInfo: TrialCacheInfo, + remoteHead?: string +): string | undefined { + if (cacheInfo.repo !== project.repo) { + return `repo changed (${cacheInfo.repo} → ${project.repo})`; + } + if (cacheInfo.branch !== project.branch) { + return `branch changed (${cacheInfo.branch} → ${project.branch})`; + } + if (remoteHead && cacheInfo.baselineCommit !== remoteHead) { + return `baseline branch advanced (${cacheInfo.baselineCommit.slice(0, 7)} → ${remoteHead.slice(0, 7)})`; + } + return undefined; +} + +async function canReuseCache( + project: Project, + cacheDir: string, + cacheInfoPath: string, + logger: Logger +): Promise { + if (!existsSync(join(cacheDir, '.git'))) { + return false; + } + + const cacheInfo = await readCacheInfo(cacheInfoPath); + if (!cacheInfo) { + logger.logStep('Refreshing cache (missing or invalid cache metadata)...'); + await clearCache(cacheDir, cacheInfoPath); + return false; + } + + const remoteHead = await getRemoteBranchHead(project.repo, project.branch, logger); + const refreshReason = getCacheRefreshReason(project, cacheInfo, remoteHead); + if (!refreshReason) { + return true; + } + + logger.logStep(`Refreshing cache (${refreshReason})...`); + await clearCache(cacheDir, cacheInfoPath); + return false; +} + +async function persistCache( + cacheDir: string, + cacheInfoPath: string, + repoRoot: string, + cacheInfo: TrialCacheInfo +) { + await clearCache(cacheDir, cacheInfoPath); + await cp(repoRoot, cacheDir, { recursive: true }); + await writeFile(cacheInfoPath, JSON.stringify(cacheInfo, null, 2)); +} + +async function readCacheInfo(cacheInfoPath: string): Promise { + if (!existsSync(cacheInfoPath)) { + return undefined; + } + + try { + return JSON.parse(await readFile(cacheInfoPath, 'utf-8')) as TrialCacheInfo; + } catch { + return undefined; + } +} + +async function getGitHead(cwd: string): Promise { + return (await x('git', ['rev-parse', 'HEAD'], { nodeOptions: { cwd } })).stdout.trim(); +} + +async function getRemoteBranchHead( + repo: string, + branch: string, + logger: Logger +): Promise { + const result = await x('git', ['ls-remote', repo, `refs/heads/${branch}`], { + throwOnError: false, + timeout: 120_000, + }); + if (result.exitCode !== 0) { + logger.logStep(`Could not verify remote HEAD for ${repo}#${branch}; reusing cache as-is.`); + return undefined; + } + + const line = result.stdout.trim().split('\n').find(Boolean); + return line?.split('\t')[0]?.trim() || undefined; +} + +async function clearCache(cacheDir: string, cacheInfoPath: string) { + await Promise.all([ + rm(cacheDir, { recursive: true, force: true }), + rm(cacheInfoPath, { force: true }), + ]); +} diff --git a/scripts/eval/lib/projects.test.ts b/scripts/eval/lib/projects.test.ts new file mode 100644 index 000000000000..b80238500f8e --- /dev/null +++ b/scripts/eval/lib/projects.test.ts @@ -0,0 +1,32 @@ +import { describe, expect, it } from 'vitest'; + +import { PROJECTS } from './projects'; + +const githubRepoUrl = /^https:\/\/github\.com\/[^/]+\/[^/]+$/; + +describe('PROJECTS', () => { + it('pins every benchmark project to a pre-initialized eval-baseline repo', () => { + expect(PROJECTS.length).toBeGreaterThan(0); + + for (const project of PROJECTS) { + expect(project).toMatchObject({ + branch: 'eval-baseline', + repo: expect.stringMatching(githubRepoUrl), + description: expect.any(String), + }); + } + }); + + it('keeps benchmark project metadata unambiguous', () => { + const names = PROJECTS.map((p) => p.name); + const repos = PROJECTS.map((p) => p.repo); + + expect(new Set(names).size).toBe(names.length); + expect(new Set(repos).size).toBe(repos.length); + + for (const project of PROJECTS) { + if (!project.projectDir) continue; + expect(project.projectDir).toMatch(/^(?!\/)(?!\.\.?(?:\/|$)).+/); + } + }); +}); diff --git a/scripts/eval/lib/projects.ts b/scripts/eval/lib/projects.ts new file mode 100644 index 000000000000..0046ed30bac4 --- /dev/null +++ b/scripts/eval/lib/projects.ts @@ -0,0 +1,48 @@ +export interface Project { + name: string; + repo: string; + branch: string; + projectDir?: string; + description?: string; +} + +export const PROJECTS: Project[] = [ + { + name: 'mealdrop', + repo: 'https://github.com/kasperpeulen/mealdrop', + branch: 'eval-baseline', + description: 'Styled components, Redux, React Router', + }, + { + name: 'edgy', + repo: 'https://github.com/kasperpeulen/edgy', + branch: 'eval-baseline', + description: 'Tailwind, HeadlessUI, React Router', + }, + { + name: 'wikitok', + repo: 'https://github.com/kasperpeulen/wikitok', + branch: 'eval-baseline', + projectDir: 'frontend', + description: 'Simple project with Tailwind', + }, + { + name: 'baklava', + repo: 'https://github.com/kasperpeulen/baklava', + branch: 'eval-baseline', + description: 'Component library with Zustand', + }, + { + name: 'echarts', + repo: 'https://github.com/kasperpeulen/echarts-react', + branch: 'eval-baseline', + description: 'ECharts React wrapper', + }, + { + name: 'evergreen-ci', + repo: 'https://github.com/kasperpeulen/ui', + branch: 'eval-baseline', + projectDir: 'packages/lib', + description: 'GraphQL', + }, +]; diff --git a/scripts/eval/lib/run-trial.test.ts b/scripts/eval/lib/run-trial.test.ts new file mode 100644 index 000000000000..8b7f79bd07c0 --- /dev/null +++ b/scripts/eval/lib/run-trial.test.ts @@ -0,0 +1,233 @@ +import { mkdirSync, readFileSync, rmSync } from 'node:fs'; +import { join } from 'node:path'; +import { tmpdir } from 'node:os'; + +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +import type { TrialConfig, TrialReport } from './run-trial'; + +// Mock external dependencies to avoid real git/storybook/vitest calls +vi.mock('./prepare-trial', () => ({ + prepareTrial: vi.fn(), +})); +vi.mock('./grade', () => ({ + grade: vi.fn(), +})); +vi.mock('./utils', async (importOriginal) => { + const actual = await importOriginal(); + return { + ...actual, + captureEnvironment: vi.fn().mockResolvedValue({ + nodeVersion: 'v22.21.1', + evalBranch: 'test-branch', + evalCommit: 'abc123', + }), + }; +}); +vi.mock('./agents/claude-code', () => ({ + claudeAgent: { name: 'claude', execute: vi.fn() }, +})); +vi.mock('./agents/codex', () => ({ + codexAgent: { name: 'codex', execute: vi.fn() }, +})); + +import { claudeAgent } from './agents/claude-code'; +import { grade } from './grade'; +import { prepareTrial } from './prepare-trial'; +import { runTrial } from './run-trial'; +import { captureEnvironment } from './utils'; + +let TMP: string; + +beforeEach(() => { + vi.clearAllMocks(); + TMP = join(tmpdir(), `eval-run-trial-${Date.now()}`); + mkdirSync(join(TMP, 'results'), { recursive: true }); +}); + +afterEach(() => { + rmSync(TMP, { recursive: true, force: true }); +}); + +const baseConfig: TrialConfig = { + project: { name: 'test-project', repo: 'https://github.com/test/repo', branch: 'main' }, + variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' }, + prompt: 'setup', +}; + +describe('runTrial pipeline', () => { + it('assembles a complete TrialReport from pipeline steps', async () => { + setupMocks(); + + const result = await runTrial(baseConfig); + + expect(result).toMatchObject({ + schemaVersion: 1, + project: { name: 'test-project', repo: 'https://github.com/test/repo', branch: 'main' }, + variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' }, + prompt: 'setup', + baselineCommit: 'deadbeef', + execution: { + cost: 0.42, + duration: 45.2, + turns: 12, + }, + grade: { + buildSuccess: true, + }, + score: { + score: 1, + }, + }); + expect(result.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it('calls pipeline steps with correct arguments', async () => { + setupMocks(); + + const config: TrialConfig = { + ...baseConfig, + project: { + name: 'mealdrop', + repo: 'https://github.com/test/mealdrop', + branch: 'eval-baseline', + }, + }; + + await runTrial(config); + + expect(vi.mocked(prepareTrial).mock.calls[0][0]).toMatchObject({ + name: 'mealdrop', + repo: 'https://github.com/test/mealdrop', + branch: 'eval-baseline', + }); + expect(vi.mocked(prepareTrial).mock.calls[0][2]).toBeDefined(); + + expect(vi.mocked(captureEnvironment).mock.calls[0][0]).toBe(join(TMP, 'results')); + + const params = vi.mocked(claudeAgent.execute).mock.calls[0][0]; + expect(params).toMatchObject({ + prompt: expect.stringContaining('set up Storybook'), + projectPath: TMP, + variant: { agent: 'claude', model: 'sonnet-4.6', effort: 'high' }, + resultsDir: join(TMP, 'results'), + }); + expect(params.logger).toBeDefined(); + + const gradeWorkspace = vi.mocked(grade).mock.calls[0][0]; + expect(gradeWorkspace).toMatchObject({ + baselineCommit: 'deadbeef', + projectPath: TMP, + resultsDir: join(TMP, 'results'), + }); + expect(vi.mocked(grade).mock.calls[0][1]).toBeDefined(); + }); + + it('writes summary.json and prompt.md to results dir', async () => { + setupMocks(); + + await runTrial(baseConfig); + + const resultsDir = join(TMP, 'results'); + + const summary: TrialReport = JSON.parse( + readFileSync(join(resultsDir, 'summary.json'), 'utf-8') + ); + expect(summary).toMatchObject({ + schemaVersion: 1, + execution: { cost: 0.42 }, + grade: { buildSuccess: true }, + }); + + const promptContent = readFileSync(join(resultsDir, 'prompt.md'), 'utf-8'); + expect(promptContent).toContain('set up Storybook'); + }); + + it('propagates failed build into result', async () => { + setupMocks({ buildSuccess: false, typeCheckErrors: 5 }); + + await expect(runTrial(baseConfig)).resolves.toMatchObject({ + grade: { buildSuccess: false, typeCheckErrors: 5 }, + score: { score: 0.3 }, + }); + }); + + it('does not call grade before agent finishes', async () => { + // Use execution order tracking to verify sequencing + const callOrder: string[] = []; + + vi.mocked(prepareTrial).mockImplementation(async () => { + callOrder.push('prepare'); + return { + trialDir: TMP, + repoRoot: TMP, + projectPath: TMP, + resultsDir: join(TMP, 'results'), + baselineCommit: 'deadbeef', + }; + }); + + vi.mocked(claudeAgent.execute).mockImplementation(async () => { + callOrder.push('agent'); + return { cost: 0.1, duration: 10, turns: 3 }; + }); + + vi.mocked(grade).mockImplementation(async () => { + callOrder.push('grade'); + return { + grade: { + buildSuccess: true, + typeCheckErrors: 0, + fileChanges: [], + storybookChanges: [], + }, + score: { score: 1, breakdown: { build: 1, typecheck: 1, ghostStories: 0, performance: 0 } }, + }; + }); + + await runTrial(baseConfig); + + expect(callOrder).toEqual(['prepare', 'agent', 'grade']); + }); +}); + +function setupMocks(overrides?: { + buildSuccess?: boolean; + typeCheckErrors?: number; + cost?: number; +}) { + const { buildSuccess = true, typeCheckErrors = 0, cost = 0.42 } = overrides ?? {}; + + vi.mocked(prepareTrial).mockResolvedValue({ + trialDir: TMP, + repoRoot: TMP, + projectPath: TMP, + resultsDir: join(TMP, 'results'), + baselineCommit: 'deadbeef', + }); + + vi.mocked(claudeAgent.execute).mockResolvedValue({ + cost, + duration: 45.2, + turns: 12, + }); + + vi.mocked(grade).mockResolvedValue({ + grade: { + buildSuccess, + typeCheckErrors, + fileChanges: [ + { path: '.storybook/preview.tsx', gitStatus: 'A' }, + { path: 'src/Button.stories.tsx', gitStatus: 'A' }, + ], + storybookChanges: [ + { path: '.storybook/preview.tsx', gitStatus: 'A' }, + { path: 'src/Button.stories.tsx', gitStatus: 'A' }, + ], + }, + score: { + score: buildSuccess ? 1 : 0.3, + breakdown: { build: buildSuccess ? 1 : 0, typecheck: 1, ghostStories: 0, performance: 0 }, + }, + }); +} diff --git a/scripts/eval/lib/run-trial.ts b/scripts/eval/lib/run-trial.ts new file mode 100644 index 000000000000..fc8dde20fff8 --- /dev/null +++ b/scripts/eval/lib/run-trial.ts @@ -0,0 +1,96 @@ +import { writeFile } from 'node:fs/promises'; +import { join } from 'node:path'; +import type { Logger } from './utils.ts'; +import type { AgentId, AgentDriver, AgentVariant, Execution } from './agents/config.ts'; +import type { Project } from './projects.ts'; +import { grade, type Grade, type QualityScore } from './grade.ts'; +import { claudeAgent } from './agents/claude-code.ts'; +import { codexAgent } from './agents/codex.ts'; +import { prepareTrial } from './prepare-trial.ts'; +import { generateTrialId, loadPrompt, captureEnvironment, createLogger } from './utils.ts'; + +export interface TrialConfig { + /** Which project to evaluate (cloned from its eval-baseline branch). */ + project: Project; + /** Agent, model, and effort level. */ + variant: AgentVariant; + /** Prompt name — maps to `prompts/{name}.md` (e.g. "setup"). */ + prompt: string; + /** Log agent messages to stdout. */ + verbose?: boolean; +} + +export interface TrialReport { + schemaVersion: 1; + project: Project; + variant: AgentVariant; + prompt: string; + timestamp: string; + baselineCommit: string; + execution: Execution; + grade: Grade; + score: QualityScore; +} + +const drivers: Record = { + claude: claudeAgent, + codex: codexAgent, +}; + +/** + * Run a full eval trial: prepare -> execute agent -> grade -> save. + */ +export async function runTrial(config: TrialConfig, logger?: Logger): Promise { + const { project, variant, prompt: promptName } = config; + const { agent: agentName, model } = variant; + const log = logger ?? createLogger(); + const trialId = generateTrialId(project.name, agentName, model, promptName || 'setup'); + const timestamp = new Date().toISOString(); + + log.log(`Preparing ${project.name}...`); + + // 1. Prepare the trial + const workspace = await prepareTrial(project, trialId, log); + + // 2. Capture environment + await captureEnvironment(workspace.resultsDir); + + // 3. Load the prompt + const prompt = loadPrompt(promptName); + await writeFile(join(workspace.resultsDir, 'prompt.md'), prompt); + + // 4. Execute the agent + log.log(` Running ${agentName} (${model}, effort=${variant.effort})...`); + const driver = drivers[agentName]; + const execution = await driver.execute({ + prompt, + projectPath: workspace.projectPath, + variant, + resultsDir: workspace.resultsDir, + logger: log, + }); + log.logSuccess( + `Agent completed (${Math.round(execution.duration)}s, ${execution.cost ? `$${execution.cost.toFixed(2)}` : 'cost N/A'}, ${execution.turns} turns)` + ); + + // 5. Grade the results (pass agent duration for performance scoring) + const { grade: trialGrade, score } = await grade(workspace, log, execution.duration); + + // 6. Assemble final report + const report: TrialReport = { + schemaVersion: 1, + project, + variant, + timestamp, + prompt: promptName || 'setup', + baselineCommit: workspace.baselineCommit, + execution, + grade: trialGrade, + score, + }; + + await writeFile(join(workspace.resultsDir, 'summary.json'), JSON.stringify(report, null, 2)); + log.logSuccess(`Results saved to ${workspace.resultsDir}`); + + return report; +} diff --git a/scripts/eval/lib/utils.test.ts b/scripts/eval/lib/utils.test.ts new file mode 100644 index 000000000000..7b4ebe4e5024 --- /dev/null +++ b/scripts/eval/lib/utils.test.ts @@ -0,0 +1,144 @@ +import { describe, expect, it } from 'vitest'; + +import { + formatDuration, + formatCost, + generateTrialId, + loadPrompt, + listPrompts, + formatTable, +} from './utils'; + +describe('formatDuration', () => { + it('formats seconds under a minute', () => { + expect(formatDuration(0)).toBe('0s'); + expect(formatDuration(1)).toBe('1s'); + expect(formatDuration(45)).toBe('45s'); + }); + + it('rounds fractional seconds', () => { + expect(formatDuration(2.7)).toBe('3s'); + expect(formatDuration(59.4)).toBe('59s'); + }); + + it('formats minutes and seconds', () => { + expect(formatDuration(60)).toBe('1m0s'); + expect(formatDuration(61)).toBe('1m1s'); + expect(formatDuration(90)).toBe('1m30s'); + expect(formatDuration(125)).toBe('2m5s'); + expect(formatDuration(3661)).toBe('61m1s'); + }); +}); + +describe('formatCost', () => { + it('returns dash for undefined', () => { + expect(formatCost(undefined)).toBe('-'); + expect(formatCost()).toBe('-'); + }); + + it('formats dollar amounts', () => { + expect(formatCost(0)).toBe('$0.00'); + expect(formatCost(1.5)).toBe('$1.50'); + }); +}); + +describe('generateTrialId', () => { + it('contains project, agent, model, and prompt', () => { + const id = generateTrialId('mealdrop', 'claude', 'sonnet-4.6', 'setup'); + expect(id).toContain('mealdrop'); + expect(id).toContain('claude'); + expect(id).toContain('sonnet-4.6'); + expect(id).toContain('setup'); + }); + + it('starts with an ISO-like timestamp', () => { + const id = generateTrialId('proj', 'agent', 'model', 'prompt'); + expect(id).toMatch(/^\d{4}-\d{2}-\d{2}T\d{2}-\d{2}-\d{2}/); + }); + + it('generates unique IDs', () => { + const a = generateTrialId('p', 'a', 'm', 'pr'); + const b = generateTrialId('p', 'a', 'm', 'pr'); + expect(a).not.toBe(b); + }); +}); + +describe('listPrompts', () => { + it('lists available prompt names', () => { + const prompts = listPrompts(); + expect(prompts).toContain('setup'); + }); + + it('returns only names without .md extension', () => { + for (const name of listPrompts()) { + expect(name).not.toContain('.md'); + } + }); +}); + +describe('loadPrompt', () => { + it('loads setup prompt by default', () => { + const prompt = loadPrompt(); + expect(prompt).toContain('Storybook'); + expect(prompt.length).toBeGreaterThan(0); + }); + + it('loads setup prompt by name', () => { + const prompt = loadPrompt('setup'); + expect(prompt).toContain('Storybook'); + expect(prompt).toContain('### Step 1'); + }); + + it('throws for unknown prompt', () => { + expect(() => loadPrompt('nonexistent-prompt-xyz')).toThrow('Prompt not found'); + }); + + it('returns trimmed content', () => { + const prompt = loadPrompt('setup'); + expect(prompt).toBe(prompt.trim()); + }); +}); + +describe('formatTable', () => { + it('formats a simple table with aligned columns', () => { + const result = formatTable( + ['Name', 'Score'], + [ + ['Alice', '100'], + ['Bob', '95'], + ] + ); + const lines = result.split('\n'); + expect(lines).toHaveLength(4); // header + divider + 2 rows + expect(lines[0]).toContain('Name'); + expect(lines[0]).toContain('Score'); + expect(lines[1]).toMatch(/^-+\+-+$/); + expect(lines[2]).toContain('Alice'); + expect(lines[3]).toContain('Bob'); + }); + + it('auto-sizes columns to fit content', () => { + const result = formatTable(['X', 'Y'], [['short', 'a-much-longer-value']]); + const lines = result.split('\n'); + // Header column for Y should be padded to match the data width + const headerCols = lines[0].split(' | '); + const dataCols = lines[2].split(' | '); + expect(headerCols[1].trim().length).toBeLessThanOrEqual(dataCols[1].trim().length); + }); + + it('handles ANSI escape codes in cells', () => { + const green = '\x1b[32mPASS\x1b[39m'; + const result = formatTable(['Status'], [[green], ['FAIL']]); + const lines = result.split('\n'); + // Both rows should be the same visible width + // The ANSI row has extra invisible chars but should still align + expect(lines[2]).toContain('PASS'); + expect(lines[3]).toContain('FAIL'); + }); + + it('handles empty rows', () => { + const result = formatTable(['A', 'B'], []); + const lines = result.split('\n'); + expect(lines).toHaveLength(2); // header + divider only + }); +}); diff --git a/scripts/eval/lib/utils.ts b/scripts/eval/lib/utils.ts new file mode 100644 index 000000000000..79d24891f227 --- /dev/null +++ b/scripts/eval/lib/utils.ts @@ -0,0 +1,101 @@ +import { readFileSync, existsSync, readdirSync } from 'node:fs'; +import { writeFile } from 'node:fs/promises'; +import { resolve, basename, join } from 'node:path'; +import pc from 'picocolors'; +import { x } from 'tinyexec'; + +export interface Logger { + log: (msg: string) => void; + logStep: (msg: string) => void; + logSuccess: (msg: string) => void; + logError: (msg: string) => void; +} + +export const REPO_ROOT = resolve(import.meta.dirname, '..', '..', '..'); +export const EVAL_ROOT = resolve(REPO_ROOT, '..', 'storybook-eval'); +export const CACHE_DIR = resolve(EVAL_ROOT, '.cache', 'repos'); +export const TRIALS_DIR = resolve(EVAL_ROOT, 'trials'); +export const PROMPTS_DIR = resolve(import.meta.dirname, '..', 'prompts'); + +export function createLogger(prefix?: string): Logger { + const p = prefix ? pc.dim(`[${prefix}]`) + ' ' : ''; + return { + log: (msg: string) => console.log(`${p}${msg}`), + logStep: (msg: string) => console.log(`${p} ${pc.cyan('>')} ${msg}`), + logSuccess: (msg: string) => console.log(`${p} ${pc.green('✓')} ${msg}`), + logError: (msg: string) => console.log(`${p} ${pc.red('✗')} ${msg}`), + }; +} + +export const formatDuration = (s: number) => + s < 60 ? `${Math.round(s)}s` : `${Math.floor(s / 60)}m${Math.round(s % 60)}s`; + +export const formatCost = (cost?: number) => (cost == null ? '-' : `$${cost.toFixed(2)}`); + +export function generateTrialId(project: string, agent: string, model: string, prompt: string) { + const ts = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19); + return `${ts}-${project}-${agent}-${model}-${prompt}-${crypto.randomUUID().slice(0, 8)}`; +} + +/** Format data as an aligned table with automatic column widths. */ +export function formatTable(headers: string[], rows: string[][]): string { + const widths = headers.map((h, i) => + Math.max(h.length, ...rows.map((r) => stripAnsi(r[i] ?? '').length)) + ); + + const pad = (str: string, width: number) => { + const visible = stripAnsi(str).length; + return str + ' '.repeat(Math.max(0, width - visible)); + }; + + const sep = ' | '; + return [ + headers.map((h, i) => pad(h, widths[i])).join(sep), + widths.map((w) => '-'.repeat(w)).join('-+-'), + ...rows.map((row) => row.map((cell, i) => pad(cell, widths[i])).join(sep)), + ].join('\n'); +} + +/** Load a prompt by name from prompts/{name}.md. */ +export function loadPrompt(name = 'setup'): string { + const file = resolve(PROMPTS_DIR, `${name}.md`); + if (!existsSync(file)) { + throw new Error(`Prompt not found: ${file}\nAvailable: ${listPrompts().join(', ')}`); + } + return readFileSync(file, 'utf-8').trim(); +} + +/** List available prompt names. */ +export function listPrompts(): string[] { + if (!existsSync(PROMPTS_DIR)) return []; + return readdirSync(PROMPTS_DIR) + .filter((f) => f.endsWith('.md')) + .map((f) => basename(f, '.md')); +} + +export interface EvalEnvironment { + nodeVersion: string; + /** Git branch of the eval harness (storybook monorepo), not the evaluated project. */ + evalBranch: string; + /** Git commit of the eval harness (storybook monorepo), not the evaluated project. */ + evalCommit: string; +} + +export async function captureEnvironment(resultsDir: string): Promise { + let evalBranch = 'unknown'; + let evalCommit = 'unknown'; + try { + evalBranch = (await x('git', ['rev-parse', '--abbrev-ref', 'HEAD'])).stdout.trim(); + evalCommit = (await x('git', ['rev-parse', 'HEAD'])).stdout.trim(); + } catch { + /* not in a git repo */ + } + const env: EvalEnvironment = { nodeVersion: process.version, evalBranch, evalCommit }; + await writeFile(join(resultsDir, 'environment.json'), JSON.stringify(env, null, 2)); + return env; +} + +/** Strip ANSI escape codes for accurate width calculation. */ +function stripAnsi(str: string) { + return str.replace(/\x1b\[[0-9;]*m/g, ''); +} diff --git a/scripts/eval/prompts/setup.md b/scripts/eval/prompts/setup.md new file mode 100644 index 000000000000..4021e67becde --- /dev/null +++ b/scripts/eval/prompts/setup.md @@ -0,0 +1,196 @@ +Attention: The following instructions must be followed in order to successfully set up Storybook in this project. Do not skip steps or attempt to do them out of order. + +Your goal is to make Storybook fully functional in this project by analyzing the codebase, +configuring the preview with the right decorators, and writing example stories for 9 components. + +Work through these steps in order. After each story file, run Vitest to verify it renders. +If the test fails, read the error, fix the issue, and re-run until it passes before moving on. + +### Step 1: Analyze the codebase + +Before writing any stories, understand what the components need to render: + +- Scan the project for context providers, theme systems, routers, stores, and i18n setups. + Look at the app's entry point (e.g. `App.tsx`, `main.tsx`, `layout.tsx`) to see what + providers wrap the component tree. +- Identify global CSS or style imports required for components to look correct. +- Note any path aliases configured in tsconfig or bundler config. +- Read `.storybook/main.ts` (or `main.js`) to find the `stories` glob patterns. + Your story files must match those patterns to be picked up by Storybook. + +### Step 2: Configure `.storybook/preview.ts` with decorators + +Add decorators that wrap every story with the providers your components need. +Without this, most non-trivial components will crash. + +If the project uses CSF Factory (look for `definePreview` in `.storybook/preview.ts`): +```ts +// .storybook/preview.ts +import '../src/index.css'; // import global styles + +import { definePreview } from 'storybook/preview'; + +export const config = definePreview({ + decorators: [ + (Story) => ( + + + + + + ), + ], +}); +``` + +Otherwise: +```ts +// .storybook/preview.ts +import '../src/index.css'; // import global styles + +const preview = { + decorators: [ + (Story) => ( + + + + + + ), + ], +}; +export default preview; +``` + +Common decorators to add: +- **Theme providers** (e.g. ThemeProvider, MUI ThemeProvider, styled-components, Tailwind) +- **Router** (e.g. MemoryRouter, BrowserRouter mock) +- **State stores** (e.g. Redux Provider, Zustand, Jotai) +- **i18n** (e.g. IntlProvider, I18nextProvider) +- **Global CSS** — import global stylesheets at the top of preview.ts + +### Step 3: Write stories for 9 components + +Pick 9 real components from the codebase, 3 of each complexity level. +Use the title prefix `AI Generated//` so they are grouped +together in the Storybook sidebar. + +**Simple (3 components)** — Presentational with few props, no internal state. +Examples: Button, Badge, Avatar, Icon, Label, Chip. +Title format: `AI Generated/Simple/` + +**Medium (3 components)** — Multiple visual variants or composed from simpler components. +Examples: Card, Alert, Input, Select, Tooltip, Tabs. +Title format: `AI Generated/Medium/` + +**Complex (3 components)** — Internal state, side effects, or deep composition. +Examples: Modal, DataTable, Form, Dropdown, Accordion, Sidebar. +Title format: `AI Generated/Complex/` + +For each component, create a `.stories.ts` file next to the component. +Each file must have at least 2 story exports covering the component's main states. +Make sure the file location and naming matches the `stories` patterns in `.storybook/main.ts`. + +If the project uses CSF Factory (look for `definePreview` / `config.meta` patterns): + +Story format (CSF Factory — this project uses CSF factories): +```ts +import { config } from '#.storybook/preview'; +import { Button } from './Button'; + +const meta = config.meta({ + title: 'AI Generated/Simple/Button', + component: Button, +}); + +export const Default = meta.story({ + args: { + label: 'Click me', + }, +}); + +export const Disabled = meta.story({ + args: { + label: 'Disabled', + disabled: true, + }, +}); +``` + +Otherwise: + +Story format (CSF): +```ts +import type { Meta, StoryObj } from '@storybook/react'; +import { Button } from './Button'; + +const meta = { + title: 'AI Generated/Simple/Button', + component: Button, +} satisfies Meta; + +export default meta; +type Story = StoryObj; + +export const Default: Story = { + args: { + label: 'Click me', + }, +}; + +export const Disabled: Story = { + args: { + label: 'Disabled', + disabled: true, + }, +}; +``` + +Rules: +- Every named export is a story. Use `args` to set props. +- Provide all required props via `args` — check the component's types. +- If a component needs per-story decorators (beyond the global ones), add them in the meta. +- Do NOT use `any` types. Use the component's prop types for type safety. + +Reference: https://storybook.js.org/docs/latest/writing-stories + +### Step 4: Verify each story with Vitest + +After writing each story file, immediately verify it: + +```bash +npx vitest --project storybook +``` + +**Self-healing loop — repeat for every story file:** +1. Write/update the story file +2. Run `npx vitest --project storybook ` +3. If it fails: read the error output carefully + - Missing provider → add a decorator in `.storybook/preview.ts` or in the story meta + - Missing prop → add the required prop to `args` + - Import error → fix the import path + - CSS/asset error → add static dirs or import the stylesheet +4. Fix the issue and go back to step 2 +5. Once the test passes, move to the next component + +After all 9 story files pass individually, run the full suite: +```bash +npx vitest --project storybook +``` + +### Checklist + +- [ ] Analyzed codebase for providers, global styles, and path aliases +- [ ] Read story patterns from `.storybook/main.ts` +- [ ] Configured `.storybook/preview.ts` with necessary decorators +- [ ] Simple component 1: story written and passing +- [ ] Simple component 2: story written and passing +- [ ] Simple component 3: story written and passing +- [ ] Medium component 1: story written and passing +- [ ] Medium component 2: story written and passing +- [ ] Medium component 3: story written and passing +- [ ] Complex component 1: story written and passing +- [ ] Complex component 2: story written and passing +- [ ] Complex component 3: story written and passing +- [ ] Full Vitest suite passes: `npx vitest --project storybook` +- [ ] Run `npx storybook doctor` to check for common issues (version mismatches, duplicated deps, etc.) diff --git a/scripts/package.json b/scripts/package.json index 11ca41bd541d..48fbc54c8704 100644 --- a/scripts/package.json +++ b/scripts/package.json @@ -9,6 +9,7 @@ "check": "jiti ./check/check-package.ts", "check-package": "jiti ./check-package.ts", "docs:codemod": "jiti ./snippets/codemod.ts", + "eval": "node ./eval/eval.ts", "generate-sandboxes": "jiti ./sandbox/generate.ts", "get-report-message": "jiti ./get-report-message.ts", "get-sandbox-dir": "jiti ./get-sandbox-dir.ts", @@ -41,10 +42,12 @@ }, "dependencies": { "@actions/core": "^1.11.1", + "@anthropic-ai/claude-agent-sdk": "^0.2.85", "@fal-works/esbuild-plugin-global-externals": "^2.1.2", "@google-cloud/bigquery": "^6.2.1", "@octokit/graphql": "^5.0.6", "@octokit/request": "^8.4.1", + "@openai/codex-sdk": "^0.117.0", "@polka/parse": "^1.0.0-next.28", "@testing-library/dom": "^10.4.0", "@testing-library/jest-dom": "^6.9.1", @@ -73,6 +76,7 @@ "@vitest/coverage-v8": "^4.1.0", "ansi-regex": "^6.0.1", "chromatic": "^13.3.4", + "citty": "^0.2.1", "codecov": "^3.8.1", "commander": "^14.0.2", "cross-env": "^7.0.3", diff --git a/scripts/tsconfig.json b/scripts/tsconfig.json index c8082acb3897..9c5b78519b8b 100644 --- a/scripts/tsconfig.json +++ b/scripts/tsconfig.json @@ -1,6 +1,7 @@ { "compileOnSave": false, "compilerOptions": { + "customConditions": ["code"], "baseUrl": ".", "noEmit": true, "incremental": false, @@ -11,6 +12,8 @@ "moduleResolution": "bundler", "target": "ESNext", "module": "Preserve", + // Required for native Node TS execution (node file.ts) — we are migrating from jiti to native node + "allowImportingTsExtensions": true, "skipLibCheck": true, "allowSyntheticDefaultImports": true, "esModuleInterop": true, diff --git a/yarn.lock b/yarn.lock index cee8c18346d7..aec95012dc21 100644 --- a/yarn.lock +++ b/yarn.lock @@ -436,6 +436,44 @@ __metadata: languageName: node linkType: hard +"@anthropic-ai/claude-agent-sdk@npm:^0.2.85": + version: 0.2.85 + resolution: "@anthropic-ai/claude-agent-sdk@npm:0.2.85" + dependencies: + "@img/sharp-darwin-arm64": "npm:^0.34.2" + "@img/sharp-darwin-x64": "npm:^0.34.2" + "@img/sharp-linux-arm": "npm:^0.34.2" + "@img/sharp-linux-arm64": "npm:^0.34.2" + "@img/sharp-linux-x64": "npm:^0.34.2" + "@img/sharp-linuxmusl-arm64": "npm:^0.34.2" + "@img/sharp-linuxmusl-x64": "npm:^0.34.2" + "@img/sharp-win32-arm64": "npm:^0.34.2" + "@img/sharp-win32-x64": "npm:^0.34.2" + peerDependencies: + zod: ^4.0.0 + dependenciesMeta: + "@img/sharp-darwin-arm64": + optional: true + "@img/sharp-darwin-x64": + optional: true + "@img/sharp-linux-arm": + optional: true + "@img/sharp-linux-arm64": + optional: true + "@img/sharp-linux-x64": + optional: true + "@img/sharp-linuxmusl-arm64": + optional: true + "@img/sharp-linuxmusl-x64": + optional: true + "@img/sharp-win32-arm64": + optional: true + "@img/sharp-win32-x64": + optional: true + checksum: 10c0/5bb31712460b03b264b489c38a2ddcac62ba60aad50da8cd6d3cebdaf46fae84c37473f25b7a4e20a6bda6f2310b4cc9f3574bc3f2e8f73a4a6e6bd0e04bd827 + languageName: node + linkType: hard + "@aw-web-design/x-default-browser@npm:1.4.126": version: 1.4.126 resolution: "@aw-web-design/x-default-browser@npm:1.4.126" @@ -2972,7 +3010,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-darwin-arm64@npm:0.34.5": +"@img/sharp-darwin-arm64@npm:0.34.5, @img/sharp-darwin-arm64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-darwin-arm64@npm:0.34.5" dependencies: @@ -2984,7 +3022,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-darwin-x64@npm:0.34.5": +"@img/sharp-darwin-x64@npm:0.34.5, @img/sharp-darwin-x64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-darwin-x64@npm:0.34.5" dependencies: @@ -3066,7 +3104,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-linux-arm64@npm:0.34.5": +"@img/sharp-linux-arm64@npm:0.34.5, @img/sharp-linux-arm64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-linux-arm64@npm:0.34.5" dependencies: @@ -3078,7 +3116,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-linux-arm@npm:0.34.5": +"@img/sharp-linux-arm@npm:0.34.5, @img/sharp-linux-arm@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-linux-arm@npm:0.34.5" dependencies: @@ -3126,7 +3164,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-linux-x64@npm:0.34.5": +"@img/sharp-linux-x64@npm:0.34.5, @img/sharp-linux-x64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-linux-x64@npm:0.34.5" dependencies: @@ -3138,7 +3176,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-linuxmusl-arm64@npm:0.34.5": +"@img/sharp-linuxmusl-arm64@npm:0.34.5, @img/sharp-linuxmusl-arm64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-linuxmusl-arm64@npm:0.34.5" dependencies: @@ -3150,7 +3188,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-linuxmusl-x64@npm:0.34.5": +"@img/sharp-linuxmusl-x64@npm:0.34.5, @img/sharp-linuxmusl-x64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-linuxmusl-x64@npm:0.34.5" dependencies: @@ -3171,7 +3209,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-win32-arm64@npm:0.34.5": +"@img/sharp-win32-arm64@npm:0.34.5, @img/sharp-win32-arm64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-win32-arm64@npm:0.34.5" conditions: os=win32 & cpu=arm64 @@ -3185,7 +3223,7 @@ __metadata: languageName: node linkType: hard -"@img/sharp-win32-x64@npm:0.34.5": +"@img/sharp-win32-x64@npm:0.34.5, @img/sharp-win32-x64@npm:^0.34.2": version: 0.34.5 resolution: "@img/sharp-win32-x64@npm:0.34.5" conditions: os=win32 & cpu=x64 @@ -4665,6 +4703,86 @@ __metadata: languageName: node linkType: hard +"@openai/codex-darwin-arm64@npm:@openai/codex@0.117.0-darwin-arm64": + version: 0.117.0-darwin-arm64 + resolution: "@openai/codex@npm:0.117.0-darwin-arm64" + conditions: os=darwin & cpu=arm64 + languageName: node + linkType: hard + +"@openai/codex-darwin-x64@npm:@openai/codex@0.117.0-darwin-x64": + version: 0.117.0-darwin-x64 + resolution: "@openai/codex@npm:0.117.0-darwin-x64" + conditions: os=darwin & cpu=x64 + languageName: node + linkType: hard + +"@openai/codex-linux-arm64@npm:@openai/codex@0.117.0-linux-arm64": + version: 0.117.0-linux-arm64 + resolution: "@openai/codex@npm:0.117.0-linux-arm64" + conditions: os=linux & cpu=arm64 + languageName: node + linkType: hard + +"@openai/codex-linux-x64@npm:@openai/codex@0.117.0-linux-x64": + version: 0.117.0-linux-x64 + resolution: "@openai/codex@npm:0.117.0-linux-x64" + conditions: os=linux & cpu=x64 + languageName: node + linkType: hard + +"@openai/codex-sdk@npm:^0.117.0": + version: 0.117.0 + resolution: "@openai/codex-sdk@npm:0.117.0" + dependencies: + "@openai/codex": "npm:0.117.0" + checksum: 10c0/96f86890fd45a4030a8e9b6f8466389a015d0ee534b1661b56463a1fd210c6fc3af0ea1f3ce57306a13a9b6ff6197d6409a4d5af7f6d7c90e672009eee15e3fd + languageName: node + linkType: hard + +"@openai/codex-win32-arm64@npm:@openai/codex@0.117.0-win32-arm64": + version: 0.117.0-win32-arm64 + resolution: "@openai/codex@npm:0.117.0-win32-arm64" + conditions: os=win32 & cpu=arm64 + languageName: node + linkType: hard + +"@openai/codex-win32-x64@npm:@openai/codex@0.117.0-win32-x64": + version: 0.117.0-win32-x64 + resolution: "@openai/codex@npm:0.117.0-win32-x64" + conditions: os=win32 & cpu=x64 + languageName: node + linkType: hard + +"@openai/codex@npm:0.117.0": + version: 0.117.0 + resolution: "@openai/codex@npm:0.117.0" + dependencies: + "@openai/codex-darwin-arm64": "npm:@openai/codex@0.117.0-darwin-arm64" + "@openai/codex-darwin-x64": "npm:@openai/codex@0.117.0-darwin-x64" + "@openai/codex-linux-arm64": "npm:@openai/codex@0.117.0-linux-arm64" + "@openai/codex-linux-x64": "npm:@openai/codex@0.117.0-linux-x64" + "@openai/codex-win32-arm64": "npm:@openai/codex@0.117.0-win32-arm64" + "@openai/codex-win32-x64": "npm:@openai/codex@0.117.0-win32-x64" + dependenciesMeta: + "@openai/codex-darwin-arm64": + optional: true + "@openai/codex-darwin-x64": + optional: true + "@openai/codex-linux-arm64": + optional: true + "@openai/codex-linux-x64": + optional: true + "@openai/codex-win32-arm64": + optional: true + "@openai/codex-win32-x64": + optional: true + bin: + codex: bin/codex.js + checksum: 10c0/a5104a396f0f33558c9a402012bf2dd954f5d3465d3b0bb5fe780d265760a3c72b64af4a2d42a0012f661b7e4a274a42c5d4f5582de115613557f480dbec3b5b + languageName: node + linkType: hard + "@oxc-project/runtime@npm:0.115.0": version: 0.115.0 resolution: "@oxc-project/runtime@npm:0.115.0" @@ -8717,10 +8835,12 @@ __metadata: resolution: "@storybook/scripts@workspace:scripts" dependencies: "@actions/core": "npm:^1.11.1" + "@anthropic-ai/claude-agent-sdk": "npm:^0.2.85" "@fal-works/esbuild-plugin-global-externals": "npm:^2.1.2" "@google-cloud/bigquery": "npm:^6.2.1" "@octokit/graphql": "npm:^5.0.6" "@octokit/request": "npm:^8.4.1" + "@openai/codex-sdk": "npm:^0.117.0" "@polka/parse": "npm:^1.0.0-next.28" "@testing-library/dom": "npm:^10.4.0" "@testing-library/jest-dom": "npm:^6.9.1" @@ -8750,6 +8870,7 @@ __metadata: "@vitest/coverage-v8": "npm:^4.1.0" ansi-regex: "npm:^6.0.1" chromatic: "npm:^13.3.4" + citty: "npm:^0.2.1" codecov: "npm:^3.8.1" commander: "npm:^14.0.2" cross-env: "npm:^7.0.3" @@ -13710,6 +13831,13 @@ __metadata: languageName: node linkType: hard +"citty@npm:^0.2.1": + version: 0.2.1 + resolution: "citty@npm:0.2.1" + checksum: 10c0/504ac5aeb076f750bf5f25d40c730083e8ed6112eac2f00dbe341a223c46ad16893ce73dfdb55b2d0da505100b9678968ee0443637c45b21917db48daa5a6977 + languageName: node + linkType: hard + "cjs-module-lexer@npm:^1.2.3": version: 1.4.3 resolution: "cjs-module-lexer@npm:1.4.3"