Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1303e34
Add LLM eval system for Storybook agentic setup (M0)
kasperpeulen Mar 27, 2026
143577e
Migrate eval agents from CLI to SDK
kasperpeulen Mar 27, 2026
20cc6b9
Incorporate improvements from PR review + pre-prepared repos
kasperpeulen Mar 27, 2026
dce536d
Add Google Sheets upload, run IDs, and environment capture
kasperpeulen Mar 27, 2026
3e74467
Add composable prompt variants and vitest-based self-heal
kasperpeulen Mar 27, 2026
7a8d08b
Simplify eval codebase (-308 lines)
kasperpeulen Mar 27, 2026
6c3e716
Remove cleanEnv from grading — only needed for installDeps
kasperpeulen Mar 27, 2026
e11b9bd
Remove cleanEnv entirely — .npmrc is only in the monorepo, not in tri…
kasperpeulen Mar 27, 2026
2be54f4
Switch from jiti to native Node TS support, add .ts extensions to all…
kasperpeulen Mar 27, 2026
5aabbda
Update models: Sonnet 4.6, Opus 4.6, Haiku 4.5, GPT 5.4 Medium/High
kasperpeulen Mar 27, 2026
986988a
Decouple agent × model × effort as three independent axes
kasperpeulen Mar 27, 2026
1ee462d
Simplify prompt to single name, add per-agent default model
kasperpeulen Mar 27, 2026
06c5f9a
Split into eval.ts (single run) and eval-parallel.ts (8 runs)
kasperpeulen Mar 27, 2026
2336c46
Add prefixed logging for parallel runs
kasperpeulen Mar 27, 2026
ca03d7c
Spawn separate node processes in eval-parallel for multi-core CPU usage
kasperpeulen Mar 27, 2026
8629948
Live-stream prefixed logs from child processes, improve Codex agent l…
kasperpeulen Mar 27, 2026
1606025
Fix Codex agent logging to match actual SDK event/item types
kasperpeulen Mar 27, 2026
47e64e3
Decouple agent and model — choose agent then model independently
kasperpeulen Mar 27, 2026
f6671a1
Clean up names: claude-code→claude, claude-sonnet-4-6→sonnet-4.6
kasperpeulen Mar 27, 2026
5701e8d
Infer agent from model — node eval.ts -m gpt-5.4 auto-selects codex
kasperpeulen Mar 27, 2026
bdbae36
Fix parallel race condition: add prompt + random suffix to trial IDs
kasperpeulen Mar 27, 2026
8819ae2
Use crypto.randomUUID for unique trial IDs
kasperpeulen Mar 27, 2026
3caafda
Fix ghost stories to match core implementation: pass paths as args + …
kasperpeulen Mar 27, 2026
4e04c66
Add tests, import ghost stories utilities from core, switch to parseArgs
kasperpeulen Mar 28, 2026
5051a6d
Simplify eval harness: merge config, options objects, remove duplication
kasperpeulen Mar 28, 2026
f397085
Fix: stop importing parse-vitest-report from core (extensionless impo…
kasperpeulen Mar 28, 2026
9fb35ca
Add .ts extensions to core imports used by eval harness
kasperpeulen Mar 28, 2026
460fc5d
Fix ghost-stories comment to reflect inline vitest parsing approach
kasperpeulen Mar 28, 2026
3dd2246
Use parseVitestResults from core for ghost stories grading
kasperpeulen Mar 28, 2026
663b8e9
Refactor eval harness: injectable logger, Node IPC, parallel grading,…
kasperpeulen Mar 28, 2026
5452a10
Refactor ghost stories: rename runStoryTests→runGhostStories, export …
kasperpeulen Mar 28, 2026
45acc9b
Fix native Node TS execution: use direct file imports with .ts extens…
kasperpeulen Mar 28, 2026
bf5855b
Simplify eval harness: Promise.all, async fs, remove Google Sheets, d…
kasperpeulen Mar 28, 2026
da1f96e
Refactor eval system: Zod schemas, unified CLI, shared utilities
kasperpeulen Mar 28, 2026
9b6085b
Update AGENTS.md and tsconfig comments for native Node TS execution
kasperpeulen Mar 28, 2026
cabe15a
WIP: checkpoint current eval harness changes
kasperpeulen Mar 29, 2026
73d7415
Fix eval ghost-stories globbing lint
kasperpeulen Mar 29, 2026
98a2f74
Refine eval grading review fixes
kasperpeulen Mar 29, 2026
35d5699
Rewrite review-pr skill: scrollable single-page instead of Reveal.js …
kasperpeulen Mar 29, 2026
842ac28
Update review-pr skill: two-layer format, TS-highlighted diffs, reada…
kasperpeulen Mar 29, 2026
6e5fcf4
Update review-pr skill: show full interface bodies in walkthrough
kasperpeulen Mar 29, 2026
87abae4
Refactor: use composition for AgentRunConfig instead of extends
kasperpeulen Mar 29, 2026
c0720ee
Rename eval data structures for clarity
kasperpeulen Mar 29, 2026
920e6d3
Make Project.branch required and remove non-null assertion
kasperpeulen Mar 29, 2026
b5f3a7b
Refine eval API: discriminated AgentVariant union, rename runTask→run…
kasperpeulen Mar 30, 2026
b4bab02
Fix CI: format eval files, fix effort type narrowing in Claude agent
kasperpeulen Mar 30, 2026
35561d2
Eval: import ghost stories from core-server, grade without empty renders
kasperpeulen Mar 30, 2026
3011b46
Update AGENTS.md: use yarn fmt:write from repo root
kasperpeulen Mar 30, 2026
37192c0
Eval: add --manual flag to prepare trial without running the agent
kasperpeulen Mar 30, 2026
7887732
Make review-pr skill project-local and contributor-friendly
kasperpeulen Mar 30, 2026
a55f40a
Colocate eval types and config, remove setup-patterns, drop section c…
kasperpeulen Mar 30, 2026
a11a783
Update review-pr skill instructions
kasperpeulen Mar 30, 2026
b4e9cb8
Improve eval install detection and Codex pricing
kasperpeulen Mar 30, 2026
9a850af
Changes from Codex
kasperpeulen Mar 30, 2026
48a6e57
Changes from Codex
kasperpeulen Mar 30, 2026
db7a142
Restore helper ordering in modified PR files
kasperpeulen Mar 31, 2026
2bd9169
Tune eval defaults and CI resources
kasperpeulen Mar 31, 2026
0c5f06a
Increase CircleCI resources for config generation
kasperpeulen Mar 31, 2026
7a8e2d5
Refine eval harness execution and cache refresh
kasperpeulen Mar 31, 2026
eccfb78
Simplify eval CLI flow
kasperpeulen Mar 31, 2026
8de10d2
Remove eval script from tsconfig exclude list
kasperpeulen Mar 31, 2026
68d0d5a
Increase CircleCI memory for format check
kasperpeulen Mar 31, 2026
65893e9
Increase CircleCI memory for format check again
kasperpeulen Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
439 changes: 439 additions & 0 deletions .agents/skills/review-pr/SKILL.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
generate-and-run-config:
executor:
name: node/default
resource_class: small
resource_class: large
steps:
- node/install:
install-yarn: true
Expand Down
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,11 @@ CLAUDE.local.md
.cursor/mcp.json
.vscode/mcp.json
.mcp.json
.nx/polygraph
.nx/polygraph

# Eval system
scripts/eval/.cache
scripts/eval/results

# review-pr skill output
.pr-review
5 changes: 3 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@ This file is the canonical instruction source for coding agents. Files like `CLA
Storybook is a large TypeScript monorepo. The git root is the repo root, the main code lives in `code/`, and build tooling lives in `scripts/`. The default branch is `next`.

- **Base branch**: `next` (all PRs should target `next`, not `main`)
- **Node.js**: `22.21.1` (see `.nvmrc`)
- **Node.js**: `22.22.1` (see `.nvmrc`) — supports `.ts` natively via type stripping (no loader needed)
- **Package Manager**: Yarn Berry
- **Task orchestration**: NX plus the custom `yarn task` runner
- **CI environment**: Linux and Windows
- **TS execution**: Migrating from `jiti` to native `node` for running `.ts` files. New scripts should use `node ./path/file.ts` with explicit `.ts` import extensions (enabled by `allowImportingTsExtensions` in tsconfig). Legacy scripts still use `jiti` but should be migrated over time.

## Repository Structure

Expand Down Expand Up @@ -234,7 +235,7 @@ When writing tests:

After changing files:

1. Format with `cd code && oxfmt`
1. Format with `yarn fmt:write` (run from the repo root)
2. Lint with `yarn --cwd code lint:js:cmd <file-relative-to-code-folder> --fix` or `cd code && yarn lint:js:cmd <file-relative-to-code-folder>`
3. Run relevant tests before submitting a PR

Expand Down
3 changes: 3 additions & 0 deletions code/core/src/core-server/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,6 @@ export {
} from './stores/test-provider';

export { getServerPort } from './utils/server-address';

export { getComponentCandidates } from './utils/ghost-stories/get-candidates';
export { runGhostStories } from './utils/ghost-stories/run-story-tests';
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import {
import type { CoreConfig, Options } from 'storybook/internal/types';

import { getComponentCandidates } from '../utils/ghost-stories/get-candidates';
import { runStoryTests } from '../utils/ghost-stories/run-story-tests';
import { runGhostStories } from '../utils/ghost-stories/run-story-tests';

export function initGhostStoriesChannel(
channel: Channel,
Expand Down Expand Up @@ -91,7 +91,7 @@ export function initGhostStoriesChannel(

// Phase 2: Run tests on those candidates Vitest. The components will be transformed directly to tests
// If they pass, it means that creating a story file for them would succeed.
const testRunResult = await runStoryTests(candidatesResult.candidates);
const testRunResult = await runGhostStories(candidatesResult.candidates);
stats.totalRunDuration = Date.now() - ghostRunStart;
stats.testRunDuration = testRunResult.duration;
if (testRunResult.runError) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
import { readFile } from 'node:fs/promises';

import { babelParse, traverse } from 'storybook/internal/babel';
import { logger } from 'storybook/internal/node-logger';

// eslint-disable-next-line depend/ban-dependencies
import { glob } from 'glob';

import { getComponentComplexity } from './component-analyzer';
import { getComponentComplexity } from './component-analyzer.ts';
Comment thread
kasperpeulen marked this conversation as resolved.

// A valid candidate includes React code and at least one export
function isValidCandidate(source: string): boolean {
Expand Down Expand Up @@ -128,9 +127,12 @@ export async function getCandidatesForStorybook(
export async function getComponentCandidates({
sampleSize = 20,
globPattern = '**/*.{tsx,jsx}',
cwd = process.cwd(),
}: {
sampleSize?: number;
globPattern?: string;
/** Working directory for glob. Defaults to process.cwd(). */
cwd?: string;
} = {}): Promise<{
candidates: string[];
error?: string;
Expand All @@ -145,7 +147,7 @@ export async function getComponentCandidates({

// Find files matching the glob pattern
files = await glob(globPattern, {
cwd: process.cwd(),
cwd,
absolute: true,
ignore: [
'**/node_modules/**',
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors';
import { categorizeError } from '../../../shared/utils/categorize-render-errors';
import { type ErrorCategorizationResult, type StoryTestResult, type TestRunSummary } from './types';
import type { ErrorCategory } from '../../../shared/utils/categorize-render-errors.ts';
import { categorizeError } from '../../../shared/utils/categorize-render-errors.ts';
import {
type ErrorCategorizationResult,
type StoryTestResult,
type TestRunSummary,
} from './types.ts';
Comment thread
kasperpeulen marked this conversation as resolved.

/**
* For a given list of test results:
Expand Down
18 changes: 15 additions & 3 deletions code/core/src/core-server/utils/ghost-stories/run-story-tests.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,21 @@ import { executeCommand, resolvePathInStorybookCache } from 'storybook/internal/

import { join } from 'pathe';

import { parseVitestResults } from './parse-vitest-report';
import type { TestRunSummary } from './types';
import { parseVitestResults } from './parse-vitest-report.ts';
import type { TestRunSummary } from './types.ts';

export async function runStoryTests(componentFilePaths: string[]): Promise<TestRunSummary> {
/**
* Run ghost stories: execute vitest on component file paths to auto-generate
* and test stories that don't exist on disk.
*
* @param componentFilePaths - Absolute paths to component files to test.
* @param options.cwd - Working directory for vitest. Defaults to process.cwd().
*/
export async function runGhostStories(
componentFilePaths: string[],
options?: { cwd?: string }
): Promise<TestRunSummary> {
const cwd = options?.cwd;
try {
// Create the cache directory for story discovery tests
const cacheDir = resolvePathInStorybookCache('ghost-stories-tests');
Expand All @@ -34,6 +45,7 @@ export async function runStoryTests(componentFilePaths: string[]): Promise<TestR
`--outputFile=${outputFile}`,
...componentFilePaths,
],
cwd,
stdio: 'pipe',
env: {
STORYBOOK_COMPONENT_PATHS: componentFilePaths.join(';'),
Expand Down
2 changes: 1 addition & 1 deletion code/core/src/shared/utils/categorize-render-errors.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import {
isRouterPackage,
isStateManagementPackage,
isStylingPackage,
} from './ecosystem-identifier';
} from './ecosystem-identifier.ts';

export const ERROR_CATEGORIES = {
MISSING_PROVIDER: 'MISSING_PROVIDER',
Expand Down
2 changes: 2 additions & 0 deletions code/tsconfig.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
"lib": ["dom", "dom.iterable", "esnext"],
"module": "Preserve",
"moduleResolution": "bundler",
// Required for explicit .ts import extensions — migrating toward native Node TS execution
"allowImportingTsExtensions": true,
"noImplicitAny": true,
"noUnusedLocals": false,
"skipLibCheck": true,
Expand Down
2 changes: 1 addition & 1 deletion scripts/ci/common-jobs.ts
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ export const build_linux = defineJob('Build (linux)', (workflowName) => ({
export const fmt = defineJob('Format check', () => ({
executor: {
name: 'sb_node_22_classic',
class: 'medium+',
class: 'xlarge',
},
steps: [
git.checkout(),
Expand Down
201 changes: 201 additions & 0 deletions scripts/eval/eval.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
/**
* Eval harness entry point.
*
* Runs with `node ./eval/eval.ts` (no jiti). Node 22+ supports .ts natively
* via type stripping. Import specifiers use explicit .ts extensions.
*
* Usage:
* node eval/eval.ts -p mealdrop # claude defaults
* node eval/eval.ts -p mealdrop -a codex # codex defaults
* node eval/eval.ts -p mealdrop -m gpt-5.4 # codex (inferred)
* node eval/eval.ts -p mealdrop -a claude -e max # claude with max effort
* node eval/eval.ts -p mealdrop --manual # prepare only, print instructions
* node eval/eval.ts --list-projects
* node eval/eval.ts --list-models
* node eval/eval.ts --list-prompts
*/
import { writeFile } from 'node:fs/promises';
import { join } from 'node:path';
import { parseArgs } from 'node:util';
import { z } from 'zod';
import pc from 'picocolors';
import {
AGENT_IDS,
AGENTS,
CLAUDE_EFFORTS,
CLAUDE_MODELS,
CODEX_EFFORTS,
CODEX_MODELS,
type AgentId,
type AgentVariant,
} from './lib/agents/config.ts';
import { prepareTrial } from './lib/prepare-trial.ts';
import { PROJECTS } from './lib/projects.ts';
import { runTrial, type TrialConfig } from './lib/run-trial.ts';
import {
captureEnvironment,
createLogger,
formatCost,
formatDuration,
generateTrialId,
listPrompts,
loadPrompt,
} from './lib/utils.ts';

const PROJECT_NAMES = PROJECTS.map((p) => p.name) as [string, ...string[]];

const base = {
project: z.enum(PROJECT_NAMES).optional(),
prompt: z.string().default('setup'),
verbose: z.boolean().default(false),
manual: z.boolean().default(false),
listProjects: z.boolean().default(false),
listModels: z.boolean().default(false),
listPrompts: z.boolean().default(false),
};

const argsSchema = z.discriminatedUnion('agent', [
z.object({
...base,
agent: z.literal('claude'),
model: z.enum(CLAUDE_MODELS).default(AGENTS.claude.defaultModel),
effort: z.enum(CLAUDE_EFFORTS).default(AGENTS.claude.defaultEffort),
}),
z.object({
...base,
agent: z.literal('codex'),
model: z.enum(CODEX_MODELS).default(AGENTS.codex.defaultModel),
effort: z.enum(CODEX_EFFORTS).default(AGENTS.codex.defaultEffort),
}),
]);

const { values } = parseArgs({
options: {
project: { type: 'string', short: 'p' },
agent: { type: 'string', short: 'a' },
model: { type: 'string', short: 'm' },
effort: { type: 'string', short: 'e' },
prompt: { type: 'string' },
verbose: { type: 'boolean', short: 'v' },
manual: { type: 'boolean' },
'list-projects': { type: 'boolean' },
'list-models': { type: 'boolean' },
'list-prompts': { type: 'boolean' },
},
args: process.argv.slice(2),
strict: true,
});

// Resolve the discriminator: explicit --agent, inferred from --model, or default to claude.
const agent = values.agent ?? (values.model ? inferAgent(values.model) : 'claude');

const parsed = argsSchema.safeParse({
...values,
agent,
listProjects: values['list-projects'],
listModels: values['list-models'],
listPrompts: values['list-prompts'],
});

if (!parsed.success) {
for (const issue of parsed.error.issues) {
console.error(pc.red(` ${issue.path.join('.')}: ${issue.message}`));
}
process.exit(1);
}

const args = parsed.data;
const logger = createLogger();

if (args.listProjects) {
for (const project of PROJECTS) {
logger.log(` ${pc.bold(project.name)} — ${project.description}`);
}
process.exit(0);
}
if (args.listModels) {
for (const [name, { models }] of Object.entries(AGENTS)) {
logger.log(`\n ${pc.bold(name)}`);
for (const model of models) logger.log(` ${model}`);
}
process.exit(0);
}
if (args.listPrompts) {
for (const name of listPrompts()) logger.log(` ${pc.bold(name)}`);
process.exit(0);
}

if (!args.project) {
logger.log(pc.red(`Specify a project with -p. Available: ${PROJECT_NAMES.join(', ')}`));
process.exit(1);
}
const project = PROJECTS.find((p) => p.name === args.project)!;
const variant = toVariant(args);

logger.log(pc.bold(`\nStorybook Setup Eval — ${project.name}`));
logger.log(
`Agent: ${variant.agent} | Model: ${variant.model} | Effort: ${variant.effort} | Prompt: ${args.prompt}\n`
);

if (args.manual) {
const trialId = generateTrialId(project.name, variant.agent, variant.model, args.prompt);
const workspace = await prepareTrial(project, trialId, logger);
await captureEnvironment(workspace.resultsDir);

const prompt = loadPrompt(args.prompt);
const promptPath = join(workspace.resultsDir, 'prompt.md');
await writeFile(promptPath, prompt);

const cliCommand = buildManualCommand(variant, promptPath);

logger.log(pc.bold('\n── Manual mode ──'));
logger.log(`\n Trial dir: ${pc.cyan(workspace.trialDir)}`);
logger.log(` Project dir: ${pc.cyan(workspace.projectPath)}`);
logger.log(` Prompt file: ${pc.cyan(promptPath)}`);
logger.log(pc.bold('\nRun the agent yourself:\n'));
logger.log(` ${pc.green('cd')} ${workspace.projectPath}`);
logger.log(` ${pc.green(cliCommand)}\n`);
} else {
const result = await runTrial(
{ project, variant, prompt: args.prompt, verbose: args.verbose } satisfies TrialConfig,
logger
);

const ghost = result.grade.ghostStories;
const ghostStr = ghost
? `${ghost.passed}/${ghost.total} (${Math.round(ghost.successRate * 100)}%)`
: '-';

logger.log(pc.bold('\nResult'));
logger.log(` Build: ${result.grade.buildSuccess ? pc.green('PASS') : pc.red('FAIL')}`);

@yannbf yannbf Mar 31, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there ways to add info related to:

  • model used
  • time spent on operations
  • context usage
  • ghost stories rate before/after <-- the before rate is quite important

These can be done later:

  • changes made to preview.js
  • setup patterns identified, set up patterns implemented in preview.js (so we can see whether it found something but didn't setup)
  • quality (renders without errors and not empty, has styles)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet in this pass. The main follow-up I still see here is expanding the saved report schema with model/time/context/before-after ghost metrics.

logger.log(` Ghost: ${ghostStr}`);
logger.log(` TS Err: ${result.grade.typeCheckErrors}`);
logger.log(` Score: ${result.score.score}`);
logger.log(` Cost: ${formatCost(result.execution.cost)}`);
logger.log(` Time: ${formatDuration(result.execution.duration)}`);
logger.log(` Turns: ${result.execution.turns}`);

logger.log('\nDone.');
}

function inferAgent(model: string): AgentId {
for (const id of AGENT_IDS) {
if (AGENTS[id].models.some((candidate) => candidate === model)) return id;
}
throw new Error(`No agent found for model: ${model}`);
}

function buildManualCommand(variant: AgentVariant, promptPath: string): string {
const promptArg = `"$(cat ${promptPath})"`;
if (variant.agent === 'claude') {
const sdkModel = AGENTS.claude.sdkModelIds[variant.model] ?? variant.model;
return `claude --model ${sdkModel} ${promptArg}`;
}
return `codex --model ${variant.model} --reasoning-effort ${variant.effort} ${promptArg}`;
}

function toVariant(args: z.infer<typeof argsSchema>): AgentVariant {
return args.agent === 'claude'
? { agent: 'claude', model: args.model, effort: args.effort }
: { agent: 'codex', model: args.model, effort: args.effort };
}
Loading
Loading