Skip to content

Build: Add eval harness for Storybook agentic setup#34365

Merged
kasperpeulen merged 63 commits into
project/sb-agentic-setupfrom
kasper/eval-system
Mar 31, 2026
Merged

Build: Add eval harness for Storybook agentic setup#34365
kasperpeulen merged 63 commits into
project/sb-agentic-setupfrom
kasper/eval-system

Conversation

@kasperpeulen
Copy link
Copy Markdown
Member

@kasperpeulen kasperpeulen commented Mar 27, 2026

Closes #34295 (M0 milestone)

What I did

Eval harness to measure how well AI agents complete Storybook setup after npx storybook@latest init --yes on real-world projects. Primary metric: ghost stories (do rendered components actually work).

Architecture

eval.ts          — single run:     node scripts/eval/eval.ts -p mealdrop
eval-parallel.ts — all 8 combos:   node scripts/eval/eval-parallel.ts -p mealdrop

Pipeline per run: copy from cache → run agent → grade (build + typecheck + ghost stories + setup patterns + changed files) → save to Google Sheets

Agents (via SDK, not CLI)

Agent SDK Models
claude @anthropic-ai/claude-agent-sdk sonnet-4.6, opus-4.6, haiku-4.5
codex @openai/codex-sdk gpt-5.4

Agent is inferred from model: -m gpt-5.4 auto-selects codex.

Four independent axes

agent × model × effort × prompt
  • Effort: low, medium, high, max (maps to Claude effort / Codex model_reasoning_effort)
  • Prompts: setup (default), self-heal (vitest-based iteration loop)

Parallel execution

eval-parallel.ts spawns 8 separate node processes (4 models × 2 prompts) with live-streamed prefixed logs:

[sonnet-4.6+setup] 🚀 Session started
[gpt-5.4+self-heal] 🔧 $ npx vitest run...
[opus-4.6+setup] 💬 I'll analyze the project...

Pre-prepared repos

6 forked repos with eval-baseline branches (storybook already initialized):

  • mealdrop — styled-components, Redux, React Router
  • edgy — Tailwind, HeadlessUI, React Router
  • wikitok — simple Tailwind
  • baklava — component library, Zustand
  • echarts — ECharts React wrapper
  • evergreen-ci — GraphQL workspace

First run clones + installs → caches. Subsequent runs copy from cache — agent starts immediately.

Grading

  • Ghost stories — discovers candidate components, runs vitest with STORYBOOK_COMPONENT_PATHS to auto-generate and test stories (mirrors core-server implementation)
  • Storybook build — pass/fail
  • TypeScript errors — count
  • Setup patterns — detects tailwind, redux, router, styled-components, etc.
  • Changed files — git diff from baseline

Results

  • Structured summary.json per trial
  • Google Sheets upload (set EVAL_GOOGLE_SHEETS_URL)
  • Run ID / Upload ID for grouping
  • Environment capture (node version, git info)

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

Manual testing

Tested end-to-end on mealdrop and wikitok with multiple model/prompt combinations via eval-parallel.ts.

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

  • New Features

    • Added evaluation framework with CLI for testing and benchmarking agent implementations against projects.
    • Introduced ghost stories testing utilities for component validation.
  • Improvements

    • Updated Node.js requirement to 22.22.1 for improved TypeScript support.
    • Enhanced TypeScript module resolution with explicit file extensions.
    • Upgraded CI resource allocation for faster job execution.
  • Chores

    • Updated repository tooling paths and configuration.

@kasperpeulen kasperpeulen added build Internal-facing build tooling & test updates ci:normal Run our default set of CI jobs (choose this for most PRs). labels Mar 27, 2026
@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Mar 27, 2026

View your CI Pipeline Execution ↗ for commit 65893e9

Command Status Duration Result
nx run-many -t compile,check,knip,test,lint,fmt... ✅ Succeeded 9m 31s View ↗

☁️ Nx Cloud last updated this comment at 2026-03-31 13:46:00 UTC

@kasperpeulen kasperpeulen changed the title Build: Add LLM eval system for agentic setup Build: Add eval harness for Storybook agentic setup Mar 27, 2026
Eval system to test how well AI agents complete Storybook setup after
`npx storybook@latest init --yes` on real-world projects.

Features:
- Multi-LLM support: Claude Code (Opus/Sonnet/Haiku), GitHub Copilot CLI
  (Claude models + GPT-5.2-codex, GPT-5.2, GPT-5.1-codex-max)
- 6 test projects covering different tech stacks: styled-components/Redux,
  Tailwind/HeadlessUI, Zustand, ECharts, GraphQL
- Structured JSON output with execution metrics (cost, duration, turns)
  and grading (build success, TypeScript errors, quality score)
- CLI with project/model/agent selection, iterations, custom prompts

Usage: npx jiti scripts/eval/eval.ts --project wikitok --model claude-sonnet-4-6

Refs: #34295
Replace CLI process spawning with proper SDKs:
- Claude: @anthropic-ai/claude-agent-sdk with query() API
- Codex: @openai/codex-sdk with thread streaming API

Benefits: structured responses, proper cost tracking, no stream-json
parsing, no CLI installation dependency, full conversation transcript.
- Pre-prepared eval-baseline branches on forked repos (kasperpeulen/*)
  eliminates storybook init during trials
- Cache system: first run clones + installs, subsequent runs copy from
  cache — agent starts immediately
- Post-init baseline commit for clean git diffs
- Richer result schema: changed files, setup patterns, ghost stories
- Ghost stories grading via STORYBOOK_COMPONENT_PATHS + Vitest
- Setup pattern detection (tailwind, redux, router, etc.)
- Better prompt: allows story creation, focuses on real components
- Smarter cleanup: only removes starter stories, not project stories

Tested on wikitok: quality 1.0, build pass, 7/7 ghost stories, $0.78
- Google Sheets integration via Apps Script webhook (set EVAL_GOOGLE_SHEETS_URL)
- Run ID (per session) and upload ID (for grouping) like MCP eval
- Environment capture (node version, git branch/commit)
- Included google-apps-script.js for setting up the spreadsheet
Prompts are now composable: --prompt setup self-heal doctor
Each name maps to prompts/{name}.md, concatenated in order.

Available prompts:
- setup: base setup prompt (default)
- self-heal: iterative fix loop using vitest --project=storybook
- doctor: run diagnostics before large config changes

Updated verification to prefer vitest over storybook build since
storybook init creates the vitest integration automatically.
- Move cleanEnv to utils (was duplicated in prepare-trial and grade)
- Replace fast-glob/glob with Node 22 built-in fs.globSync
- Compact setup-patterns rules into tuple array
- Remove manual file recursion in setup-patterns and ghost-stories
- Fix save.ts bug (relative(EVAL_ROOT, "") → removed trialPath)
- Remove unused logWarn, simplify logging helpers
- Tighten prepare-trial install detection into single expression
- Delete config.ts and generate-prompt.ts — merge PROJECTS into types.ts,
  prompts into utils.ts, inline agents map into run-task.ts
- computeQualityScore takes options object instead of 4 positional params
- Quality score now includes ghost stories (40%), build (25%),
  typecheck (25%), and performance (10%)
- exec() uses tinyexec native timeout instead of manual AbortController
- Codex agent tracks token usage and estimates cost from pricing table
- Environment fields renamed to evalBranch/evalCommit for clarity
- IPC sentinel shared as exported constant between eval.ts and eval-parallel.ts
- Summary tables now show quality score column
- setup-patterns uses object array instead of positional tuples
- prepare-repos.ts uses shared exec(), static imports, consistent quotes
- google-apps-script.js modernized to const/let + arrow functions
- Remove SupportedModel type alias (was just string)
- Fix .gitignore trailing newline, prompt no longer hardcodes React+Vite
- MAX_TURNS extracted as named constant in claude agent
…rts)

Core source files use extensionless import specifiers that fail under
Node's native TypeScript loader. Read numPassedTests/numTotalTests
directly from the vitest JSON report instead.
…omments

Move types from centralized types.ts into their owning modules:
- Agent types/config → lib/agents/config.ts
- Project/PROJECTS → lib/projects.ts
- Logger → lib/utils.ts
- Grade/scoring types → lib/grade.ts
- TrialConfig/TrialReport → lib/run-trial.ts
- TrialWorkspace → lib/prepare-trial.ts

Remove setup-patterns (detectSetupPatterns, SetupPattern) entirely.
Strip all // --- section separator comments.
Comment thread scripts/eval/eval.ts
export const PROJECTS: Project[] = [
{
name: 'mealdrop',
repo: 'https://github.com/kasperpeulen/mealdrop',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

@kasperpeulen kasperpeulen Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet in this branch.

Comment thread code/core/src/core-server/utils/ghost-stories/get-candidates.ts
Comment thread scripts/eval/lib/agents/claude-code.ts Outdated
Comment thread scripts/eval/lib/agents/claude-code.ts Outdated
Comment thread scripts/eval/lib/agents/claude-code.ts Outdated
export const PROJECTS: Project[] = [
{
name: 'mealdrop',
repo: 'https://github.com/kasperpeulen/mealdrop',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannbf suggested moving these to storybookjs org. We could also make the repos private to the org and anonymise repo names that way if anyone requests it.

Copy link
Copy Markdown
Member Author

@kasperpeulen kasperpeulen Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't moved the benchmark repos in this branch.

Comment thread scripts/eval/lib/utils.ts
@@ -0,0 +1,196 @@
Attention: The following instructions must be followed in order to successfully set up Storybook in this project. Do not skip steps or attempt to do them out of order.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed in the peer review adding a small frontmatter to prompts where we can store metadata, e.g. monorepo: true, etc.

This will help us cross-analyse multiple prompts by traits to attribute score variance to specific traits and to have a better chance at computing statistical significance later on.

@yannbf do you already have a list of conditionals applied by the ai command to inject content into the prompt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the command is quite simple for now and we will be improving it once we start using the eval system. The one conditional right now is csf factories, but no project uses it by default yet

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added prompt frontmatter in this pass. Agreed it's a useful follow-up once we start comparing prompts more systematically.

Comment thread scripts/tsconfig.json Outdated
Comment thread .gitignore Outdated
Comment thread scripts/eval/eval.ts
: '-';

logger.log(pc.bold('\nResult'));
logger.log(` Build: ${result.grade.buildSuccess ? pc.green('PASS') : pc.red('FAIL')}`);
Copy link
Copy Markdown
Member

@yannbf yannbf Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there ways to add info related to:

  • model used
  • time spent on operations
  • context usage
  • ghost stories rate before/after <-- the before rate is quite important

These can be done later:

  • changes made to preview.js
  • setup patterns identified, set up patterns implemented in preview.js (so we can see whether it found something but didn't setup)
  • quality (renders without errors and not empty, has styles)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet in this pass. The main follow-up I still see here is expanding the saved report schema with model/time/context/before-after ghost metrics.

Comment thread scripts/eval/lib/grade.ts Outdated
@@ -0,0 +1,99 @@
/**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed at all? Can't we use the JsPackageManager instance for this instead?

@storybook-app-bot
Copy link
Copy Markdown

storybook-app-bot Bot commented Mar 31, 2026

Package Benchmarks

Commit: 65893e9, ran on 31 March 2026 at 13:50:06 UTC

The following packages have significant changes to their size or dependencies:

eslint-plugin-storybook

Before After Difference
Dependency count 20 20 0
Self size 131 KB 131 KB 0 B
Dependency size 3.41 MB 3.45 MB 🚨 +31 KB 🚨
Bundle Size Analyzer Link Link

@kasperpeulen kasperpeulen marked this pull request as ready for review March 31, 2026 14:17
@kasperpeulen kasperpeulen merged commit 5093a3e into project/sb-agentic-setup Mar 31, 2026
124 of 127 checks passed
@kasperpeulen kasperpeulen deleted the kasper/eval-system branch March 31, 2026 14:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 779aea2a-6142-421c-b14f-acf2925617e2

📥 Commits

Reviewing files that changed from the base of the PR and between d48f719 and 65893e9.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (33)
  • .agents/skills/review-pr/SKILL.md
  • .circleci/config.yml
  • .gitignore
  • AGENTS.md
  • code/core/src/core-server/index.ts
  • code/core/src/core-server/server-channel/ghost-stories-channel.ts
  • code/core/src/core-server/utils/ghost-stories/get-candidates.ts
  • code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts
  • code/core/src/core-server/utils/ghost-stories/run-story-tests.ts
  • code/core/src/shared/utils/categorize-render-errors.ts
  • code/tsconfig.json
  • scripts/ci/common-jobs.ts
  • scripts/eval/eval.ts
  • scripts/eval/lib/agents/claude-code.ts
  • scripts/eval/lib/agents/codex.ts
  • scripts/eval/lib/agents/config.test.ts
  • scripts/eval/lib/agents/config.ts
  • scripts/eval/lib/grade.test.ts
  • scripts/eval/lib/grade.ts
  • scripts/eval/lib/grading-helpers.test.ts
  • scripts/eval/lib/package-manager.test.ts
  • scripts/eval/lib/package-manager.ts
  • scripts/eval/lib/prepare-trial.test.ts
  • scripts/eval/lib/prepare-trial.ts
  • scripts/eval/lib/projects.test.ts
  • scripts/eval/lib/projects.ts
  • scripts/eval/lib/run-trial.test.ts
  • scripts/eval/lib/run-trial.ts
  • scripts/eval/lib/utils.test.ts
  • scripts/eval/lib/utils.ts
  • scripts/eval/prompts/setup.md
  • scripts/package.json
  • scripts/tsconfig.json

📝 Walkthrough

Walkthrough

Introduces a comprehensive evaluation framework for AI agents alongside TypeScript import extension support. Renames ghost-stories utilities (runStoryTestsrunGhostStories), adds .ts extension support throughout, and creates an eval CLI with Claude/Codex agent drivers, project management, trial orchestration, and automated grading based on build success, type-checks, ghost-story tests, and performance.

Changes

Cohort / File(s) Summary
Configuration & Documentation
AGENTS.md, code/tsconfig.json, scripts/tsconfig.json, .gitignore, .circleci/config.yml, scripts/ci/common-jobs.ts, .agents/skills/review-pr/SKILL.md
Node version bump (22.21.1→22.22.1), TypeScript native execution guidance, format command migration (oxfmt→yarn fmt:write), tsconfig additions for allowImportingTsExtensions, CircleCI resource class adjustments (small→large, medium+→xlarge), .gitignore entries for eval artifacts and .pr-review, and new PR review skill documentation.
Ghost Stories API & Core Server
code/core/src/core-server/index.ts, code/core/src/core-server/server-channel/ghost-stories-channel.ts, code/core/src/core-server/utils/ghost-stories/...
Renamed runStoryTestsrunGhostStories, added optional cwd parameter to both runGhostStories and getComponentCandidates, re-exported new utilities from core-server, updated imports to use explicit .ts extensions throughout ghost-stories modules (get-candidates, parse-vitest-report, run-story-tests), removed unused logger import.
TypeScript Extension Migration
code/core/src/shared/utils/categorize-render-errors.ts
Updated import path for ecosystem-identifier to include explicit .ts extension.
Evaluation Framework: Core Modules
scripts/eval/eval.ts, scripts/eval/lib/run-trial.ts, scripts/eval/lib/prepare-trial.ts, scripts/eval/lib/grade.ts
New CLI entry point with argument parsing (agent/model/project/effort selection, manual mode), trial orchestration (preparation, environment capture, execution, grading), workspace/trial workspace management, parallel build/typecheck execution, optional ghost-story grading, and weighted quality-score computation.
Evaluation Framework: Agent Drivers
scripts/eval/lib/agents/claude-code.ts, scripts/eval/lib/agents/codex.ts, scripts/eval/lib/agents/config.ts
New Claude agent driver with streaming SDK execution, message logging, transcript persistence, cost/duration extraction; Codex driver with event streaming, token aggregation, turn tracking, cost estimation; shared configuration (models/efforts, execution settings, token pricing, cost estimation utilities).
Evaluation Framework: Supporting Utilities
scripts/eval/lib/utils.ts, scripts/eval/lib/package-manager.ts, scripts/eval/lib/projects.ts
Logger creation, trial ID generation, formatting helpers (duration/cost/table), environment capture, prompt loading/listing; package-manager detection and dependency installation with workspace resolution; project definitions and metadata.
Evaluation Framework: Prompts & Tests
scripts/eval/prompts/setup.md, scripts/eval/lib/agents/config.test.ts, scripts/eval/lib/grade.test.ts, scripts/eval/lib/grading-helpers.test.ts, scripts/eval/lib/package-manager.test.ts, scripts/eval/lib/prepare-trial.test.ts, scripts/eval/lib/projects.test.ts, scripts/eval/lib/run-trial.test.ts, scripts/eval/lib/utils.test.ts
Storybook setup documentation with CSF/factory templates and verification steps; comprehensive test suites for agent configuration, grading logic, ghost-story integration, package-manager detection, trial preparation, project validation, end-to-end trial execution, and utility formatting/helpers.
Dependencies
scripts/package.json
Added Claude agent SDK, Codex SDK, citty CLI utility; added eval script entry point.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Eval CLI
    participant Prep as prepareTrial
    participant Env as captureEnvironment
    participant Agent as Agent Driver
    participant Build as Build/Typecheck
    participant Ghost as Ghost Stories
    participant Grade as Grading

    CLI->>Prep: Initialize workspace
    Prep->>Prep: Clone/cache repo
    Prep->>Prep: Install dependencies
    Prep-->>CLI: Return workspace

    CLI->>Env: Capture environment
    Env-->>CLI: Return node/git info

    CLI->>Agent: Execute agent
    Agent->>Agent: Stream messages
    Agent->>Agent: Log/persist transcript
    Agent-->>CLI: Return execution (cost/duration/turns)

    CLI->>Build: Run Storybook + tsc
    Build->>Build: Parallel execution
    Build-->>CLI: Return build/typecheck results

    alt Build Success
        CLI->>Ghost: Run ghost-stories
        Ghost->>Ghost: Discover candidates
        Ghost->>Ghost: Run vitest tests
        Ghost-->>CLI: Return grade (pass rate)
    end

    CLI->>Grade: Compute quality score
    Grade->>Grade: Weighted calculation
    Grade-->>CLI: Return grade + score

    CLI->>CLI: Assemble TrialReport
    CLI->>CLI: Persist summary.json
    CLI-->>CLI: Return report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Internal-facing build tooling & test updates ci:normal Run our default set of CI jobs (choose this for most PRs).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Tracking]: SB Agentic Setup

3 participants