Build: Add eval harness for Storybook agentic setup#34365
Conversation
|
View your CI Pipeline Execution ↗ for commit 65893e9
☁️ Nx Cloud last updated this comment at |
Eval system to test how well AI agents complete Storybook setup after `npx storybook@latest init --yes` on real-world projects. Features: - Multi-LLM support: Claude Code (Opus/Sonnet/Haiku), GitHub Copilot CLI (Claude models + GPT-5.2-codex, GPT-5.2, GPT-5.1-codex-max) - 6 test projects covering different tech stacks: styled-components/Redux, Tailwind/HeadlessUI, Zustand, ECharts, GraphQL - Structured JSON output with execution metrics (cost, duration, turns) and grading (build success, TypeScript errors, quality score) - CLI with project/model/agent selection, iterations, custom prompts Usage: npx jiti scripts/eval/eval.ts --project wikitok --model claude-sonnet-4-6 Refs: #34295
Replace CLI process spawning with proper SDKs: - Claude: @anthropic-ai/claude-agent-sdk with query() API - Codex: @openai/codex-sdk with thread streaming API Benefits: structured responses, proper cost tracking, no stream-json parsing, no CLI installation dependency, full conversation transcript.
- Pre-prepared eval-baseline branches on forked repos (kasperpeulen/*) eliminates storybook init during trials - Cache system: first run clones + installs, subsequent runs copy from cache — agent starts immediately - Post-init baseline commit for clean git diffs - Richer result schema: changed files, setup patterns, ghost stories - Ghost stories grading via STORYBOOK_COMPONENT_PATHS + Vitest - Setup pattern detection (tailwind, redux, router, etc.) - Better prompt: allows story creation, focuses on real components - Smarter cleanup: only removes starter stories, not project stories Tested on wikitok: quality 1.0, build pass, 7/7 ghost stories, $0.78
- Google Sheets integration via Apps Script webhook (set EVAL_GOOGLE_SHEETS_URL) - Run ID (per session) and upload ID (for grouping) like MCP eval - Environment capture (node version, git branch/commit) - Included google-apps-script.js for setting up the spreadsheet
Prompts are now composable: --prompt setup self-heal doctor
Each name maps to prompts/{name}.md, concatenated in order.
Available prompts:
- setup: base setup prompt (default)
- self-heal: iterative fix loop using vitest --project=storybook
- doctor: run diagnostics before large config changes
Updated verification to prefer vitest over storybook build since
storybook init creates the vitest integration automatically.
- Move cleanEnv to utils (was duplicated in prepare-trial and grade) - Replace fast-glob/glob with Node 22 built-in fs.globSync - Compact setup-patterns rules into tuple array - Remove manual file recursion in setup-patterns and ghost-stories - Fix save.ts bug (relative(EVAL_ROOT, "") → removed trialPath) - Remove unused logWarn, simplify logging helpers - Tighten prepare-trial install detection into single expression
…env, 1s timeout, no --project
- Delete config.ts and generate-prompt.ts — merge PROJECTS into types.ts, prompts into utils.ts, inline agents map into run-task.ts - computeQualityScore takes options object instead of 4 positional params - Quality score now includes ghost stories (40%), build (25%), typecheck (25%), and performance (10%) - exec() uses tinyexec native timeout instead of manual AbortController - Codex agent tracks token usage and estimates cost from pricing table - Environment fields renamed to evalBranch/evalCommit for clarity - IPC sentinel shared as exported constant between eval.ts and eval-parallel.ts - Summary tables now show quality score column - setup-patterns uses object array instead of positional tuples - prepare-repos.ts uses shared exec(), static imports, consistent quotes - google-apps-script.js modernized to const/let + arrow functions - Remove SupportedModel type alias (was just string) - Fix .gitignore trailing newline, prompt no longer hardcodes React+Vite - MAX_TURNS extracted as named constant in claude agent
…rts) Core source files use extensionless import specifiers that fail under Node's native TypeScript loader. Read numPassedTests/numTotalTests directly from the vitest JSON report instead.
…omments Move types from centralized types.ts into their owning modules: - Agent types/config → lib/agents/config.ts - Project/PROJECTS → lib/projects.ts - Logger → lib/utils.ts - Grade/scoring types → lib/grade.ts - TrialConfig/TrialReport → lib/run-trial.ts - TrialWorkspace → lib/prepare-trial.ts Remove setup-patterns (detectSetupPatterns, SetupPattern) entirely. Strip all // --- section separator comments.
| export const PROJECTS: Project[] = [ | ||
| { | ||
| name: 'mealdrop', | ||
| repo: 'https://github.com/kasperpeulen/mealdrop', |
There was a problem hiding this comment.
Can you move the projects to https://github.com/orgs/storybook-tmp/?
There was a problem hiding this comment.
Not yet in this branch.
| export const PROJECTS: Project[] = [ | ||
| { | ||
| name: 'mealdrop', | ||
| repo: 'https://github.com/kasperpeulen/mealdrop', |
There was a problem hiding this comment.
@yannbf suggested moving these to storybookjs org. We could also make the repos private to the org and anonymise repo names that way if anyone requests it.
There was a problem hiding this comment.
I haven't moved the benchmark repos in this branch.
| @@ -0,0 +1,196 @@ | |||
| Attention: The following instructions must be followed in order to successfully set up Storybook in this project. Do not skip steps or attempt to do them out of order. | |||
There was a problem hiding this comment.
We discussed in the peer review adding a small frontmatter to prompts where we can store metadata, e.g. monorepo: true, etc.
This will help us cross-analyse multiple prompts by traits to attribute score variance to specific traits and to have a better chance at computing statistical significance later on.
@yannbf do you already have a list of conditionals applied by the ai command to inject content into the prompt?
There was a problem hiding this comment.
No, the command is quite simple for now and we will be improving it once we start using the eval system. The one conditional right now is csf factories, but no project uses it by default yet
There was a problem hiding this comment.
I haven't added prompt frontmatter in this pass. Agreed it's a useful follow-up once we start comparing prompts more systematically.
| : '-'; | ||
|
|
||
| logger.log(pc.bold('\nResult')); | ||
| logger.log(` Build: ${result.grade.buildSuccess ? pc.green('PASS') : pc.red('FAIL')}`); |
There was a problem hiding this comment.
Are there ways to add info related to:
- model used
- time spent on operations
- context usage
- ghost stories rate before/after <-- the before rate is quite important
These can be done later:
- changes made to preview.js
- setup patterns identified, set up patterns implemented in preview.js (so we can see whether it found something but didn't setup)
- quality (renders without errors and not empty, has styles)
There was a problem hiding this comment.
Not yet in this pass. The main follow-up I still see here is expanding the saved report schema with model/time/context/before-after ghost metrics.
| @@ -0,0 +1,99 @@ | |||
| /** | |||
There was a problem hiding this comment.
Is this file needed at all? Can't we use the JsPackageManager instance for this instead?
Package BenchmarksCommit: The following packages have significant changes to their size or dependencies:
|
| Before | After | Difference | |
|---|---|---|---|
| Dependency count | 20 | 20 | 0 |
| Self size | 131 KB | 131 KB | 0 B |
| Dependency size | 3.41 MB | 3.45 MB | 🚨 +31 KB 🚨 |
| Bundle Size Analyzer | Link | Link |
5093a3e
into
project/sb-agentic-setup
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (33)
📝 WalkthroughWalkthroughIntroduces a comprehensive evaluation framework for AI agents alongside TypeScript import extension support. Renames ghost-stories utilities ( Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Eval CLI
participant Prep as prepareTrial
participant Env as captureEnvironment
participant Agent as Agent Driver
participant Build as Build/Typecheck
participant Ghost as Ghost Stories
participant Grade as Grading
CLI->>Prep: Initialize workspace
Prep->>Prep: Clone/cache repo
Prep->>Prep: Install dependencies
Prep-->>CLI: Return workspace
CLI->>Env: Capture environment
Env-->>CLI: Return node/git info
CLI->>Agent: Execute agent
Agent->>Agent: Stream messages
Agent->>Agent: Log/persist transcript
Agent-->>CLI: Return execution (cost/duration/turns)
CLI->>Build: Run Storybook + tsc
Build->>Build: Parallel execution
Build-->>CLI: Return build/typecheck results
alt Build Success
CLI->>Ghost: Run ghost-stories
Ghost->>Ghost: Discover candidates
Ghost->>Ghost: Run vitest tests
Ghost-->>CLI: Return grade (pass rate)
end
CLI->>Grade: Compute quality score
Grade->>Grade: Weighted calculation
Grade-->>CLI: Return grade + score
CLI->>CLI: Assemble TrialReport
CLI->>CLI: Persist summary.json
CLI-->>CLI: Return report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
Comment |
Closes #34295 (M0 milestone)
What I did
Eval harness to measure how well AI agents complete Storybook setup after
npx storybook@latest init --yeson real-world projects. Primary metric: ghost stories (do rendered components actually work).Architecture
Pipeline per run: copy from cache → run agent → grade (build + typecheck + ghost stories + setup patterns + changed files) → save to Google Sheets
Agents (via SDK, not CLI)
@anthropic-ai/claude-agent-sdk@openai/codex-sdkAgent is inferred from model:
-m gpt-5.4auto-selects codex.Four independent axes
effort/ Codexmodel_reasoning_effort)setup(default),self-heal(vitest-based iteration loop)Parallel execution
eval-parallel.tsspawns 8 separate node processes (4 models × 2 prompts) with live-streamed prefixed logs:Pre-prepared repos
6 forked repos with
eval-baselinebranches (storybook already initialized):First run clones + installs → caches. Subsequent runs copy from cache — agent starts immediately.
Grading
STORYBOOK_COMPONENT_PATHSto auto-generate and test stories (mirrors core-server implementation)Results
summary.jsonper trialEVAL_GOOGLE_SHEETS_URL)Checklist for Contributors
Testing
The changes in this PR are covered in the following automated tests:
Manual testing
Tested end-to-end on mealdrop and wikitok with multiple model/prompt combinations via
eval-parallel.ts.Documentation
MIGRATION.MD
Checklist for Maintainers
When this PR is ready for testing, make sure to add
ci:normal,ci:mergedorci:dailyGH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found incode/lib/cli-storybook/src/sandbox-templates.tsMake sure this PR contains one of the labels below:
Available labels
bug: Internal changes that fixes incorrect behavior.maintenance: User-facing maintenance tasks.dependencies: Upgrading (sometimes downgrading) dependencies.build: Internal-facing build tooling & test updates. Will not show up in release changelog.cleanup: Minor cleanup style change. Will not show up in release changelog.documentation: Documentation only changes. Will not show up in release changelog.feature request: Introducing a new feature.BREAKING CHANGE: Changes that break compatibility in some way with current major version.other: Changes that don't fit in the above categories.🦋 Canary release
This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the
@storybookjs/coreteam here.core team members can create a canary release here or locally with
gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>Summary by CodeRabbit
New Features
Improvements
Chores