Debug: Debug Dangerfile PR data content 2#34399
Conversation
Eval system to test how well AI agents complete Storybook setup after `npx storybook@latest init --yes` on real-world projects. Features: - Multi-LLM support: Claude Code (Opus/Sonnet/Haiku), GitHub Copilot CLI (Claude models + GPT-5.2-codex, GPT-5.2, GPT-5.1-codex-max) - 6 test projects covering different tech stacks: styled-components/Redux, Tailwind/HeadlessUI, Zustand, ECharts, GraphQL - Structured JSON output with execution metrics (cost, duration, turns) and grading (build success, TypeScript errors, quality score) - CLI with project/model/agent selection, iterations, custom prompts Usage: npx jiti scripts/eval/eval.ts --project wikitok --model claude-sonnet-4-6 Refs: #34295
Replace CLI process spawning with proper SDKs: - Claude: @anthropic-ai/claude-agent-sdk with query() API - Codex: @openai/codex-sdk with thread streaming API Benefits: structured responses, proper cost tracking, no stream-json parsing, no CLI installation dependency, full conversation transcript.
- Pre-prepared eval-baseline branches on forked repos (kasperpeulen/*) eliminates storybook init during trials - Cache system: first run clones + installs, subsequent runs copy from cache — agent starts immediately - Post-init baseline commit for clean git diffs - Richer result schema: changed files, setup patterns, ghost stories - Ghost stories grading via STORYBOOK_COMPONENT_PATHS + Vitest - Setup pattern detection (tailwind, redux, router, etc.) - Better prompt: allows story creation, focuses on real components - Smarter cleanup: only removes starter stories, not project stories Tested on wikitok: quality 1.0, build pass, 7/7 ghost stories, $0.78
- Google Sheets integration via Apps Script webhook (set EVAL_GOOGLE_SHEETS_URL) - Run ID (per session) and upload ID (for grouping) like MCP eval - Environment capture (node version, git branch/commit) - Included google-apps-script.js for setting up the spreadsheet
Prompts are now composable: --prompt setup self-heal doctor
Each name maps to prompts/{name}.md, concatenated in order.
Available prompts:
- setup: base setup prompt (default)
- self-heal: iterative fix loop using vitest --project=storybook
- doctor: run diagnostics before large config changes
Updated verification to prefer vitest over storybook build since
storybook init creates the vitest integration automatically.
- Move cleanEnv to utils (was duplicated in prepare-trial and grade) - Replace fast-glob/glob with Node 22 built-in fs.globSync - Compact setup-patterns rules into tuple array - Remove manual file recursion in setup-patterns and ghost-stories - Fix save.ts bug (relative(EVAL_ROOT, "") → removed trialPath) - Remove unused logWarn, simplify logging helpers - Tighten prepare-trial install detection into single expression
…env, 1s timeout, no --project
- Delete config.ts and generate-prompt.ts — merge PROJECTS into types.ts, prompts into utils.ts, inline agents map into run-task.ts - computeQualityScore takes options object instead of 4 positional params - Quality score now includes ghost stories (40%), build (25%), typecheck (25%), and performance (10%) - exec() uses tinyexec native timeout instead of manual AbortController - Codex agent tracks token usage and estimates cost from pricing table - Environment fields renamed to evalBranch/evalCommit for clarity - IPC sentinel shared as exported constant between eval.ts and eval-parallel.ts - Summary tables now show quality score column - setup-patterns uses object array instead of positional tuples - prepare-repos.ts uses shared exec(), static imports, consistent quotes - google-apps-script.js modernized to const/let + arrow functions - Remove SupportedModel type alias (was just string) - Fix .gitignore trailing newline, prompt no longer hardcodes React+Vite - MAX_TURNS extracted as named constant in claude agent
…rts) Core source files use extensionless import specifiers that fail under Node's native TypeScript loader. Read numPassedTests/numTotalTests directly from the vitest JSON report instead.
Node's native TypeScript loader requires explicit .ts extensions. Add them to parse-vitest-report.ts and categorize-render-errors.ts so the eval can import parseVitestResults from core via relative path.
… tsconfig fixes - Separate types from runtime config (types.ts + config.ts) - Thread Logger through entire pipeline (fixes garbled parallel output) - Replace fragile stdout sentinel IPC with Node fork/process.send - Run storybook build + typecheck in parallel (saves ~60-120s/trial) - Tighten Agent interface to single params object - Add --agent/--model/--prompt filters to eval-parallel - Make quality score weights configurable - Add prompt template variable support - Enable allowImportingTsExtensions in root and scripts tsconfigs - Fix all pre-existing TS errors in eval files
…from core-server, inline into grade.ts - Rename runStoryTests to runGhostStories in core (clearer name) - Add cwd parameter to runGhostStories and getComponentCandidates - Export getComponentCandidates, runGhostStories, TestRunSummary from core-server index - Remove eval ghost-stories.ts wrapper — inline logic into grade.ts - Remove eval ghost-stories.test.ts — core already has its own tests - Revert speculative isCandidate/isValidCandidate export (unused) - Remove unused logger import from get-candidates.ts
…ions The core-server barrel index re-exports modules (build-static, etc.) that fail under native Node TS. Import ghost-stories utilities directly from their source files instead, and add .ts extensions to internal imports in the import chain.
…rop exec wrapper - Replace fork/IPC parallel execution with direct Promise.allSettled + prefixed loggers - Make blocking fs calls async (cpSync→cp, writeFileSync→writeFile, mkdirSync→mkdir) - Remove Google Sheets upload, google-apps-script.js, and upload-id/run-id plumbing - Drop custom exec wrapper — use tinyexec's x() directly at call sites - Remove runId/uploadId from runTask signature and both CLI entry points
- Replace plain interfaces with Zod schemas for runtime validation (types.ts) - Merge eval.ts + eval-parallel.ts into a single CLI with comma-separated args - Fix deep core imports to use barrel export (core-server/index.ts) - Extract shared package-manager detection and install (lib/package-manager.ts) - Move pricing tables and model ID mappings into config.ts - Make setup-patterns.ts fully async with fs/promises - Add formatTable utility with ANSI-aware column alignment - Integrate prepare-repos.ts with shared logger and PM utilities
…slides - Replace slideshow format with a scrollable HTML page using file cards - Show complete file contents for new files, diffs for modified files - Lexend + JetBrains Mono fonts, light/dark theme, mobile-responsive - Static server on port 3000 (no live-reload) - Issues shown inline as smell-boxes, never block page generation - Simplified to 5 steps: gather → read → generate → serve → iterate
…bility review - Two layers per area: curated walkthrough (API→Tests→Impl) + collapsed full files - Use language-typescript with data-diff attribute instead of language-diff - Post-processing script for line-level add/remove backgrounds on top of TS highlighting - Add readability review guidance: logical order, clear names, comments, test quality - Order areas high-level to low-level
Principle 3 now explicitly requires showing complete interface definitions where they're first relevant, not just type names.
Extract AgentRunConfig { agent, model, effort } and compose it as
a `run` field in TrialConfig, ExecutionResult, and TrialResult
instead of spreading via extends/inheritance.
- AgentRunConfig → AgentVariant (it's the experimental variant, not a "run config") - Agent → AgentDriver, AgentConfig → AgentDefinition (disambiguate) - ExecutionResult → Execution, GradingResult → Grade, QualityResult → QualityScore - TrialResult → TrialReport, TrialPaths → TrialWorkspace - ChangedFile → FileChange, Pricing → TokenPricing, Environment → EvalEnvironment - GhostStoriesResult → GhostStoryGrade, GhostStoryRunResult → GhostStoryOutput - QualityWeights → ScoreWeights, DEFAULT_QUALITY_WEIGHTS → DEFAULT_SCORE_WEIGHTS - Field renames: run → variant, grading → grade, quality → score, changedFiles → fileChanges, storybookFiles → storybookChanges - Extract AgentExecuteParams with variant: AgentVariant (reuses the model) - Remove redundant run field from Execution (lives on TrialReport only)
Every project needs a branch for cloning. The type now reflects that, and the `branch!` assertion in prepareTrial is no longer needed.
…Trial, throw on ghost story errors - Make AgentVariant a discriminated union on agent, with typed model/effort per agent - Rename runTask→runTrial and run-task.ts→run-trial.ts for consistent domain naming - Store full Project in TrialReport instead of just the name for reproducibility - Replace error-object returns with GhostStoryError throws in ghost-stories.ts - Fix successRate rounding to use Math.round(x*100)/100 consistently - Extract scoring magic numbers into named constants - Validate git status chars against known set instead of blind casting - Truncate build/typecheck output at line boundaries
There was a problem hiding this comment.
Pull request overview
This PR adds a new “eval” harness under scripts/eval/ to benchmark/grade Storybook setup work using AI agents (Claude + Codex), while also advancing the repo’s move toward native Node execution of .ts files (explicit .ts import extensions). It also updates core “ghost stories” utilities and introduces some debug/CI changes.
Changes:
- Add a new
scripts/eval/pipeline (prepare trial → run agent → grade results) with prompts, scoring, and Vitest coverage. - Update TypeScript configs to support explicit
.tsimport extensions (native Node TS execution migration). - Refactor/extend core ghost-stories utilities (renames
runStoryTests→runGhostStories, adds optionalcwd) and add a stub CLI command (skill).
Reviewed changes
Copilot reviewed 32 out of 35 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| yarn.lock | Lockfile updates for new agent/eval dependencies (Anthropic SDK, Codex SDK, citty, transitive deps). |
| scripts/tsconfig.json | Enables allowImportingTsExtensions; excludes one eval artifact file from typechecking. |
| scripts/package.json | Adds eval script entrypoint and new dependencies for agent SDKs + citty. |
| scripts/eval/types.ts | Defines core types for eval trials, grading, scoring, and reporting. |
| scripts/eval/types.test.ts | Validates AGENTS/PROJECTS config invariants (defaults, mappings, uniqueness). |
| scripts/eval/prompts/setup.md | Adds the “setup” prompt used to guide agents toward stable Storybook setup. |
| scripts/eval/prompts/self-heal.md | Adds a “self-heal” loop prompt focused on iterating via vitest --project=storybook. |
| scripts/eval/lib/utils.ts | Implements shared utilities: logging, formatting, prompt loading, environment capture, table formatting. |
| scripts/eval/lib/utils.test.ts | Unit tests for formatting helpers, prompt loading/listing, and table alignment (incl ANSI handling). |
| scripts/eval/lib/setup-patterns.ts | Detects common Storybook setup patterns by scanning .storybook/ files. |
| scripts/eval/lib/setup-patterns.test.ts | Tests setup-pattern detection against a temporary .storybook/ tree. |
| scripts/eval/lib/run-trial.ts | Orchestrates a full trial (prepare → capture env → prompt → agent → grade → summary.json). |
| scripts/eval/lib/run-trial.test.ts | Mocks pipeline dependencies and verifies report assembly, sequencing, and output files. |
| scripts/eval/lib/prepare-trial.ts | Clones/caches benchmark repos and installs deps before the agent runs. |
| scripts/eval/lib/package-manager.ts | Detects package manager via lockfiles and runs installs. |
| scripts/eval/lib/grading-helpers.test.ts | Integration-style tests composing candidate discovery, setup patterns, git parsing, and scoring. |
| scripts/eval/lib/grade.ts | Implements grading: changed files, setup patterns, storybook build, tsc, ghost stories, and scoring. |
| scripts/eval/lib/grade.test.ts | Unit tests for file filtering, scoring math, TS error counting, and git name-status parsing. |
| scripts/eval/lib/ghost-stories.ts | Eval-side ghost story runner (find candidates, run vitest JSON reporter, parse counts). |
| scripts/eval/lib/agents/codex.ts | Codex agent driver using @openai/codex-sdk, streaming events and estimating cost. |
| scripts/eval/lib/agents/claude-code.ts | Claude agent driver using @anthropic-ai/claude-agent-sdk with debug logging and transcript capture. |
| scripts/eval/eval.ts | CLI entrypoint for running one or many eval trials in parallel with zod-validated args. |
| scripts/eval/config.ts | Defines agent model/effort/pricing tables and benchmark projects (eval-baseline repos). |
| scripts/dangerfile.js | Adds debug printing and an unconditional fail() for non-team PRs in target-branch check. |
| foo | New file containing terminal escape sequences (appears accidental). |
| code/tsconfig.json | Enables allowImportingTsExtensions in the main code/ TS config. |
| code/lib/cli-storybook/src/bin/run.ts | Adds a new skill command (currently a stub implementation). |
| code/core/src/shared/utils/categorize-render-errors.ts | Switches relative import to explicit .ts extension. |
| code/core/src/core-server/utils/ghost-stories/run-story-tests.ts | Renames exported runner to runGhostStories and adds optional cwd; updates imports to .ts. |
| code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts | Updates imports to explicit .ts extensions. |
| code/core/src/core-server/utils/ghost-stories/get-candidates.ts | Adds configurable cwd for globbing and updates import to explicit .ts. |
| code/core/src/core-server/server-channel/ghost-stories-channel.ts | Updates channel to call renamed runGhostStories. |
| AGENTS.md | Updates Node version and documents native Node TS execution migration guidance. |
| .gitignore | Adds ignore entries for eval outputs (currently pointing under scripts/eval/). |
| .agents/skills/review-pr/SKILL.md | Adds a new “review-pr” agent skill document (HTML single-page PR review generator). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return ['pnpm', ['install', '--no-frozen-lockfile']]; | ||
| case 'yarn': | ||
| return [ | ||
| 'yarn', | ||
| existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'], | ||
| ]; | ||
| case 'bun': | ||
| return ['bun', ['install']]; |
There was a problem hiding this comment.
This eval harness clones external repos and installs dependencies; however only the npm path uses --ignore-scripts, while pnpm/yarn/bun will run lifecycle scripts by default. For safety and reproducibility, consider consistently disabling install scripts (or explicitly documenting/isolating why it’s safe to run them) across all package managers.
| return ['pnpm', ['install', '--no-frozen-lockfile']]; | |
| case 'yarn': | |
| return [ | |
| 'yarn', | |
| existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'], | |
| ]; | |
| case 'bun': | |
| return ['bun', ['install']]; | |
| return ['pnpm', ['install', '--no-frozen-lockfile', '--ignore-scripts']]; | |
| case 'yarn': | |
| return [ | |
| 'yarn', | |
| existsSync(join(dir, '.yarnrc.yml')) | |
| ? ['install', '--no-immutable', '--ignore-scripts'] | |
| : ['install', '--ignore-scripts'], | |
| ]; | |
| case 'bun': | |
| return ['bun', ['install', '--ignore-scripts']]; |
| console.log('authorAssociation', authorAssociation); | ||
| console.log('author', author); | ||
| console.log(JSON.stringify(danger.github.pr, null, 2)); | ||
|
|
There was a problem hiding this comment.
The added debug logging prints the full PR payload (including potentially large or sensitive metadata) to Danger’s output. Please remove these console.log calls (or guard them behind an explicit debug flag) to keep CI logs clean and avoid leaking data.
| console.log('authorAssociation', authorAssociation); | |
| console.log('author', author); | |
| console.log(JSON.stringify(danger.github.pr, null, 2)); |
| fail(JSON.stringify(danger.github.pr, null, 2)); | ||
|
|
There was a problem hiding this comment.
fail(JSON.stringify(danger.github.pr, null, 2)) will make Danger fail every non-team PR unconditionally, bypassing the actual target-branch logic below. This should be removed; only fail when the base branch is invalid.
| @@ -0,0 +1 @@ | |||
| [?2004h[?1049h[22;0;0t[1;58r(B[m[4l[?7h[39;49m[?1h=[?1h=[?25l[39;49m(B[m[H[2J[56;121H(B[0;7m[ Reading... ](B[m[56;120H(B[0;7m[ Read 1 line ](B[m[H(B[0;7m GNU nano 8.7.1 foo [1;253H(B[m[57d(B[0;7m^G(B[m Help[57;19H(B[0;7m^O(B[m Write Out[37G(B[0;7m^F(B[m Where Is[55G(B[0;7m^K(B[m Cut[57;73H(B[0;7m^T(B[m Execute[57;91H(B[0;7m^C(B[m Location[109G(B[0;7mM-U(B[m Undo[57;127H(B[0;7mM-A(B[m Set Mark[145G(B[0;7mM-](B[m To Bracket (B[0;7mM-B(B[m Previous[181G(B[0;7m◂(B[m Back[57;199H(B[0;7m^◂(B[m Prev Word[217G(B[0;7m^A(B[m Home[57;235H(B[0;7m^P(B[m Prev Line[58d(B[0;7m^X(B[m Exit[58;19H(B[0;7m^R(B[m Read File[37G(B[0;7m^\(B[m Replace[58;55H(B[0;7m^U(B[m Paste[58;73H(B[0;7m^J(B[m Justify[58;91H(B[0;7m^/(B[m Go To Line (B[0;7mM-E(B[m Redo[58;127H(B[0;7mM-6(B[m Copy[58;145H(B[0;7m^B(B[m Where Was[163G(B[0;7mM-F(B[m Next[58;181H(B[0;7m▸(B[m Forward[58;199H(B[0;7m^▸(B[m Next Word[217G(B[0;7m^E(B[m End[58;235H(B[0;7m^N(B[m Next Line[2d^[[?2004h^[[?1049h^[[22;0;0t^[[1;58r^[(B^[[m^[[4l^[[?7h^[[39;49m^[[?1h^[=^[[?1h^[=^[[?25l^[[39;49m^[(B^[[m^[[H^[[2J^[[56;121H^[(B^[[0;7m[ Reading... ]^[(B^[[m[?12l[?25h[?25l[56;99H(B[0;7m[ line 1/2 (50%), col 1/159 ( 0%), char 0/135 ( 0%) ](B[m[?12l[?25h[2d[?25l[56d[J[58d[?12l[?25h[58;1H[?1049l[23;0;0t[?1l>[?2004l No newline at end of file | |||
There was a problem hiding this comment.
This file appears to contain raw terminal escape sequences (likely an accidentally committed editor buffer) and has no meaningful source content. It should be removed from the repository to avoid polluting diffs and tooling.
| "@anthropic-ai/claude-agent-sdk": "^0.2.85", | ||
| "@fal-works/esbuild-plugin-global-externals": "^2.1.2", | ||
| "@google-cloud/bigquery": "^6.2.1", | ||
| "@octokit/graphql": "^5.0.6", | ||
| "@octokit/request": "^8.4.1", | ||
| "@openai/codex-sdk": "^0.117.0", | ||
| "@polka/parse": "^1.0.0-next.28", |
There was a problem hiding this comment.
@anthropic-ai/claude-agent-sdk declares a peer dependency on zod@^4, but this workspace currently depends on zod@^3.25.76. This will cause peer-dep warnings and can break at runtime if the SDK relies on Zod v4 APIs; either upgrade zod in scripts/ to a compatible major or use an SDK version compatible with Zod v3.
| "citty": "^0.2.1", | ||
| "codecov": "^3.8.1", |
There was a problem hiding this comment.
citty is added as a dependency but there are no references to it in the scripts/ workspace. If it’s not used yet, please remove it to avoid unnecessary install surface area; otherwise, add the usage in this PR so the dependency is justified.
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces a comprehensive Storybook evaluation system for testing automated setup workflows. It adds PR review skill documentation, enables TypeScript's native Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client CLI
participant Eval as eval.ts
participant Trial as runTrial()
participant Prep as prepareTrial()
participant Grade as grade()
participant Agent as AgentDriver
participant Vitest as Vitest<br/>(Ghost Stories)
participant Build as Build & TSC
Client->>Eval: npm run eval (with args)
Eval->>Eval: Parse arguments & derive<br/>trial configs
Eval->>Trial: Execute trial config<br/>(concurrent)
Trial->>Prep: Prepare workspace<br/>(clone/install)
Prep-->>Trial: TrialWorkspace
Trial->>Agent: execute({prompt,<br/>projectPath, ...})
Agent->>Agent: Stream & log<br/>agent output
Agent-->>Trial: Execution{cost,<br/>duration, turns}
Trial->>Grade: grade(workspace,<br/>logger, duration)
Grade->>Build: Run storybook build<br/>& tsc --noEmit
Build-->>Grade: Outputs & errors
Grade->>Vitest: runGhostStories<br/>(candidates)
Vitest-->>Grade: GhostStoryGrade
Grade-->>Trial: {grade,<br/>QualityScore}
Trial->>Trial: Assemble TrialReport<br/>& write summary.json
Trial-->>Eval: TrialReport
Eval->>Eval: Aggregate results &<br/>format output table
Eval-->>Client: Summary with cost,<br/>metrics, success rate
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes The PR spans 25+ new TypeScript files introducing a modular but substantial evaluation system. While most components are isolated and logic is relatively straightforward (no complex algorithms), understanding the orchestration flow, agent integrations, grading architecture, and ensuring type safety across the pipeline requires careful cross-file reasoning. The breadth of changes across agents, grading, utilities, and tests demands attention to architectural consistency and API contracts. Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
Comment |
Closes #
What I did
Checklist for Contributors
Testing
The changes in this PR are covered in the following automated tests:
Manual testing
ribbit
Documentation
MIGRATION.MD
Checklist for Maintainers
When this PR is ready for testing, make sure to add
ci:normal,ci:mergedorci:dailyGH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found incode/lib/cli-storybook/src/sandbox-templates.tsMake sure this PR contains one of the labels below:
Available labels
bug: Internal changes that fixes incorrect behavior.maintenance: User-facing maintenance tasks.dependencies: Upgrading (sometimes downgrading) dependencies.build: Internal-facing build tooling & test updates. Will not show up in release changelog.cleanup: Minor cleanup style change. Will not show up in release changelog.documentation: Documentation only changes. Will not show up in release changelog.feature request: Introducing a new feature.BREAKING CHANGE: Changes that break compatibility in some way with current major version.other: Changes that don't fit in the above categories.🦋 Canary release
This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the
@storybookjs/coreteam here.core team members can create a canary release here or locally with
gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>Summary by CodeRabbit
Release Notes
New Features
skillCLI command for Storybook.Updates
.tsfile execution.