Skip to content

Debug: Debug Dangerfile PR data content 2#34399

Closed
Sidnioulz wants to merge 53 commits into
nextfrom
DEBUG-DANGER-2
Closed

Debug: Debug Dangerfile PR data content 2#34399
Sidnioulz wants to merge 53 commits into
nextfrom
DEBUG-DANGER-2

Conversation

@Sidnioulz
Copy link
Copy Markdown
Contributor

@Sidnioulz Sidnioulz commented Mar 30, 2026

Closes #

What I did

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

Manual testing

ribbit

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

Release Notes

  • New Features

    • Added evaluation framework for benchmarking Storybook setup across projects with configurable agents and models.
    • Introduced skill CLI command for Storybook.
    • Added ghost stories testing workflow for automated component validation.
    • Added setup guidance prompts and self-healing iteration workflow.
  • Updates

    • Updated Node.js version guidance to 22.22.1.
    • Enhanced TypeScript support with native .ts file execution.
    • Improved component candidate discovery and grading system.

yannbf and others added 30 commits March 24, 2026 12:43
Eval system to test how well AI agents complete Storybook setup after
`npx storybook@latest init --yes` on real-world projects.

Features:
- Multi-LLM support: Claude Code (Opus/Sonnet/Haiku), GitHub Copilot CLI
  (Claude models + GPT-5.2-codex, GPT-5.2, GPT-5.1-codex-max)
- 6 test projects covering different tech stacks: styled-components/Redux,
  Tailwind/HeadlessUI, Zustand, ECharts, GraphQL
- Structured JSON output with execution metrics (cost, duration, turns)
  and grading (build success, TypeScript errors, quality score)
- CLI with project/model/agent selection, iterations, custom prompts

Usage: npx jiti scripts/eval/eval.ts --project wikitok --model claude-sonnet-4-6

Refs: #34295
Replace CLI process spawning with proper SDKs:
- Claude: @anthropic-ai/claude-agent-sdk with query() API
- Codex: @openai/codex-sdk with thread streaming API

Benefits: structured responses, proper cost tracking, no stream-json
parsing, no CLI installation dependency, full conversation transcript.
- Pre-prepared eval-baseline branches on forked repos (kasperpeulen/*)
  eliminates storybook init during trials
- Cache system: first run clones + installs, subsequent runs copy from
  cache — agent starts immediately
- Post-init baseline commit for clean git diffs
- Richer result schema: changed files, setup patterns, ghost stories
- Ghost stories grading via STORYBOOK_COMPONENT_PATHS + Vitest
- Setup pattern detection (tailwind, redux, router, etc.)
- Better prompt: allows story creation, focuses on real components
- Smarter cleanup: only removes starter stories, not project stories

Tested on wikitok: quality 1.0, build pass, 7/7 ghost stories, $0.78
- Google Sheets integration via Apps Script webhook (set EVAL_GOOGLE_SHEETS_URL)
- Run ID (per session) and upload ID (for grouping) like MCP eval
- Environment capture (node version, git branch/commit)
- Included google-apps-script.js for setting up the spreadsheet
Prompts are now composable: --prompt setup self-heal doctor
Each name maps to prompts/{name}.md, concatenated in order.

Available prompts:
- setup: base setup prompt (default)
- self-heal: iterative fix loop using vitest --project=storybook
- doctor: run diagnostics before large config changes

Updated verification to prefer vitest over storybook build since
storybook init creates the vitest integration automatically.
- Move cleanEnv to utils (was duplicated in prepare-trial and grade)
- Replace fast-glob/glob with Node 22 built-in fs.globSync
- Compact setup-patterns rules into tuple array
- Remove manual file recursion in setup-patterns and ghost-stories
- Fix save.ts bug (relative(EVAL_ROOT, "") → removed trialPath)
- Remove unused logWarn, simplify logging helpers
- Tighten prepare-trial install detection into single expression
- Delete config.ts and generate-prompt.ts — merge PROJECTS into types.ts,
  prompts into utils.ts, inline agents map into run-task.ts
- computeQualityScore takes options object instead of 4 positional params
- Quality score now includes ghost stories (40%), build (25%),
  typecheck (25%), and performance (10%)
- exec() uses tinyexec native timeout instead of manual AbortController
- Codex agent tracks token usage and estimates cost from pricing table
- Environment fields renamed to evalBranch/evalCommit for clarity
- IPC sentinel shared as exported constant between eval.ts and eval-parallel.ts
- Summary tables now show quality score column
- setup-patterns uses object array instead of positional tuples
- prepare-repos.ts uses shared exec(), static imports, consistent quotes
- google-apps-script.js modernized to const/let + arrow functions
- Remove SupportedModel type alias (was just string)
- Fix .gitignore trailing newline, prompt no longer hardcodes React+Vite
- MAX_TURNS extracted as named constant in claude agent
…rts)

Core source files use extensionless import specifiers that fail under
Node's native TypeScript loader. Read numPassedTests/numTotalTests
directly from the vitest JSON report instead.
kasperpeulen and others added 23 commits March 30, 2026 15:23
Node's native TypeScript loader requires explicit .ts extensions.
Add them to parse-vitest-report.ts and categorize-render-errors.ts
so the eval can import parseVitestResults from core via relative path.
… tsconfig fixes

- Separate types from runtime config (types.ts + config.ts)
- Thread Logger through entire pipeline (fixes garbled parallel output)
- Replace fragile stdout sentinel IPC with Node fork/process.send
- Run storybook build + typecheck in parallel (saves ~60-120s/trial)
- Tighten Agent interface to single params object
- Add --agent/--model/--prompt filters to eval-parallel
- Make quality score weights configurable
- Add prompt template variable support
- Enable allowImportingTsExtensions in root and scripts tsconfigs
- Fix all pre-existing TS errors in eval files
…from core-server, inline into grade.ts

- Rename runStoryTests to runGhostStories in core (clearer name)
- Add cwd parameter to runGhostStories and getComponentCandidates
- Export getComponentCandidates, runGhostStories, TestRunSummary from core-server index
- Remove eval ghost-stories.ts wrapper — inline logic into grade.ts
- Remove eval ghost-stories.test.ts — core already has its own tests
- Revert speculative isCandidate/isValidCandidate export (unused)
- Remove unused logger import from get-candidates.ts
…ions

The core-server barrel index re-exports modules (build-static, etc.) that
fail under native Node TS. Import ghost-stories utilities directly from
their source files instead, and add .ts extensions to internal imports
in the import chain.
…rop exec wrapper

- Replace fork/IPC parallel execution with direct Promise.allSettled + prefixed loggers
- Make blocking fs calls async (cpSync→cp, writeFileSync→writeFile, mkdirSync→mkdir)
- Remove Google Sheets upload, google-apps-script.js, and upload-id/run-id plumbing
- Drop custom exec wrapper — use tinyexec's x() directly at call sites
- Remove runId/uploadId from runTask signature and both CLI entry points
- Replace plain interfaces with Zod schemas for runtime validation (types.ts)
- Merge eval.ts + eval-parallel.ts into a single CLI with comma-separated args
- Fix deep core imports to use barrel export (core-server/index.ts)
- Extract shared package-manager detection and install (lib/package-manager.ts)
- Move pricing tables and model ID mappings into config.ts
- Make setup-patterns.ts fully async with fs/promises
- Add formatTable utility with ANSI-aware column alignment
- Integrate prepare-repos.ts with shared logger and PM utilities
…slides

- Replace slideshow format with a scrollable HTML page using file cards
- Show complete file contents for new files, diffs for modified files
- Lexend + JetBrains Mono fonts, light/dark theme, mobile-responsive
- Static server on port 3000 (no live-reload)
- Issues shown inline as smell-boxes, never block page generation
- Simplified to 5 steps: gather → read → generate → serve → iterate
…bility review

- Two layers per area: curated walkthrough (API→Tests→Impl) + collapsed full files
- Use language-typescript with data-diff attribute instead of language-diff
- Post-processing script for line-level add/remove backgrounds on top of TS highlighting
- Add readability review guidance: logical order, clear names, comments, test quality
- Order areas high-level to low-level
Principle 3 now explicitly requires showing complete interface definitions
where they're first relevant, not just type names.
Extract AgentRunConfig { agent, model, effort } and compose it as
a `run` field in TrialConfig, ExecutionResult, and TrialResult
instead of spreading via extends/inheritance.
- AgentRunConfig → AgentVariant (it's the experimental variant, not a "run config")
- Agent → AgentDriver, AgentConfig → AgentDefinition (disambiguate)
- ExecutionResult → Execution, GradingResult → Grade, QualityResult → QualityScore
- TrialResult → TrialReport, TrialPaths → TrialWorkspace
- ChangedFile → FileChange, Pricing → TokenPricing, Environment → EvalEnvironment
- GhostStoriesResult → GhostStoryGrade, GhostStoryRunResult → GhostStoryOutput
- QualityWeights → ScoreWeights, DEFAULT_QUALITY_WEIGHTS → DEFAULT_SCORE_WEIGHTS
- Field renames: run → variant, grading → grade, quality → score,
  changedFiles → fileChanges, storybookFiles → storybookChanges
- Extract AgentExecuteParams with variant: AgentVariant (reuses the model)
- Remove redundant run field from Execution (lives on TrialReport only)
Every project needs a branch for cloning. The type now reflects
that, and the `branch!` assertion in prepareTrial is no longer needed.
…Trial, throw on ghost story errors

- Make AgentVariant a discriminated union on agent, with typed model/effort per agent
- Rename runTask→runTrial and run-task.ts→run-trial.ts for consistent domain naming
- Store full Project in TrialReport instead of just the name for reproducibility
- Replace error-object returns with GhostStoryError throws in ghost-stories.ts
- Fix successRate rounding to use Math.round(x*100)/100 consistently
- Extract scoring magic numbers into named constants
- Validate git status chars against known set instead of blind casting
- Truncate build/typecheck output at line boundaries
Copilot AI review requested due to automatic review settings March 30, 2026 10:47
@Sidnioulz Sidnioulz added build Internal-facing build tooling & test updates ci:docs Run the CI jobs for documentation checks only. labels Mar 30, 2026
@Sidnioulz Sidnioulz closed this Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new “eval” harness under scripts/eval/ to benchmark/grade Storybook setup work using AI agents (Claude + Codex), while also advancing the repo’s move toward native Node execution of .ts files (explicit .ts import extensions). It also updates core “ghost stories” utilities and introduces some debug/CI changes.

Changes:

  • Add a new scripts/eval/ pipeline (prepare trial → run agent → grade results) with prompts, scoring, and Vitest coverage.
  • Update TypeScript configs to support explicit .ts import extensions (native Node TS execution migration).
  • Refactor/extend core ghost-stories utilities (renames runStoryTestsrunGhostStories, adds optional cwd) and add a stub CLI command (skill).

Reviewed changes

Copilot reviewed 32 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
yarn.lock Lockfile updates for new agent/eval dependencies (Anthropic SDK, Codex SDK, citty, transitive deps).
scripts/tsconfig.json Enables allowImportingTsExtensions; excludes one eval artifact file from typechecking.
scripts/package.json Adds eval script entrypoint and new dependencies for agent SDKs + citty.
scripts/eval/types.ts Defines core types for eval trials, grading, scoring, and reporting.
scripts/eval/types.test.ts Validates AGENTS/PROJECTS config invariants (defaults, mappings, uniqueness).
scripts/eval/prompts/setup.md Adds the “setup” prompt used to guide agents toward stable Storybook setup.
scripts/eval/prompts/self-heal.md Adds a “self-heal” loop prompt focused on iterating via vitest --project=storybook.
scripts/eval/lib/utils.ts Implements shared utilities: logging, formatting, prompt loading, environment capture, table formatting.
scripts/eval/lib/utils.test.ts Unit tests for formatting helpers, prompt loading/listing, and table alignment (incl ANSI handling).
scripts/eval/lib/setup-patterns.ts Detects common Storybook setup patterns by scanning .storybook/ files.
scripts/eval/lib/setup-patterns.test.ts Tests setup-pattern detection against a temporary .storybook/ tree.
scripts/eval/lib/run-trial.ts Orchestrates a full trial (prepare → capture env → prompt → agent → grade → summary.json).
scripts/eval/lib/run-trial.test.ts Mocks pipeline dependencies and verifies report assembly, sequencing, and output files.
scripts/eval/lib/prepare-trial.ts Clones/caches benchmark repos and installs deps before the agent runs.
scripts/eval/lib/package-manager.ts Detects package manager via lockfiles and runs installs.
scripts/eval/lib/grading-helpers.test.ts Integration-style tests composing candidate discovery, setup patterns, git parsing, and scoring.
scripts/eval/lib/grade.ts Implements grading: changed files, setup patterns, storybook build, tsc, ghost stories, and scoring.
scripts/eval/lib/grade.test.ts Unit tests for file filtering, scoring math, TS error counting, and git name-status parsing.
scripts/eval/lib/ghost-stories.ts Eval-side ghost story runner (find candidates, run vitest JSON reporter, parse counts).
scripts/eval/lib/agents/codex.ts Codex agent driver using @openai/codex-sdk, streaming events and estimating cost.
scripts/eval/lib/agents/claude-code.ts Claude agent driver using @anthropic-ai/claude-agent-sdk with debug logging and transcript capture.
scripts/eval/eval.ts CLI entrypoint for running one or many eval trials in parallel with zod-validated args.
scripts/eval/config.ts Defines agent model/effort/pricing tables and benchmark projects (eval-baseline repos).
scripts/dangerfile.js Adds debug printing and an unconditional fail() for non-team PRs in target-branch check.
foo New file containing terminal escape sequences (appears accidental).
code/tsconfig.json Enables allowImportingTsExtensions in the main code/ TS config.
code/lib/cli-storybook/src/bin/run.ts Adds a new skill command (currently a stub implementation).
code/core/src/shared/utils/categorize-render-errors.ts Switches relative import to explicit .ts extension.
code/core/src/core-server/utils/ghost-stories/run-story-tests.ts Renames exported runner to runGhostStories and adds optional cwd; updates imports to .ts.
code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts Updates imports to explicit .ts extensions.
code/core/src/core-server/utils/ghost-stories/get-candidates.ts Adds configurable cwd for globbing and updates import to explicit .ts.
code/core/src/core-server/server-channel/ghost-stories-channel.ts Updates channel to call renamed runGhostStories.
AGENTS.md Updates Node version and documents native Node TS execution migration guidance.
.gitignore Adds ignore entries for eval outputs (currently pointing under scripts/eval/).
.agents/skills/review-pr/SKILL.md Adds a new “review-pr” agent skill document (HTML single-page PR review generator).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +31
return ['pnpm', ['install', '--no-frozen-lockfile']];
case 'yarn':
return [
'yarn',
existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'],
];
case 'bun':
return ['bun', ['install']];
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eval harness clones external repos and installs dependencies; however only the npm path uses --ignore-scripts, while pnpm/yarn/bun will run lifecycle scripts by default. For safety and reproducibility, consider consistently disabling install scripts (or explicitly documenting/isolating why it’s safe to run them) across all package managers.

Suggested change
return ['pnpm', ['install', '--no-frozen-lockfile']];
case 'yarn':
return [
'yarn',
existsSync(join(dir, '.yarnrc.yml')) ? ['install', '--no-immutable'] : ['install'],
];
case 'bun':
return ['bun', ['install']];
return ['pnpm', ['install', '--no-frozen-lockfile', '--ignore-scripts']];
case 'yarn':
return [
'yarn',
existsSync(join(dir, '.yarnrc.yml'))
? ['install', '--no-immutable', '--ignore-scripts']
: ['install', '--ignore-scripts'],
];
case 'bun':
return ['bun', ['install', '--ignore-scripts']];

Copilot uses AI. Check for mistakes.
Comment thread scripts/dangerfile.js
Comment on lines +169 to +172
console.log('authorAssociation', authorAssociation);
console.log('author', author);
console.log(JSON.stringify(danger.github.pr, null, 2));

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added debug logging prints the full PR payload (including potentially large or sensitive metadata) to Danger’s output. Please remove these console.log calls (or guard them behind an explicit debug flag) to keep CI logs clean and avoid leaking data.

Suggested change
console.log('authorAssociation', authorAssociation);
console.log('author', author);
console.log(JSON.stringify(danger.github.pr, null, 2));

Copilot uses AI. Check for mistakes.
Comment thread scripts/dangerfile.js
Comment on lines +181 to +182
fail(JSON.stringify(danger.github.pr, null, 2));

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fail(JSON.stringify(danger.github.pr, null, 2)) will make Danger fail every non-team PR unconditionally, bypassing the actual target-branch logic below. This should be removed; only fail when the base branch is invalid.

Copilot uses AI. Check for mistakes.
Comment thread foo
@@ -0,0 +1 @@
[?2004h[?1049h(B[?7h[?1h=[?1h=[?25l(B(B[ Reading... ](B(B[ Read 1 line ](B(B GNU nano 8.7.1 foo (B(B^G(B Help(B^O(B Write Out(B^F(B Where Is(B^K(B Cut(B^T(B Execute(B^C(B Location(BM-U(B Undo(BM-A(B Set Mark(BM-](B To Bracket (BM-B(B Previous(B◂(B Back(B^◂(B Prev Word(B^A(B Home(B^P(B Prev Line(B^X(B Exit(B^R(B Read File(B^\(B Replace(B^U(B Paste(B^J(B Justify(B^/(B Go To Line (BM-E(B Redo(BM-6(B Copy(B^B(B Where Was(BM-F(B Next(B▸(B Forward(B^▸(B Next Word(B^E(B End(B^N(B Next Line^[[?2004h^[[?1049h^[[22;0;0t^[[1;58r^[(B^[[m^[[4l^[[?7h^[[39;49m^[[?1h^[=^[[?1h^[=^[[?25l^[[39;49m^[(B^[[m^[[H^[[2J^[[56;121H^[(B^[[0;7m[ Reading... ]^[(B^[[m[?12l[?25h[?25l(B[ line 1/2 (50%), col 1/159 ( 0%), char 0/135 ( 0%) ](B[?12l[?25h[?25l[?12l[?25h[?1049l[?1l>[?2004l No newline at end of file
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to contain raw terminal escape sequences (likely an accidentally committed editor buffer) and has no meaningful source content. It should be removed from the repository to avoid polluting diffs and tooling.

Copilot uses AI. Check for mistakes.
Comment thread scripts/package.json
Comment on lines +45 to 51
"@anthropic-ai/claude-agent-sdk": "^0.2.85",
"@fal-works/esbuild-plugin-global-externals": "^2.1.2",
"@google-cloud/bigquery": "^6.2.1",
"@octokit/graphql": "^5.0.6",
"@octokit/request": "^8.4.1",
"@openai/codex-sdk": "^0.117.0",
"@polka/parse": "^1.0.0-next.28",
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthropic-ai/claude-agent-sdk declares a peer dependency on zod@^4, but this workspace currently depends on zod@^3.25.76. This will cause peer-dep warnings and can break at runtime if the SDK relies on Zod v4 APIs; either upgrade zod in scripts/ to a compatible major or use an SDK version compatible with Zod v3.

Copilot uses AI. Check for mistakes.
Comment thread scripts/package.json
Comment on lines +79 to 80
"citty": "^0.2.1",
"codecov": "^3.8.1",
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

citty is added as a dependency but there are no references to it in the scripts/ workspace. If it’s not used yet, please remove it to avoid unnecessary install surface area; otherwise, add the usage in this PR so the dependency is justified.

Copilot uses AI. Check for mistakes.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 30, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Storybook evaluation system for testing automated setup workflows. It adds PR review skill documentation, enables TypeScript's native .ts import resolution, refactors the ghost stories testing pipeline to support configurable working directories, introduces a new CLI "skill" command, and implements a complete evaluation harness with agent drivers (Claude, Codex), grading logic, and extensive test coverage.

Changes

Cohort / File(s) Summary
PR Review Skill Documentation
.agents/skills/review-pr/SKILL.md
New skill definition providing an HTML document template for summarizing pull requests with sticky navigation, API/test/implementation walktabs, and diff styling via Highlight.js and custom post-processing.
TypeScript & Runtime Configuration
code/tsconfig.json, scripts/tsconfig.json, AGENTS.md
Enable allowImportingTsExtensions for native Node.js TypeScript execution; update Node.js version guidance to 22.22.1 and document migration from jiti to direct .ts file imports.
Ghost Stories Test Refactoring
code/core/src/core-server/server-channel/ghost-stories-channel.ts, code/core/src/core-server/utils/ghost-stories/...
Rename runStoryTests to runGhostStories and add optional cwd parameter support; update imports to use explicit .ts extensions; remove unused logger dependency.
CLI Skill Command
code/lib/cli-storybook/src/bin/run.ts
Add new skill command with --package-manager and --config-dir options for executing Storybook skills (currently a stub with placeholder logging).
Evaluation System Type Definitions
scripts/eval/types.ts
Define core data models: Logger, AgentVariant, TrialConfig, TrialWorkspace, Execution, Grade, QualityScore, TrialReport, and related interfaces for the evaluation pipeline.
Evaluation System Configuration
scripts/eval/config.ts
Export agent configurations (AGENTS), project definitions (PROJECTS), token pricing tables, and estimateCost() function for computing trial execution costs across Claude and Codex models.
Evaluation Harness
scripts/eval/eval.ts
Main CLI entry point orchestrating the evaluation workflow: parses arguments, derives trial configurations, executes trials concurrently via runTrial, aggregates results, and outputs summary tables with cost/performance metrics.
Agent Implementations
scripts/eval/lib/agents/claude-code.ts, scripts/eval/lib/agents/codex.ts
Concrete AgentDriver implementations for Claude and Codex models; stream agent output, log execution details, compute costs, and write transcripts to results directories.
Grading & Scoring Logic
scripts/eval/lib/grade.ts, scripts/eval/lib/setup-patterns.ts
Implement weighted quality scoring (build success, TypeScript errors, ghost story pass rate, duration), run storybook build and tsc checks with timeouts, detect Storybook setup patterns via regex, and conditionally run ghost story evaluation.
Ghost Stories Runner
scripts/eval/lib/ghost-stories.ts
Discover component candidates via globbing, run Vitest with STORYBOOK_COMPONENT_PATHS env variable, parse JSON report, compute success rate, and provide standardized error reporting.
Trial Orchestration & Utilities
scripts/eval/lib/run-trial.ts, scripts/eval/lib/prepare-trial.ts, scripts/eval/lib/utils.ts, scripts/eval/lib/package-manager.ts
Coordinate single-trial execution; prepare and cache project repositories; provide logging, formatting, prompt loading, environment capture, and package manager detection utilities.
Evaluation System Tests
scripts/eval/lib/...test.ts, scripts/eval/types.test.ts
Add 1,200+ lines of Vitest suites validating agent configuration invariants, grading helper behavior, trial orchestration sequencing, setup pattern detection, file parsing, and utility functions.
Evaluation Prompts
scripts/eval/prompts/setup.md, scripts/eval/prompts/self-heal.md
Markdown guides instructing agents on Storybook setup procedures and iterative self-healing workflows using Vitest integration for story validation.
Dependencies & Infrastructure
.gitignore, scripts/package.json, scripts/dangerfile.js, foo
Add eval system cache/results directories to ignore patterns; introduce @anthropic-ai/claude-agent-sdk, @openai/codex-sdk, and citty dependencies; add debug logging to PR validation; include control character file (unclear purpose).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client CLI
    participant Eval as eval.ts
    participant Trial as runTrial()
    participant Prep as prepareTrial()
    participant Grade as grade()
    participant Agent as AgentDriver
    participant Vitest as Vitest<br/>(Ghost Stories)
    participant Build as Build & TSC

    Client->>Eval: npm run eval (with args)
    Eval->>Eval: Parse arguments & derive<br/>trial configs
    Eval->>Trial: Execute trial config<br/>(concurrent)
    Trial->>Prep: Prepare workspace<br/>(clone/install)
    Prep-->>Trial: TrialWorkspace
    Trial->>Agent: execute({prompt,<br/>projectPath, ...})
    Agent->>Agent: Stream & log<br/>agent output
    Agent-->>Trial: Execution{cost,<br/>duration, turns}
    Trial->>Grade: grade(workspace,<br/>logger, duration)
    Grade->>Build: Run storybook build<br/>& tsc --noEmit
    Build-->>Grade: Outputs & errors
    Grade->>Vitest: runGhostStories<br/>(candidates)
    Vitest-->>Grade: GhostStoryGrade
    Grade-->>Trial: {grade,<br/>QualityScore}
    Trial->>Trial: Assemble TrialReport<br/>& write summary.json
    Trial-->>Eval: TrialReport
    Eval->>Eval: Aggregate results &<br/>format output table
    Eval-->>Client: Summary with cost,<br/>metrics, success rate
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

The PR spans 25+ new TypeScript files introducing a modular but substantial evaluation system. While most components are isolated and logic is relatively straightforward (no complex algorithms), understanding the orchestration flow, agent integrations, grading architecture, and ensuring type safety across the pipeline requires careful cross-file reasoning. The breadth of changes across agents, grading, utilities, and tests demands attention to architectural consistency and API contracts.

Possibly related PRs

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Internal-facing build tooling & test updates ci:docs Run the CI jobs for documentation checks only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants