Debug: Debug Dangerfile PR data content#34398
Conversation
Eval system to test how well AI agents complete Storybook setup after `npx storybook@latest init --yes` on real-world projects. Features: - Multi-LLM support: Claude Code (Opus/Sonnet/Haiku), GitHub Copilot CLI (Claude models + GPT-5.2-codex, GPT-5.2, GPT-5.1-codex-max) - 6 test projects covering different tech stacks: styled-components/Redux, Tailwind/HeadlessUI, Zustand, ECharts, GraphQL - Structured JSON output with execution metrics (cost, duration, turns) and grading (build success, TypeScript errors, quality score) - CLI with project/model/agent selection, iterations, custom prompts Usage: npx jiti scripts/eval/eval.ts --project wikitok --model claude-sonnet-4-6 Refs: #34295
Replace CLI process spawning with proper SDKs: - Claude: @anthropic-ai/claude-agent-sdk with query() API - Codex: @openai/codex-sdk with thread streaming API Benefits: structured responses, proper cost tracking, no stream-json parsing, no CLI installation dependency, full conversation transcript.
- Pre-prepared eval-baseline branches on forked repos (kasperpeulen/*) eliminates storybook init during trials - Cache system: first run clones + installs, subsequent runs copy from cache — agent starts immediately - Post-init baseline commit for clean git diffs - Richer result schema: changed files, setup patterns, ghost stories - Ghost stories grading via STORYBOOK_COMPONENT_PATHS + Vitest - Setup pattern detection (tailwind, redux, router, etc.) - Better prompt: allows story creation, focuses on real components - Smarter cleanup: only removes starter stories, not project stories Tested on wikitok: quality 1.0, build pass, 7/7 ghost stories, $0.78
- Google Sheets integration via Apps Script webhook (set EVAL_GOOGLE_SHEETS_URL) - Run ID (per session) and upload ID (for grouping) like MCP eval - Environment capture (node version, git branch/commit) - Included google-apps-script.js for setting up the spreadsheet
Prompts are now composable: --prompt setup self-heal doctor
Each name maps to prompts/{name}.md, concatenated in order.
Available prompts:
- setup: base setup prompt (default)
- self-heal: iterative fix loop using vitest --project=storybook
- doctor: run diagnostics before large config changes
Updated verification to prefer vitest over storybook build since
storybook init creates the vitest integration automatically.
- Move cleanEnv to utils (was duplicated in prepare-trial and grade) - Replace fast-glob/glob with Node 22 built-in fs.globSync - Compact setup-patterns rules into tuple array - Remove manual file recursion in setup-patterns and ghost-stories - Fix save.ts bug (relative(EVAL_ROOT, "") → removed trialPath) - Remove unused logWarn, simplify logging helpers - Tighten prepare-trial install detection into single expression
…env, 1s timeout, no --project
- Delete config.ts and generate-prompt.ts — merge PROJECTS into types.ts, prompts into utils.ts, inline agents map into run-task.ts - computeQualityScore takes options object instead of 4 positional params - Quality score now includes ghost stories (40%), build (25%), typecheck (25%), and performance (10%) - exec() uses tinyexec native timeout instead of manual AbortController - Codex agent tracks token usage and estimates cost from pricing table - Environment fields renamed to evalBranch/evalCommit for clarity - IPC sentinel shared as exported constant between eval.ts and eval-parallel.ts - Summary tables now show quality score column - setup-patterns uses object array instead of positional tuples - prepare-repos.ts uses shared exec(), static imports, consistent quotes - google-apps-script.js modernized to const/let + arrow functions - Remove SupportedModel type alias (was just string) - Fix .gitignore trailing newline, prompt no longer hardcodes React+Vite - MAX_TURNS extracted as named constant in claude agent
…rts) Core source files use extensionless import specifiers that fail under Node's native TypeScript loader. Read numPassedTests/numTotalTests directly from the vitest JSON report instead.
Node's native TypeScript loader requires explicit .ts extensions. Add them to parse-vitest-report.ts and categorize-render-errors.ts so the eval can import parseVitestResults from core via relative path.
… tsconfig fixes - Separate types from runtime config (types.ts + config.ts) - Thread Logger through entire pipeline (fixes garbled parallel output) - Replace fragile stdout sentinel IPC with Node fork/process.send - Run storybook build + typecheck in parallel (saves ~60-120s/trial) - Tighten Agent interface to single params object - Add --agent/--model/--prompt filters to eval-parallel - Make quality score weights configurable - Add prompt template variable support - Enable allowImportingTsExtensions in root and scripts tsconfigs - Fix all pre-existing TS errors in eval files
…slides - Replace slideshow format with a scrollable HTML page using file cards - Show complete file contents for new files, diffs for modified files - Lexend + JetBrains Mono fonts, light/dark theme, mobile-responsive - Static server on port 3000 (no live-reload) - Issues shown inline as smell-boxes, never block page generation - Simplified to 5 steps: gather → read → generate → serve → iterate
…bility review - Two layers per area: curated walkthrough (API→Tests→Impl) + collapsed full files - Use language-typescript with data-diff attribute instead of language-diff - Post-processing script for line-level add/remove backgrounds on top of TS highlighting - Add readability review guidance: logical order, clear names, comments, test quality - Order areas high-level to low-level
Principle 3 now explicitly requires showing complete interface definitions where they're first relevant, not just type names.
Extract AgentRunConfig { agent, model, effort } and compose it as
a `run` field in TrialConfig, ExecutionResult, and TrialResult
instead of spreading via extends/inheritance.
- AgentRunConfig → AgentVariant (it's the experimental variant, not a "run config") - Agent → AgentDriver, AgentConfig → AgentDefinition (disambiguate) - ExecutionResult → Execution, GradingResult → Grade, QualityResult → QualityScore - TrialResult → TrialReport, TrialPaths → TrialWorkspace - ChangedFile → FileChange, Pricing → TokenPricing, Environment → EvalEnvironment - GhostStoriesResult → GhostStoryGrade, GhostStoryRunResult → GhostStoryOutput - QualityWeights → ScoreWeights, DEFAULT_QUALITY_WEIGHTS → DEFAULT_SCORE_WEIGHTS - Field renames: run → variant, grading → grade, quality → score, changedFiles → fileChanges, storybookFiles → storybookChanges - Extract AgentExecuteParams with variant: AgentVariant (reuses the model) - Remove redundant run field from Execution (lives on TrialReport only)
Every project needs a branch for cloning. The type now reflects that, and the `branch!` assertion in prepareTrial is no longer needed.
…Trial, throw on ghost story errors - Make AgentVariant a discriminated union on agent, with typed model/effort per agent - Rename runTask→runTrial and run-task.ts→run-trial.ts for consistent domain naming - Store full Project in TrialReport instead of just the name for reproducibility - Replace error-object returns with GhostStoryError throws in ghost-stories.ts - Fix successRate rounding to use Math.round(x*100)/100 consistently - Extract scoring magic numbers into named constants - Validate git status chars against known set instead of blind casting - Truncate build/typecheck output at line boundaries
There was a problem hiding this comment.
Pull request overview
This PR introduces a new scripts/eval harness for running automated Storybook-setup trials (including agent execution + grading) and makes supporting updates to enable native Node execution of TypeScript with explicit .ts import specifiers. It also includes a few core “ghost stories” utility updates and a temporary Dangerfile debug print.
Changes:
- Add an eval CLI (
node scripts/eval/eval.ts) with Claude/Codex agent drivers, trial orchestration, grading, prompts, and unit tests. - Extend “ghost stories” utilities (core + eval harness) with
cwdsupport and naming updates. - Enable
allowImportingTsExtensionsin tsconfigs and add new script/dependencies (plus a stubstorybook skillCLI command).
Reviewed changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/eval/eval.ts |
New eval CLI entrypoint (arg parsing, parallel trials, results output). |
scripts/eval/config.ts |
Agent/project registry + pricing/cost estimation. |
scripts/eval/types.ts |
Shared types for eval pipeline + scoring schema. |
scripts/eval/lib/run-trial.ts |
Orchestrates prepare → run agent → grade → write report artifacts. |
scripts/eval/lib/grade.ts |
Computes grade outputs + quality score (build, typecheck, ghost stories, perf). |
scripts/eval/lib/ghost-stories.ts |
Eval-side ghost stories discovery + vitest execution/report parsing. |
scripts/eval/lib/setup-patterns.ts |
Scans .storybook/ configs for setup signals (CSS, providers, aliases, etc.). |
scripts/eval/lib/prepare-trial.ts |
Clones/caches benchmark repos and installs dependencies. |
scripts/eval/lib/package-manager.ts |
Detects PM and runs installs for prepared trials. |
scripts/eval/lib/agents/claude-code.ts |
Claude agent driver via @anthropic-ai/claude-agent-sdk. |
scripts/eval/lib/agents/codex.ts |
Codex agent driver via @openai/codex-sdk. |
scripts/eval/**.test.ts |
Vitest coverage for config/type invariants and eval utilities/pipeline. |
scripts/eval/prompts/*.md |
Prompt templates used by the eval harness. |
scripts/package.json |
Adds eval script + new dependencies for the eval system. |
scripts/tsconfig.json |
Enables allowImportingTsExtensions; adjusts excludes. |
code/tsconfig.json |
Enables allowImportingTsExtensions for code/ TypeScript. |
code/core/src/core-server/utils/ghost-stories/run-story-tests.ts |
Renames export to runGhostStories and adds optional cwd. |
code/core/src/core-server/utils/ghost-stories/get-candidates.ts |
Adds cwd option for globbing candidate components. |
code/core/src/core-server/utils/ghost-stories/parse-vitest-report.ts |
Updates imports to use explicit .ts extensions. |
code/core/src/core-server/server-channel/ghost-stories-channel.ts |
Switches to runGhostStories export. |
code/core/src/shared/utils/categorize-render-errors.ts |
Updates import to explicit .ts extension. |
code/lib/cli-storybook/src/bin/run.ts |
Adds a new (currently stubbed) skill CLI command. |
scripts/dangerfile.js |
Adds debug logging of PR data (should be removed before merge). |
yarn.lock |
Lockfile updates for new eval dependencies. |
AGENTS.md |
Documents Node version and migration toward native Node TS execution. |
.gitignore |
Adds eval-related ignore entries. |
.agents/skills/review-pr/SKILL.md |
Adds an agent “skill” definition for narrative PR review output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (33)
📝 WalkthroughWalkthroughAdds a new PR-review agent skill and a comprehensive evaluation harness for Storybook setup: typed configs, CLI orchestration, agent drivers, grading/ghost-story execution, tests, utilities, tsconfig/import adjustments, package changes, and a modified dangerfile and .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Eval CLI
participant Runner as runTrial()
participant Prep as prepareTrial()
participant Agent as AgentDriver (claude/codex)
participant Grade as grade()
participant Ghost as runGhostStories()
CLI->>Runner: runTrial(config)
Runner->>Prep: prepareTrial(project, trialId)
Prep-->>Runner: TrialWorkspace (repoRoot, projectPath, resultsDir, baselineCommit)
Runner->>Agent: execute(prompt, projectPath, variant, resultsDir)
activate Agent
Agent->>Agent: stream events (messages, turns, token usage)
Agent-->>Runner: Execution (duration, cost?, turns)
deactivate Agent
Runner->>Grade: grade(workspace, execution.duration)
activate Grade
Grade->>Grade: git diff, storybook build, tsc typecheck
alt build success
Grade->>Ghost: runGhostStories(candidates, { cwd })
Ghost->>Ghost: npx vitest run (storybook) -> JSON report
Ghost-->>Grade: GhostStoryOutput (total, passed, successRate)
end
Grade-->>Runner: Grade + QualityScore
deactivate Grade
Runner->>Runner: write summary.json, prompt.md
Runner-->>CLI: TrialReport
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
N/A
Summary by CodeRabbit
New Features
Improvements
Dependencies