CLI: Source eval prompts from ai setup via EVAL_SETUP_PROMPT#34602
Conversation
… loaded Addresses #34594. Adds a prompt-level instruction and a grading flag that together catch "renders fine, but user CSS never loaded" failures. - pattern-copy-play.md: new Step 7 requires exactly one story to assert a component-specific computed style via getComputedStyle. - grade.ts: records hasComputedStyleAssertion based on whether the staged diff contains "getComputedStyle" (reuses the existing cached diff, no extra file reads). Chose the prompt+diff approach over a runtime stylesheet heuristic (filtering document.styleSheets, isolated all:initial probe, etc.) because: - The agent already knows what "styled correctly" means for a given component; a component-specific computed-style assertion catches the real failure ("bg-blue-600 did not apply") rather than a generic "something was applied" signal. - No fragile filtering of vitest-browser / storybook / addon stylesheet sources. Addons keep shipping new sheets; that filter would bit-rot. - Failures surface as normal Vitest assertion failures and already flow through pass/fail grading — no new counter, no new warning channel, no changes to render-analysis. - Complementary to a future runtime heuristic if we want one: prompt-level catches "agent misconfigured the design system"; runtime catches "agent shipped a visibly unstyled story without the check".
'Render call' could read as 'you need a render: () => ... function', which is wrong — args stories have no render call and that's the preferred shape for prop-driven components. Softening to 'just rendering the component in the story is enough' keeps the intent without steering toward render().
Before: hasComputedStyleAssertion was a plain rawDiff.includes('getComputedStyle'),
which matched the prompt markdown (written to .storybook/eval-results/prompt.md
before grade runs) and the transcript JSON — both of which contain the token
verbatim because the new prompt Step 7 and the agent's own tool-output lines
include it. The flag was effectively tautological: true whenever the prompt was
staged, regardless of what the agent did.
After: parse the unified patch, track which file each hunk belongs to via the
'+++ b/<path>' headers, and only consider added lines (skipping the '+++' header
itself) that live in files also present in storybookChanges. Uses the existing
STORY_FILE_PATTERN from story-render.ts as the single source of truth for what
counts as a story file.
Exports diffAddsTokenInStoryFiles as a pure helper with unit tests covering the
false-positive paths (prompt.md / data.json), deleted lines, the +++ header,
and files not in storybookChanges.
Aligns the prompt + grade check with the Slack agreement: instead of
hoping the agent adds *some* `getComputedStyle` call somewhere, the
prompt now asks for one story explicitly named `CssCheck`. That
specific story name is what the AI-stories vitest run in core will
grep for to attribute the pass/fail result in the
`ai-setup-final-scoring` telemetry event.
- `pattern-copy-play.md` Step 7: heading + example updated to
`export const CssCheck: Story = { ... }`.
- `grade.ts`: `hasComputedStyleAssertion` -> `hasCssCheckStory`,
token matched in the diff changed from `getComputedStyle` to
`export const CssCheck`.
- `grade.test.ts`: added two tests locking in the new use case
(positive: story-file diff with the export; negative: prompt.md
false positive).
- Trial / publish / result-docs test mocks renamed to match.
Rationale (from Slack): giving the story a known name means
telemetry in core can report on the CSS check result directly,
without layering on a separate tag. The story also ends up being
educational — a visible example of how to verify CSS loaded. No
tag, no new telemetry field required on top of whatever core
adds in a follow-up PR.
Move the eval harness's prompt catalog into code/lib/cli-storybook/src/ai/prompts/ so trials exercise the exact prompt a real user gets from `npx storybook ai setup`. Each variant lives in its own fully isolated .ts file; the registry selects one at runtime via the internal EVAL_SETUP_PROMPT env var (unset for real users → always the default). The harness now hands the agent the AI_SETUP_PROMPT nudge and sets EVAL_SETUP_PROMPT on the agent's spawn, so the agent itself runs `ai setup` as a tool call — mirroring the real user flow instead of resolving the prompt upfront.
Bring the three prompt-content changes that were about to ship in #34596 onto the post-refactor layout. Applies to code/lib/cli-storybook/src/ai/prompts/ pattern-copy-play.ts (previously getSetupInstructions in prompt.ts): - New end-state paragraph in the intro clarifying that the shared preview should own all providers, CSS, browser state, and network mocks so rendering the component in the story is enough. - New "#### Args vs render" subsection under Step 5 with two full examples (args-driven Button, render-based composition inside Card), via two new self-contained helpers getArgsStoryExample and getRenderCompositionExample. - New Step 7 "Prove CSS is loaded in exactly one story named CssCheck" asserting a component-specific computed style via getComputedStyle to catch "renders but CSS never loaded" failures. Steps 8 and 9 renumbered accordingly. Makes #34596 redundant against this branch.
…to cursor/eval-css-loaded-prompt-check # Conflicts: # scripts/eval/prompts/pattern-copy-play.md
|
View your CI Pipeline Execution ↗ for commit a6aaf4f
☁️ Nx Cloud last updated this comment at |
Trivial one-line signature reflow picked up by `oxfmt --check` after the merge of #34602 into this branch. No behavior change.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces file-based markdown prompts with a TypeScript prompt registry and prompt-builder modules, threads selected prompt variant via EVAL_SETUP_PROMPT through the eval harness into agent execution, and conditions ai-setup telemetry snapshotting on a new disableTelemetry option. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Eval Runner
participant Driver as Eval Driver / SDK
participant Agent as Agent (Claude/Codex)
participant SBCLI as Storybook CLI (npx storybook ai setup)
participant Registry as Prompt Registry
Runner->>Driver: execute(promptName, env={EVAL_SETUP_PROMPT: promptName})
Driver->>Agent: start agent with merged env
Agent->>SBCLI: run "npx storybook ai setup" (inherits env)
SBCLI->>Registry: resolve getPrompts(projectInfo) using EVAL_SETUP_PROMPT
Registry-->>SBCLI: return prompts/instructions
SBCLI-->>Agent: prints generated markdown to stdout
Agent-->>Driver: capture stdout
Driver-->>Runner: return captured markdown (stored in trial)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/eval/README.md (1)
254-260: Add a language specifier to the fenced code block.The code block showing the env var flow lacks a language identifier. Since this is a text diagram rather than executable code, consider using
textorplaintext.📝 Suggested fix
-``` +```text eval.ts --prompt setup → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns with that env → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT → CLI's getPrompts() picks the 'setup' variant</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@scripts/eval/README.mdaround lines 254 - 260, Update the fenced code block
that contains the env var flow diagram so it includes a language specifier
(e.g., add "text" or "plaintext" after the opening triple backticks); locate the
block showing "eval.ts --prompt setup → run-trial.ts calls driver.execute({ env:
{ EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening
"" to "text" to mark it as a plain text diagram for proper rendering.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In@scripts/eval/README.md:
- Around line 254-260: Update the fenced code block that contains the env var
flow diagram so it includes a language specifier (e.g., add "text" or
"plaintext" after the opening triple backticks); locate the block showing
"eval.ts --prompt setup → run-trial.ts calls driver.execute({ env: {
EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening "" to "text" to mark it as a plain text diagram for proper rendering.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `0a5e3e78-8cbe-4976-84b3-6f790fc90a20` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 8ed433230702e49101e26e08ee5a2713136044f5 and 9412695be9a13536ccce4778ed3e21e87ed8a936. </details> <details> <summary>📒 Files selected for processing (15)</summary> * `code/lib/cli-storybook/src/ai/prompt.ts` * `code/lib/cli-storybook/src/ai/prompts/index.ts` * `code/lib/cli-storybook/src/ai/prompts/pattern-copy-play.ts` * `code/lib/cli-storybook/src/ai/prompts/setup.ts` * `scripts/eval/README.md` * `scripts/eval/eval.ts` * `scripts/eval/lib/agents/claude-code.ts` * `scripts/eval/lib/agents/codex.ts` * `scripts/eval/lib/agents/config.ts` * `scripts/eval/lib/run-trial.ts` * `scripts/eval/lib/utils.test.ts` * `scripts/eval/lib/utils.ts` * `scripts/eval/prompts/pattern-copy-play.md` * `scripts/eval/prompts/setup.md` * `scripts/eval/run-batch.ts` </details> <details> <summary>💤 Files with no reviewable changes (2)</summary> * scripts/eval/prompts/setup.md * scripts/eval/prompts/pattern-copy-play.md </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Package BenchmarksCommit: The following packages have significant changes to their size or dependencies:
|
| Before | After | Difference | |
|---|---|---|---|
| Dependency count | 184 | 184 | 0 |
| Self size | 819 KB | 835 KB | 🚨 +16 KB 🚨 |
| Dependency size | 68.26 MB | 68.26 MB | 🚨 +281 B 🚨 |
| Bundle Size Analyzer | Link | Link |
…ompt exports - Spawn `npx storybook ai setup` inside the trial workspace with `EVAL_SETUP_PROMPT=<name>` and save its stdout as `prompt.content` so `data.json` and transcript docs carry the project-aware instructions instead of the one-line nudge. - Rename every prompt variant's builder to `instructions` and use namespace imports in the prompts registry so all variant files share one export convention. - Fix two stale `run-trial.test.ts` assertions that still expected the full markdown as the agent prompt; mock `tinyexec` and cover the new `prompt.content` field. - Collapse `buildManualCommand` signature in `eval.ts` onto one line so `yarn fmt:check` passes.
- Gate ai-setup preview snapshot + ai-setup-pending cache write on !disableTelemetry — the record is only consumed by the ai-setup-evidence telemetry event, so it has no consumer when telemetry is off. - Plumb disableTelemetry through AiSetupOptions and add a unit test covering the enabled / disabled / default paths. - Swap `requested in PROMPT_BUILDERS` for `Object.hasOwn(...)` so prototype property names in EVAL_SETUP_PROMPT fall back to the default. - Wrap captureAiSetupMarkdown in try/catch so spawn or timeout failures log and return an empty string instead of aborting the trial; set STORYBOOK_DISABLE_TELEMETRY=1 on the subprocess env.
The single-to-double quote edit landed by accident in commit 4 of this PR and doesn't belong to the eval-prompts refactor. It also broke `yarn fmt:check` (oxfmt enforces single quotes). Restoring the base- branch state fixes CI and keeps this PR scoped to the CLI/eval work.
…to cursor/eval-css-loaded-prompt-check
…on' into kasper/eval-prompts-from-cli
…on' into cursor/eval-css-loaded-prompt-check
| // collect evidence of what the agent accomplished — but only via telemetry | ||
| // (the `ai-setup-evidence` event). Skip the snapshot + cache write when | ||
| // telemetry is disabled so there's nobody to read it. | ||
| if (!disableTelemetry) { | ||
| const resolvedConfigDir = resolve(projectInfo.configDir); | ||
| const previewSnapshot = await snapshotPreviewFile(resolvedConfigDir); | ||
| const sessionId = await getSessionId(); | ||
| const pendingRecord: AiSetupPendingRecord = { | ||
| timestamp: Date.now(), | ||
| sessionId, | ||
| configDir: resolvedConfigDir, | ||
| ...previewSnapshot, | ||
| }; | ||
| await cache.set('ai-setup-pending', pendingRecord); | ||
| } |
There was a problem hiding this comment.
@ValentinFunk what do you think of this? Did you plan a new API for these use cases or is this still the "canonical" way to do things?
I think Kasper is doing the right thing here as we wanna avoid useless compute, but this happens earlier than when we actually can use the telemetry higher order function.
There was a problem hiding this comment.
I'm so sorry Valentin, I need to stop pinging you by accident 😭
|
Haven't had time to read everything and run the eval but the architecture seems sound to me. Good idea capturing the conditional prompt and instrumenting the served prompt with an env var. My only concern so far would be avoiding bloating the prod build if we end up having more prompts, and more ai commands with their own prompts. |
| @@ -0,0 +1,283 @@ | |||
| import { dedent } from 'ts-dedent'; | |||
There was a problem hiding this comment.
Would be somehow nice to know that a specific prompt was created/experimented with at a specific date or which prompt that is (first ever? iteration number 5?), and even better if there is at least one link of an eval for it so it's easy to refer to its results. WDYT?
| // collect evidence of what the agent accomplished — but only via telemetry | ||
| // (the `ai-setup-evidence` event). Skip the snapshot + cache write when | ||
| // telemetry is disabled so there's nobody to read it. | ||
| if (!disableTelemetry) { |
There was a problem hiding this comment.
I think this now needs to be using isTelemetryModuleEnabled from storybook/internal/telemetry
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/eval/README.md (2)
286-292:⚠️ Potential issue | 🟡 MinorAdd a language tag to the fenced block to satisfy MD040.
Use an explicit fence like
textorconsole.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing the eval.ts → run-trial.ts → agent → tool → getPrompts flow lacks a language tag which triggers MD040; update the triple-backtick fence to include a language tag (e.g., ```text or ```console) so the block becomes ```text (or ```console) ... ``` to satisfy the linter while preserving the existing content.
278-278:⚠️ Potential issue | 🟡 MinorREADME flow is still inaccurate about who runs
ai setup.This line says the harness never spawns
ai setup, but the harness also runs it for prompt capture (prompt.contentpath).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/README.md` at line 278, The README incorrectly states the harness never runs `ai setup`; update the text to reflect that the harness does run `ai setup` in the prompt-capture path (i.e., when populating prompt.content) while still handing the task to the trial agent for normal execution; mention both behaviors and reference the harness's prompt capture flow (prompt.content) and the `ai setup` command so readers understand the distinction.
🧹 Nitpick comments (2)
scripts/eval/lib/utils.ts (2)
5-9: Keep the prompt registry out of this low-level utility.Importing
PROMPT_NAMESfromindex.tsalso loads the prompt-builder modules and their instruction payloads, so every eval command that touchesscripts/eval/lib/utils.tsnow pays that cost even when it only needs names. A names-only registry module would keep startup and bundle size lower.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lib/utils.ts` around lines 5 - 9, The utils module is importing PROMPT_NAMES (and DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts), which triggers loading of prompt-builder modules and instruction payloads; replace that heavy import with a lightweight "names-only" registry export and update the import in scripts/eval/lib/utils.ts to pull PROMPT_NAMES and DEFAULT_PROMPT_NAME from the new names-only module (e.g., prompts/names or prompts/registry-names) so consumers get just the list of names without loading builders/payloads; create the names-only module to re-export only the minimal constants and ensure utils.ts references those symbols (PROMPT_NAMES, DEFAULT_PROMPT_NAME) from the new module.
149-167: Rename this helper or split validation from loading.This function no longer loads a prompt variant; it only validates a name and returns
AI_SETUP_PROMPT. The current API is easy to misread at future call sites, especially since the returned text is independent ofname.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lib/utils.ts` around lines 149 - 167, The function loadPrompt no longer loads a variant by name (it only validates and returns the constant AI_SETUP_PROMPT), so split validation from retrieval: add a new function validatePromptName(name: string) that uses listPrompts() / PROMPT_NAMES and throws the same error on missing names, then change loadPrompt to be a no-arg function that simply returns AI_SETUP_PROMPT (or rename loadPrompt to getSetupPrompt and make it no-arg); update all call sites to call validatePromptName(name) where they currently pass a name, and call the no-arg getSetupPrompt()/loadPrompt() to obtain AI_SETUP_PROMPT.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/README.md`:
- Line 7: The README has malformed inline markdown like "`**gh` CLI** —
installed and authenticated (`gh auth login`)" (and another occurrence at the
other noted line); fix by removing the mixed backtick+bold markup and use
consistent markdown for commands/file names — e.g. use backticks for commands
(`gh`, `gh auth login`) or bold for emphasis (**) but not both, and update both
occurrences to the chosen correct form.
---
Duplicate comments:
In `@scripts/eval/README.md`:
- Around line 286-292: The fenced code block showing the eval.ts → run-trial.ts
→ agent → tool → getPrompts flow lacks a language tag which triggers MD040;
update the triple-backtick fence to include a language tag (e.g., ```text or
```console) so the block becomes ```text (or ```console) ... ``` to satisfy the
linter while preserving the existing content.
- Line 278: The README incorrectly states the harness never runs `ai setup`;
update the text to reflect that the harness does run `ai setup` in the
prompt-capture path (i.e., when populating prompt.content) while still handing
the task to the trial agent for normal execution; mention both behaviors and
reference the harness's prompt capture flow (prompt.content) and the `ai setup`
command so readers understand the distinction.
---
Nitpick comments:
In `@scripts/eval/lib/utils.ts`:
- Around line 5-9: The utils module is importing PROMPT_NAMES (and
DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts), which triggers
loading of prompt-builder modules and instruction payloads; replace that heavy
import with a lightweight "names-only" registry export and update the import in
scripts/eval/lib/utils.ts to pull PROMPT_NAMES and DEFAULT_PROMPT_NAME from the
new names-only module (e.g., prompts/names or prompts/registry-names) so
consumers get just the list of names without loading builders/payloads; create
the names-only module to re-export only the minimal constants and ensure
utils.ts references those symbols (PROMPT_NAMES, DEFAULT_PROMPT_NAME) from the
new module.
- Around line 149-167: The function loadPrompt no longer loads a variant by name
(it only validates and returns the constant AI_SETUP_PROMPT), so split
validation from retrieval: add a new function validatePromptName(name: string)
that uses listPrompts() / PROMPT_NAMES and throws the same error on missing
names, then change loadPrompt to be a no-arg function that simply returns
AI_SETUP_PROMPT (or rename loadPrompt to getSetupPrompt and make it no-arg);
update all call sites to call validatePromptName(name) where they currently pass
a name, and call the no-arg getSetupPrompt()/loadPrompt() to obtain
AI_SETUP_PROMPT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fb456c16-3186-4721-8c8a-60812aebadf8
📒 Files selected for processing (3)
scripts/eval/README.mdscripts/eval/lib/utils.tsscripts/eval/run-batch.ts
✅ Files skipped from review due to trivial changes (1)
- scripts/eval/run-batch.ts
…mpt-check Eval: Record hasCssCheckStory when the diff adds CssCheck
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/eval/lib/story-render.ts (1)
210-215:⚠️ Potential issue | 🟡 MinorNormalize
cssCheckbefore returning summary.At Line 214,
cssCheckis copied directly fromparsed.cssCheck. If parser output is missing/unknown, this violates theStoryRenderGraderuntime contract and leaks ambiguity downstream.Suggested fix
return { total: parsed.total, passed: parsed.passed, storyFiles, - cssCheck: parsed.cssCheck, + cssCheck: + parsed.cssCheck === 'pass' || parsed.cssCheck === 'fail' ? parsed.cssCheck : 'not-run', } satisfies StoryRenderGrade;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lib/story-render.ts` around lines 210 - 215, The returned summary currently passes parsed.cssCheck directly which can be undefined and violate the StoryRenderGrade contract; update the return to normalize parsed.cssCheck first (e.g. validate type/value and fallback to a defined default) and return that normalized value instead of parsed.cssCheck. Locate the return in the same function where total/passed/storyFiles are assembled and replace the raw parsed.cssCheck with a normalized variable (or call a small helper like normalizeCssCheck) so the object always satisfies StoryRenderGrade.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/lib/grade.ts`:
- Around line 183-190: The current branch treating cssCheck === 'not-run' as an
error should be changed to an informational log: locate the code that reads
storyRenderRun.summary?.cssCheck into cssCheck and the three-branch logic that
calls logger.logSuccess / logger.logError; change the final else branch (the one
logging "CssCheck story missing or not run") to use an informational logging
method (e.g. logger.logInfo or logger.info) instead of logger.logError so
'not-run' is not reported as an error while keeping 'pass' as logSuccess and
'fail' as logError.
---
Outside diff comments:
In `@scripts/eval/lib/story-render.ts`:
- Around line 210-215: The returned summary currently passes parsed.cssCheck
directly which can be undefined and violate the StoryRenderGrade contract;
update the return to normalize parsed.cssCheck first (e.g. validate type/value
and fallback to a defined default) and return that normalized value instead of
parsed.cssCheck. Locate the return in the same function where
total/passed/storyFiles are assembled and replace the raw parsed.cssCheck with a
normalized variable (or call a small helper like normalizeCssCheck) so the
object always satisfies StoryRenderGrade.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a5bf4df9-6c42-4773-99b7-18500b928dc4
📒 Files selected for processing (5)
scripts/eval/lib/grade.tsscripts/eval/lib/publish-trial.test.tsscripts/eval/lib/publish-trial.tsscripts/eval/lib/run-trial.test.tsscripts/eval/lib/story-render.ts
✅ Files skipped from review due to trivial changes (1)
- scripts/eval/lib/publish-trial.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/eval/lib/run-trial.test.ts
Co-authored-by: Steve Dodier-Lazaro <Sidnioulz@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
scripts/eval/README.md (2)
16-16:⚠️ Potential issue | 🟡 MinorFix malformed markdown: move bold outside backticks.
The current format
**sync-baselines.ts**places bold markers inside code backticks, which renders them as literal asterisks instead of formatting. Bold formatting doesn't work inside inline code.📝 Proposed fix
-1. `**sync-baselines.ts**` pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline. +1. **`sync-baselines.ts`** pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/README.md` at line 16, Replace the malformed inline code with proper bold text: find the literal string `**sync-baselines.ts**` and change it to **sync-baselines.ts** so the file name is rendered in bold (remove the backticks around the asterisks); ensure no other inline code spans include bold markers inside backticks.
286-292:⚠️ Potential issue | 🟡 MinorAdd language identifier to fenced code block.
The fenced block is missing a language identifier, triggering markdownlint MD040.
🔧 Proposed fix
-``` +```text eval.ts --prompt setup → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns with that env → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT → CLI's getPrompts() picks the 'setup' variant -``` +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing "eval.ts --prompt setup" in the README is missing a language identifier which triggers markdownlint MD040; fix it by adding a language tag (e.g., text) after the opening triple backticks so the block becomes ```text ... ```, ensuring the snippet that starts with "eval.ts --prompt setup" uses that language identifier; update the README fenced block accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/eval/README.md`:
- Line 16: Replace the malformed inline code with proper bold text: find the
literal string `**sync-baselines.ts**` and change it to **sync-baselines.ts** so
the file name is rendered in bold (remove the backticks around the asterisks);
ensure no other inline code spans include bold markers inside backticks.
- Around line 286-292: The fenced code block showing "eval.ts --prompt setup" in
the README is missing a language identifier which triggers markdownlint MD040;
fix it by adding a language tag (e.g., text) after the opening triple backticks
so the block becomes ```text ... ```, ensuring the snippet that starts with
"eval.ts --prompt setup" uses that language identifier; update the README fenced
block accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 53a8c4b1-ce2b-4100-988f-ddc351c6d0e4
📒 Files selected for processing (1)
scripts/eval/README.md
Closes #
What I did
This PR makes the eval harness use the same AI setup prompt source as the public CLI.
code/lib/cli-storybook/src/ai/prompts/.EVAL_SETUP_PROMPTenv var, withpattern-copy-playas the default.code/lib/cli-storybook/src/ai/prompt.tsso it delegates prompt selection to the registry.npx storybook ai setupitself. The selected prompt variant is passed through the agent environment, which means evals now exercise the same CLI flow a user would use.scripts/eval/prompts/and updated the supporting docs and tests.This gives the CLI a single source of truth for AI setup prompts while still letting the eval suite switch between internal variants for experiments.
Checklist for Contributors
Testing
The changes in this PR are covered in the following automated tests:
code/lib/cli-storybook/src/ai/index.test.ts,scripts/eval/lib/utils.test.ts, andscripts/eval/lib/run-trial.test.tscover prompt lookup, validation, environment plumbing, and prompt capture.Manual testing
Caution
This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!
npx storybook ai setupin a Storybook project and confirm the default output matches the currentpattern-copy-playprompt.node scripts/eval/eval.ts --list-promptsand confirm bothpattern-copy-playandsetupare listed.node scripts/eval/eval.ts -p mealdrop --prompt setup --manualand confirm the printed command is prefixed withEVAL_SETUP_PROMPT=setup.npx storybook ai setupand the savedprompt.contentmatches the selected variant.Documentation
MIGRATION.MD
Updated
scripts/eval/README.mdto describe the prompt-variant workflow.Checklist for Maintainers
When this PR is ready for testing, make sure to add
ci:normal,ci:mergedorci:dailyGH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found incode/lib/cli-storybook/src/sandbox-templates.tsMake sure this PR contains one of the labels below:
Available labels
bug: Internal changes that fixes incorrect behavior.maintenance: User-facing maintenance tasks.dependencies: Upgrading (sometimes downgrading) dependencies.build: Internal-facing build tooling & test updates. Will not show up in release changelog.cleanup: Minor cleanup style change. Will not show up in release changelog.documentation: Documentation only changes. Will not show up in release changelog.feature request: Introducing a new feature.BREAKING CHANGE: Changes that break compatibility in some way with current major version.other: Changes that don't fit in the above categories.🦋 Canary release
This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the
@storybookjs/coreteam here.core team members can create a canary release here or locally with
gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>Summary by CodeRabbit
Refactor
New Features
Documentation
Tests