CLI: Source eval prompts from ai setup via EVAL_SETUP_PROMPT by kasperpeulen · Pull Request #34602 · storybookjs/storybook

kasperpeulen · 2026-04-20T14:49:43Z

Closes #

What I did

This PR makes the eval harness use the same AI setup prompt source as the public CLI.

Moved the eval prompt variants into code/lib/cli-storybook/src/ai/prompts/.
Added a registry that selects a prompt variant from the internal EVAL_SETUP_PROMPT env var, with pattern-copy-play as the default.
Simplified code/lib/cli-storybook/src/ai/prompt.ts so it delegates prompt selection to the registry.
Updated the eval harness so the agent runs npx storybook ai setup itself. The selected prompt variant is passed through the agent environment, which means evals now exercise the same CLI flow a user would use.
Removed the duplicated markdown prompt files under scripts/eval/prompts/ and updated the supporting docs and tests.

This gives the CLI a single source of truth for AI setup prompts while still letting the eval suite switch between internal variants for experiments.

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

stories
unit tests
integration tests
end-to-end tests

code/lib/cli-storybook/src/ai/index.test.ts, scripts/eval/lib/utils.test.ts, and scripts/eval/lib/run-trial.test.ts cover prompt lookup, validation, environment plumbing, and prompt capture.

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

Run npx storybook ai setup in a Storybook project and confirm the default output matches the current pattern-copy-play prompt.
Run node scripts/eval/eval.ts --list-prompts and confirm both pattern-copy-play and setup are listed.
Run node scripts/eval/eval.ts -p mealdrop --prompt setup --manual and confirm the printed command is prefixed with EVAL_SETUP_PROMPT=setup.
In a trial workspace, confirm the agent runs npx storybook ai setup and the saved prompt.content matches the selected variant.

Documentation

Add or update documentation reflecting your changes
If you are deprecating/removing a feature, make sure to update
MIGRATION.MD

Updated scripts/eval/README.md to describe the prompt-variant workflow.

Checklist for Maintainers

When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts
Make sure this PR contains one of the labels below:
Available labels
- bug: Internal changes that fixes incorrect behavior.
- maintenance: User-facing maintenance tasks.
- dependencies: Upgrading (sometimes downgrading) dependencies.
- build: Internal-facing build tooling & test updates. Will not show up in release changelog.
- cleanup: Minor cleanup style change. Will not show up in release changelog.
- documentation: Documentation only changes. Will not show up in release changelog.
- feature request: Introducing a new feature.
- BREAKING CHANGE: Changes that break compatibility in some way with current major version.
- other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

Refactor
- Replaced markdown-based prompts with a registry of prompt variants backed by pluggable prompt builders.
New Features
- Choose prompt variants via CLI/env and pass selection into AI runs.
- Optional telemetry gating for the AI setup flow.
- Capture AI-setup markdown output during trials.
- Include a CssCheck status in grading and PR descriptions.
Documentation
- Updated eval docs and CLI help to describe prompt variants.
Tests
- Added/updated tests for prompt registry, agent env handling, trial capture, and grading outputs.

… loaded Addresses #34594. Adds a prompt-level instruction and a grading flag that together catch "renders fine, but user CSS never loaded" failures. - pattern-copy-play.md: new Step 7 requires exactly one story to assert a component-specific computed style via getComputedStyle. - grade.ts: records hasComputedStyleAssertion based on whether the staged diff contains "getComputedStyle" (reuses the existing cached diff, no extra file reads). Chose the prompt+diff approach over a runtime stylesheet heuristic (filtering document.styleSheets, isolated all:initial probe, etc.) because: - The agent already knows what "styled correctly" means for a given component; a component-specific computed-style assertion catches the real failure ("bg-blue-600 did not apply") rather than a generic "something was applied" signal. - No fragile filtering of vitest-browser / storybook / addon stylesheet sources. Addons keep shipping new sheets; that filter would bit-rot. - Failures surface as normal Vitest assertion failures and already flow through pass/fail grading — no new counter, no new warning channel, no changes to render-analysis. - Complementary to a future runtime heuristic if we want one: prompt-level catches "agent misconfigured the design system"; runtime catches "agent shipped a visibly unstyled story without the check".

'Render call' could read as 'you need a render: () => ... function', which is wrong — args stories have no render call and that's the preferred shape for prop-driven components. Softening to 'just rendering the component in the story is enough' keeps the intent without steering toward render().

Before: hasComputedStyleAssertion was a plain rawDiff.includes('getComputedStyle'), which matched the prompt markdown (written to .storybook/eval-results/prompt.md before grade runs) and the transcript JSON — both of which contain the token verbatim because the new prompt Step 7 and the agent's own tool-output lines include it. The flag was effectively tautological: true whenever the prompt was staged, regardless of what the agent did. After: parse the unified patch, track which file each hunk belongs to via the '+++ b/<path>' headers, and only consider added lines (skipping the '+++' header itself) that live in files also present in storybookChanges. Uses the existing STORY_FILE_PATTERN from story-render.ts as the single source of truth for what counts as a story file. Exports diffAddsTokenInStoryFiles as a pure helper with unit tests covering the false-positive paths (prompt.md / data.json), deleted lines, the +++ header, and files not in storybookChanges.

Aligns the prompt + grade check with the Slack agreement: instead of hoping the agent adds *some* `getComputedStyle` call somewhere, the prompt now asks for one story explicitly named `CssCheck`. That specific story name is what the AI-stories vitest run in core will grep for to attribute the pass/fail result in the `ai-setup-final-scoring` telemetry event. - `pattern-copy-play.md` Step 7: heading + example updated to `export const CssCheck: Story = { ... }`. - `grade.ts`: `hasComputedStyleAssertion` -> `hasCssCheckStory`, token matched in the diff changed from `getComputedStyle` to `export const CssCheck`. - `grade.test.ts`: added two tests locking in the new use case (positive: story-file diff with the export; negative: prompt.md false positive). - Trial / publish / result-docs test mocks renamed to match. Rationale (from Slack): giving the story a known name means telemetry in core can report on the CSS check result directly, without layering on a separate tag. The story also ends up being educational — a visible example of how to verify CSS loaded. No tag, no new telemetry field required on top of whatever core adds in a follow-up PR.

Move the eval harness's prompt catalog into code/lib/cli-storybook/src/ai/prompts/ so trials exercise the exact prompt a real user gets from `npx storybook ai setup`. Each variant lives in its own fully isolated .ts file; the registry selects one at runtime via the internal EVAL_SETUP_PROMPT env var (unset for real users → always the default). The harness now hands the agent the AI_SETUP_PROMPT nudge and sets EVAL_SETUP_PROMPT on the agent's spawn, so the agent itself runs `ai setup` as a tool call — mirroring the real user flow instead of resolving the prompt upfront.

Bring the three prompt-content changes that were about to ship in #34596 onto the post-refactor layout. Applies to code/lib/cli-storybook/src/ai/prompts/ pattern-copy-play.ts (previously getSetupInstructions in prompt.ts): - New end-state paragraph in the intro clarifying that the shared preview should own all providers, CSS, browser state, and network mocks so rendering the component in the story is enough. - New "#### Args vs render" subsection under Step 5 with two full examples (args-driven Button, render-based composition inside Card), via two new self-contained helpers getArgsStoryExample and getRenderCompositionExample. - New Step 7 "Prove CSS is loaded in exactly one story named CssCheck" asserting a component-specific computed style via getComputedStyle to catch "renders but CSS never loaded" failures. Steps 8 and 9 renumbered accordingly. Makes #34596 redundant against this branch.

…to cursor/eval-css-loaded-prompt-check # Conflicts: # scripts/eval/prompts/pattern-copy-play.md

nx-cloud · 2026-04-20T15:07:19Z

View your CI Pipeline Execution ↗ for commit a6aaf4f

Command	Status	Duration	Result
`nx run-many -t compile,check,knip,test,lint,fmt...`	✅ Succeeded	9m 48s	View ↗

☁️ Nx Cloud last updated this comment at 2026-04-22 14:31:06 UTC

Trivial one-line signature reflow picked up by `oxfmt --check` after the merge of #34602 into this branch. No behavior change.

coderabbitai · 2026-04-20T15:16:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces file-based markdown prompts with a TypeScript prompt registry and prompt-builder modules, threads selected prompt variant via EVAL_SETUP_PROMPT through the eval harness into agent execution, and conditions ai-setup telemetry snapshotting on a new disableTelemetry option.

Changes

Cohort / File(s)	Summary
Prompt registry & builders `code/lib/cli-storybook/src/ai/prompts/index.ts`, `code/lib/cli-storybook/src/ai/prompts/pattern-copy-play.ts`, `code/lib/cli-storybook/src/ai/prompts/setup.ts`	Added centralized PROMPT_BUILDERS registry, exported `PROMPT_NAMES`/`DEFAULT_PROMPT_NAME`/`PromptName`, and two prompt-builder `instructions(projectInfo)` implementations that produce project-aware instruction strings.
Prompt generation refactor `code/lib/cli-storybook/src/ai/prompt.ts`	Removed inline prompt-building and docs-URL helper; `generateMarkdownOutput` now imports `getPrompts` from the new registry and composes final markdown.
Eval prompts → registry migration `scripts/eval/lib/utils.ts`, `scripts/eval/lib/utils.test.ts`, `scripts/eval/prompts/*` (deleted)	Replaced filesystem-based prompts/*.md discovery/loading with registry-driven `listPrompts()`/`loadPrompt()` validating against `PROMPT_NAMES`; deleted legacy markdown prompt files.
Agent env plumbing `scripts/eval/lib/agents/config.ts`, `scripts/eval/lib/agents/claude-code.ts`, `scripts/eval/lib/agents/codex.ts`	Extended agent execute API to accept optional `env?: Record<string,string>` and merge it into the SDK runtime env so caller-provided vars (e.g., EVAL_SETUP_PROMPT) can override process.env while still forcing telemetry disable flag.
Eval CLI / trial wiring `scripts/eval/eval.ts`, `scripts/eval/lib/run-trial.ts`, `scripts/eval/lib/run-trial.test.ts`, `scripts/eval/run-batch.ts`	Treat `--prompt` as a prompt variant name; pass selected variant via `EVAL_SETUP_PROMPT` into agent execution; added `captureAiSetupMarkdown` to run `npx storybook ai setup` and store captured markdown in trial artifacts; updated CLI help and tests.
Telemetry gating in aiSetup `code/lib/cli-storybook/src/ai/index.ts`, `code/lib/cli-storybook/src/ai/types.ts`, `code/lib/cli-storybook/src/ai/index.test.ts`	Added optional `disableTelemetry` to `AiSetupOptions` and skip preview snapshot + pending-cache evidence when `disableTelemetry` is true; tests updated/added to cover behavior.
Reporting & grading additions `scripts/eval/lib/story-render.ts`, `scripts/eval/lib/grade.ts`, `scripts/eval/lib/publish-trial.ts`, `scripts/eval/lib/publish-trial.test.ts`	Introduced a three-state `cssCheck` field (`pass`
Docs & README `scripts/eval/README.md`	Updated docs to describe prompt variants, `--prompt` semantics, default `pattern-copy-play`, and instructions for adding new TypeScript prompt variants to the registry.
Tests & harness adjustments `scripts/eval/lib/run-trial.test.ts`, `scripts/eval/lib/utils.test.ts`, `code/lib/cli-storybook/src/ai/index.test.ts`	Updated tests to reflect registry-driven prompts, env propagation (EVAL_SETUP_PROMPT), captured `npx storybook ai setup` output assertions, and telemetry gating behaviour.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Eval Runner
  participant Driver as Eval Driver / SDK
  participant Agent as Agent (Claude/Codex)
  participant SBCLI as Storybook CLI (npx storybook ai setup)
  participant Registry as Prompt Registry

  Runner->>Driver: execute(promptName, env={EVAL_SETUP_PROMPT: promptName})
  Driver->>Agent: start agent with merged env
  Agent->>SBCLI: run "npx storybook ai setup" (inherits env)
  SBCLI->>Registry: resolve getPrompts(projectInfo) using EVAL_SETUP_PROMPT
  Registry-->>SBCLI: return prompts/instructions
  SBCLI-->>Agent: prints generated markdown to stdout
  Agent-->>Driver: capture stdout
  Driver-->>Runner: return captured markdown (stored in trial)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

#34365 — Implements eval harness and migrates prompt handling to a CLI-side TypeScript prompt registry (EVAL_SETUP_PROMPT selection, prompt builders), overlapping registry/agent wiring.
#34421 — Refactors prompts to a registry and updates CLI/default registration patterns; directly related to PROMPT_BUILDERS/DEFAULT_PROMPT_NAME changes.
#34595 — Adds and propagates CSS-related checks through grading and reporting (cssCheck field), overlapping the new grading/PR-body additions.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

scripts/eval/README.md (1)
254-260: Add a language specifier to the fenced code block.

The code block showing the env var flow lacks a language identifier. Since this is a text diagram rather than executable code, consider using text or plaintext.
📝 Suggested fix
-```
+```text
 eval.ts --prompt setup
   → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } })
     → agent spawns with that env
       → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT
         → CLI's getPrompts() picks the 'setup' variant
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @scripts/eval/README.md around lines 254 - 260, Update the fenced code block
that contains the env var flow diagram so it includes a language specifier
(e.g., add "text" or "plaintext" after the opening triple backticks); locate the
block showing "eval.ts --prompt setup → run-trial.ts calls driver.execute({ env:
{ EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening
"" to "text" to mark it as a plain text diagram for proper rendering.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @scripts/eval/README.md:

Around line 254-260: Update the fenced code block that contains the env var
flow diagram so it includes a language specifier (e.g., add "text" or
"plaintext" after the opening triple backticks); locate the block showing
"eval.ts --prompt setup → run-trial.ts calls driver.execute({ env: {
EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening "" to "text" to mark it as a plain text diagram for proper rendering.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `0a5e3e78-8cbe-4976-84b3-6f790fc90a20`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 8ed433230702e49101e26e08ee5a2713136044f5 and 9412695be9a13536ccce4778ed3e21e87ed8a936.

</details>

<details>
<summary>📒 Files selected for processing (15)</summary>

* `code/lib/cli-storybook/src/ai/prompt.ts`
* `code/lib/cli-storybook/src/ai/prompts/index.ts`
* `code/lib/cli-storybook/src/ai/prompts/pattern-copy-play.ts`
* `code/lib/cli-storybook/src/ai/prompts/setup.ts`
* `scripts/eval/README.md`
* `scripts/eval/eval.ts`
* `scripts/eval/lib/agents/claude-code.ts`
* `scripts/eval/lib/agents/codex.ts`
* `scripts/eval/lib/agents/config.ts`
* `scripts/eval/lib/run-trial.ts`
* `scripts/eval/lib/utils.test.ts`
* `scripts/eval/lib/utils.ts`
* `scripts/eval/prompts/pattern-copy-play.md`
* `scripts/eval/prompts/setup.md`
* `scripts/eval/run-batch.ts`

</details>

<details>
<summary>💤 Files with no reviewable changes (2)</summary>

* scripts/eval/prompts/setup.md
* scripts/eval/prompts/pattern-copy-play.md

</details>

</details>

storybook-app-bot · 2026-04-20T15:21:55Z

Package Benchmarks

^{Commit: b35ff84, ran on 29 April 2026 at 18:27:29 UTC}

The following packages have significant changes to their size or dependencies:

`@storybook/cli`

	Before	After	Difference
Dependency count	184	184	0
Self size	819 KB	835 KB	🚨 +16 KB 🚨
Dependency size	68.26 MB	68.26 MB	🚨 +281 B 🚨
Bundle Size Analyzer	Link	Link

…ompt exports - Spawn `npx storybook ai setup` inside the trial workspace with `EVAL_SETUP_PROMPT=<name>` and save its stdout as `prompt.content` so `data.json` and transcript docs carry the project-aware instructions instead of the one-line nudge. - Rename every prompt variant's builder to `instructions` and use namespace imports in the prompts registry so all variant files share one export convention. - Fix two stale `run-trial.test.ts` assertions that still expected the full markdown as the agent prompt; mock `tinyexec` and cover the new `prompt.content` field. - Collapse `buildManualCommand` signature in `eval.ts` onto one line so `yarn fmt:check` passes.

- Gate ai-setup preview snapshot + ai-setup-pending cache write on !disableTelemetry — the record is only consumed by the ai-setup-evidence telemetry event, so it has no consumer when telemetry is off. - Plumb disableTelemetry through AiSetupOptions and add a unit test covering the enabled / disabled / default paths. - Swap `requested in PROMPT_BUILDERS` for `Object.hasOwn(...)` so prototype property names in EVAL_SETUP_PROMPT fall back to the default. - Wrap captureAiSetupMarkdown in try/catch so spawn or timeout failures log and return an empty string instead of aborting the trial; set STORYBOOK_DISABLE_TELEMETRY=1 on the subprocess env.

The single-to-double quote edit landed by accident in commit 4 of this PR and doesn't belong to the eval-prompts refactor. It also broke `yarn fmt:check` (oxfmt enforces single quotes). Restoring the base- branch state fixes CI and keeps this PR scoped to the CLI/eval work.

…to cursor/eval-css-loaded-prompt-check

…on' into kasper/eval-prompts-from-cli

…on' into cursor/eval-css-loaded-prompt-check

Sidnioulz · 2026-04-25T00:08:10Z

+  // collect evidence of what the agent accomplished — but only via telemetry
+  // (the `ai-setup-evidence` event). Skip the snapshot + cache write when
+  // telemetry is disabled so there's nobody to read it.
+  if (!disableTelemetry) {
+    const resolvedConfigDir = resolve(projectInfo.configDir);
+    const previewSnapshot = await snapshotPreviewFile(resolvedConfigDir);
+    const sessionId = await getSessionId();
+    const pendingRecord: AiSetupPendingRecord = {
+      timestamp: Date.now(),
+      sessionId,
+      configDir: resolvedConfigDir,
+      ...previewSnapshot,
+    };
+    await cache.set('ai-setup-pending', pendingRecord);
+  }


@ValentinFunk what do you think of this? Did you plan a new API for these use cases or is this still the "canonical" way to do things?

I think Kasper is doing the right thing here as we wanna avoid useless compute, but this happens earlier than when we actually can use the telemetry higher order function.

@valentinpalkovic

I'm so sorry Valentin, I need to stop pinging you by accident 😭

Sidnioulz · 2026-04-25T00:14:45Z

Haven't had time to read everything and run the eval but the architecture seems sound to me. Good idea capturing the conditional prompt and instrumenting the served prompt with an env var. My only concern so far would be avoiding bloating the prod build if we end up having more prompts, and more ai commands with their own prompts.

yannbf · 2026-04-27T07:28:43Z

@@ -0,0 +1,283 @@
+import { dedent } from 'ts-dedent';


Would be somehow nice to know that a specific prompt was created/experimented with at a specific date or which prompt that is (first ever? iteration number 5?), and even better if there is at least one link of an eval for it so it's easy to refer to its results. WDYT?

yannbf · 2026-04-27T07:50:48Z

+  // collect evidence of what the agent accomplished — but only via telemetry
+  // (the `ai-setup-evidence` event). Skip the snapshot + cache write when
+  // telemetry is disabled so there's nobody to read it.
+  if (!disableTelemetry) {


I think this now needs to be using isTelemetryModuleEnabled from storybook/internal/telemetry

…-cli

…ed-prompt-check

…-cli

…ed-prompt-check

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

scripts/eval/README.md (2)

286-292: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block to satisfy MD040.

Use an explicit fence like text or console.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing
the eval.ts → run-trial.ts → agent → tool → getPrompts flow lacks a language tag
which triggers MD040; update the triple-backtick fence to include a language tag
(e.g., ```text or ```console) so the block becomes ```text (or ```console) ...
``` to satisfy the linter while preserving the existing content.

278-278: ⚠️ Potential issue | 🟡 Minor

README flow is still inaccurate about who runs ai setup.

This line says the harness never spawns ai setup, but the harness also runs it for prompt capture (prompt.content path).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` at line 278, The README incorrectly states the
harness never runs `ai setup`; update the text to reflect that the harness does
run `ai setup` in the prompt-capture path (i.e., when populating prompt.content)
while still handing the task to the trial agent for normal execution; mention
both behaviors and reference the harness's prompt capture flow (prompt.content)
and the `ai setup` command so readers understand the distinction.

🧹 Nitpick comments (2)

scripts/eval/lib/utils.ts (2)
5-9: Keep the prompt registry out of this low-level utility.

Importing PROMPT_NAMES from index.ts also loads the prompt-builder modules and their instruction payloads, so every eval command that touches scripts/eval/lib/utils.ts now pays that cost even when it only needs names. A names-only registry module would keep startup and bundle size lower.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/utils.ts` around lines 5 - 9, The utils module is importing
PROMPT_NAMES (and DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts),
which triggers loading of prompt-builder modules and instruction payloads;
replace that heavy import with a lightweight "names-only" registry export and
update the import in scripts/eval/lib/utils.ts to pull PROMPT_NAMES and
DEFAULT_PROMPT_NAME from the new names-only module (e.g., prompts/names or
prompts/registry-names) so consumers get just the list of names without loading
builders/payloads; create the names-only module to re-export only the minimal
constants and ensure utils.ts references those symbols (PROMPT_NAMES,
DEFAULT_PROMPT_NAME) from the new module.
149-167: Rename this helper or split validation from loading.

This function no longer loads a prompt variant; it only validates a name and returns AI_SETUP_PROMPT. The current API is easy to misread at future call sites, especially since the returned text is independent of name.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/utils.ts` around lines 149 - 167, The function loadPrompt no
longer loads a variant by name (it only validates and returns the constant
AI_SETUP_PROMPT), so split validation from retrieval: add a new function
validatePromptName(name: string) that uses listPrompts() / PROMPT_NAMES and
throws the same error on missing names, then change loadPrompt to be a no-arg
function that simply returns AI_SETUP_PROMPT (or rename loadPrompt to
getSetupPrompt and make it no-arg); update all call sites to call
validatePromptName(name) where they currently pass a name, and call the no-arg
getSetupPrompt()/loadPrompt() to obtain AI_SETUP_PROMPT.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/README.md`:
- Line 7: The README has malformed inline markdown like "`**gh` CLI** —
installed and authenticated (`gh auth login`)" (and another occurrence at the
other noted line); fix by removing the mixed backtick+bold markup and use
consistent markdown for commands/file names — e.g. use backticks for commands
(`gh`, `gh auth login`) or bold for emphasis (**) but not both, and update both
occurrences to the chosen correct form.

---

Duplicate comments:
In `@scripts/eval/README.md`:
- Around line 286-292: The fenced code block showing the eval.ts → run-trial.ts
→ agent → tool → getPrompts flow lacks a language tag which triggers MD040;
update the triple-backtick fence to include a language tag (e.g., ```text or
```console) so the block becomes ```text (or ```console) ... ``` to satisfy the
linter while preserving the existing content.
- Line 278: The README incorrectly states the harness never runs `ai setup`;
update the text to reflect that the harness does run `ai setup` in the
prompt-capture path (i.e., when populating prompt.content) while still handing
the task to the trial agent for normal execution; mention both behaviors and
reference the harness's prompt capture flow (prompt.content) and the `ai setup`
command so readers understand the distinction.

---

Nitpick comments:
In `@scripts/eval/lib/utils.ts`:
- Around line 5-9: The utils module is importing PROMPT_NAMES (and
DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts), which triggers
loading of prompt-builder modules and instruction payloads; replace that heavy
import with a lightweight "names-only" registry export and update the import in
scripts/eval/lib/utils.ts to pull PROMPT_NAMES and DEFAULT_PROMPT_NAME from the
new names-only module (e.g., prompts/names or prompts/registry-names) so
consumers get just the list of names without loading builders/payloads; create
the names-only module to re-export only the minimal constants and ensure
utils.ts references those symbols (PROMPT_NAMES, DEFAULT_PROMPT_NAME) from the
new module.
- Around line 149-167: The function loadPrompt no longer loads a variant by name
(it only validates and returns the constant AI_SETUP_PROMPT), so split
validation from retrieval: add a new function validatePromptName(name: string)
that uses listPrompts() / PROMPT_NAMES and throws the same error on missing
names, then change loadPrompt to be a no-arg function that simply returns
AI_SETUP_PROMPT (or rename loadPrompt to getSetupPrompt and make it no-arg);
update all call sites to call validatePromptName(name) where they currently pass
a name, and call the no-arg getSetupPrompt()/loadPrompt() to obtain
AI_SETUP_PROMPT.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb456c16-3186-4721-8c8a-60812aebadf8

📥 Commits

Reviewing files that changed from the base of the PR and between a6aaf4f and 96b782c.

📒 Files selected for processing (3)

scripts/eval/README.md
scripts/eval/lib/utils.ts
scripts/eval/run-batch.ts

✅ Files skipped from review due to trivial changes (1)

scripts/eval/run-batch.ts

…mpt-check Eval: Record hasCssCheckStory when the diff adds CssCheck

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/eval/lib/story-render.ts (1)

210-215: ⚠️ Potential issue | 🟡 Minor

Normalize cssCheck before returning summary.

At Line 214, cssCheck is copied directly from parsed.cssCheck. If parser output is missing/unknown, this violates the StoryRenderGrade runtime contract and leaks ambiguity downstream.

Suggested fix

   return {
     total: parsed.total,
     passed: parsed.passed,
     storyFiles,
-    cssCheck: parsed.cssCheck,
+    cssCheck:
+      parsed.cssCheck === 'pass' || parsed.cssCheck === 'fail' ? parsed.cssCheck : 'not-run',
   } satisfies StoryRenderGrade;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/story-render.ts` around lines 210 - 215, The returned
summary currently passes parsed.cssCheck directly which can be undefined and
violate the StoryRenderGrade contract; update the return to normalize
parsed.cssCheck first (e.g. validate type/value and fallback to a defined
default) and return that normalized value instead of parsed.cssCheck. Locate the
return in the same function where total/passed/storyFiles are assembled and
replace the raw parsed.cssCheck with a normalized variable (or call a small
helper like normalizeCssCheck) so the object always satisfies StoryRenderGrade.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/grade.ts`:
- Around line 183-190: The current branch treating cssCheck === 'not-run' as an
error should be changed to an informational log: locate the code that reads
storyRenderRun.summary?.cssCheck into cssCheck and the three-branch logic that
calls logger.logSuccess / logger.logError; change the final else branch (the one
logging "CssCheck story missing or not run") to use an informational logging
method (e.g. logger.logInfo or logger.info) instead of logger.logError so
'not-run' is not reported as an error while keeping 'pass' as logSuccess and
'fail' as logError.

---

Outside diff comments:
In `@scripts/eval/lib/story-render.ts`:
- Around line 210-215: The returned summary currently passes parsed.cssCheck
directly which can be undefined and violate the StoryRenderGrade contract;
update the return to normalize parsed.cssCheck first (e.g. validate type/value
and fallback to a defined default) and return that normalized value instead of
parsed.cssCheck. Locate the return in the same function where
total/passed/storyFiles are assembled and replace the raw parsed.cssCheck with a
normalized variable (or call a small helper like normalizeCssCheck) so the
object always satisfies StoryRenderGrade.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a5bf4df9-6c42-4773-99b7-18500b928dc4

📥 Commits

Reviewing files that changed from the base of the PR and between 96b782c and 3b67e50.

📒 Files selected for processing (5)

scripts/eval/lib/grade.ts
scripts/eval/lib/publish-trial.test.ts
scripts/eval/lib/publish-trial.ts
scripts/eval/lib/run-trial.test.ts
scripts/eval/lib/story-render.ts

✅ Files skipped from review due to trivial changes (1)

scripts/eval/lib/publish-trial.ts

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/eval/lib/run-trial.test.ts

Co-authored-by: Steve Dodier-Lazaro <Sidnioulz@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (2)

scripts/eval/README.md (2)

16-16: ⚠️ Potential issue | 🟡 Minor

Fix malformed markdown: move bold outside backticks.

The current format **sync-baselines.ts** places bold markers inside code backticks, which renders them as literal asterisks instead of formatting. Bold formatting doesn't work inside inline code.

📝 Proposed fix

-1. `**sync-baselines.ts**` pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline.
+1. **`sync-baselines.ts`** pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` at line 16, Replace the malformed inline code with
proper bold text: find the literal string `**sync-baselines.ts**` and change it
to **sync-baselines.ts** so the file name is rendered in bold (remove the
backticks around the asterisks); ensure no other inline code spans include bold
markers inside backticks.

286-292: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The fenced block is missing a language identifier, triggering markdownlint MD040.

🔧 Proposed fix

-```
+```text
 eval.ts --prompt setup
   → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } })
     → agent spawns with that env
       → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT
         → CLI's getPrompts() picks the 'setup' variant
-```
+```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing
"eval.ts --prompt setup" in the README is missing a language identifier which
triggers markdownlint MD040; fix it by adding a language tag (e.g., text) after
the opening triple backticks so the block becomes ```text ... ```, ensuring the
snippet that starts with "eval.ts --prompt setup" uses that language identifier;
update the README fenced block accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/eval/README.md`:
- Line 16: Replace the malformed inline code with proper bold text: find the
literal string `**sync-baselines.ts**` and change it to **sync-baselines.ts** so
the file name is rendered in bold (remove the backticks around the asterisks);
ensure no other inline code spans include bold markers inside backticks.
- Around line 286-292: The fenced code block showing "eval.ts --prompt setup" in
the README is missing a language identifier which triggers markdownlint MD040;
fix it by adding a language tag (e.g., text) after the opening triple backticks
so the block becomes ```text ... ```, ensuring the snippet that starts with
"eval.ts --prompt setup" uses that language identifier; update the README fenced
block accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 53a8c4b1-ce2b-4100-988f-ddc351c6d0e4

📥 Commits

Reviewing files that changed from the base of the PR and between 3b67e50 and 8e7c397.

📒 Files selected for processing (1)

scripts/eval/README.md

kasperpeulen added 6 commits April 20, 2026 00:37

kasperpeulen mentioned this pull request Apr 20, 2026

Build: Source eval prompts from the CLI via EVAL_SETUP_PROMPT #34601

Closed

8 tasks

Merge remote-tracking branch 'origin/kasper/eval-prompts-from-cli' in…

5a85358

…to cursor/eval-css-loaded-prompt-check # Conflicts: # scripts/eval/prompts/pattern-copy-play.md

This was referenced Apr 20, 2026

Eval: Record hasCssCheckStory when the diff adds CssCheck #34595

Merged

CLI: Sync storybook setup prompt with eval pattern-copy-play improvements #34596

Closed

kasperpeulen added build Internal-facing build tooling & test updates ci:normal labels Apr 20, 2026

kasperpeulen marked this pull request as ready for review April 20, 2026 15:07

kasperpeulen requested review from Sidnioulz and yannbf April 20, 2026 15:07

kasperpeulen added maintenance User-facing maintenance tasks and removed build Internal-facing build tooling & test updates labels Apr 20, 2026

kasperpeulen changed the title ~~Build: Source eval prompts from the CLI via EVAL_SETUP_PROMPT~~ CLI: Source eval prompts from the CLI via EVAL_SETUP_PROMPT Apr 20, 2026

kasperpeulen marked this pull request as draft April 20, 2026 15:12

chore: oxfmt scripts/eval/eval.ts (fix CI format-check)

741237e

Trivial one-line signature reflow picked up by `oxfmt --check` after the merge of #34602 into this branch. No behavior change.

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

kasperpeulen added 6 commits April 21, 2026 17:18

Merge remote-tracking branch 'origin/kasper/eval-prompts-from-cli' in…

86ae333

…to cursor/eval-css-loaded-prompt-check

Merge remote-tracking branch 'origin/kasper/eval-sync-storybook-versi…

9140f6c

…on' into kasper/eval-prompts-from-cli

Merge remote-tracking branch 'origin/kasper/eval-sync-storybook-versi…

62f2fa6

…on' into cursor/eval-css-loaded-prompt-check

Sidnioulz reviewed Apr 25, 2026

View reviewed changes