Skip to content

CLI: Source eval prompts from ai setup via EVAL_SETUP_PROMPT#34602

Merged
yannbf merged 27 commits into
project/sb-agentic-setupfrom
kasper/eval-prompts-from-cli
Apr 30, 2026
Merged

CLI: Source eval prompts from ai setup via EVAL_SETUP_PROMPT#34602
yannbf merged 27 commits into
project/sb-agentic-setupfrom
kasper/eval-prompts-from-cli

Conversation

@kasperpeulen
Copy link
Copy Markdown
Member

@kasperpeulen kasperpeulen commented Apr 20, 2026

Closes #

What I did

This PR makes the eval harness use the same AI setup prompt source as the public CLI.

  • Moved the eval prompt variants into code/lib/cli-storybook/src/ai/prompts/.
  • Added a registry that selects a prompt variant from the internal EVAL_SETUP_PROMPT env var, with pattern-copy-play as the default.
  • Simplified code/lib/cli-storybook/src/ai/prompt.ts so it delegates prompt selection to the registry.
  • Updated the eval harness so the agent runs npx storybook ai setup itself. The selected prompt variant is passed through the agent environment, which means evals now exercise the same CLI flow a user would use.
  • Removed the duplicated markdown prompt files under scripts/eval/prompts/ and updated the supporting docs and tests.

This gives the CLI a single source of truth for AI setup prompts while still letting the eval suite switch between internal variants for experiments.

Checklist for Contributors

Testing

The changes in this PR are covered in the following automated tests:

  • stories
  • unit tests
  • integration tests
  • end-to-end tests

code/lib/cli-storybook/src/ai/index.test.ts, scripts/eval/lib/utils.test.ts, and scripts/eval/lib/run-trial.test.ts cover prompt lookup, validation, environment plumbing, and prompt capture.

Manual testing

Caution

This section is mandatory for all contributions. If you believe no manual test is necessary, please state so explicitly. Thanks!

  1. Run npx storybook ai setup in a Storybook project and confirm the default output matches the current pattern-copy-play prompt.
  2. Run node scripts/eval/eval.ts --list-prompts and confirm both pattern-copy-play and setup are listed.
  3. Run node scripts/eval/eval.ts -p mealdrop --prompt setup --manual and confirm the printed command is prefixed with EVAL_SETUP_PROMPT=setup.
  4. In a trial workspace, confirm the agent runs npx storybook ai setup and the saved prompt.content matches the selected variant.

Documentation

  • Add or update documentation reflecting your changes
  • If you are deprecating/removing a feature, make sure to update
    MIGRATION.MD

Updated scripts/eval/README.md to describe the prompt-variant workflow.

Checklist for Maintainers

  • When this PR is ready for testing, make sure to add ci:normal, ci:merged or ci:daily GH label to it to run a specific set of sandboxes. The particular set of sandboxes can be found in code/lib/cli-storybook/src/sandbox-templates.ts

  • Make sure this PR contains one of the labels below:

    Available labels
    • bug: Internal changes that fixes incorrect behavior.
    • maintenance: User-facing maintenance tasks.
    • dependencies: Upgrading (sometimes downgrading) dependencies.
    • build: Internal-facing build tooling & test updates. Will not show up in release changelog.
    • cleanup: Minor cleanup style change. Will not show up in release changelog.
    • documentation: Documentation only changes. Will not show up in release changelog.
    • feature request: Introducing a new feature.
    • BREAKING CHANGE: Changes that break compatibility in some way with current major version.
    • other: Changes that don't fit in the above categories.

🦋 Canary release

This PR does not have a canary release associated. You can request a canary release of this pull request by mentioning the @storybookjs/core team here.

core team members can create a canary release here or locally with gh workflow run --repo storybookjs/storybook publish.yml --field pr=<PR_NUMBER>

Summary by CodeRabbit

  • Refactor

    • Replaced markdown-based prompts with a registry of prompt variants backed by pluggable prompt builders.
  • New Features

    • Choose prompt variants via CLI/env and pass selection into AI runs.
    • Optional telemetry gating for the AI setup flow.
    • Capture AI-setup markdown output during trials.
    • Include a CssCheck status in grading and PR descriptions.
  • Documentation

    • Updated eval docs and CLI help to describe prompt variants.
  • Tests

    • Added/updated tests for prompt registry, agent env handling, trial capture, and grading outputs.

… loaded

Addresses #34594. Adds a prompt-level instruction and a grading flag
that together catch "renders fine, but user CSS never loaded" failures.

- pattern-copy-play.md: new Step 7 requires exactly one story to assert
  a component-specific computed style via getComputedStyle.
- grade.ts: records hasComputedStyleAssertion based on whether the staged
  diff contains "getComputedStyle" (reuses the existing cached diff, no
  extra file reads).

Chose the prompt+diff approach over a runtime stylesheet heuristic
(filtering document.styleSheets, isolated all:initial probe, etc.)
because:

- The agent already knows what "styled correctly" means for a given
  component; a component-specific computed-style assertion catches the
  real failure ("bg-blue-600 did not apply") rather than a generic
  "something was applied" signal.
- No fragile filtering of vitest-browser / storybook / addon stylesheet
  sources. Addons keep shipping new sheets; that filter would bit-rot.
- Failures surface as normal Vitest assertion failures and already flow
  through pass/fail grading — no new counter, no new warning channel,
  no changes to render-analysis.
- Complementary to a future runtime heuristic if we want one: prompt-level
  catches "agent misconfigured the design system"; runtime catches "agent
  shipped a visibly unstyled story without the check".
'Render call' could read as 'you need a render: () => ... function',
which is wrong — args stories have no render call and that's the
preferred shape for prop-driven components. Softening to 'just
rendering the component in the story is enough' keeps the intent
without steering toward render().
Before: hasComputedStyleAssertion was a plain rawDiff.includes('getComputedStyle'),
which matched the prompt markdown (written to .storybook/eval-results/prompt.md
before grade runs) and the transcript JSON — both of which contain the token
verbatim because the new prompt Step 7 and the agent's own tool-output lines
include it. The flag was effectively tautological: true whenever the prompt was
staged, regardless of what the agent did.

After: parse the unified patch, track which file each hunk belongs to via the
'+++ b/<path>' headers, and only consider added lines (skipping the '+++' header
itself) that live in files also present in storybookChanges. Uses the existing
STORY_FILE_PATTERN from story-render.ts as the single source of truth for what
counts as a story file.

Exports diffAddsTokenInStoryFiles as a pure helper with unit tests covering the
false-positive paths (prompt.md / data.json), deleted lines, the +++ header,
and files not in storybookChanges.
Aligns the prompt + grade check with the Slack agreement: instead of
hoping the agent adds *some* `getComputedStyle` call somewhere, the
prompt now asks for one story explicitly named `CssCheck`. That
specific story name is what the AI-stories vitest run in core will
grep for to attribute the pass/fail result in the
`ai-setup-final-scoring` telemetry event.

- `pattern-copy-play.md` Step 7: heading + example updated to
  `export const CssCheck: Story = { ... }`.
- `grade.ts`: `hasComputedStyleAssertion` -> `hasCssCheckStory`,
  token matched in the diff changed from `getComputedStyle` to
  `export const CssCheck`.
- `grade.test.ts`: added two tests locking in the new use case
  (positive: story-file diff with the export; negative: prompt.md
  false positive).
- Trial / publish / result-docs test mocks renamed to match.

Rationale (from Slack): giving the story a known name means
telemetry in core can report on the CSS check result directly,
without layering on a separate tag. The story also ends up being
educational — a visible example of how to verify CSS loaded. No
tag, no new telemetry field required on top of whatever core
adds in a follow-up PR.
Move the eval harness's prompt catalog into code/lib/cli-storybook/src/ai/prompts/
so trials exercise the exact prompt a real user gets from `npx storybook ai setup`.
Each variant lives in its own fully isolated .ts file; the registry selects one at
runtime via the internal EVAL_SETUP_PROMPT env var (unset for real users → always
the default). The harness now hands the agent the AI_SETUP_PROMPT nudge and sets
EVAL_SETUP_PROMPT on the agent's spawn, so the agent itself runs `ai setup` as a
tool call — mirroring the real user flow instead of resolving the prompt upfront.
Bring the three prompt-content changes that were about to ship in #34596 onto
the post-refactor layout. Applies to code/lib/cli-storybook/src/ai/prompts/
pattern-copy-play.ts (previously getSetupInstructions in prompt.ts):

- New end-state paragraph in the intro clarifying that the shared preview should
  own all providers, CSS, browser state, and network mocks so rendering the
  component in the story is enough.
- New "#### Args vs render" subsection under Step 5 with two full examples
  (args-driven Button, render-based composition inside Card), via two new
  self-contained helpers getArgsStoryExample and getRenderCompositionExample.
- New Step 7 "Prove CSS is loaded in exactly one story named CssCheck" asserting
  a component-specific computed style via getComputedStyle to catch "renders but
  CSS never loaded" failures. Steps 8 and 9 renumbered accordingly.

Makes #34596 redundant against this branch.
…to cursor/eval-css-loaded-prompt-check

# Conflicts:
#	scripts/eval/prompts/pattern-copy-play.md
@kasperpeulen kasperpeulen added build Internal-facing build tooling & test updates ci:normal labels Apr 20, 2026
@kasperpeulen kasperpeulen marked this pull request as ready for review April 20, 2026 15:07
@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Apr 20, 2026

View your CI Pipeline Execution ↗ for commit a6aaf4f

Command Status Duration Result
nx run-many -t compile,check,knip,test,lint,fmt... ✅ Succeeded 9m 48s View ↗

☁️ Nx Cloud last updated this comment at 2026-04-22 14:31:06 UTC

@kasperpeulen kasperpeulen added maintenance User-facing maintenance tasks and removed build Internal-facing build tooling & test updates labels Apr 20, 2026
@kasperpeulen kasperpeulen changed the title Build: Source eval prompts from the CLI via EVAL_SETUP_PROMPT CLI: Source eval prompts from the CLI via EVAL_SETUP_PROMPT Apr 20, 2026
@kasperpeulen kasperpeulen marked this pull request as draft April 20, 2026 15:12
Trivial one-line signature reflow picked up by `oxfmt --check` after
the merge of #34602 into this branch. No behavior change.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces file-based markdown prompts with a TypeScript prompt registry and prompt-builder modules, threads selected prompt variant via EVAL_SETUP_PROMPT through the eval harness into agent execution, and conditions ai-setup telemetry snapshotting on a new disableTelemetry option.

Changes

Cohort / File(s) Summary
Prompt registry & builders
code/lib/cli-storybook/src/ai/prompts/index.ts, code/lib/cli-storybook/src/ai/prompts/pattern-copy-play.ts, code/lib/cli-storybook/src/ai/prompts/setup.ts
Added centralized PROMPT_BUILDERS registry, exported PROMPT_NAMES/DEFAULT_PROMPT_NAME/PromptName, and two prompt-builder instructions(projectInfo) implementations that produce project-aware instruction strings.
Prompt generation refactor
code/lib/cli-storybook/src/ai/prompt.ts
Removed inline prompt-building and docs-URL helper; generateMarkdownOutput now imports getPrompts from the new registry and composes final markdown.
Eval prompts → registry migration
scripts/eval/lib/utils.ts, scripts/eval/lib/utils.test.ts, scripts/eval/prompts/* (deleted)
Replaced filesystem-based prompts/*.md discovery/loading with registry-driven listPrompts()/loadPrompt() validating against PROMPT_NAMES; deleted legacy markdown prompt files.
Agent env plumbing
scripts/eval/lib/agents/config.ts, scripts/eval/lib/agents/claude-code.ts, scripts/eval/lib/agents/codex.ts
Extended agent execute API to accept optional env?: Record<string,string> and merge it into the SDK runtime env so caller-provided vars (e.g., EVAL_SETUP_PROMPT) can override process.env while still forcing telemetry disable flag.
Eval CLI / trial wiring
scripts/eval/eval.ts, scripts/eval/lib/run-trial.ts, scripts/eval/lib/run-trial.test.ts, scripts/eval/run-batch.ts
Treat --prompt as a prompt variant name; pass selected variant via EVAL_SETUP_PROMPT into agent execution; added captureAiSetupMarkdown to run npx storybook ai setup and store captured markdown in trial artifacts; updated CLI help and tests.
Telemetry gating in aiSetup
code/lib/cli-storybook/src/ai/index.ts, code/lib/cli-storybook/src/ai/types.ts, code/lib/cli-storybook/src/ai/index.test.ts
Added optional disableTelemetry to AiSetupOptions and skip preview snapshot + pending-cache evidence when disableTelemetry is true; tests updated/added to cover behavior.
Reporting & grading additions
scripts/eval/lib/story-render.ts, scripts/eval/lib/grade.ts, scripts/eval/lib/publish-trial.ts, scripts/eval/lib/publish-trial.test.ts
Introduced a three-state cssCheck field (pass
Docs & README
scripts/eval/README.md
Updated docs to describe prompt variants, --prompt semantics, default pattern-copy-play, and instructions for adding new TypeScript prompt variants to the registry.
Tests & harness adjustments
scripts/eval/lib/run-trial.test.ts, scripts/eval/lib/utils.test.ts, code/lib/cli-storybook/src/ai/index.test.ts
Updated tests to reflect registry-driven prompts, env propagation (EVAL_SETUP_PROMPT), captured npx storybook ai setup output assertions, and telemetry gating behaviour.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Eval Runner
  participant Driver as Eval Driver / SDK
  participant Agent as Agent (Claude/Codex)
  participant SBCLI as Storybook CLI (npx storybook ai setup)
  participant Registry as Prompt Registry

  Runner->>Driver: execute(promptName, env={EVAL_SETUP_PROMPT: promptName})
  Driver->>Agent: start agent with merged env
  Agent->>SBCLI: run "npx storybook ai setup" (inherits env)
  SBCLI->>Registry: resolve getPrompts(projectInfo) using EVAL_SETUP_PROMPT
  Registry-->>SBCLI: return prompts/instructions
  SBCLI-->>Agent: prints generated markdown to stdout
  Agent-->>Driver: capture stdout
  Driver-->>Runner: return captured markdown (stored in trial)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #34365 — Implements eval harness and migrates prompt handling to a CLI-side TypeScript prompt registry (EVAL_SETUP_PROMPT selection, prompt builders), overlapping registry/agent wiring.
  • #34421 — Refactors prompts to a registry and updates CLI/default registration patterns; directly related to PROMPT_BUILDERS/DEFAULT_PROMPT_NAME changes.
  • #34595 — Adds and propagates CSS-related checks through grading and reporting (cssCheck field), overlapping the new grading/PR-body additions.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/eval/README.md (1)

254-260: Add a language specifier to the fenced code block.

The code block showing the env var flow lacks a language identifier. Since this is a text diagram rather than executable code, consider using text or plaintext.

📝 Suggested fix
-```
+```text
 eval.ts --prompt setup
   → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } })
     → agent spawns with that env
       → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT
         → CLI's getPrompts() picks the 'setup' variant
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @scripts/eval/README.md around lines 254 - 260, Update the fenced code block
that contains the env var flow diagram so it includes a language specifier
(e.g., add "text" or "plaintext" after the opening triple backticks); locate the
block showing "eval.ts --prompt setup → run-trial.ts calls driver.execute({ env:
{ EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening
"" to "text" to mark it as a plain text diagram for proper rendering.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @scripts/eval/README.md:

  • Around line 254-260: Update the fenced code block that contains the env var
    flow diagram so it includes a language specifier (e.g., add "text" or
    "plaintext" after the opening triple backticks); locate the block showing
    "eval.ts --prompt setup → run-trial.ts calls driver.execute({ env: {
    EVAL_SETUP_PROMPT: 'setup' } }) → agent spawns..." and change the opening "" to "text" to mark it as a plain text diagram for proper rendering.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `0a5e3e78-8cbe-4976-84b3-6f790fc90a20`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 8ed433230702e49101e26e08ee5a2713136044f5 and 9412695be9a13536ccce4778ed3e21e87ed8a936.

</details>

<details>
<summary>📒 Files selected for processing (15)</summary>

* `code/lib/cli-storybook/src/ai/prompt.ts`
* `code/lib/cli-storybook/src/ai/prompts/index.ts`
* `code/lib/cli-storybook/src/ai/prompts/pattern-copy-play.ts`
* `code/lib/cli-storybook/src/ai/prompts/setup.ts`
* `scripts/eval/README.md`
* `scripts/eval/eval.ts`
* `scripts/eval/lib/agents/claude-code.ts`
* `scripts/eval/lib/agents/codex.ts`
* `scripts/eval/lib/agents/config.ts`
* `scripts/eval/lib/run-trial.ts`
* `scripts/eval/lib/utils.test.ts`
* `scripts/eval/lib/utils.ts`
* `scripts/eval/prompts/pattern-copy-play.md`
* `scripts/eval/prompts/setup.md`
* `scripts/eval/run-batch.ts`

</details>

<details>
<summary>💤 Files with no reviewable changes (2)</summary>

* scripts/eval/prompts/setup.md
* scripts/eval/prompts/pattern-copy-play.md

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@storybook-app-bot
Copy link
Copy Markdown

storybook-app-bot Bot commented Apr 20, 2026

Package Benchmarks

Commit: b35ff84, ran on 29 April 2026 at 18:27:29 UTC

The following packages have significant changes to their size or dependencies:

@storybook/cli

Before After Difference
Dependency count 184 184 0
Self size 819 KB 835 KB 🚨 +16 KB 🚨
Dependency size 68.26 MB 68.26 MB 🚨 +281 B 🚨
Bundle Size Analyzer Link Link

…ompt exports

- Spawn `npx storybook ai setup` inside the trial workspace with
  `EVAL_SETUP_PROMPT=<name>` and save its stdout as `prompt.content` so
  `data.json` and transcript docs carry the project-aware instructions
  instead of the one-line nudge.
- Rename every prompt variant's builder to `instructions` and use
  namespace imports in the prompts registry so all variant files share
  one export convention.
- Fix two stale `run-trial.test.ts` assertions that still expected the
  full markdown as the agent prompt; mock `tinyexec` and cover the new
  `prompt.content` field.
- Collapse `buildManualCommand` signature in `eval.ts` onto one line so
  `yarn fmt:check` passes.
- Gate ai-setup preview snapshot + ai-setup-pending cache write on
  !disableTelemetry — the record is only consumed by the ai-setup-evidence
  telemetry event, so it has no consumer when telemetry is off.
- Plumb disableTelemetry through AiSetupOptions and add a unit test covering
  the enabled / disabled / default paths.
- Swap `requested in PROMPT_BUILDERS` for `Object.hasOwn(...)` so prototype
  property names in EVAL_SETUP_PROMPT fall back to the default.
- Wrap captureAiSetupMarkdown in try/catch so spawn or timeout failures log
  and return an empty string instead of aborting the trial; set
  STORYBOOK_DISABLE_TELEMETRY=1 on the subprocess env.
The single-to-double quote edit landed by accident in commit 4 of this PR
and doesn't belong to the eval-prompts refactor. It also broke
`yarn fmt:check` (oxfmt enforces single quotes). Restoring the base-
branch state fixes CI and keeps this PR scoped to the CLI/eval work.
…on' into cursor/eval-css-loaded-prompt-check
Comment on lines +107 to +121
// collect evidence of what the agent accomplished — but only via telemetry
// (the `ai-setup-evidence` event). Skip the snapshot + cache write when
// telemetry is disabled so there's nobody to read it.
if (!disableTelemetry) {
const resolvedConfigDir = resolve(projectInfo.configDir);
const previewSnapshot = await snapshotPreviewFile(resolvedConfigDir);
const sessionId = await getSessionId();
const pendingRecord: AiSetupPendingRecord = {
timestamp: Date.now(),
sessionId,
configDir: resolvedConfigDir,
...previewSnapshot,
};
await cache.set('ai-setup-pending', pendingRecord);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ValentinFunk what do you think of this? Did you plan a new API for these use cases or is this still the "canonical" way to do things?

I think Kasper is doing the right thing here as we wanna avoid useless compute, but this happens earlier than when we actually can use the telemetry higher order function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry Valentin, I need to stop pinging you by accident 😭

@Sidnioulz
Copy link
Copy Markdown
Member

Haven't had time to read everything and run the eval but the architecture seems sound to me. Good idea capturing the conditional prompt and instrumenting the served prompt with an env var. My only concern so far would be avoiding bloating the prod build if we end up having more prompts, and more ai commands with their own prompts.

@@ -0,0 +1,283 @@
import { dedent } from 'ts-dedent';
Copy link
Copy Markdown
Member

@yannbf yannbf Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be somehow nice to know that a specific prompt was created/experimented with at a specific date or which prompt that is (first ever? iteration number 5?), and even better if there is at least one link of an eval for it so it's easy to refer to its results. WDYT?

Comment thread code/lib/cli-storybook/src/ai/setup-prompts/pattern-copy-play.ts
Comment thread code/lib/cli-storybook/src/ai/setup-prompts/pattern-copy-play.ts
Comment thread code/lib/cli-storybook/src/ai/setup-prompts/pattern-copy-play.ts
Comment thread code/lib/cli-storybook/src/ai/setup-prompts/pattern-copy-play.ts
Comment thread code/lib/cli-storybook/src/ai/index.ts Outdated
// collect evidence of what the agent accomplished — but only via telemetry
// (the `ai-setup-evidence` event). Skip the snapshot + cache write when
// telemetry is disabled so there's nobody to read it.
if (!disableTelemetry) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this now needs to be using isTelemetryModuleEnabled from storybook/internal/telemetry

Base automatically changed from kasper/eval-sync-storybook-version to project/sb-agentic-setup April 28, 2026 07:56
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
scripts/eval/README.md (2)

286-292: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block to satisfy MD040.

Use an explicit fence like text or console.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing
the eval.ts → run-trial.ts → agent → tool → getPrompts flow lacks a language tag
which triggers MD040; update the triple-backtick fence to include a language tag
(e.g., ```text or ```console) so the block becomes ```text (or ```console) ...
``` to satisfy the linter while preserving the existing content.

278-278: ⚠️ Potential issue | 🟡 Minor

README flow is still inaccurate about who runs ai setup.

This line says the harness never spawns ai setup, but the harness also runs it for prompt capture (prompt.content path).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` at line 278, The README incorrectly states the
harness never runs `ai setup`; update the text to reflect that the harness does
run `ai setup` in the prompt-capture path (i.e., when populating prompt.content)
while still handing the task to the trial agent for normal execution; mention
both behaviors and reference the harness's prompt capture flow (prompt.content)
and the `ai setup` command so readers understand the distinction.
🧹 Nitpick comments (2)
scripts/eval/lib/utils.ts (2)

5-9: Keep the prompt registry out of this low-level utility.

Importing PROMPT_NAMES from index.ts also loads the prompt-builder modules and their instruction payloads, so every eval command that touches scripts/eval/lib/utils.ts now pays that cost even when it only needs names. A names-only registry module would keep startup and bundle size lower.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/utils.ts` around lines 5 - 9, The utils module is importing
PROMPT_NAMES (and DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts),
which triggers loading of prompt-builder modules and instruction payloads;
replace that heavy import with a lightweight "names-only" registry export and
update the import in scripts/eval/lib/utils.ts to pull PROMPT_NAMES and
DEFAULT_PROMPT_NAME from the new names-only module (e.g., prompts/names or
prompts/registry-names) so consumers get just the list of names without loading
builders/payloads; create the names-only module to re-export only the minimal
constants and ensure utils.ts references those symbols (PROMPT_NAMES,
DEFAULT_PROMPT_NAME) from the new module.

149-167: Rename this helper or split validation from loading.

This function no longer loads a prompt variant; it only validates a name and returns AI_SETUP_PROMPT. The current API is easy to misread at future call sites, especially since the returned text is independent of name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/utils.ts` around lines 149 - 167, The function loadPrompt no
longer loads a variant by name (it only validates and returns the constant
AI_SETUP_PROMPT), so split validation from retrieval: add a new function
validatePromptName(name: string) that uses listPrompts() / PROMPT_NAMES and
throws the same error on missing names, then change loadPrompt to be a no-arg
function that simply returns AI_SETUP_PROMPT (or rename loadPrompt to
getSetupPrompt and make it no-arg); update all call sites to call
validatePromptName(name) where they currently pass a name, and call the no-arg
getSetupPrompt()/loadPrompt() to obtain AI_SETUP_PROMPT.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/README.md`:
- Line 7: The README has malformed inline markdown like "`**gh` CLI** —
installed and authenticated (`gh auth login`)" (and another occurrence at the
other noted line); fix by removing the mixed backtick+bold markup and use
consistent markdown for commands/file names — e.g. use backticks for commands
(`gh`, `gh auth login`) or bold for emphasis (**) but not both, and update both
occurrences to the chosen correct form.

---

Duplicate comments:
In `@scripts/eval/README.md`:
- Around line 286-292: The fenced code block showing the eval.ts → run-trial.ts
→ agent → tool → getPrompts flow lacks a language tag which triggers MD040;
update the triple-backtick fence to include a language tag (e.g., ```text or
```console) so the block becomes ```text (or ```console) ... ``` to satisfy the
linter while preserving the existing content.
- Line 278: The README incorrectly states the harness never runs `ai setup`;
update the text to reflect that the harness does run `ai setup` in the
prompt-capture path (i.e., when populating prompt.content) while still handing
the task to the trial agent for normal execution; mention both behaviors and
reference the harness's prompt capture flow (prompt.content) and the `ai setup`
command so readers understand the distinction.

---

Nitpick comments:
In `@scripts/eval/lib/utils.ts`:
- Around line 5-9: The utils module is importing PROMPT_NAMES (and
DEFAULT_PROMPT_NAME) from the full prompt registry (index.ts), which triggers
loading of prompt-builder modules and instruction payloads; replace that heavy
import with a lightweight "names-only" registry export and update the import in
scripts/eval/lib/utils.ts to pull PROMPT_NAMES and DEFAULT_PROMPT_NAME from the
new names-only module (e.g., prompts/names or prompts/registry-names) so
consumers get just the list of names without loading builders/payloads; create
the names-only module to re-export only the minimal constants and ensure
utils.ts references those symbols (PROMPT_NAMES, DEFAULT_PROMPT_NAME) from the
new module.
- Around line 149-167: The function loadPrompt no longer loads a variant by name
(it only validates and returns the constant AI_SETUP_PROMPT), so split
validation from retrieval: add a new function validatePromptName(name: string)
that uses listPrompts() / PROMPT_NAMES and throws the same error on missing
names, then change loadPrompt to be a no-arg function that simply returns
AI_SETUP_PROMPT (or rename loadPrompt to getSetupPrompt and make it no-arg);
update all call sites to call validatePromptName(name) where they currently pass
a name, and call the no-arg getSetupPrompt()/loadPrompt() to obtain
AI_SETUP_PROMPT.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb456c16-3186-4721-8c8a-60812aebadf8

📥 Commits

Reviewing files that changed from the base of the PR and between a6aaf4f and 96b782c.

📒 Files selected for processing (3)
  • scripts/eval/README.md
  • scripts/eval/lib/utils.ts
  • scripts/eval/run-batch.ts
✅ Files skipped from review due to trivial changes (1)
  • scripts/eval/run-batch.ts

Comment thread scripts/eval/README.md Outdated
…mpt-check

Eval: Record hasCssCheckStory when the diff adds CssCheck
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/eval/lib/story-render.ts (1)

210-215: ⚠️ Potential issue | 🟡 Minor

Normalize cssCheck before returning summary.

At Line 214, cssCheck is copied directly from parsed.cssCheck. If parser output is missing/unknown, this violates the StoryRenderGrade runtime contract and leaks ambiguity downstream.

Suggested fix
   return {
     total: parsed.total,
     passed: parsed.passed,
     storyFiles,
-    cssCheck: parsed.cssCheck,
+    cssCheck:
+      parsed.cssCheck === 'pass' || parsed.cssCheck === 'fail' ? parsed.cssCheck : 'not-run',
   } satisfies StoryRenderGrade;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lib/story-render.ts` around lines 210 - 215, The returned
summary currently passes parsed.cssCheck directly which can be undefined and
violate the StoryRenderGrade contract; update the return to normalize
parsed.cssCheck first (e.g. validate type/value and fallback to a defined
default) and return that normalized value instead of parsed.cssCheck. Locate the
return in the same function where total/passed/storyFiles are assembled and
replace the raw parsed.cssCheck with a normalized variable (or call a small
helper like normalizeCssCheck) so the object always satisfies StoryRenderGrade.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/lib/grade.ts`:
- Around line 183-190: The current branch treating cssCheck === 'not-run' as an
error should be changed to an informational log: locate the code that reads
storyRenderRun.summary?.cssCheck into cssCheck and the three-branch logic that
calls logger.logSuccess / logger.logError; change the final else branch (the one
logging "CssCheck story missing or not run") to use an informational logging
method (e.g. logger.logInfo or logger.info) instead of logger.logError so
'not-run' is not reported as an error while keeping 'pass' as logSuccess and
'fail' as logError.

---

Outside diff comments:
In `@scripts/eval/lib/story-render.ts`:
- Around line 210-215: The returned summary currently passes parsed.cssCheck
directly which can be undefined and violate the StoryRenderGrade contract;
update the return to normalize parsed.cssCheck first (e.g. validate type/value
and fallback to a defined default) and return that normalized value instead of
parsed.cssCheck. Locate the return in the same function where
total/passed/storyFiles are assembled and replace the raw parsed.cssCheck with a
normalized variable (or call a small helper like normalizeCssCheck) so the
object always satisfies StoryRenderGrade.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a5bf4df9-6c42-4773-99b7-18500b928dc4

📥 Commits

Reviewing files that changed from the base of the PR and between 96b782c and 3b67e50.

📒 Files selected for processing (5)
  • scripts/eval/lib/grade.ts
  • scripts/eval/lib/publish-trial.test.ts
  • scripts/eval/lib/publish-trial.ts
  • scripts/eval/lib/run-trial.test.ts
  • scripts/eval/lib/story-render.ts
✅ Files skipped from review due to trivial changes (1)
  • scripts/eval/lib/publish-trial.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/eval/lib/run-trial.test.ts

Comment thread scripts/eval/lib/grade.ts
Comment thread scripts/eval/README.md Outdated
Comment thread scripts/eval/README.md Outdated
Comment thread scripts/eval/README.md Outdated
Co-authored-by: Steve Dodier-Lazaro <Sidnioulz@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
scripts/eval/README.md (2)

16-16: ⚠️ Potential issue | 🟡 Minor

Fix malformed markdown: move bold outside backticks.

The current format **sync-baselines.ts** places bold markers inside code backticks, which renders them as literal asterisks instead of formatting. Bold formatting doesn't work inside inline code.

📝 Proposed fix
-1. `**sync-baselines.ts**` pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline.
+1. **`sync-baselines.ts`** pushes a canonical `.storybook` config to each benchmark repo so every trial starts from the same known-good baseline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` at line 16, Replace the malformed inline code with
proper bold text: find the literal string `**sync-baselines.ts**` and change it
to **sync-baselines.ts** so the file name is rendered in bold (remove the
backticks around the asterisks); ensure no other inline code spans include bold
markers inside backticks.

286-292: ⚠️ Potential issue | 🟡 Minor

Add language identifier to fenced code block.

The fenced block is missing a language identifier, triggering markdownlint MD040.

🔧 Proposed fix
-```
+```text
 eval.ts --prompt setup
   → run-trial.ts calls driver.execute({ env: { EVAL_SETUP_PROMPT: 'setup' } })
     → agent spawns with that env
       → agent's `npx storybook ai setup` tool call inherits EVAL_SETUP_PROMPT
         → CLI's getPrompts() picks the 'setup' variant
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/README.md` around lines 286 - 292, The fenced code block showing
"eval.ts --prompt setup" in the README is missing a language identifier which
triggers markdownlint MD040; fix it by adding a language tag (e.g., text) after
the opening triple backticks so the block becomes ```text ... ```, ensuring the
snippet that starts with "eval.ts --prompt setup" uses that language identifier;
update the README fenced block accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/eval/README.md`:
- Line 16: Replace the malformed inline code with proper bold text: find the
literal string `**sync-baselines.ts**` and change it to **sync-baselines.ts** so
the file name is rendered in bold (remove the backticks around the asterisks);
ensure no other inline code spans include bold markers inside backticks.
- Around line 286-292: The fenced code block showing "eval.ts --prompt setup" in
the README is missing a language identifier which triggers markdownlint MD040;
fix it by adding a language tag (e.g., text) after the opening triple backticks
so the block becomes ```text ... ```, ensuring the snippet that starts with
"eval.ts --prompt setup" uses that language identifier; update the README fenced
block accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 53a8c4b1-ce2b-4100-988f-ddc351c6d0e4

📥 Commits

Reviewing files that changed from the base of the PR and between 3b67e50 and 8e7c397.

📒 Files selected for processing (1)
  • scripts/eval/README.md

@yannbf yannbf merged commit db8eabc into project/sb-agentic-setup Apr 30, 2026
122 checks passed
@yannbf yannbf deleted the kasper/eval-prompts-from-cli branch April 30, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:normal maintenance User-facing maintenance tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants