try: storybookjs/storybook#34766 β label on object JSON control#14
try: storybookjs/storybook#34766 β label on object JSON control#14valentinpalkovic wants to merge 2 commits into
Conversation
|
Verify HarnessVerdict: Replay: Screenshots
|
Verify HarnessVerdict: Replay: Screenshots
|
cd32170 to
b5a0c1a
Compare
Verify HarnessVerdict: Replay: Screenshots
|
β¦k to author Previously only evidence-missing/undetermined verdicts triggered retry. Regression verdicts (Playwright assertions failed) skipped retry entirely, even though the author could often self-correct from the error trace (wrong route, missing trigger state, stale selector). This change extends retry to cover regression by: 1. Wrapping the initial verify-pr step with `|| true` so the workflow continues on Playwright failure; the final verdict gate at the end of the job preserves red-CI signal based on post-retry verdict. 2. Evidence-check + Retry steps switch from `if: success()` to `if: always()` and gate internally on the JSON verdict. 3. Retry step extends gate to include verdict==regression. Retry-context for regression cases is built from each failed test's error-context.md (page snapshot) + first Playwright error message from playwright-report.json, capped at 8 KB per snapshot. 4. verify-pr-generate's --retry-context preamble is softened to cover both regression (selector / route correction) and evidence (trigger-state correction) paths. 5. Authoring guide Β§6: explicit guidance for non-stories diffs β locate the sibling *.stories.tsx and derive kind-id from its file path under the registered titlePrefix in code/.storybook/main.ts. Avoids agents guessing docs-site routes (e.g. addons-controls-basics) that don't exist in internal-ui. Empirical findings from PR #14 and #16 firetest: - #14 (Object.tsx control label diff) agent navigated to non-existent route addons-controls-object--basic. Correct route would be addons-docs-blocks-controls-object--object (or --docs autodocs page). - #16 (HighlightStyles conditional CSS) agent's toHaveCSS regex timed out because HighlightStyles only mounts on keyboard-nav highlight. Author had no way to learn this without the Playwright trace. Retry-on-regression gives the author one chance to self-correct using the page snapshot at failure point.
Past dispatches have 404'd guessing kind-ids by hand (PR #14: agent picked addons-controls-object--basic when the real route was addons-docs-blocks-controls-object--object). Storybook's auto-title pipeline mangles paths differently than a naive kebabify (leaf/dir dedupe, index.stories.ts collapsing, titlePrefix interplay), so naive derivation keeps failing. Fix: harness pre-computes the routes deterministically using Storybook's own algorithms and surfaces them in the prompt bundle. New: scripts/verify/derive-story-routes.ts. Pure-TS module that: - Parses code/.storybook/main.ts via the TypeScript AST to extract the `stories` config (string + object specifiers, with directory/files/ titlePrefix resolved). - Re-implements Storybook's autoTitle pipeline (sanitize + pathJoin + leaf/dir dedupe) and toId / storyNameFromExport / toStartCaseStr inline β kept verbatim from code/core/src/{csf,preview-api/...} so output matches what the indexer would emit at runtime. - For a given *.stories.{ts,tsx,mdx} file, parses the meta object to extract `title:` override + `tags:` (for autodocs detection) + exports. - Returns { title, kindId, autodocs, routes: [{exportName, storyId, storyUrl, docsUrl?}] }. Integration: scripts/verify-pr-generate.ts collectRelevantStoryFiles() walks the PR diff and, for each touched file under code/: - If the touched file is itself a story β include it. - If the touched file is a non-stories source β look for a sibling story with the same basename (Object.tsx β Object.stories.tsx). If absent, fall back to sibling stories that *import* the changed module by name. If still none, emit nothing for that file (avoids alphabetical fallback that sent past runs to unrelated stories). The harness then derives routes for each resolved story file and injects a "Story routes (computed deterministically by the harness)" section into the prompt before the optional retry-context block. Authoring guide Β§6: rewritten to instruct agents to use the pre-computed routes verbatim instead of re-deriving kebab-case kind-ids by hand. Past 404 examples (addons-controls-object--basic, addons-controls-basics--docs) called out explicitly. Smoke test: scripts/verify/derive-story-routes.test.ts. Asserts canonical routes for: - Object.stories.tsx (auto-title with autodocs, satisfies-wrapped meta, titlePrefix 'addons/docs'). - Button.stories.tsx (auto-title with leaf/dir dedupe under titlePrefix 'components'). Both currently pass against the live internal-ui main.ts.
Verify HarnessVerdict: Replay: Screenshots
|
After the deterministic story-route util lands, agents still regressed on non-trivial DOM details (PR #14: assumed textarea[name="value"] when the story passes `args: { name: "object" }` so the rendered input id derives from "object"; PR #16: couldn't see that the TreeNode story doesn't mount the Explorer / HighlightStyles tree). The page-snapshot retry feedback captures manager DOM only β iframe story content is opaque. Fix: include the source of each resolved story file in the prompt bundle under a "Story file sources" section, capped at 160 lines per file so the prompt stays bounded. Agents now see `meta.args`, story-level `args`, `parameters`, and `tags` β enough to derive the rendered input id and the mount conditions without guessing. Same file-resolution path as the routes section reuses collectRelevantStoryFiles(): touched stories first, then sibling stories that import the changed module by basename.
Verify HarnessVerdict: Replay: Screenshots
|
PR #14 Day-3 firetest showed agent self-correcting routes + label text from the deterministic-route util but still missing the runtime trigger state (ObjectControl's RawButton toggle needs a click before the textarea mounts). The retry-on-regression page-snapshot only captures manager DOM; iframe story content is opaque, so the agent never sees the actual `switch "Edit object as JSON"` it needs to click. Fix: auto-capture the preview iframe accessibility snapshot on test failure / timeout, then feed it into the retry-context alongside the existing manager snapshot. - .verify-recipes/_util.ts: re-exports `expect` + a `test` extended with an auto-running `recipeFailureCapture` fixture. On non-passed status it finds the preview frame (matches `/preview|iframe\.html/`), takes `body.ariaSnapshot()`, writes it to `iframe-snapshot.md` in the test's outputDir, and attaches it to the report. Best-effort β never breaks the test reporter. - .verify-recipes/example-smoke.spec.ts + authoring-guide Β§1: agent recipes now import `test` + `expect` from `./_util.ts` so the fixture applies automatically. - scripts/verify/recipe-deny.ts: new pattern rejects `from '@playwright/test'` imports β importing directly bypasses the fixture and loses the iframe snapshot. - scripts/verify/agent-prompt.ts: hard-requirements list updated to match. - .github/workflows/verify-pr.yml: Retry-on-regression step ERROR_CTX builder now appends `iframe-snapshot.md` (capped 8 KB) after each `error-context.md` so the retry author sees both manager + iframe DOM at the failure point.
Verify HarnessNo verdict produced β the workflow failed before the harness ran (likely recipe-author dispatch, deny-regex, or lint). See run log for details. |
Verify HarnessNo verdict produced β the workflow failed before the harness ran (likely recipe-author dispatch, deny-regex, or lint). See run log for details. |
Verify HarnessVerdict: Evidence (after 1 retry) (vision-check, Replay: Screenshots
|
Verify HarnessVerdict: Replay: Screenshots
|
β¦x-target CI, Layer-1/Layer-2 security, retry on regression, telemetry Squash of fork-side iteration on top of the single-round v6 pivot. Major changes since 00aa5c4: ## Verdict layering - Three orthogonal signals: Playwright (recipe execution) + vision evidence-check (claude-haiku-4-5 reading the diff + spec + screenshots) + PR-added unit tests (vitest on *.test.* files from the PR diff). - Final verdict gates on AND of Playwright + unit tests. Vision is informational (catches sr-only / invisible changes where assertions pass but screenshots can't confirm). - regressionReason is derived from playwright-report.json when the recipe author doesn't populate one β reviewers see the failing test title + first error inline. ## Retry loop - Retry-on-regression: feeds Playwright error context (page snapshot + iframe a11y snapshot + first error from playwright-report.json) back to the recipe author as --retry-context. Author re-emits the spec, Playwright re-runs. Single retry; final verdict gates label. - Retry-on-evidence-undetermined: feeds vision reasoning back so the author can target the diff more precisely (e.g., tighter screenshot region). ## Sandbox-target CI path - Recipes can set `// @verify-target: sandbox:<template>` (e.g., `sandbox:vue3-vite/default-ts`). The workflow detects the header, runs `nx run <template>:sandbox` (NX resolves implicitDependencies, emits the sandbox at code/sandbox/<key>), and verify-pr.ts boots Storybook against that sandbox instead of the internal-ui dev server. - Allowlisted templates: react-vite, react-webpack, vue3-vite, svelte-vite, angular-cli, nextjs, nextjs-vite (all default-ts). - Skips the global `compile` target when sandbox-bound β `:sandbox` handles all transitive deps via the NX project graph. ## Layer-1 security: secret stripping - pull_request_target runs build / sandbox / recipe code from the untrusted PR head as the runner user that holds GITHUB_TOKEN (contents:write, pull-requests:write) and ANTHROPIC_API_KEY. - The Verify-PR, Retry-on-regression, and Run-PR-added-unit-tests steps `unset GITHUB_TOKEN GH_TOKEN ANTHROPIC_API_KEY` before invoking any PR-head script. Trusted scripts above (verify-pr-generate, verify-pr-author) still see the keys because env -u (or env --unset on the inner command) only strips for the single command. ## Layer-2 security: @anthropic-ai/sandbox-runtime jail - Wraps `yarn verify-pr` (initial attempt + retry) in srt with a bubblewrap-backed FS + network jail. Defence-in-depth on top of Layer-1. - network.allowLocalBinding: true (Storybook dev server on localhost:6006); network.allowedDomains: [] (no public-internet egress). - filesystem.allowWrite: $RUNNER_TEMP, /tmp, $HOME/.cache, $HOME/.local/share, $HOME/.storybook. - filesystem.denyRead: $HOME/{.ssh, .aws, .docker, .npmrc, .gitconfig, .config/gh} (belt-and-suspenders alongside the env stripping). - CLAUDE_CODE_TMPDIR=$RUNNER_TEMP/sandbox-tmp so the sandbox's TMPDIR bind source exists on the host. ## Recipe-author quality - Deterministic story-route derivation: scripts/verify/derive-story- routes.ts parses code/.storybook/main.ts via TS AST + inlines Storybook's auto-title / toId / storyNameFromExport algorithms. Routes injected into the prompt bundle verbatim β agents stop guessing 404 paths. - Full source of touched non-stories files in the prompt bundle (capped 250 lines per file, 4 files per PR). Agents see actual component props / ariaLabels / data-attrs upfront. - Iframe a11y snapshot fixture in _util.ts: on test failure, writes the preview-iframe's body.ariaSnapshot() to iframe-snapshot.md. Retry step appends this alongside the manager page-snapshot. - Authoring guide Β§8.1 expanded with evidence requirement + four-step evidence gate + worked examples (focus-ring, Save-from-Controls icon swap, sr-only label gating). ## Compile-failure surfacing - When `nx compile` fails before Playwright runs, the workflow writes a stub verify-result.json with verdict=regression, regressionReason= "compile failure", regressionDetails=tail -c 4000 of the log (ANSI-stripped). PR comment renders the build error in-line so reviewers see WHY without downloading artifacts. ## UX polish - Vision reasoning collapsed inside a <details> block (verdict stays one-glance, reasoning one click away). - PR comment unitTests block renders β /β alongside Playwright + vision so reviewers see all three signals together. - Artifact zip staged under non-dot dirs so reviewers can browse it without toggling Finder's hidden-file display. - Replay link points at the run-summary page (where the Artifacts section lives) instead of the 404-emitting /artifacts path. ## Telemetry - New "Append telemetry" workflow step writes one CSV row per run to telemetry.csv on the _verify-screenshots side branch. Columns: run_id, pr_number, verdict, target, evidence_verdict, evidence_retry, unit_tests_ran, unit_tests_passed, duration_ms, timestamp. After 10β20 PRs the data drives v8 prioritisation (in-app role discovery, 2-retry budget, cross-package story heuristic, etc.). ## Validation Firetest PRs (fork-side): - #12 internal-ui smoke β verified - #13 Save-from-Controls icon swap β verified + evidence found - #14 ObjectControl raw JSON sr-only label β verified after retry - #15 ArgsTable dark-mode border β regression (genuine compile-fail) - #16 sidebar focus ring β verified, three signals positive - #17 Vue3 page-style scoping (sandbox target) β verified + found - #18 Svelte docgen refactor (sandbox target) β verified - #21 Angular stats.json (sandbox target) β verified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d57333c to
046eda5
Compare
β¦posites
Adds an LLM-authored, single-round PR verification harness that runs
under `pull_request_target` when a maintainer applies `ci:verify` to a
non-draft PR. Authors a Playwright spec from the PR diff, executes it
against either the monorepo's internal Storybook UI or a sandbox
template, optionally validates evidence via Claude Haiku vision, runs
PR-added unit tests, and posts a verdict comment with screenshots.
Infrastructure is factored into reusable composite actions so future
agentic workflows can reuse the trust-boundary plumbing:
.github/actions/agentic-pr-prepare/ β actor gate, base + PR-head
manual clones, sandbox-runtime
(srt) install + sha-pin, srt
settings, trusted-harness sync
.github/actions/agentic-pr-publish/ β verdict read, side-branch
screenshot push, telemetry
append, artifact upload
Trust-boundary hardenings (per dual-LLM security review):
- C1 HMAC-bound verdict: scripts/verify-pr.ts signs the trust-critical
subset of verify-result.json with VERIFY_PROVENANCE_SECRET; trusted
derive-verdict.ts downgrades 'verified' β 'regression' on signature
mismatch (closes in-srt forgery vector).
- H1: srt-sha256 has no composite default β caller must pass inline.
- H2: sync-files/sync-trees inputs reject `..` / leading `/` / extra `:`;
realpath asserts under $PR_HEAD_DIR; symlink-refuse before cp.
- H3: srt-settings.json arrays emitted via jq -R | jq -s.
- H4: screenshot URLs exposed as composite output FILE PATH (caller
fs.readFileSync), closing heredoc-terminator injection.
- M1: every publish sub-step that needs prior-step-failure tolerance
carries explicit `if: always()` (composite-level if: doesn't cascade).
- M2: VERIFY_PROVENANCE_SECRET written to file (mode 0600), not
$GITHUB_ENV.
- M3: tokens passed via env mapping only, never literal interpolation.
Layer-2 isolation: every PR-controlled step (yarn install, nx compile,
nx run <tpl>:sandbox, Playwright recipe, PR-added unit tests) wraps
under @anthropic-ai/sandbox-runtime (bubblewrap mount/network namespaces).
Layer-1 controls (deny-regex, ESLint policy, enableScripts:false,
committed lockfile, scoped API keys) remain in place.
Trusted scripts (verify-pr-generate / verify-pr-author / recipe-author-core
/ recipe-deny / lint-invocation / authoring guide / canonical smoke) live
in the base checkout, so a malicious PR cannot weaken the gate.
Helper scripts under scripts/verify/ci/:
- derive-verdict.ts β reads verify-result.json + playwright report,
validates HMAC, downgrades on mismatch.
- push-screenshots.ts β clones _agentic-pr-assets side branch, validates
PNG mime + per-file (5MB) / bundle (50MB) caps,
commits, pushes, emits raw.githubusercontent.com
URLs.
- append-telemetry.ts β POSTs to Google Apps Script Sheet (no-op when
webhook secrets unset).
- render-pr-comment.ts β renders verdict comment body, redacts token-
shaped substrings, supports unit-test merge.
- write-compile-failure-stub.ts β emits signed regression stub when
compile aborts before orchestrator runs.
Documentation: scripts/verify/SECURITY.md (threat model + lethal-trifecta
breakers), scripts/verify/RUNBOOK.md (operational details),
.github/actions/agentic-pr-*/README.md (caller contract + worked example).
b5a0c1a to
b675df5
Compare
Verify HarnessNo verdict produced β the workflow failed before the harness ran (likely recipe-author dispatch, deny-regex, or lint). See run log for details. |
ad75ba9 to
099b6f7
Compare













Cherry-pick of upstream 9f3bbc2 (storybookjs#34766) onto fork's next for v6 single-round verify harness firetest.
Reference: storybookjs#34766