try: storybookjs/storybook#34658 — sidebar focus ring for selected item#16
try: storybookjs/storybook#34658 — sidebar focus ring for selected item#16valentinpalkovic wants to merge 2 commits into
Conversation
|
Verify HarnessVerdict: Replay: Screenshots
|
Verify HarnessVerdict: Replay: |
5539a1d to
eea7e1f
Compare
Verify HarnessVerdict: Replay: Screenshots
|
…k to author Previously only evidence-missing/undetermined verdicts triggered retry. Regression verdicts (Playwright assertions failed) skipped retry entirely, even though the author could often self-correct from the error trace (wrong route, missing trigger state, stale selector). This change extends retry to cover regression by: 1. Wrapping the initial verify-pr step with `|| true` so the workflow continues on Playwright failure; the final verdict gate at the end of the job preserves red-CI signal based on post-retry verdict. 2. Evidence-check + Retry steps switch from `if: success()` to `if: always()` and gate internally on the JSON verdict. 3. Retry step extends gate to include verdict==regression. Retry-context for regression cases is built from each failed test's error-context.md (page snapshot) + first Playwright error message from playwright-report.json, capped at 8 KB per snapshot. 4. verify-pr-generate's --retry-context preamble is softened to cover both regression (selector / route correction) and evidence (trigger-state correction) paths. 5. Authoring guide §6: explicit guidance for non-stories diffs — locate the sibling *.stories.tsx and derive kind-id from its file path under the registered titlePrefix in code/.storybook/main.ts. Avoids agents guessing docs-site routes (e.g. addons-controls-basics) that don't exist in internal-ui. Empirical findings from PR #14 and #16 firetest: - #14 (Object.tsx control label diff) agent navigated to non-existent route addons-controls-object--basic. Correct route would be addons-docs-blocks-controls-object--object (or --docs autodocs page). - #16 (HighlightStyles conditional CSS) agent's toHaveCSS regex timed out because HighlightStyles only mounts on keyboard-nav highlight. Author had no way to learn this without the Playwright trace. Retry-on-regression gives the author one chance to self-correct using the page snapshot at failure point.
Verify HarnessVerdict: Replay: Screenshots
|
After the deterministic story-route util lands, agents still regressed on non-trivial DOM details (PR #14: assumed textarea[name="value"] when the story passes `args: { name: "object" }` so the rendered input id derives from "object"; PR #16: couldn't see that the TreeNode story doesn't mount the Explorer / HighlightStyles tree). The page-snapshot retry feedback captures manager DOM only — iframe story content is opaque. Fix: include the source of each resolved story file in the prompt bundle under a "Story file sources" section, capped at 160 lines per file so the prompt stays bounded. Agents now see `meta.args`, story-level `args`, `parameters`, and `tags` — enough to derive the rendered input id and the mount conditions without guessing. Same file-resolution path as the routes section reuses collectRelevantStoryFiles(): touched stories first, then sibling stories that import the changed module by basename.
Verify HarnessVerdict: Evidence (after 1 retry) (vision-check, Replay: Screenshots
|
…ignal
When a PR ships its own *.test.{ts,tsx,js,jsx} alongside the diff
(e.g. PR #16 added HighlightStyles.test.tsx that exercises the new CSS
rule directly), run them via vitest after the Playwright recipe. Result
is stored under verify-result.json's new `unitTests` field:
{ran, files, passed, summary, details}. PR comment now renders the
unit-test status alongside the Playwright verdict — reviewers see both
signals in one place.
This step doesn't override the Playwright verdict (independent signals
are most useful when shown side-by-side); a verified Playwright + failed
unit test, or the inverse, is informative on its own. Future iteration
may gate the final verdict on the AND of both signals.
Detection: greps /tmp/pr.diff for '+++ b/code/.+\.(test|spec)\.(ts|tsx|js|jsx)' file headers, only includes files that exist on disk. Skips
the step entirely (recording 'ran: false') when the diff has no test files.
Verify HarnessVerdict: Evidence (after 1 retry) (vision-check, PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
Verify HarnessVerdict: Evidence (after 1 retry) (vision-check, PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
Verify HarnessVerdict: Reason: PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
PR #16 run 25785401577 ran cleanly through the new telemetry path but every token/cost field landed at 0. Cause: recipe-author writes dispatch-response.json under \$GITHUB_WORKSPACE/.verify-output/<runId>/ (verify-pr-author runs from the trusted base checkout with default CWD), while the telemetry glob was scoped to \$PR_HEAD_DIR/.verify-output/ only — so the recipe-author dispatches were invisible to the aggregator, while evidence-check (which writes beside verify-result.json under \$PR_HEAD_DIR) would have been counted if a verified verdict had triggered it. Scan both roots. The Stage-artifacts step already separates the two trees (base/ vs runner-pr-head/), so this is a parity fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verify HarnessVerdict: Reason: PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
PR #16 run 25785774485 hit a runtime jq error during cost aggregation: jq: error (at .verify-output/<id>/dispatch-response.json:16): number (182708.75) and object ({"input_tok...) cannot be divided The exact same script + exact same artifact files ran cleanly on jq 1.8.1 locally and produced \$0.340414. The CI runner ships jq 1.6.x (Ubuntu 22.04 default), which evaluates the chained \`//\` alternative operator + arithmetic differently when an intermediate field is absent or non-numeric. Two fixes layered together: 1. Coerce every SDK field via a new \`num(x)\` helper that returns 0 for any non-number input (null / object / string), instead of relying on \`(.x // 0)\` defaulting which can leak the original non-numeric value when the field exists but is the wrong type. 2. Replace the \`(a // null) // (b // 0)\` chain for the 5m cache write tokens with an explicit \`if a > 0 then a else b end\` ladder. Same output as before on the real PR #16 artifacts (\$0.340414, 45803 input / 1996 output / 9111 cache_read / 9111 cache_write). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verify HarnessVerdict: Reason: PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
Verify HarnessVerdict: Reason: PR-added unit tests: ✅ passed — 1 passed, 0 failed across 2 suite(s) Files: Replay: Screenshots
|
…x-target CI, Layer-1/Layer-2 security, retry on regression, telemetry Squash of fork-side iteration on top of the single-round v6 pivot. Major changes since 00aa5c4: ## Verdict layering - Three orthogonal signals: Playwright (recipe execution) + vision evidence-check (claude-haiku-4-5 reading the diff + spec + screenshots) + PR-added unit tests (vitest on *.test.* files from the PR diff). - Final verdict gates on AND of Playwright + unit tests. Vision is informational (catches sr-only / invisible changes where assertions pass but screenshots can't confirm). - regressionReason is derived from playwright-report.json when the recipe author doesn't populate one — reviewers see the failing test title + first error inline. ## Retry loop - Retry-on-regression: feeds Playwright error context (page snapshot + iframe a11y snapshot + first error from playwright-report.json) back to the recipe author as --retry-context. Author re-emits the spec, Playwright re-runs. Single retry; final verdict gates label. - Retry-on-evidence-undetermined: feeds vision reasoning back so the author can target the diff more precisely (e.g., tighter screenshot region). ## Sandbox-target CI path - Recipes can set `// @verify-target: sandbox:<template>` (e.g., `sandbox:vue3-vite/default-ts`). The workflow detects the header, runs `nx run <template>:sandbox` (NX resolves implicitDependencies, emits the sandbox at code/sandbox/<key>), and verify-pr.ts boots Storybook against that sandbox instead of the internal-ui dev server. - Allowlisted templates: react-vite, react-webpack, vue3-vite, svelte-vite, angular-cli, nextjs, nextjs-vite (all default-ts). - Skips the global `compile` target when sandbox-bound — `:sandbox` handles all transitive deps via the NX project graph. ## Layer-1 security: secret stripping - pull_request_target runs build / sandbox / recipe code from the untrusted PR head as the runner user that holds GITHUB_TOKEN (contents:write, pull-requests:write) and ANTHROPIC_API_KEY. - The Verify-PR, Retry-on-regression, and Run-PR-added-unit-tests steps `unset GITHUB_TOKEN GH_TOKEN ANTHROPIC_API_KEY` before invoking any PR-head script. Trusted scripts above (verify-pr-generate, verify-pr-author) still see the keys because env -u (or env --unset on the inner command) only strips for the single command. ## Layer-2 security: @anthropic-ai/sandbox-runtime jail - Wraps `yarn verify-pr` (initial attempt + retry) in srt with a bubblewrap-backed FS + network jail. Defence-in-depth on top of Layer-1. - network.allowLocalBinding: true (Storybook dev server on localhost:6006); network.allowedDomains: [] (no public-internet egress). - filesystem.allowWrite: $RUNNER_TEMP, /tmp, $HOME/.cache, $HOME/.local/share, $HOME/.storybook. - filesystem.denyRead: $HOME/{.ssh, .aws, .docker, .npmrc, .gitconfig, .config/gh} (belt-and-suspenders alongside the env stripping). - CLAUDE_CODE_TMPDIR=$RUNNER_TEMP/sandbox-tmp so the sandbox's TMPDIR bind source exists on the host. ## Recipe-author quality - Deterministic story-route derivation: scripts/verify/derive-story- routes.ts parses code/.storybook/main.ts via TS AST + inlines Storybook's auto-title / toId / storyNameFromExport algorithms. Routes injected into the prompt bundle verbatim — agents stop guessing 404 paths. - Full source of touched non-stories files in the prompt bundle (capped 250 lines per file, 4 files per PR). Agents see actual component props / ariaLabels / data-attrs upfront. - Iframe a11y snapshot fixture in _util.ts: on test failure, writes the preview-iframe's body.ariaSnapshot() to iframe-snapshot.md. Retry step appends this alongside the manager page-snapshot. - Authoring guide §8.1 expanded with evidence requirement + four-step evidence gate + worked examples (focus-ring, Save-from-Controls icon swap, sr-only label gating). ## Compile-failure surfacing - When `nx compile` fails before Playwright runs, the workflow writes a stub verify-result.json with verdict=regression, regressionReason= "compile failure", regressionDetails=tail -c 4000 of the log (ANSI-stripped). PR comment renders the build error in-line so reviewers see WHY without downloading artifacts. ## UX polish - Vision reasoning collapsed inside a <details> block (verdict stays one-glance, reasoning one click away). - PR comment unitTests block renders ✅/❌ alongside Playwright + vision so reviewers see all three signals together. - Artifact zip staged under non-dot dirs so reviewers can browse it without toggling Finder's hidden-file display. - Replay link points at the run-summary page (where the Artifacts section lives) instead of the 404-emitting /artifacts path. ## Telemetry - New "Append telemetry" workflow step writes one CSV row per run to telemetry.csv on the _verify-screenshots side branch. Columns: run_id, pr_number, verdict, target, evidence_verdict, evidence_retry, unit_tests_ran, unit_tests_passed, duration_ms, timestamp. After 10–20 PRs the data drives v8 prioritisation (in-app role discovery, 2-retry budget, cross-package story heuristic, etc.). ## Validation Firetest PRs (fork-side): - #12 internal-ui smoke — verified - #13 Save-from-Controls icon swap — verified + evidence found - #14 ObjectControl raw JSON sr-only label — verified after retry - #15 ArgsTable dark-mode border — regression (genuine compile-fail) - #16 sidebar focus ring — verified, three signals positive - #17 Vue3 page-style scoping (sandbox target) — verified + found - #18 Svelte docgen refactor (sandbox target) — verified - #21 Angular stats.json (sandbox target) — verified Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d57333c to
046eda5
Compare
…posites
Adds an LLM-authored, single-round PR verification harness that runs
under `pull_request_target` when a maintainer applies `ci:verify` to a
non-draft PR. Authors a Playwright spec from the PR diff, executes it
against either the monorepo's internal Storybook UI or a sandbox
template, optionally validates evidence via Claude Haiku vision, runs
PR-added unit tests, and posts a verdict comment with screenshots.
Infrastructure is factored into reusable composite actions so future
agentic workflows can reuse the trust-boundary plumbing:
.github/actions/agentic-pr-prepare/ — actor gate, base + PR-head
manual clones, sandbox-runtime
(srt) install + sha-pin, srt
settings, trusted-harness sync
.github/actions/agentic-pr-publish/ — verdict read, side-branch
screenshot push, telemetry
append, artifact upload
Trust-boundary hardenings (per dual-LLM security review):
- C1 HMAC-bound verdict: scripts/verify-pr.ts signs the trust-critical
subset of verify-result.json with VERIFY_PROVENANCE_SECRET; trusted
derive-verdict.ts downgrades 'verified' → 'regression' on signature
mismatch (closes in-srt forgery vector).
- H1: srt-sha256 has no composite default — caller must pass inline.
- H2: sync-files/sync-trees inputs reject `..` / leading `/` / extra `:`;
realpath asserts under $PR_HEAD_DIR; symlink-refuse before cp.
- H3: srt-settings.json arrays emitted via jq -R | jq -s.
- H4: screenshot URLs exposed as composite output FILE PATH (caller
fs.readFileSync), closing heredoc-terminator injection.
- M1: every publish sub-step that needs prior-step-failure tolerance
carries explicit `if: always()` (composite-level if: doesn't cascade).
- M2: VERIFY_PROVENANCE_SECRET written to file (mode 0600), not
$GITHUB_ENV.
- M3: tokens passed via env mapping only, never literal interpolation.
Layer-2 isolation: every PR-controlled step (yarn install, nx compile,
nx run <tpl>:sandbox, Playwright recipe, PR-added unit tests) wraps
under @anthropic-ai/sandbox-runtime (bubblewrap mount/network namespaces).
Layer-1 controls (deny-regex, ESLint policy, enableScripts:false,
committed lockfile, scoped API keys) remain in place.
Trusted scripts (verify-pr-generate / verify-pr-author / recipe-author-core
/ recipe-deny / lint-invocation / authoring guide / canonical smoke) live
in the base checkout, so a malicious PR cannot weaken the gate.
Helper scripts under scripts/verify/ci/:
- derive-verdict.ts — reads verify-result.json + playwright report,
validates HMAC, downgrades on mismatch.
- push-screenshots.ts — clones _agentic-pr-assets side branch, validates
PNG mime + per-file (5MB) / bundle (50MB) caps,
commits, pushes, emits raw.githubusercontent.com
URLs.
- append-telemetry.ts — POSTs to Google Apps Script Sheet (no-op when
webhook secrets unset).
- render-pr-comment.ts — renders verdict comment body, redacts token-
shaped substrings, supports unit-test merge.
- write-compile-failure-stub.ts — emits signed regression stub when
compile aborts before orchestrator runs.
Documentation: scripts/verify/SECURITY.md (threat model + lethal-trifecta
breakers), scripts/verify/RUNBOOK.md (operational details),
.github/actions/agentic-pr-*/README.md (caller contract + worked example).
eea7e1f to
8404db8
Compare
Verify HarnessVerdict: PR-added unit tests: ❌ failed — vitest exited 1 without writing a JSON report (likely setup error); see Action log Files: vitest output (last 4KB)Replay: Screenshots
|
8e0a05c to
80ccd7d
Compare


























Cherry-pick of upstream 51ddf4c (storybookjs#34658) onto fork's next for v6 single-round verify harness firetest.
Reference: storybookjs#34658