Skip to content

[firetest] Refactor svelte docgen source file cache handling#18

Open
valentinpalkovic wants to merge 68 commits into
nextfrom
try-pr-34644
Open

[firetest] Refactor svelte docgen source file cache handling#18
valentinpalkovic wants to merge 68 commits into
nextfrom
try-pr-34644

Conversation

@valentinpalkovic
Copy link
Copy Markdown
Owner

Cross-renderer sweep firetest for upstream PR storybookjs#34644. Validates Layer-2 sandbox-runtime + recipe authoring against the target framework template. Do not merge.

valentinpalkovic and others added 30 commits May 8, 2026 23:36
Single-template (react-vite/default-ts), single-story
(example-button--primary) PR verification entry script with 6 helpers
under scripts/verify/.

Flow: compile core -> symlink code/core/dist into NX-cached sandbox ->
boot Storybook on :6006 -> Playwright capture via SbPage from
code/e2e-tests/util.ts -> emit verify-result.json + iframe-clipped
screenshot under .verify-output/<runId>/.

Helpers:
- core.ts: types, run-path math, computeVerdict, pruneOldRuns(10)
- symlink.ts: lifted EPERM/EEXIST cp fallback from
  scripts/tasks/sandbox-parts.ts:43-79 + net-new dangling-symlink heal
- sandbox.ts: multi-base resolveSandboxDir (code/sandbox, sandbox,
  ../storybook-sandboxes, STORYBOOK_SANDBOX_ROOT override),
  snapshot/restore, sanitizeResolutions
- sync.ts: yarn nx compile core (run from repoRoot) + symlink dist
- boot.ts: cross-platform port preflight, idempotent SIGINT/SIGTERM
  handlers, dual wait-on iframe.html + index.html (uses
  node:child_process.spawn per repo lint policy)
- capture.ts: page.on('pageerror'/'console') registered before goto,
  iframe-clipped screenshot

Run via `yarn verify-pr` (uses bun for native TS exec — node
strip-types rejects transitive enums in cli/projectTypes.ts).

Verification:
- V-1 sanity: verdict=verified, ~8s wall-time (well under 90s SLO)
- V-2 regression: VERIFY_HARNESS_TEST sentinel detected at compile,
  exit 1

Plan: .omc/plans/pr-verify-poc-mvp.md
Research: .omc/research/research-20260508-prverify/report.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

Pivot from custom Chromium launch (capture.ts) to spawning `bun x playwright test`
against committed specs under `.verify-recipes/`. Trace artifacts are produced by
Playwright's built-in tracing API and replayable via `npx playwright show-trace`.
Schema bumped to v2 with per-test results, attached pageErrors/consoleErrors, and
trace paths sourced from the Playwright JSON report contract.

Adds Phase-1 security hardening: `.claude/settings.json` deny rules (local),
`.dockerignore` for credential exclusion, `SECURITY.md` with phase-gated threat
model and isolation matrix, and a gated `.github/workflows/verify-pr.yml`
(if: false) scaffolding the Phase-2 container/proxy shape.

Recipe-local `RecipePage` (`.verify-recipes/_util.ts`) reimplements only the
subset of `SbPage` needed for verify recipes — Playwright's Node worker
processes cannot strip the non-erasable TS enums reached transitively from
`code/e2e-tests/util.ts`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allow overriding the Storybook port (default 6006) so the harness can run
alongside side-processes that already occupy 6006. baseURL, preflightPort,
bootStorybook, and the --resync alive-check are all threaded through the
resolved port. Validates that the value parses as an integer in 1..65535.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a two-step flow on top of the v2 raw-Playwright runner:

  yarn verify-pr-generate --pr <#>      # emit prompt bundle
  Skill: verify-recipe-author           # dispatch executor, write spec
  yarn verify-pr --recipe-spec ...      # run committed spec

The generator script does deterministic I/O only — gh pr fetch, triage
routing (19 path globs in scripts/verify/recipes/triage-table.ts mapping
addon/manager/csf-tools/builder/framework/renderer changes to reference
specs under code/e2e-tests/), per-file 500-line cap with 20-file total cap
sorted triage-matched first, and prompt-bundle emission. The script never
dispatches an agent and never writes the final spec.

The verify-recipe-author skill (.agents/skills/verify-recipe-author/SKILL.md
with redirect at .claude/skills/...) consumes the bundle, dispatches the
oh-my-claudecode:executor subagent (model=opus), runs a security deny-regex
guard (recipe-deny.ts: child_process, fs.unlink/rm, process.exit, eval,
node: imports), prepends a header-comment provenance block to the agent
output, writes .verify-recipes/pr-<#>.spec.ts, lints via
yarn --cwd code lint:js:cmd with one categorized retry (recipe-retry-policy.ts:
maxAttempts=2, errorCategories=[listener-before-goto, attach-pattern,
imports]), runs post-write regex checks for the listener-before-goto and
testInfo.attach pattern invariants, and emits result.json.

Spec-name collision = fail unless --force; the human-review gate from
v2's SECURITY.md is preserved (the skill never executes its output).

The authoring guide at .verify-recipes/_recipe-authoring-guide.md is the
agent's contract: import surface, listener-before-goto rule, attach
pattern, RecipePage API, what to avoid, story URL routing, and per-change
assertion shapes.

Verification: structural ACs (V3-6, V3-7, V3-9, V3-10) pass via grep
against the new files; AC-V3-1 (generator exit 0 + bundle written +
next-step printed) and AC-V3-5 (committed spec runs end-to-end via
verify-pr, schema v2 verdict emitted with trace.zip and per-test
attachments) ran clean against PR storybookjs#34737 (manager-api/modules/stories.ts);
AC-V3-3/V3-4 (listener-before-goto + attach-pattern regex) and AC-V3-8
(deny-regex aborts on child_process) verified directly.

Phase-1 security model unchanged: spec-review gate is the lethal-trifecta
breaker; the bun script + skill make that gate easy to apply, but never
substitute for it. Phase-2 CI activation will require migration to a
direct Anthropic SDK call with API-key handling — tracked in the
SECURITY.md / README roadmap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a headless authoring path (yarn verify-pr-author) that consumes the
v3 prompt bundle and dispatches Claude directly via @anthropic-ai/sdk
(single-block prompt caching on guide + canonical smoke). Skill and CI
script share scripts/verify/recipe-author-core.ts so they cannot drift.

Three lanes:
- Lane A — scripts/verify/agent-dispatch.ts (SDK + MODEL_ID_MAP +
  transport retry + stub mode + DEBUG redaction), recipe-author-core.ts
  (TOCTOU -> dispatch -> deny-regex -> D8 header -> lint -> regex ->
  categorize + retry), verify-pr-author.ts CLI with --dispatch-mode
  {sdk|stdin} and --retry-of (D4-α EX_TEMPFAIL=75 sentinel),
  recipe-retry-policy.ts extension (categorizeEslintViolations +
  formatRetryMessage), three stub fixtures, @anthropic-ai/sdk 0.65.0
  exact-pinned.
- Lane B — .github/workflows/verify-pr.yml flipped from if:false to
  label-gated (ci:verify) + !draft + actor-permission-action; Generate
  bundle + Author recipe steps added on bare runner with
  ANTHROPIC_API_KEY scoped to Author recipe env only; spec-runner
  container keeps --network=none and never sees the key; proxy.sock
  mount removed (Envoy deferred to v5). SECURITY.md Phase-2 section +
  README two-paths section.
- Lane C — scripts/verify/lint-invocation.ts wrapper (eslint via
  require.resolve('eslint/package.json') + bin/eslint.js, --no-eslintrc
  --no-ignore --resolve-plugins-relative-to repo-root); D3-E dedicated
  recipe eslintrc (parserOptions.project:false, non-typed recommended,
  argsIgnorePattern:'none'); SKILL.md Step 8 rewritten for the D4-α
  retry contract.

Verification (10 acceptance criteria):
- AC-V4-2/4/5/6/7a/7b/8/9/10 PASS end-to-end against the existing v3
  bundle. AC-V4-1 and AC-V4-3b gated on a live ANTHROPIC_API_KEY (CI
  verification mandatory; local optional). AC-V4-3a passes 9/9
  buildAnthropicRequest shape checks.
- AC-V4-7a SHA-256 parity: stdin + sdk paths produce byte-identical
  specs (D8 header generatedAt pinned to bundle.metadata.generatedAt).
- AC-V4-9 redaction: dispatch-request.json contains no x-api-key /
  authorization / sk-ant- substrings.
- AC-V4-10 retry: stdin attempt 1 exits 75 with framed retry block +
  result.partial.json; stdin --retry-of <runId> attempt 2 exits 0 with
  attempts=2.

scripts/verify-pr.ts (runner) untouched (frozen this increment). Envoy
credential-injector and author_association gating deferred to v5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve placeholder SHAs for the four third-party actions in
.github/workflows/verify-pr.yml to commit SHAs of their latest stable
releases. Required activation gate before the harness can fire in CI.

- prince-chrismc/check-actor-permissions-action: v3.0.2
- actions/checkout: v6.0.2
- actions/upload-artifact: v7.0.1
- actions/github-script: v9.0.0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous map pointed every entry at claude-opus-4-5-20250929, which
returns 404 from the Anthropic API. Update to current public IDs:

- claude-opus-4-7[1m] / claude-opus-4-7 → claude-opus-4-7
- claude-opus-4-6 → claude-opus-4-6
- claude-opus-4-5 → claude-opus-4-5-20251101 (correct snapshot)

Update MODEL_MAX_TOKENS keys to match. Verified live AC-V4-1 (spec
written) and AC-V4-3b (cache_read_input_tokens=4358 >= 1024) against
PR storybookjs#34761.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two activation-blocking bugs surfaced by A4 label-fire test on
valentinpalkovic/storybook fork:

1. Generate bundle step failed with "Couldn't find the node_modules
   state file" — workflow never ran yarn install after checkout. Add
   the standard `./.github/actions/setup-node-and-install` composite
   step between Checkout and Fetch PR diff.

2. Post PR comment hard-failed with ENOENT on
   `.verify-output/latest/verify-result.json`. The harness writes
   timestamped dirs and never creates a `latest` symlink, so the path
   was wrong on every run, not just failures. Replace with a sort-
   newest-first scan of `.verify-output/*/verify-result.json` and
   degrade gracefully when no verdict exists (workflow failed before
   harness ran), so the comment always posts a useful status.

Remaining gap: `Run harness in container` step references
`verify-harness:pinned-sha` which has no Dockerfile in repo and is
not built anywhere in the workflow. Tracked as next activation gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A4 label-fire test on fork run #25673185333 failed at Generate bundle
with "command not found: bun". The verify-pr-generate and verify-pr
yarn scripts (in package.json:40,42) invoke bun directly. The composite
setup-node-and-install action provisions Node/Yarn but not Bun, so add
oven-sh/setup-bun pinned to v2.2.0 between Node setup and Fetch PR
diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upload-artifact v7 emitted "No files were found with the provided path:
.verify-output/*/" on A4 run #25673778823 despite the dir existing. The
trailing-slash dir-glob isn't accepted as a file pattern in v7. Replace
with the directory path, which uploads the whole tree. Add explicit
`if-no-files-found: warn` so future glob drift surfaces as a warning
rather than silent zero artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A4 run #25674121554 confirmed .verify-output/ exists with the prompt
bundle inside, but upload-artifact v7 silently skipped it because the
default include-hidden-files: false rejects dot-prefixed paths. Set
include-hidden-files: true. Drop the temporary debug step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the v5-0 gap where the `Run harness in container` workflow step
referenced `verify-harness:pinned-sha` with no Dockerfile in repo. The
harness can now produce verdicts in CI.

Implementation follows the ralplan-approved design at
`.omc/plans/v5-0-dockerfile.md` (4 iterations to consensus APPROVE from
Architect + Critic under DELIBERATE mode).

What lands:

- `scripts/verify/Dockerfile` — multi-stage build pinned by SHA digest
  (Playwright v1.58.2-jammy base + Bun 1.3.0-slim via `COPY --from=`).
  Pre-bakes node_modules + code/core/dist + react-vite/default-ts sandbox
  so the runtime container can satisfy `--read-only` + `--network=none`.
  Corepack is bypassed — yarn invoked directly via `node $YARN_BIN`.
  Bakes `HEAD_SHA` for runtime drift detection.

- `scripts/verify/harden-build-context.sh` +
  `scripts/verify/strip-lifecycle-scripts.mjs` — supply-chain hardening
  that runs on the bare runner before `docker build`. Overlays trusted
  `.dockerignore` / `.yarnrc.yml` / `.yarn/releases/` from base-sha,
  strips lifecycle scripts from every workspace `package.json`,
  normalises `packageManager`, deletes head-supplied `.npmrc`,
  diff-asserts `Dockerfile` byte-identity. Walker is hardened with
  symlink-skip, max-depth, 1 MB file-size cap, 60s timeout, and
  prototype-chain hygiene.

- `.github/workflows/verify-pr.yml` — adds `Checkout PR head`,
  `Spec precheck`, `Harden build context`, `Build harness image`
  (with per-PR cache scope), `Smoke test image` (digest fail-closed),
  `Run harness in container` (named container), and `Mirror tmpfs
  output` (no `|| true` on the load-bearing copy).

- `.github/actions/verify-spec-precheck/action.yml` — extracted
  composite action so v5-1's first-time-use UX can swap internals
  without touching the workflow shape.

- `scripts/verify/core.ts` — adds `writeRegressionResult()` helper plus
  optional `regressionReason` / `inContainer` / `imageDigest` /
  `headSha` fields (schemaVersion unchanged).

- `scripts/verify-pr.ts` — honours `VERIFY_HARNESS_IN_CONTAINER=1` at
  every sandbox-prep call site; rejects `--resync` in-container;
  asserts `HEAD_SHA` via the new helper, warn-and-skip when
  `VERIFY_HARNESS_EXPECTED_HEAD_SHA` is unset (laptop dev mode).

- `scripts/verify/playwright.config.ts` — chromium-only projects.

- `renovate.json` — tracks Playwright + Bun digests on weekly schedule.

- `scripts/verify/SECURITY.md` § Image-build provenance — documents
  every supply-chain control plus the residual `GITHUB_TOKEN`-in-buildx
  risk and its v5-1 job-split mitigation.

- `scripts/verify/RUNBOOK.md` — diagnosis playbook for failure signals.

- `scripts/verify/__tests__/` — four integration tests covering the
  short-circuit, sandbox-root env, head-sha assertion, and hadolint.

Known residual risk (documented in SECURITY.md, deferred to v5-1):
`GITHUB_TOKEN` remains in the buildx daemon's process env on the
build step. The mitigation stack (lifecycle-script stripping,
`enableScripts: false`, `.npmrc` purge, corepack bypass, per-PR cache
scope, Dockerfile byte-identity) defends `yarn install` against
head-controlled code execution. v5-1 splits into prep + harness jobs
with `permissions: {}` on the harness job to eliminate this surface.

TODOs flagged in source:
- `timeout-minutes: 30` is a placeholder; AC-V5-0-2 cold-build
  measurement is required before final lock-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
actions/checkout's submodule-foreach cleanup pass aborts with exit 128
on this repo because of orphan gitlinks under `.external/` that have no
matching entries in any (missing) `.gitmodules` file. The base-sha
checkout escapes this because it doesn't pass `persist-credentials:
false` (the cleanup phase that runs `git submodule foreach` is gated on
needing to scrub credentials). The PR-head checkout did set the flag
for the v5-0 untrusted-context posture and hit the gitlinks/no-modules
mismatch.

Replace `actions/checkout@v6.0.2` for the PR-head step with a manual
`git clone --no-tags --no-checkout --filter=blob:none` followed by a
single-sha fetch and checkout. Strip the cached credential helper +
rewrite the remote URL to drop the token afterwards. Net posture is
equivalent to `persist-credentials: false` and `.git/` is excluded
from the docker build context by `.dockerignore`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xtraction

Build failure on fork firetest run 25684557942: playwright:v1.58.2-jammy
ships Node 24.13.0 (not 22.22.1), so the conditional Node re-install
always fires; but the base image is missing `xz-utils`, so `tar -xJf` on
the .tar.xz tarball aborts with "xz: Cannot exec: No such file or
directory". Add an apt-get install of xz-utils + ca-certificates inside
the same RUN block, gated on the same version-mismatch conditional so a
future Playwright base that already ships Node 22.22.1 skips the apt
fetch entirely (resolves the apt-vs-probe trade-off in OQ-V5-0-E).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build failure on fork firetest PR #3 run 25685301642: playwright base
image ships `pwuser` at UID/GID 1000, so the unconditional `groupadd
--gid 1000` aborts with "GID '1000' already exists". Guard the group
and user creates with getent/id probes so the layer is idempotent
across base-image variants that may or may not ship the user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_BIN

Yarn 4 treats any YARN_<KEY> env var as a config setting, so 'YARN_BIN'
was being parsed as a 'bin' config key and rejected with 'Unrecognized or
legacy configuration settings found: bin'. Rename the variable to
HARNESS_YARN_BIN throughout the Dockerfile and matching docs.
…al target

scripts/package.json depends on eslint-plugin-local-rules via portal:
specifier. Yarn install fails ('Manifest not found') unless the portal
target directory is present in the build context. Add a COPY for the
whole scripts/eslint-plugin-local-rules/ directory in stage 2 so yarn
can resolve the portal manifest and 'yarn lint' has the rule files at
runtime.
WORKDIR /opt/verify-harness/repo runs as root, leaving the directory
root-owned even though COPY --chown=1000:1000 sets file ownership.
Yarn 4 runs as uid 1000 (USER 1000:1000) and fails the link step with
EACCES while creating node_modules/. Add an explicit chown of the
workdir before the USER switch so yarn can create node_modules and
persist the link tree.
… task

scripts/utils/cli-step.ts resolves dist/bin/index.js for both
cli-storybook and create-storybook at module-eval time. Stage 3 only
compiled 'core', so the sandbox task failed with MODULE_NOT_FOUND.
Expand the nx target list to compile core + cli-storybook +
create-storybook before the sandbox bootstrap step runs.
…torybook

The nx project name in code/lib/cli-storybook/project.json is 'cli'
(package name '@storybook/cli'), not 'cli-storybook'. The previous
nx run-many list silently dropped the unknown target, so the cli
package was never compiled and the sandbox task still failed with
MODULE_NOT_FOUND for cli-storybook/dist/bin/index.js.
… index.js

code/core/dist has no top-level index.js — the bundle is split into
per-entry-point subdirectories (preview-api/, manager-api/, etc.) plus
the bin script at dist/bin/dispatcher.js (declared in core
package.json#bin). Update the sandbox dist sanity check to verify the
dispatcher bin file instead.
…overlay

'yarn task sandbox' runs run-registry -> publish, which packs and
republishes packages through verdaccio. The publish step can churn
code/core/dist, so by the time stage 3.d tries to cp it the directory
no longer exists. Re-run 'nx compile core' immediately before the cp
to guarantee the freshly-built artifact from the PR head is in place,
and fail loudly with a directory listing if it is somehow still
missing.
…caffolding

Eleven v5-0 firetest rounds confirmed the Dockerfile architecture is
asymmetrically over-engineered: ~70% of the complexity (digest pins,
harden-build-context overlay, lifecycle-script stripping, Verdaccio publish
pipeline, BuildKit cache scope, smoke-test sentinel) addresses supply-chain
threats that `enableScripts: false` + lockfile + `.npmrc` purge already
mitigate. Runtime isolation — the threat the doc actually calls out for
CI/CD — was weakly addressed (no `--cap-drop ALL` / `--network=none` /
`--read-only` / `--tmpfs` in places it would matter). BuildKit also
proved fragile: `code/core/dist` kept disappearing between stages 6 and 7
despite multiple recompile attempts.

v6 drops the container and accepts the same isolation profile as the
existing Storybook PR CI (ephemeral GitHub Actions runner). The recipe
author step keeps `ANTHROPIC_API_KEY` scope-limited to one step; the
committed-spec runner remains the lethal-trifecta breaker.

If the threat model later expands to processing third-party PRs at scale
with adversarial recipe authors, sandbox-runtime (bubblewrap on Linux)
wraps just the playwright step in ~10 lines of config per Anthropic's
"Securely deploying AI agents" doc. Do not reintroduce the full Docker +
Verdaccio stack.

Deletes:
- scripts/verify/Dockerfile
- scripts/verify/harden-build-context.sh
- scripts/verify/strip-lifecycle-scripts.mjs
- scripts/verify/SECURITY.md
- scripts/verify/__tests__/dockerfile-lint.test.ts
- scripts/verify/__tests__/head-sha-assertion.test.ts
- scripts/verify/__tests__/in-container-shortcircuit.test.ts

Strips v5-0 additions from .dockerignore (preserves the pre-existing
sensitive-path entries) and removes the Dockerfile-pin rules from
renovate.json.

The workflow yml + verify-pr.ts still reference the container path; both
get rewritten in the subsequent v6 commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ispatch

Rewrites scripts/verify-pr.ts around two execution targets selected per
recipe via a `// @verify-target:` header comment (scanned in the first
30 lines of the spec):

  internal-ui (default)
      Builds code/storybook-static once via `yarn storybook:ui:build`,
      then serves it on the requested port via `yarn http-server`. The
      fast path for fixes that exercise the monorepo's own UI against
      the PR head's compiled packages — no sandbox bootstrap, no
      verdaccio publish, no docker.

  sandbox:<template>
      Pre-existing sandbox flow — snapshotSandbox, sanitizeResolutions,
      syncCorePackage (symlink code/core/dist into the sandbox), then
      bootStorybook. Use only when a fix is template-specific (rare).

Also:

- Adds a positional <PR#> argument so `yarn verify-pr 34762` resolves to
  `.verify-recipes/pr-34762.spec.ts` automatically. The explicit
  `--recipe-spec <path>` flag still works and takes precedence.
- Drops every `VERIFY_HARNESS_IN_CONTAINER` short-circuit, the
  `/opt/verify-harness/HEAD_SHA` runtime assertion, and the
  `VERIFY_HARNESS_EXPECTED_HEAD_SHA` env-var contract. The container
  paths no longer exist.
- Drops the imageDigest / inContainer / headSha fields from
  VerifyResult writes and from writeRegressionResult's options. The
  fields stayed optional in the schema for backward-compat readers but
  are no longer populated.
- Widens VerifyResult.template from the
  `'react-vite/default-ts'` literal to `string` so the field can carry
  `'internal-ui'` and arbitrary sandbox templates.
- Switches the root `verify-pr` script from `bun scripts/verify-pr.ts`
  to `node ./scripts/verify-pr.ts`. verify-pr.ts no longer imports any
  of the non-erasable enum chain from cli-storybook, so Node 22.22.1's
  native TS strip is sufficient. The Playwright runner still spawns
  `bun x playwright test` internally because the recipe specs live
  under .verify-recipes/ and load through Playwright's own worker
  process, not through verify-pr.ts.
- --resync now only applies to sandbox-target recipes (the internal-ui
  build is fast enough that --resync would add no value); the script
  exits with an actionable error if --resync is passed for an
  internal-ui recipe.
- New: scripts/verify/target.ts (header parser, default = internal-ui).
- New: scripts/verify/internal-ui.ts (storybook:ui:build + http-server
  boot, waitOn :port/index.html).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the v5-0 Docker pipeline (harden build context, buildx, image
build, smoke test, docker run with --network=none / --cap-drop ALL /
--read-only / tmpfs, docker cp mirror) with a single shell step that
runs the harness directly on the GitHub Actions runner.

Verify step (only when `verify-spec-precheck` reports the committed spec
exists at the PR head):

    set -euo pipefail
    yarn install --immutable
    yarn nx run-many -t compile -p core,cli,create-storybook
    yarn verify-pr --recipe-spec ".verify-recipes/pr-${PR_NUMBER}.spec.ts"

The internal-ui target (default) builds code/storybook-static once and
serves it via http-server. Sandbox targets follow the pre-existing
snapshot + sanitize + sync + boot flow. The recipe header chooses.

New: a `Read verdict` step parses `pr-head/.verify-output/*/verify-result.json`
and a `Apply verified-by-harness label` step adds the label to the PR
when the verdict is `verified`. The PR comment script renders the same
two-state not-applicable message and a verified-vs-regression block,
but drops the imageDigest reference (no longer populated by v6).

Permissions: pull-requests + issues (label add) + statuses. The
ANTHROPIC_API_KEY remains scoped to the `Author recipe` step only;
nothing downstream of that step ever sees the secret. The committed
spec under review is still the lethal-trifecta breaker — the runner
executes only what was committed and reviewed at the PR head.

Workflow surface dropped:
- Harden build context (./scripts/verify/harden-build-context.sh)
- Set up Docker Buildx
- Build harness image (docker/build-push-action + cache-to/from)
- Capture image digest
- Smoke test image
- Run harness in container
- Mirror tmpfs output to runner workspace

Net delta: -78 lines, +33 lines (-45 net) on verify-pr.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites the harness documentation around the v6 local-first
architecture. The Docker / Verdaccio / image-build-provenance sections
are dropped; new sections cover the per-recipe target dispatch
(`internal-ui` vs `sandbox:<template>`) and the runner-native CI
workflow.

scripts/verify/README.md
  * Architecture diagram updated to list target.ts + internal-ui.ts.
  * Flag table adds the positional <PR#> sugar and clarifies that
    `--resync` and `--restore-sandbox` only apply to sandbox-target
    recipes.
  * `verify-result.json` example uses `template: "internal-ui"`.
  * Prerequisites section calls out Node 22 (native TS-strip) as the
    primary runtime; Bun is needed only by the Playwright runner.
  * Side-effects section narrows to the sandbox target.
  * CI section documents the new yaml shape.
  * Drops the "Running inside the verify-harness container" section
    in its entirety.

scripts/verify/RUNBOOK.md
  * Full rewrite around the two flows: local AI fix-loop +
    GH Actions runner. Drops every Docker / buildx / harden-script /
    smoke-sentinel signal. Adds signals specific to v6:
    bootInternalUi timeout, --resync rejected, sandbox bootstrap
    missing, not-applicable verdict, label-step skipped, github-script
    verdict-read failure.

scripts/verify/SECURITY.md (recreated)
  * Brief threat-model note (~70 lines instead of 250+). Restates the
    eight load-bearing controls (committed-spec review, scoped API
    key, deny-regex, lint gate, provenance header, actor permission,
    label gate, repo-wide deny rules) and explains why v6 dropped the
    container without weakening the trifecta breakers. Notes the
    sandbox-runtime path as the next-step option if the threat model
    expands.

.verify-recipes/_recipe-authoring-guide.md
  * New §12 "Target selection (v6)" documenting the
    `// @verify-target:` header convention. Renumbers existing §12
    "Output budget" to §13.

.verify-recipes/example-smoke.spec.ts
  * Adds the explicit `// @verify-target: internal-ui` header as the
    canonical baseline example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…step

The Verify PR step runs 'yarn nx run-many -t compile -p core,cli,create-storybook'
inside the pr-head/ checkout. When the workflow runs on a fork without
Nx Cloud org access (e.g. the v6 firetest on valentinpalkovic/storybook),
nx aborts with:

  NX Cloud: Workspace is unable to be authorized. Exiting run.
  This Nx Cloud organization is disabled.

The Verify step doesn't need distributed cache for correctness — a
clean compile against the PR head is the whole point. Force-disable Nx
Cloud (NX_NO_CLOUD=true + empty access token) on the step's env block.

Upstream storybookjs/storybook CI is unaffected: other workflow steps
(Generate bundle, Author recipe) that already rely on Nx Cloud auth
continue to use it; only the Verify step opts out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ternal-ui

The Verify PR step previously ran:

    yarn nx run-many -t compile -p core,cli,create-storybook

This is sufficient for sandbox-target recipes (the sandbox already has
its own node_modules) but not for the internal-ui target. The
internal-ui build invokes 'yarn storybook:ui:build' in code/, which
loads code/.storybook/main.ts, which imports @storybook/react-vite
plus every addon (addon-onboarding, addon-themes, addon-docs,
addon-designs, addon-vitest, addon-a11y, addon-mcp) and transitive
renderer + builder packages. None of those are compiled by the
narrower filter, so .storybook/main.ts evaluation fails with
ERR_MODULE_NOT_FOUND.

Drop the project filter and let nx compile every project. Slower per
run but correct for the default target. Sandbox-target recipes are
unaffected — the same compile output is reused under the
syncCorePackage symlink path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r eslint-plugin topo

eslint-plugin's prebuild script (code/lib/eslint-plugin/scripts/...)
imports from 'storybook/csf' via jiti. nx run-many parallelises 42
projects without enforcing the compile-order edge (the repo lacks an
explicit dependsOn: ['^compile'] for that target), so eslint-plugin
runs before core finishes and the import resolves upward through the
parent base checkout's node_modules/storybook symlink → which has no
dist/csf yet → ERR_MODULE_NOT_FOUND.

Compile core first explicitly, then run-many for everything else. nx
caches the core build so the second pass is a no-op for that target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Fails
🚫

PR is not labeled with one of: ["cleanup","BREAKING CHANGE","feature request","bug","documentation","maintenance","build","dependencies"]

🚫

PR is not labeled with one of: ["ci:normal","ci:merged","ci:daily","ci:docs"]

🚫 PR title must be in the format of "Area: Summary", With both Area and Summary starting with a capital letter Good examples: - "Docs: Describe Canvas Doc Block" - "Svelte: Support Svelte v4" Bad examples: - "add new api docs" - "fix: Svelte 4 support" - "Vue: improve docs"

Generated by 🚫 dangerJS against cca5c1d

@valentinpalkovic valentinpalkovic added the ci:verify Trigger PR Verification Harness label May 12, 2026
@github-actions github-actions Bot added the verified-by-harness Verified by PR Verify Harness label May 12, 2026
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
@github-actions
Copy link
Copy Markdown

Verify Harness

Verdict: verified (target svelte-vite/default-ts)

Evidence (after 1 retry) (vision-check, claude-haiku-4-5-20251001): undetermined

Vision reasoning

The diff is a pure refactoring of internal TypeScript code in the svelte-vite plugin (extracting helper functions for cache management). While the Playwright recipe asserts that the ArgsTable renders correctly with prop rows visible, the screenshots show the UI working as expected, which is consistent with the refactoring being functionally correct. However, the diff itself contains no user-visible changes—it's purely a code organization/extraction refactor with no observable UI differences from before/after.

Replay: npx playwright show-trace on the trace.zip listed under "Artifacts" on the run summary page.

Screenshots

2026-05-12T18-36-03.682Z/pr-18-svelte-vite-Button-d-50a6b-le-renders-via-docgen-cache-chromium/test-finished-1.png

2026-05-12T18-36-03.682Z/pr-18-svelte-vite-Button-d-50a6b-le-renders-via-docgen-cache-chromium/test-finished-1.png

2026-05-12T18-36-03.682Z/pr-18-svelte-vite-Button-d-50a6b-le-renders-via-docgen-cache-chromium/svelte-button-docs-argstable.png

2026-05-12T18-36-03.682Z/pr-18-svelte-vite-Button-d-50a6b-le-renders-via-docgen-cache-chromium/svelte-button-docs-argstable.png

2026-05-12T18-35-07.807Z/pr-18-svelte-vite-docgen-p-d6c4d-Types-table-for-Button-docs-chromium/test-failed-1.png

2026-05-12T18-35-07.807Z/pr-18-svelte-vite-docgen-p-d6c4d-Types-table-for-Button-docs-chromium/test-failed-1.png

@valentinpalkovic valentinpalkovic added ci:verify Trigger PR Verification Harness and removed ci:verify Trigger PR Verification Harness verified-by-harness Verified by PR Verify Harness labels May 12, 2026
@github-actions github-actions Bot added the verified-by-harness Verified by PR Verify Harness label May 12, 2026
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
@github-actions
Copy link
Copy Markdown

Verify Harness

Verdict: verified (target svelte-vite/default-ts)

Evidence (after 1 retry) (vision-check, claude-haiku-4-5-20251001): undetermined

Vision reasoning

The diff is a pure refactor of internal build-time docgen cache helper functions (extracting repeated logic into three new helper functions). This has no user-visible UI changes — the refactoring only reorganizes how the cache is accessed/updated internally. The Playwright recipe correctly verifies that the refactored code still produces correct docgen output by checking that prop rows (label, primary) appear in the Controls panel, which they do in the screenshots, but this is a functional correctness test rather than a visual change detection test.

Replay: npx playwright show-trace on the trace.zip listed under "Artifacts" on the run summary page.

Screenshots

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/test-finished-1.png

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/test-finished-1.png

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/controls-panel.png

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/controls-panel.png

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/preview.png

2026-05-12T18-48-57.792Z/pr-18-svelte-docgen-cache--a704f--surfaces-props-in-Controls-chromium/preview.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/test-finished-1.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/test-finished-1.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/preview-secondary.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/preview-secondary.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/controls-panel.png

2026-05-12T18-48-12.532Z/pr-18-svelte-vite-docgen-c-b9322-ArgTypes-for-Button-stories-chromium/controls-panel.png

valentinpalkovic added a commit that referenced this pull request May 13, 2026
…x-target CI, Layer-1/Layer-2 security, retry on regression, telemetry

Squash of fork-side iteration on top of the single-round v6 pivot.
Major changes since 00aa5c4:

## Verdict layering

- Three orthogonal signals: Playwright (recipe execution) + vision
  evidence-check (claude-haiku-4-5 reading the diff + spec + screenshots)
  + PR-added unit tests (vitest on *.test.* files from the PR diff).
- Final verdict gates on AND of Playwright + unit tests. Vision is
  informational (catches sr-only / invisible changes where assertions
  pass but screenshots can't confirm).
- regressionReason is derived from playwright-report.json when the
  recipe author doesn't populate one — reviewers see the failing test
  title + first error inline.

## Retry loop

- Retry-on-regression: feeds Playwright error context (page snapshot +
  iframe a11y snapshot + first error from playwright-report.json) back
  to the recipe author as --retry-context. Author re-emits the spec,
  Playwright re-runs. Single retry; final verdict gates label.
- Retry-on-evidence-undetermined: feeds vision reasoning back so the
  author can target the diff more precisely (e.g., tighter screenshot
  region).

## Sandbox-target CI path

- Recipes can set `// @verify-target: sandbox:<template>` (e.g.,
  `sandbox:vue3-vite/default-ts`). The workflow detects the header,
  runs `nx run <template>:sandbox` (NX resolves implicitDependencies,
  emits the sandbox at code/sandbox/<key>), and verify-pr.ts boots
  Storybook against that sandbox instead of the internal-ui dev server.
- Allowlisted templates: react-vite, react-webpack, vue3-vite,
  svelte-vite, angular-cli, nextjs, nextjs-vite (all default-ts).
- Skips the global `compile` target when sandbox-bound — `:sandbox`
  handles all transitive deps via the NX project graph.

## Layer-1 security: secret stripping

- pull_request_target runs build / sandbox / recipe code from the
  untrusted PR head as the runner user that holds GITHUB_TOKEN
  (contents:write, pull-requests:write) and ANTHROPIC_API_KEY.
- The Verify-PR, Retry-on-regression, and Run-PR-added-unit-tests
  steps `unset GITHUB_TOKEN GH_TOKEN ANTHROPIC_API_KEY` before
  invoking any PR-head script. Trusted scripts above
  (verify-pr-generate, verify-pr-author) still see the keys because
  env -u (or env --unset on the inner command) only strips for the
  single command.

## Layer-2 security: @anthropic-ai/sandbox-runtime jail

- Wraps `yarn verify-pr` (initial attempt + retry) in srt with a
  bubblewrap-backed FS + network jail. Defence-in-depth on top of
  Layer-1.
- network.allowLocalBinding: true (Storybook dev server on
  localhost:6006); network.allowedDomains: [] (no public-internet
  egress).
- filesystem.allowWrite: $RUNNER_TEMP, /tmp, $HOME/.cache,
  $HOME/.local/share, $HOME/.storybook.
- filesystem.denyRead: $HOME/{.ssh, .aws, .docker, .npmrc, .gitconfig,
  .config/gh} (belt-and-suspenders alongside the env stripping).
- CLAUDE_CODE_TMPDIR=$RUNNER_TEMP/sandbox-tmp so the sandbox's TMPDIR
  bind source exists on the host.

## Recipe-author quality

- Deterministic story-route derivation: scripts/verify/derive-story-
  routes.ts parses code/.storybook/main.ts via TS AST + inlines
  Storybook's auto-title / toId / storyNameFromExport algorithms.
  Routes injected into the prompt bundle verbatim — agents stop
  guessing 404 paths.
- Full source of touched non-stories files in the prompt bundle
  (capped 250 lines per file, 4 files per PR). Agents see actual
  component props / ariaLabels / data-attrs upfront.
- Iframe a11y snapshot fixture in _util.ts: on test failure, writes
  the preview-iframe's body.ariaSnapshot() to iframe-snapshot.md.
  Retry step appends this alongside the manager page-snapshot.
- Authoring guide §8.1 expanded with evidence requirement + four-step
  evidence gate + worked examples (focus-ring, Save-from-Controls
  icon swap, sr-only label gating).

## Compile-failure surfacing

- When `nx compile` fails before Playwright runs, the workflow writes
  a stub verify-result.json with verdict=regression, regressionReason=
  "compile failure", regressionDetails=tail -c 4000 of the log
  (ANSI-stripped). PR comment renders the build error in-line so
  reviewers see WHY without downloading artifacts.

## UX polish

- Vision reasoning collapsed inside a <details> block (verdict stays
  one-glance, reasoning one click away).
- PR comment unitTests block renders ✅/❌ alongside Playwright +
  vision so reviewers see all three signals together.
- Artifact zip staged under non-dot dirs so reviewers can browse it
  without toggling Finder's hidden-file display.
- Replay link points at the run-summary page (where the Artifacts
  section lives) instead of the 404-emitting /artifacts path.

## Telemetry

- New "Append telemetry" workflow step writes one CSV row per run to
  telemetry.csv on the _verify-screenshots side branch. Columns:
  run_id, pr_number, verdict, target, evidence_verdict,
  evidence_retry, unit_tests_ran, unit_tests_passed, duration_ms,
  timestamp. After 10–20 PRs the data drives v8 prioritisation
  (in-app role discovery, 2-retry budget, cross-package story
  heuristic, etc.).

## Validation

Firetest PRs (fork-side):
- #12 internal-ui smoke — verified
- #13 Save-from-Controls icon swap — verified + evidence found
- #14 ObjectControl raw JSON sr-only label — verified after retry
- #15 ArgsTable dark-mode border — regression (genuine compile-fail)
- #16 sidebar focus ring — verified, three signals positive
- #17 Vue3 page-style scoping (sandbox target) — verified + found
- #18 Svelte docgen refactor (sandbox target) — verified
- #21 Angular stats.json (sandbox target) — verified

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@valentinpalkovic valentinpalkovic force-pushed the next branch 16 times, most recently from ad75ba9 to 099b6f7 Compare May 15, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:verify Trigger PR Verification Harness verified-by-harness Verified by PR Verify Harness

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant