feat(scripts): agentic PR verification harness (v6 single-round — author + execute in one run) by valentinpalkovic · Pull Request #34762 · storybookjs/storybook

valentinpalkovic · 2026-05-11T12:54:00Z

Summary

Introduces an agentic PR verification harness under scripts/verify/ and .github/workflows/verify-pr.yml.

v6 single-round: drops the v5-0 Docker / Verdaccio / image-build-provenance scaffolding and collapses the previous two-round flow (author at base → maintainer commits at PR head → re-fire ci:verify) into a single workflow run. The PR-head checkout is staged first; the LLM-authored Playwright recipe is materialised straight into $RUNNER_TEMP/pr-head/.verify-recipes/pr-<#>.spec.ts (ephemeral — never committed) and executed in the same job. ANTHROPIC_API_KEY stays scoped to the Author step only; the Verify step that runs the recipe has no API key and no GITHUB_TOKEN.

Architecture diffs vs v5-0

No Docker, no Verdaccio, no image-build-provenance. Runs on a stock GitHub Actions runner — same isolation profile as the existing Storybook PR CI (which already executes untrusted contributor code in normal test runs).
~70 % of v5-0's complexity was supply-chain hardening (digest pins, harden-build-context overlay, lifecycle-script stripping, Verdaccio publish pipeline). v6 covers the same surface via enableScripts: false + lockfile + .npmrc purge.
BuildKit's layer isolation kept dropping code/core/dist between stages 6 and 7 across 11 v5-0 firetest rounds. v6 eliminates the multi-stage Docker boundary entirely.

Architecture diffs vs the earlier v6 two-round model

Aspect	Two-round v6	Single-round v6
Author → execute gating	Committed-spec human review at PR head	None — agent authors and executes in one run
Recipe storage	Committed to `.verify-recipes/pr-<#>.spec.ts` on the PR branch	Materialised into `$RUNNER_TEMP/pr-head/.verify-recipes/` (ephemeral, artifact-only)
Workflow rounds	2 (kick off, review committed spec, re-fire)	1
Load-bearing controls	Committed-spec review + scoped key + deny-regex + lint + actor-perm + label gate	Scoped key + deny-regex + lint + structural pattern checks + controlled `outputSpecPath` + actor-perm + label gate + trusted-script provenance

See scripts/verify/SECURITY.md for the full re-audit. Local-dev with the verify-recipe-author skill (human-review path) remains supported for ambiguous PRs.

Per-recipe targets

Recipes declare execution target via a single header line (scanned in the first 30 lines):

Target	Boots	When
`internal-ui` (default)	`yarn storybook:ui:build` once → `yarn http-server code/storybook-static -p <port>`	Most fixes — exercises the monorepo's own Storybook UI against the PR head's compiled packages.
`sandbox:<template>`	Pre-existing sandbox flow (snapshot, sanitize, sync core/dist symlink, boot sandbox storybook)	Template-specific bugs (rare).

Local CLI: yarn verify-pr <PR#> resolves to .verify-recipes/pr-<#>.spec.ts; explicit --recipe-spec <path> still works.

CI workflow shape

.github/workflows/verify-pr.yml:

Check actor permission (≥ write)
Checkout base (pull_request.base.ref) + Setup Node.js / install root deps + Setup Bun
Compute paths — echo "PR_HEAD_DIR=$RUNNER_TEMP/pr-head" >> $GITHUB_ENV (workaround for runner.temp not being a valid job-env context)
Manual git clone PR head into $PR_HEAD_DIR (clone target is outside the base checkout so nx / yarn / jiti module resolution cannot escape upward into the base node_modules)
gh pr diff → /tmp/pr.diff
yarn verify-pr-generate --pr <#> --force --output "$PR_HEAD_DIR/.verify-recipes/pr-<#>.spec.ts" — trusted base scripts emit the prompt bundle with the ephemeral output path baked in
yarn verify-pr-author --bundle … (ANTHROPIC_API_KEY scoped here) dispatches the LLM, runs deny-regex + lint + structural pattern checks, and atomically renames the candidate onto the ephemeral path inside $PR_HEAD_DIR

Verify PR step (working-directory = $PR_HEAD_DIR):

yarn install --immutable
yarn playwright install --with-deps chromium
yarn nx compile core              # topo-pre to satisfy eslint-plugin prebuild
yarn nx run-many -t compile       # all 42 projects so internal-ui main.ts resolves
yarn verify-pr --recipe-spec ".verify-recipes/pr-${PR_NUMBER}.spec.ts"

Read verdict → on verified, gh label create (idempotent) + gh pr edit --add-label verified-by-harness
Push screenshots to the _verify-screenshots orphan side branch (raw.githubusercontent.com URLs pinned at the commit pushed)
Upload $PR_HEAD_DIR/.verify-output/ + the ephemeral spec + the base-checkout author run-dir as the artefact bundle
Post PR comment with verdict, trace replay link, and inline screenshots

Files dropped (vs v5-0 + two-round v6)

scripts/verify/Dockerfile                      # v5-0
scripts/verify/harden-build-context.sh         # v5-0
scripts/verify/strip-lifecycle-scripts.mjs     # v5-0
scripts/verify/__tests__/dockerfile-lint.test.ts        # v5-0
scripts/verify/__tests__/head-sha-assertion.test.ts     # v5-0
scripts/verify/__tests__/in-container-shortcircuit.test.ts  # v5-0
.github/actions/verify-spec-precheck/action.yml         # two-round v6 (gated on committed spec)

Plus the v5-0 additions in .dockerignore and the renovate.json Docker-pin rules.

New / changed in single-round v6

scripts/verify/target.ts — // @verify-target: header parser (default internal-ui).
scripts/verify/internal-ui.ts — storybook:ui:build + http-server boot.
scripts/verify-pr.ts — rewritten around per-recipe target dispatch; positional <PR#> arg; dropped inContainer short-circuit, HEAD_SHA runtime assertion, imageDigest field.
scripts/verify-pr-generate.ts — --output <path> flag so CI can point the bundle at an ephemeral PR-head path; default behaviour preserved for local-dev.
scripts/verify/recipe-author-core.ts — provenance header acknowledges both modes (local-dev human-reviewed; CI single-round materialised and executed without intermediate review).
scripts/verify/core.ts — template widened from a literal to string; computeVerdict filters generic "Failed to load resource:" console 404s (low-signal browser-side fetch misses).
scripts/verify/SECURITY.md — full rewrite for single-round: load-bearing controls, what single-round explicitly gives up, when to add stronger isolation.
scripts/verify/README.md and scripts/verify/RUNBOOK.md — full rewrite around single-round v6 architecture and failure-mode debugging.
.verify-recipes/_recipe-authoring-guide.md — new §12 documenting target selection.

Validation — fork firetest

Validated end-to-end on a throwaway fork PR (valentinpalkovic/storybook#12).

Single-round flow: run 25721180250 — conclusion success in 4m22s
Final verdict: verified (target internal-ui)
verified-by-harness label applied automatically
PR comment posted with verdict + inline screenshots + trace replay link
Total of 9 prior CI rounds during v6 development (root causes captured in commit messages); single-round pivot validated cleanly on the next round after the original two-round v6 shipped end-to-end.

Test plan

v6 force-pushed to fork's next and to a throwaway PR (#12)
ci:verify label fires the workflow
Recipe is authored AND executed in the same workflow run (no committed spec at PR head)
yarn install --immutable succeeds in $RUNNER_TEMP/pr-head/
yarn nx run-many -t compile succeeds for all 42 projects
yarn verify-pr boots internal-ui, runs Playwright recipe, writes verify-result.json
On verified: verified-by-harness label is applied (auto-created if missing)
PR comment renders verdict block + inline screenshots (raw.githubusercontent.com URLs pinned at the screenshot side-branch commit)
Land first real PR through the harness on next after merge

Security posture (single-round v6)

Load-bearing controls — re-audited in scripts/verify/SECURITY.md:

Scoped API key. ANTHROPIC_API_KEY is mounted only on the Author recipe step's env: block. The Verify step has no API key and no GITHUB_TOKEN.
Static deny-regex pass. Blocks child_process, fs.unlink*, fs.rm*, process.exit, eval(, import 'node:*', etc. before the spec lands on disk.
Scoped lint gate. Pinned ESLint config; first-failure retry with categorised errors; second failure aborts.
Structural pattern checks. listener-before-goto and finally-attach regex assertions on every dispatched candidate.
Controlled output path. bundle.outputSpecPath is computed by the trusted base script (verify-pr-generate) and consumed verbatim by recipe-author-core. The LLM cannot influence where its output lands.
Trusted-script provenance. verify-pr-generate, recipe-author-core, recipe-deny, lint-invocation, the authoring-guide, and the canonical-smoke reference all read from the base checkout (not the PR head), so a malicious PR cannot replace the gates.
Actor-permission gate + label gate + non-draft gate.
Local-dev fallback. The verify-recipe-author skill remains supported for PRs where a maintainer wants human review of the spec before execution.

What single-round gives up: no human review of the executed spec, and no replay-by-default in version control (artefacts have 14-day retention). If either is unacceptable for a given PR class, fall back to the local-dev path.

If the threat model later expands to processing third-party PRs at scale with adversarial recipe authors, wrap the playwright test step in sandbox-runtime (bubblewrap on Linux) — ~10 lines of config per Anthropic's "Securely deploying AI agents" doc. Do not reintroduce the Docker + Verdaccio stack.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Full PR verification harness: agent-assisted recipe authoring, safety deny-list and lint-driven retry, single-round in-band execution, CLI dispatch for authoring, and PR comments with trace-replay links, inline screenshots, and artifact uploads.
Documentation
- New authoring guide, verification README, security posture documentation, and field-debugging runbook.
Chores
- Playwright verification setup, helper utilities and fixtures, ESLint config for verification, updated ignore files, and new verification CLI scripts.

Day 3 + Day 4 follow-ups (post-v6 single-round pivot)

The squash commit feat(verify): Day 3 + Day 4 follow-ups on top of the single-round pivot adds:

Verdict layering — three orthogonal signals

Playwright (recipe execution) + vision evidence-check (claude-haiku-4-5 reading the diff + spec + screenshots) + PR-added unit tests (vitest on *.test.* files from the PR diff).
Final verdict gates on AND of Playwright + unit tests. Vision is informational (catches sr-only / invisible changes where assertions pass but screenshots can't confirm).
regressionReason derived from playwright-report.json when the recipe author doesn't populate one — reviewers see the failing test title + first error inline.

Retry loop

Retry-on-regression: feeds Playwright error context (page snapshot + iframe a11y snapshot + first error from playwright-report.json) back to the recipe author as --retry-context. Author re-emits the spec, Playwright re-runs. Single retry; final verdict gates the label.
Retry-on-evidence-undetermined: feeds vision reasoning back so the author can target the diff more precisely.

Sandbox-target CI path

Recipes can set // @verify-target: sandbox:<template> (e.g. sandbox:vue3-vite/default-ts). The workflow detects the header, runs nx run <template>:sandbox (NX resolves implicitDependencies, emits the sandbox at code/sandbox/<key>), and verify-pr.ts boots Storybook against that sandbox instead of the internal-ui dev server.
Allowlisted templates: react-vite, react-webpack, vue3-vite, svelte-vite, angular-cli, nextjs, nextjs-vite (all default-ts).
Skips the global compile target when sandbox-bound — :sandbox handles all transitive deps via the NX project graph.

Layer-1 security: secret stripping

pull_request_target runs build / sandbox / recipe code from the untrusted PR head as the runner user that holds GITHUB_TOKEN (contents:write, pull-requests:write) and ANTHROPIC_API_KEY.
The Verify-PR, Retry-on-regression, and Run-PR-added-unit-tests steps unset GITHUB_TOKEN GH_TOKEN ANTHROPIC_API_KEY before invoking any PR-head script. Trusted scripts above (verify-pr-generate, verify-pr-author) still see the keys because env -u only strips for the single command.

Layer-2 security: `@anthropic-ai/sandbox-runtime` jail

Wraps yarn verify-pr (initial attempt + retry) in srt with a bubblewrap-backed FS + network jail. Defence-in-depth on top of Layer-1.
network.allowLocalBinding: true (Storybook dev server on localhost:6006); network.allowedDomains: [] (no public-internet egress).
filesystem.allowWrite: $RUNNER_TEMP, /tmp, $HOME/.cache, $HOME/.local/share, $HOME/.storybook.
filesystem.denyRead: $HOME/{.ssh, .aws, .docker, .npmrc, .gitconfig, .config/gh} (belt-and-suspenders alongside the env stripping).
CLAUDE_CODE_TMPDIR=$RUNNER_TEMP/sandbox-tmp so the sandbox's TMPDIR bind source exists on the host.

Recipe-author quality

Deterministic story-route derivation: scripts/verify/derive-story-routes.ts parses code/.storybook/main.ts via TS AST + inlines Storybook's auto-title / toId / storyNameFromExport algorithms. Routes injected into the prompt bundle verbatim — agents stop guessing 404 paths.
Full source of touched non-stories files in the prompt bundle (capped 250 lines per file, 4 files per PR). Agents see actual component props / ariaLabels / data- attrs upfront.
Iframe a11y snapshot fixture in _util.ts: on test failure, writes the preview-iframe's body.ariaSnapshot() to iframe-snapshot.md. Retry step appends this alongside the manager page-snapshot.
Authoring guide §8.1 expanded with evidence requirement + four-step evidence gate + worked examples (focus-ring, Save-from-Controls icon swap, sr-only label gating).

Compile-failure surfacing

When nx compile fails before Playwright runs, the workflow writes a stub verify-result.json with verdict=regression, regressionReason="compile failure", regressionDetails=tail -c 4000 of the log (ANSI-stripped). PR comment renders the build error in-line so reviewers see WHY without downloading artifacts.

UX polish

Vision reasoning collapsed inside a <details> block (verdict stays one-glance, reasoning one click away).
PR comment unit-tests block renders ✅/❌ alongside Playwright + vision so reviewers see all three signals together.
Artifact zip staged under non-dot dirs so reviewers can browse it without toggling Finder's hidden-file display.
Replay link points at the run-summary page (where the Artifacts section lives) instead of the 404-emitting /artifacts path.

Telemetry

New "Append telemetry" workflow step writes one CSV row per run to telemetry.csv on the _verify-screenshots side branch. Columns: run_id, pr_number, verdict, target, evidence_verdict, evidence_retry, unit_tests_ran, unit_tests_passed, duration_ms, timestamp. After 10–20 PRs the data drives v8 prioritisation (in-app role discovery, 2-retry budget, cross-package story heuristic, etc.).

Validation

Firetest PRs (fork-side):

Fork PR	Upstream PR	Target	Verdict
#12	n/a	internal-ui smoke	verified
#13	#34767 (UndoIcon swap)	internal-ui	verified, evidence `found`
#14	#34766 (Object JSON sr-only label)	internal-ui	verified after retry
#15	#34756 (ArgsTable border)	internal-ui	regression (genuine compile-fail)
#16	#34658 (sidebar focus ring)	internal-ui	verified, three signals positive
#17	#34571 (Vue3 page-style scoping)	`vue3-vite/default-ts`	verified, evidence `found`
#18	#34644 (Svelte docgen refactor)	`svelte-vite/default-ts`	verified
#21	#34551 (Angular stats.json)	`angular-cli/default-ts`	verified

github-actions · 2026-05-11T12:54:26Z

	Fails
🚫	PR is not labeled with one of: ["cleanup","BREAKING CHANGE","feature request","bug","documentation","maintenance","build","dependencies"]
🚫	PR is not labeled with one of: ["ci:normal","ci:merged","ci:daily","ci:docs"]
🚫	PR title must be in the format of "Area: Summary", With both Area and Summary starting with a capital letter Good examples: - "Docs: Describe Canvas Doc Block" - "Svelte: Support Svelte v4" Bad examples: - "add new api docs" - "fix: Svelte 4 support" - "Vue: improve docs"
🚫	PR description is missing the mandatory "#### Manual testing" section. Please add it so that reviewers know how to manually test your changes.

Generated by 🚫 dangerJS against 58e0609

coderabbitai · 2026-05-11T13:03:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a PR verification harness: prompt-bundle generation, agent-driven Playwright spec authoring with deny-regex and ESLint gates (single retry), provenance-tagged committed specs, Storybook sandbox boot/sync, Playwright execution with trace capture, fixtures/docs, and a gated CI workflow that uploads artifacts and posts verdict comments.

Changes

PR Verification Harness

Layer / File(s)	Summary
Data Models & Prompt Builder `scripts/verify/agent-prompt.ts`	Defines PromptInput, prompt assembly, and token-budget validation.
Prompt Bundle Generator & CLI `scripts/verify-pr-generate.ts`, `scripts/verify-pr-author.ts`, `package.json`	Generates `.verify-output/<runId>/prompt-bundle.json`, provides dispatch CLI (stdin/sdk), and adds package scripts plus `@anthropic-ai/sdk` devDependency.
Agent Dispatch Layer `scripts/verify/agent-dispatch.ts`	Resolves model hints, builds Anthropic requests (guide+smoke), supports stub mode, and implements transport retry/backoff and debug artifact emission.
Recipe Author Engine `scripts/verify/recipe-author-core.ts`	Orchestrates dispatch attempts, fenced-spec extraction, deny checks, linting, structural listener/attach checks, TOCTOU collision handling, provenance header prepending, result persistence, and retry framing.
Static Deny & ESLint Runner `scripts/verify/recipe-deny.ts`, `scripts/verify/lint-invocation.ts`, `.verify-recipes/.eslintrc.cjs`	Provides DENY_PATTERNS and assertNoDeniedPatterns; programmatic ESLint runner pinned to scoped config and normalized violation output.
Playwright Runner & Core Parsing `scripts/verify/runner.ts`, `scripts/verify/core.ts`, `scripts/verify/playwright.config.ts`	Spawns Playwright runs, streams logs, discovers trace attachments, parses Playwright JSON into structured VerifyResult, computes verdicts, and prunes old runs.
Sandbox, Sync & Boot Utilities `scripts/verify/sandbox.ts`, `scripts/verify/sync.ts`, `scripts/verify/symlink.ts`, `scripts/verify/boot.ts`	Sandbox discovery/snapshot/restore, compile & sync core, symlink healing or copy fallback, port preflight, signal handlers, and Storybook boot/readiness checks.
Triage Routes & Selection `scripts/verify/recipes/triage-table.ts`, `scripts/verify/triage.ts`	Maps changed-path globs to reference spec basenames and resolves absolute reference spec paths for prompts and provenance metadata.
Helpers, Fixtures & Guide `.verify-recipes/_util.ts`, `.verify-recipes/_recipe-authoring-guide.md`, `.verify-recipes/example-smoke.spec.ts`, `scripts/verify/__fixtures__/*`	`RecipePage` helper, authoring contract guide, example smoke spec and stub fixtures used for smoke/lint/authoring validation.
CI Workflow, Docs & Ignores `.github/workflows/verify-pr.yml`, `scripts/verify/README.md`, `scripts/verify/SECURITY.md`, `.dockerignore`, `.gitignore`	Gated `pull_request_target` workflow with actor checks, hardened container execution, artifact upload and PR comment step; documentation and ignore updates for artifacts/credentials.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Dev / CI
  participant Generator as verify-pr-generate
  participant Bundle as PromptBundle (.verify-output)
  participant Agent as Anthropic / Agent
  participant AuthorCLI as verify-pr-author (stdin/sdk)
  participant Linter as ESLint (lint-invocation)
  participant Writer as Spec Writer (.verify-recipes)
  participant Runner as verify-pr (Playwright Runner)

  Dev->>Generator: yarn verify-pr-generate (create prompt-bundle.json)
  Generator->>Bundle: write prompt-bundle.json
  Bundle->>Agent: dispatch prompt (agent-dispatch)
  Agent->>AuthorCLI: assistant reply (stdin/sdk)
  AuthorCLI->>Linter: lint candidate spec
  Linter-->>AuthorCLI: lint result / violations
  AuthorCLI->>Writer: write provenance-tagged `.verify-recipes/pr-<N>.spec.ts` or emit retry (exit 75)
  Dev->>Runner: yarn verify-pr --recipe-spec (run committed spec)
  Runner-->>Dev: verify-result.json + artifacts (trace, logs)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

storybookjs/storybook#33270: Related sandbox-location and sandbox-usage changes; overlaps with sandbox discovery and boot logic.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (2)

scripts/verify/boot.ts (2)

30-58: 💤 Low value

Same command injection concern applies to lsof/netstat calls.

The lsof and netstat fallback commands on macOS/Linux have the same pattern. Consider refactoring these to use spawnSync with argument arrays as well.

♻️ Suggested refactor for Unix port checks

-    const out = execSync(`lsof -ti :${port}`, { encoding: 'utf-8' }).trim();
+    const lsof = spawnSync('lsof', ['-ti', `:${port}`], { encoding: 'utf-8' });
+    const out = (lsof.stdout ?? '').trim();

And for the netstat fallback:

-        const out = execSync(`netstat -an | grep :${port}`, { encoding: 'utf-8' }).trim();
+        const netstat = spawnSync('netstat', ['-an'], { encoding: 'utf-8' });
+        const out = (netstat.stdout ?? '')
+          .split('\n')
+          .filter((line) => line.includes(`:${port}`))
+          .join('\n')
+          .trim();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify/boot.ts` around lines 30 - 58, The lsof/netstat invocations in
scripts/verify/boot.ts use execSync with string interpolation (the
execSync(`lsof -ti :${port}`) and execSync(`netstat -an | grep :${port}`) calls)
which allows shell injection; change both to use child_process.spawnSync with
argument arrays (e.g., spawnSync('lsof', ['-ti', String(port)]) and for the
fallback use spawnSync('netstat', ['-an']) and pipe/filter the stdout in Node
rather than using a shell grep), handle ENOENT the same way, and preserve the
existing stdout/trimming/error handling logic used in the try/catch around those
calls.

11-11: 💤 Low value

Prefer spawnSync with args array to avoid command injection risk.

While port is typed as number (limiting actual injection risk), using execSync with string interpolation is flagged by static analysis and doesn't follow secure coding best practices. Using spawnSync with an argument array is safer and eliminates the warning.

♻️ Suggested refactor for Windows port check

-      const out = execSync(`netstat -ano | findstr :${port}`, { encoding: 'utf-8' }).trim();
+      const netstat = spawnSync('netstat', ['-ano'], { encoding: 'utf-8' });
+      const out = (netstat.stdout ?? '')
+        .split('\n')
+        .filter((line) => line.includes(`:${port}`))
+        .join('\n')
+        .trim();

Import spawnSync alongside execSync:

-import { execSync, spawn } from 'node:child_process';
+import { execSync, spawn, spawnSync } from 'node:child_process';

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify/boot.ts` at line 11, Replace the execSync call with spawnSync:
import spawnSync from 'child_process' (or include it when importing execSync),
call spawnSync('netstat', ['-ano'], { encoding: 'utf-8' }) to get stdout, then
search/filter the stdout string for the port (e.g., split by newlines and find a
line that includes `:${port}`) instead of using a shell pipeline to findstr;
update the const out assignment to use the spawnSync stdout and handle
empty/no-match cases similarly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/verify-recipe-author/SKILL.md:
- Around line 13-20: The SKILL.md currently hardcodes the absolute machine-local
path /Users/valentinpalkovic/Projects/storybook/... for bundle discovery and
other references; change the contract to use repository-relative or configurable
paths instead: describe auto-discovery as looking under the repo/workspace root
(e.g., ./ .verify-output/) or respect an environment/argument (e.g., BUNDLE_ROOT
or a CLI arg) and update all occurrences mentioned (the auto-discover section
and the other occurrences at lines ~80-86, 137-138, 175-177, 187-202) to
reference the new repo-relative or configurable path rather than the absolute
path so the skill works from any checkout.
- Around line 51-89: The docs in SKILL.md are out-of-date and describe legacy
result shapes and retry flows that conflict with the implemented logic in
verify-pr-author.ts and recipe-author-core.ts; update the step descriptions to
match the shipped contract: ensure the pre-flight check references
bundle.outputSpecPath / bundle.force but writes the current result.json shape
used by recipe-author-core.ts (use the exact status strings and attempts framing
implemented there), change the retry/fence-miss semantics to match
verify-pr-author.ts (how many attempts, what messages trigger retries vs final
failure), and update the deny-regex invocation notes to reference the actual
script function assertNoDeniedPatterns in recipe-deny.ts and the real temp-file
path convention (.verify-output/<runId>/.deny-input.txt) and failure handling
(no retry on throw). Ensure all mentions of statuses, retry counts, and
ownership of header/lint/write behavior align with the real functions in
verify-pr-author.ts and recipe-author-core.ts rather than the legacy text.

In @.github/workflows/verify-pr.yml:
- Around line 23-26: The workflow is checking out the base commit (Checkout
base) using ref: github.event.pull_request.base.sha so the job builds the wrong
code; update the checkout step to instead fetch and check out the PR head (use
ref: github.event.pull_request.head.sha) or perform a two-step checkout/merge
(keep Checkout base, then run a step that fetches
github.event.pull_request.head.ref and merges or checks out that commit) so the
subsequent yarn verify-pr runs against the PR changes rather than the base
commit.
- Around line 19-24: Summary: The workflow uses placeholder all-zero SHAs for
two actions which will fail; replace them with real pinned commit SHAs or stable
tags. Locate the uses:
"prince-chrismc/check-actor-permissions-action@0000000000000000000000000000000000000000"
and "actions/checkout@0000000000000000000000000000000000000000" (and the
repeated occurrence at lines referenced 54-61) and swap the placeholder refs for
the actual commit SHA (preferred) or an official released tag, then re-run the
workflow to confirm the actions resolve correctly.
- Around line 31-51: The workflow assumes Bun and a local Docker image exist;
add steps to provision them before the "Generate bundle" and "Run harness in
container" steps: install Bun (e.g., use actions/setup-bun or an install script)
so the yarn verify-pr-generate command runs on a fresh runner, and add a step to
produce or pull the verify-harness:pinned-sha image (e.g., a "Build harness
image" step that runs docker build -t verify-harness:pinned-sha ... or a docker
pull) so the docker run that executes yarn verify-pr --recipe-spec can find the
image; reference the "Generate bundle" (yarn verify-pr-generate), "Author
recipe" (yarn verify-pr-author) and the docker run using
verify-harness:pinned-sha to place these new steps immediately before those
existing steps.
- Around line 15-17: The workflow's permissions block is missing the repository
contents read scope required by actions/checkout; add "contents: read" to the
existing permissions map (alongside pull-requests: write and statuses: write) so
actions/checkout can access the repo before the harness steps run and avoid the
job failing early.

In `@scripts/verify-pr-author.ts`:
- Around line 186-191: The catch block after runRecipeAuthor currently only logs
to stderr and returns; instead build a structured failure result and persist it
to result.json before returning. In the catch for runRecipeAuthor (and any
thrown by dispatchRecipeAuthor/stdin), populate the existing result variable (or
create one) with failure fields (e.g. success: false, error: msg, errorStack:
err.stack, attempt info), then write that object to result.json (using the same
JSON structure the rest of the flow expects) and flush to disk before calling
console.error and returning 1 so every run always leaves a well-formed
result.json.

In `@scripts/verify/README.md`:
- Around line 228-309: The README describes pre-v4 behavior that no longer
matches the shipped implementation; update the Limitations/Roadmap/Increment 2
text to reflect label-gated CI, direct SDK dispatch for the authoring skill, and
the current stdout retry contract `===VERIFY_PR_AUTHOR_RETRY_BEGIN===` (and
remove obsolete notes about `if: false`, Claude Code-only dispatch, stderr/JSON
retry framing, and older retry ownership). Edit the Increment 2 four-step flow
and Spec-name collision/Header-comment/Triage routing/Diff truncation paragraphs
(and the mirrored section at lines ~322-363) so commands like `yarn
verify-pr-generate`, the skill path
`.claude/skills/verify-recipe-author/SKILL.md`, the emitted
`.verify-output/<runId>/result.json`, and CI behavior reference the actual
label-gated workflow and direct Anthropic SDK dispatch used in this PR; keep the
same user-facing commands but change the descriptive prose and troubleshooting
pointers to the new stdout contract and CI gating.

In `@scripts/verify/recipe-author-core.ts`:
- Around line 114-120: The extractSpecBody function currently returns an empty
string for fences that contain only whitespace; change it to treat
whitespace-only payloads as extraction failures by checking the extracted body
with body.trim() and returning null when it's empty. Locate extractSpecBody (and
the variables SPEC_FENCE_START / SPEC_FENCE_END) and after computing body from
reply.slice(...), add a guard that returns null if body.trim() === '' and
otherwise perform the existing whitespace-adjustments and return the body.
- Around line 252-268: The current existsSync + fs.promises.rename still allows
a race where rename can overwrite bundle.outputSpecPath; change the move logic
to an atomic create-that-fails approach: attempt to create the destination using
fs.promises.copyFile(candidatePath, bundle.outputSpecPath,
fs.constants.COPYFILE_EXCL) (or fs.promises.link(candidatePath,
bundle.outputSpecPath) which also fails if the target exists), and only on
success unlink the candidatePath; on failure treat it as a collision (return the
same 'collision' RecipeAuthorResult). Update the code paths around candidatePath
and bundle.outputSpecPath to use COPYFILE_EXCL or link and handle the thrown
EEXIST to preserve the "do not clobber unless --force" guarantee.

In `@scripts/verify/symlink.ts`:
- Around line 27-42: The dangling-symlink heal code incorrectly calls
access(linkTarget) directly because readlink can return a relative path resolved
against dirname(target); change it to resolve relative targets first by
computing const resolved = path.isAbsolute(linkTarget) ? linkTarget :
path.resolve(path.dirname(target), linkTarget) and then call await
access(resolved) (keep existing lstat, readlink, access, unlink flow and ensure
path is imported/available).

---

Nitpick comments:
In `@scripts/verify/boot.ts`:
- Around line 30-58: The lsof/netstat invocations in scripts/verify/boot.ts use
execSync with string interpolation (the execSync(`lsof -ti :${port}`) and
execSync(`netstat -an | grep :${port}`) calls) which allows shell injection;
change both to use child_process.spawnSync with argument arrays (e.g.,
spawnSync('lsof', ['-ti', String(port)]) and for the fallback use
spawnSync('netstat', ['-an']) and pipe/filter the stdout in Node rather than
using a shell grep), handle ENOENT the same way, and preserve the existing
stdout/trimming/error handling logic used in the try/catch around those calls.
- Line 11: Replace the execSync call with spawnSync: import spawnSync from
'child_process' (or include it when importing execSync), call
spawnSync('netstat', ['-ano'], { encoding: 'utf-8' }) to get stdout, then
search/filter the stdout string for the port (e.g., split by newlines and find a
line that includes `:${port}`) instead of using a shell pipeline to findstr;
update the const out assignment to use the spawnSync stdout and handle
empty/no-match cases similarly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8d11de-840a-42a1-863d-ef397d7616cc

📥 Commits

Reviewing files that changed from the base of the PR and between 0eab5b5 and 7f3c95e.

⛔ Files ignored due to path filters (1)

yarn.lock is excluded by !**/yarn.lock, !**/*.lock

📒 Files selected for processing (34)

.agents/skills/verify-recipe-author/SKILL.md
.claude/skills/verify-recipe-author/SKILL.md
.dockerignore
.github/workflows/verify-pr.yml
.gitignore
.verify-recipes/.eslintrc.cjs
.verify-recipes/.gitkeep
.verify-recipes/_recipe-authoring-guide.md
.verify-recipes/_util.ts
.verify-recipes/example-smoke.spec.ts
package.json
scripts/verify-pr-author.ts
scripts/verify-pr-generate.ts
scripts/verify-pr.ts
scripts/verify/README.md
scripts/verify/SECURITY.md
scripts/verify/__fixtures__/stub-assistant-reply-clean.txt
scripts/verify/__fixtures__/stub-assistant-reply-with-unused-var.txt
scripts/verify/__fixtures__/stub-assistant-reply.txt
scripts/verify/agent-dispatch.ts
scripts/verify/agent-prompt.ts
scripts/verify/boot.ts
scripts/verify/core.ts
scripts/verify/lint-invocation.ts
scripts/verify/playwright.config.ts
scripts/verify/recipe-author-core.ts
scripts/verify/recipe-deny.ts
scripts/verify/recipe-retry-policy.ts
scripts/verify/recipes/triage-table.ts
scripts/verify/runner.ts
scripts/verify/sandbox.ts
scripts/verify/symlink.ts
scripts/verify/sync.ts
scripts/verify/triage.ts

coderabbitai · 2026-05-11T13:04:01Z

+  try {
+    result = await runRecipeAuthor({ bundle, dispatch, runDir, attempt, mode });
+  } catch (err) {
+    const msg = err instanceof Error ? err.message : String(err);
+    console.error(`[verify-pr-author] dispatch failed: ${msg}`);
+    return 1;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist dispatch failures to result.json.

If dispatchRecipeAuthor() or stdin consumption throws here, the CLI exits with only stderr output. That breaks the “always leave a structured per-run result” contract the rest of this flow relies on.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/verify-pr-author.ts` around lines 186 - 191, The catch block after runRecipeAuthor currently only logs to stderr and returns; instead build a structured failure result and persist it to result.json before returning. In the catch for runRecipeAuthor (and any thrown by dispatchRecipeAuthor/stdin), populate the existing result variable (or create one) with failure fields (e.g. success: false, error: msg, errorStack: err.stack, attempt info), then write that object to result.json (using the same JSON structure the rest of the flow expects) and flush to disk before calling console.error and returning 1 so every run always leaves a well-formed result.json.

coderabbitai · 2026-05-11T13:04:02Z

+  // Net-new dangling-symlink heal: if target exists as a symlink but points to a missing location,
+  // unlink it so ensureSymlink can recreate it correctly.
+  try {
+    const stat = await lstat(target);
+    if (stat.isSymbolicLink()) {
+      try {
+        const linkTarget = await readlink(target);
+        await access(linkTarget);
+      } catch {
+        await unlink(target);
+        console.log('[symlink] healed dangling target ' + target);
+      }
+    }
+  } catch (e: any) {
+    if (e?.code !== 'ENOENT') throw e;
+  }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix relative symlink target resolution.

The dangling-symlink healing logic has a path-resolution bug. When readlink(target) returns a relative path, it's relative to dirname(target), not process.cwd(). However, access(linkTarget) resolves relative paths against process.cwd(), causing incorrect validation.

Example failure: If a symlink at /repo/foo/link points to ../bar/file, readlink returns ../bar/file, but access checks {cwd}/../bar/file instead of /repo/foo/../bar/file.

🔧 Proposed fix to resolve relative symlink targets correctly

const linkTarget = await readlink(target); - await access(linkTarget); + const resolvedTarget = path.isAbsolute(linkTarget) + ? linkTarget + : path.resolve(path.dirname(target), linkTarget); + await access(resolvedTarget); } catch {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/verify/symlink.ts` around lines 27 - 42, The dangling-symlink heal code incorrectly calls access(linkTarget) directly because readlink can return a relative path resolved against dirname(target); change it to resolve relative targets first by computing const resolved = path.isAbsolute(linkTarget) ? linkTarget : path.resolve(path.dirname(target), linkTarget) and then call await access(resolved) (keep existing lstat, readlink, access, unlink flow and ensure path is imported/available).

valentinpalkovic · 2026-05-11T13:14:58Z

Activation Gates A1–A4 — Results

Approach

A4 testing run on valentinpalkovic/storybook fork to avoid noise on upstream. Fork's next was fast-forwarded to valentin/agentic-review-harness, secret provisioned, draft PR opened + labeled.

A1 — action SHAs pinned ✅

Commit 478ba479ec2. All 4 third-party actions resolved to commit SHAs of latest stable tags.

A2 — `ANTHROPIC_API_KEY` repo secret ✅

Provisioned on both upstream and fork.

A3 — live AC-V4-1 + AC-V4-3b ✅

Locally validated against bundle from #34761.

Run 1 (populate):  input=11727  cache_creation=4358   cache_read=0     out=958
Run 2 (hit):       input=6      cache_creation=11721  cache_read=4358  out=867

Run 2 cache_read_input_tokens=4358 ≥ 1024. AC-V4-1, AC-V4-3b PASS.

A4 — label-fire end-to-end ✅ (modulo docker gap)

Workflow fires on label, actor-permission gate passes. 7 of 8 substantive steps pass on run 25674270957:

Step	Result
Check actor permission	✅
Setup Node.js and Install Dependencies	✅
Setup Bun	✅
Fetch PR diff	✅
Generate bundle (gh + triage)	✅
Author recipe (Anthropic SDK call)	✅
Run harness in container	❌ no Dockerfile in repo
Upload artifacts	✅ (7992 bytes, .verify-output/...)
Post PR comment	✅ (robust fallback)

Fixes shipped to make activation work

Commit	Fix
`478ba479ec2`	Pin 4 action SHAs
`51ca9477016`	`MODEL_ID_MAP` pointed at non-existent `claude-opus-4-5-20250929` → use valid IDs
`e3aae89b69d`	Workflow missing `yarn install` + `Post PR comment` referenced non-existent `latest/` symlink
`2a1aa10b9e9`	Workflow missing `bun` setup (scripts use bun)
`a256f3a7197`	`path: .verify-output/*/` finds 0 files in upload-artifact v7
`7b660d73e26`	Debug step (used to confirm dir layout)
`0993b55e24e`	`include-hidden-files: true` (`.verify-output` is dot-prefixed)

Remaining activation gap

v5-0 — Dockerfile + image build in CI. Run harness in container step references verify-harness:pinned-sha. No Dockerfile in repo. Needs:

scripts/verify/Dockerfile defining the spec-runner sandbox (Node + Playwright + minimal deps, non-root user)
Workflow step to build the image (or pull from a registry) before the harness step

Until v5-0, the harness step always fails. Everything upstream of it now works.

🤖 Generated with Claude Code

…posites LLM-authored single-round PR verification harness running under pull_request_target on ci:verify label. Authors a Playwright spec from the PR diff, executes against internal-ui or a sandbox template, runs vision evidence-check + PR-added unit tests, posts a verdict comment. Composite actions (.github/actions/agentic-pr-prepare + agentic-pr-publish) factor the trust-boundary plumbing. Hardenings: C1 HMAC-bound verdict, srt-sha256 caller-inline, sync-path traversal guards, jq-emitted srt settings, file-path screenshot output, provenance secret to file, env-mapped tokens. Layer-2 srt (bubblewrap) isolation wraps all PR-controlled steps. Includes eval-driven fixes: vitest --cache=false, evidence-check scans $PR_HEAD_DIR/.verify-output, playwright-report path via runId, evidence reasoning 2000-char cap, recipe-body block in PR comment, 4 authoring triage rules (addon-docs→sandbox, ActionBar hover+scroll, sidebar expand, .first()-visible), srt allowWrite includes sandbox root. Clean 64-file harness footprint atop current next (prior squash had captured stale upstream snapshots surfacing as spurious reverts).

…ssing Non-visual coverage gap (5 of 9 eval regressions were aria/XSS/type-only where vision returns `undetermined`): - scripts/verify/mode.ts: parse `@verify-mode` header (visual default, behavioral, pure-fn, build-config; type-only deliberately excluded). - core.ts: `mode` on VerifyResult, added to SIGNED_FIELDS so a forged in-srt result cannot claim a non-visual mode to dodge vision. - verify-pr.ts: parse/log/stamp mode; visual+behavioral share the Playwright path; pure-fn/build-config emit an explicit `skipped` (no false verdict) until their execution harness ships. - verify-evidence-check.ts: skip vision when mode != visual (kills the useless `undetermined` on behavioral recipes). - _recipe-authoring-guide.md §12.5: HARD GATE mode-selection triage + behavioral worked example (aria/XSS → behavioral). Scratch-dir blessing (sanctioned FS-write path, no srt loosening): - _util.ts: RecipePage.scratchDir + writeFixture(relPath, contents), traversal/absolute guarded; $PR_HEAD_DIR/.verify-scratch is already in srt allowWrite. - guide §4 documents it; .gitignore += .verify-scratch. DESIGN-nonvisual-coverage.md records full design; coverage gate (Part C) and pure-fn/build-config wiring deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ule import Wave finding (#36 try-pr-34649 a11yRunner): recipe-author correctly chose @verify-mode: behavioral but reached the changed module via in-browser dynamic import() + monkeypatch — the deny-regex gate rejected it at attempt 1 with no retry, producing "no verdict". - recipe-author-core.ts: on assertNoDeniedPatterns failure, build a retry message (denied pattern + §12.5 pointer) and loop, mirroring the lint failure path. Terminal `deny-regex-hit` only after MAX_RECIPE_ATTEMPTS. - _recipe-authoring-guide.md §12.5: HARD GATE — a behavioral recipe must never import()/monkeypatch/eval the changed module (deny-regex blocks it pre-run = no verdict). Drive the public UI path and assert observable effect; if no UI path exists, fall back to visual smoke + filterPageErrors rather than fabricating a module import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rity no-verdict cause Wave finding (#31 try-pr-34712 XSS): recipe-author correctly chose behavioral mode and drove the change via the public manager-api `setOptions` path (no module import), but needed `(window as any).__STORYBOOK_ADDONS_MANAGER` to reach the runtime singleton. `@typescript-eslint/recommended` makes `no-explicit-any` an error, so the scoped lint gate failed twice → no verdict. - .verify-recipes/.eslintrc.cjs: `@typescript-eslint/no-explicit-any: 'off'`. Code-quality rule, NOT a security control — deny-regex and no-restricted-{globals,imports,syntax} remain the load-bearing gates. `as any` for window/manager-api globals is correct and unavoidable. - _recipe-authoring-guide.md §12.5: note that `as any` for runtime globals is allowed; don't waste retries trying to type them. Verified: behavioral recipe using `(window as any).__STORYBOOK_ADDONS_MANAGER` now lints clean (exit 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ActionBar scope Wave findings (#28/#29/#31 stuck at regression despite passing PR unit tests — recipe-author mis-targeted the DOM): - ActionBar/Canvas rule was conflating the docs-Canvas Zoom/Show-code toolbar with the generic `ActionBar` component. Scope-tagged it to the docs-Canvas surface only. - New HARD GATE "additive-only API changes with no story/consumer" — the #1 false-regression cause. #28/#29 add `ActionItem.ariaLabel` but no story or in-diff consumer passes it, so the attribute is never in the DOM; asserting it always fails. Rule: detect additive-no-consumer, fall back to `@verify-mode: visual` smoke on the component's existing story (`components-actionbar--many-items`), never `getByRole('toolbar')` (the component renders plain <button>s) nor `.docs-story`. - New HARD GATE for `Brand` / `theme.brand.title`: the sanitized dangerouslySetInnerHTML path runs ONLY when `theme.brand.image === null`. Target the existing `manager-sidebar-heading--only-text` / `--link-and-text` stories (already `{title, image:null}`); never runtime `api.setOptions({theme})` (#31 false regression — never reaches the path). XSS-inert proof is the PR's unit test; recipe is a render/boot smoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… TMPDIR pinned Two distinct wave-#31/#36 root causes, both false regressions: (a) _util.ts previewRoot() filtered `#storybook-root:visible`. Stories with `parameters.layout:'fullscreen'` + the internal-ui side-by-side/stacked theme decorator wrap the story so #storybook-root has a zero-size (Playwright-"not visible") box though it rendered — locator matched nothing, waitForStoryLoaded timed out (#31 manager-sidebar-heading--*). Use `:has(> *)` instead: selects whichever container actually has children, keeps story-vs-docs disambiguation, drops the bounding-box requirement. (b) verify-pr.yml unit-test step runs `env -i … srt … yarn vitest`. `env -i` strips TMPDIR, so Yarn's run-temp realpaths a nonexistent srt path (`lstat '/tmp/claude'` ENOENT) and aborts before vitest starts → false "vitest exited without JSON report" regression (#36 a11yRunner). Pin TMPDIR to an existing allowWrite dir ($PR_HEAD_DIR/.verify-output/vitest-tmp), same rationale REPORT/LOG already live there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bans root-visible assert Re-run of #36/#31 showed both prior fixes missed the real cause: - #36: TMPDIR pin had zero effect — Yarn's mktempPromise still ENOENT `/tmp/claude`. Root cause: srt derives its sandbox tmp from CLAUDE_CODE_TMPDIR, NOT TMPDIR. The main recipe run inherits it via $GITHUB_ENV ($SANDBOX_TMPDIR); the unit-test step's `env -i` strips it, so srt falls back to its hardcoded `/tmp/claude` (never created). Pass CLAUDE_CODE_TMPDIR=$VITEST_TMPDIR (existing allowWrite dir) in env -i. - #31: previewRoot `:has(> *)` fix removed the _util.ts:66 timeout, but the recipe-author hand-rolled `expect('#storybook-root').toBeVisible()` which is "hidden" for `Sidebar/Heading` (layout:fullscreen + side-by-side = zero-box root). Brand triage rule now explicitly bans root-visibility asserts and prescribes a child `toBeAttached()` content assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndate it Wave #36 (after CLAUDE_CODE_TMPDIR fix let vitest run): Playwright recipe failed on `expect(consoleErrors).toEqual([])` because the srt egress jail denies every non-allowlisted domain, so internal-ui's external probes always log `Failed to load resource: net::ERR_INTERNET_DISCONNECTED` — environmental, not a PR regression. No console-error equivalent of filterPageErrors existed. - _util.ts: add `filterConsoleErrors()` dropping `net::ERR_*` (INTERNET_DISCONNECTED / NAME_NOT_RESOLVED / BLOCKED_BY_CLIENT / CONNECTION_REFUSED / FAILED) + the shared cross-origin sessionStorage SecurityError. Verified: keeps only genuine errors. - _recipe-authoring-guide.md §3: MANDATORY subsection — never assert the raw consoleErrors array; always filterConsoleErrors(consoleErrors), mirroring the filterPageErrors mandate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on-visual flipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pragmatic-Programmer Wave-1 (EPIC-1/2/4), 17 P0 defects: Crash Early / Do No Harm: - loadCostLedger: ENOENT→0, unreadable→fail-safe over-budget; atomic ledger write (tmp+rename) so concurrent reads can't see a torn file - getPricing: unknown model throws instead of silent opus fallback; telemetry warns instead of silently charging $0 - telemetry sink hiccup warns + exits 0 (no longer gates the verdict) - computeVerdict: zero-tests carries an explicit regressionReason - sync/symlink failures propagate (no stale-core false "verified"); atomic dir copy; pruneOldRuns surfaces undeletable dirs Finish What You Start: - preflightPort uses an authoritative bind probe (no fail-open) - abort: SIGTERM→timed SIGKILL, process-group kill, await child exit - abort-race listener leak fixed; runner treats AbortError as non-fatal - SDK call honours an optional AbortSignal; runResync fetch timeout Design by Contract (HMAC verdict integrity): - trusted post-processors re-sign verify-result.json after mutation via exported core.signResultFile; result+sig published atomically (sig before result); evidence-check missing-key writes undetermined - SECURITY.md §c1 cites core.RESULT_FILENAME/verifyResultPath - workflow re-sign is rollout-guarded (feature-detect signResultFile) Security-reviewed: SHIP, 0 CRITICAL/HIGH/MED. Follow-ups (LOW) tracked as Wave-1.1: atomicWrite O_EXCL, derive-verdict atomic write, SIGNED_FIELDS exclusion assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layers the prior-session Pragmatic-Programmer H1–H7 fixes on top of the committed Wave-1 P0 work. Clean H1–H7 files taken as-is; the 5 files that overlap Wave-1 were 3-way merged (base f573e99 ↔ Wave-1 ↔ H1–H7): - H1: delete dead recipe-retry-policy.ts (inlined into recipe-author-core) - H2: model-pricing.ts single source; agent-dispatch + append-telemetry now consume it. Semantic merge with Wave-1: getPricing keeps W1's fail-loud HARD throw on an unknown model (no silent opus fallback), expressed against the H2 single-source table; telemetry single-sources too but warns + records 0 on unknown (non-blocking, unchanged intent). - H3: deny-policy single-sourced; docs cite recipe-deny.ts - H4: env-jail single source (strip-untrusted-secrets.sh) — merged cleanly with Wave-1's EPIC-4 re-sign + rollout guard in verify-pr.yml - H5/H6: README skeleton + SKILL runbook corrected - H7: dead provenance verifier removed; SECURITY.md §c1 merged with Wave-1's RESULT_FILENAME/verifyResultPath citation - evidence-check: union import — keeps Wave-1 signResultFile (re-sign) AND H2 getModelPrice (vision cost single-source) Verified: 12 .ts typecheck clean, verify-pr.yml YAML valid, shellcheck clean, pricing roundtrip (opus-4-7 $5/$25, unknown→throws), zero conflict markers, no unrelated cross-branch WIP staged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nvariant guard Security-review follow-ups on the PR-verify harness (all in-worktree): - core.ts atomicWrite: unpredictable randomBytes(12) temp suffix (was guessable pid+Date.now()) opened O_WRONLY|O_CREAT|O_EXCL|O_NOFOLLOW 0o600 so a pre-planted temp path / symlink cannot hijack, observe, or redirect the trusted signed-result write before the atomic rename (CWE-377/59). Exported as the one safe writer; no-retry-on-EEXIST documented (a retry would reintroduce a name-guessing oracle). - core.ts: module-load assertion that SIGNED_FIELDS is disjoint from {unitTests, evidenceRetry, evidenceVerdict} — crashes early instead of silently breaking every verified PR if a post-processor field is ever moved into the signed set (EPIC-4.1 HMAC invariant, fail-loud). - ci/derive-verdict.ts: both result writes (forgery-downgrade persist + pre-re-sign write) routed through core.ts atomicWrite so a reader never sees a torn result and the temp path can't be symlink-hijacked. Re-sign failure after the unit-test merge now fails the step loudly (process.exitCode=1) instead of persisting an unregenerable stale .sig. - SECURITY.md §c1: rewrote the post-signing invariant into the two distinct mechanisms (machine-enforced disjointness = load-bearing; re-sign = mandatory only where a signed field changes with the secret in scope, else defense-in-depth). Corrected the scope: only the unit-test jq writers are deliberately unsigned (secret stripped before untrusted vitest); evidenceRetry IS re-signed by the workflow. Verified: tsc clean, node --experimental-strip-types --check + module load (assertion passes, non-vacuous), atomicWrite functional (write + atomic overwrite + zero tmp leftovers), workflow YAML still parses. Mandatory security-reviewer pass: 0 CRIT/HIGH; MED (fail-loud re-sign) and LOW(a) applied; LOW(b) test-pin deferred to Wave 2 (EPIC-5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…8 single-sourcing EPIC-5 (test the verifier — was 3 test files, zero on the security/cost core): 8 new vitest suites, 181 tests, auto-globbed by the `scripts` project (no CI wiring needed; satisfies 5.11): - 5.1 recipe-deny.test.ts — all 19 DENY patterns, exact 1-based line, per-line tripwire (no comment-awareness) pinned; eval-#36 `dynamic import(` pinned (isolated + overlapping). - 5.2 agent-dispatch-cost.test.ts — budget gate boundary (computed, not hardcoded), resolveModelId round-trip, pricing digit-transpose guard. - 5.4 derive-verdict-hmac.test.ts — saboteur suite (forged/tampered/ correct/wrong-secret/non-signed vs signed field) + the deferred Wave-1.1 LOW(b) disjointness pin, made non-vacuous (poisoned-set replica proves the guard has teeth). - 5.6 triage/target-suggest, 5.7 mode/target (30-line window edge), 5.10 agent-prompt-sanitize (ANSI/NUL/fence redaction, cap boundary). EPIC-6: - 6.1 SKILL.md — de-absolutized 4 hardcoded /Users/... paths +1 stale prose note to runtime-resolved $REPO_ROOT (git rev-parse). - 6.8 srt pin — replaced manual-paste srt-version/srt-sha256 in verify-pr.yml with committed scripts/verify/srt.lock.json read fail-closed from the TRUSTED base checkout (values byte-identical: 0.0.51 / 36de…6338); load-bearing sha verification in the composite untouched. Added workflow_dispatch-only _srt-sha-probe.yml. Mandatory separate review pass (security-reviewer + code-reviewer), findings addressed and re-verified: - sec HIGH: probe no longer auto-commits/pushes the supply-chain pin — emit-only (summary + artifact + outputs), contents: read, human lands it via reviewed PR (restores the "reviewed diff" invariant). - sec LOW: strict srt version regex (rejects ./../leading-trailing dot) in both workflows. - code HIGH: derive-verdict-hmac.test.ts typed (Partial<VerifyResult>/ RecipeTest/StepStatus) — 22 tsc errors cleared, assertions unchanged. - code MED/LOW: dropped unused beforeEach import; over-claiming "ordering invariant" tests renamed to honest "disjoint rules resolve independently" (no dual-match input exists in the real globs); added target grammar negative cases. Verified: tsc clean for all 8 test files (scripts/tsconfig.json), 181/181 green, both workflows parse, bash -n OK, srt values unchanged, trust boundary + fail-closed intact (security-reviewed). Scope clean (no yarn.lock/shared-tree drift). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

valentinpalkovic mentioned this pull request May 11, 2026

chore: verify-harness A4 label-fire test (do not merge) #34763

Closed

valentinpalkovic changed the title ~~feat(scripts): agentic PR verification harness (PoC → v4)~~ feat(scripts): agentic PR verification harness (v6 — runner-native, no Docker) May 11, 2026

valentinpalkovic changed the title ~~feat(scripts): agentic PR verification harness (v6 — runner-native, no Docker)~~ feat(scripts): agentic PR verification harness (v6 single-round — author + execute in one run) May 12, 2026

valentinpalkovic force-pushed the valentin/agentic-review-harness branch from d57333c to 046eda5 Compare May 14, 2026 10:54

valentinpalkovic requested review from jonniebigodes and kylegach as code owners May 14, 2026 10:54

valentinpalkovic force-pushed the valentin/agentic-review-harness branch 13 times, most recently from 160bdcd to ad75ba9 Compare May 15, 2026 19:40

valentinpalkovic force-pushed the valentin/agentic-review-harness branch from ad75ba9 to 099b6f7 Compare May 15, 2026 19:48

valentinpalkovic and others added 5 commits May 15, 2026 22:42

valentinpalkovic and others added 7 commits May 18, 2026 23:17

docs(verify): record 2026-05-18 eval outcome — 7 fixes shipped, 4/5 n…

f573e99

…on-visual flipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Uh oh!

Conversation

valentinpalkovic commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture diffs vs v5-0

Architecture diffs vs the earlier v6 two-round model

Per-recipe targets

CI workflow shape

Files dropped (vs v5-0 + two-round v6)

New / changed in single-round v6

Validation — fork firetest

Test plan

Security posture (single-round v6)

Summary by CodeRabbit

Day 3 + Day 4 follow-ups (post-v6 single-round pivot)

Verdict layering — three orthogonal signals

Retry loop

Sandbox-target CI path

Layer-1 security: secret stripping

Layer-2 security: @anthropic-ai/sandbox-runtime jail

Recipe-author quality

Compile-failure surfacing

UX polish

Telemetry

Validation

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

valentinpalkovic commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Activation Gates A1–A4 — Results

Approach

A1 — action SHAs pinned ✅

A2 — ANTHROPIC_API_KEY repo secret ✅

A3 — live AC-V4-1 + AC-V4-3b ✅

A4 — label-fire end-to-end ✅ (modulo docker gap)

Fixes shipped to make activation work

Remaining activation gap

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

valentinpalkovic commented May 11, 2026 •

edited

Loading

Layer-2 security: `@anthropic-ai/sandbox-runtime` jail

github-actions Bot commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading

valentinpalkovic commented May 11, 2026 •

edited

Loading

A2 — `ANTHROPIC_API_KEY` repo secret ✅