Skip to content

feat(scripts): agentic PR verification harness (v6 single-round — author + execute in one run)#34762

Open
valentinpalkovic wants to merge 13 commits into
nextfrom
valentin/agentic-review-harness
Open

feat(scripts): agentic PR verification harness (v6 single-round — author + execute in one run)#34762
valentinpalkovic wants to merge 13 commits into
nextfrom
valentin/agentic-review-harness

Conversation

@valentinpalkovic
Copy link
Copy Markdown
Contributor

@valentinpalkovic valentinpalkovic commented May 11, 2026

Summary

Introduces an agentic PR verification harness under scripts/verify/ and .github/workflows/verify-pr.yml.

v6 single-round: drops the v5-0 Docker / Verdaccio / image-build-provenance scaffolding and collapses the previous two-round flow (author at base → maintainer commits at PR head → re-fire ci:verify) into a single workflow run. The PR-head checkout is staged first; the LLM-authored Playwright recipe is materialised straight into $RUNNER_TEMP/pr-head/.verify-recipes/pr-<#>.spec.ts (ephemeral — never committed) and executed in the same job. ANTHROPIC_API_KEY stays scoped to the Author step only; the Verify step that runs the recipe has no API key and no GITHUB_TOKEN.

Architecture diffs vs v5-0

  • No Docker, no Verdaccio, no image-build-provenance. Runs on a stock GitHub Actions runner — same isolation profile as the existing Storybook PR CI (which already executes untrusted contributor code in normal test runs).
  • ~70 % of v5-0's complexity was supply-chain hardening (digest pins, harden-build-context overlay, lifecycle-script stripping, Verdaccio publish pipeline). v6 covers the same surface via enableScripts: false + lockfile + .npmrc purge.
  • BuildKit's layer isolation kept dropping code/core/dist between stages 6 and 7 across 11 v5-0 firetest rounds. v6 eliminates the multi-stage Docker boundary entirely.

Architecture diffs vs the earlier v6 two-round model

Aspect Two-round v6 Single-round v6
Author → execute gating Committed-spec human review at PR head None — agent authors and executes in one run
Recipe storage Committed to .verify-recipes/pr-<#>.spec.ts on the PR branch Materialised into $RUNNER_TEMP/pr-head/.verify-recipes/ (ephemeral, artifact-only)
Workflow rounds 2 (kick off, review committed spec, re-fire) 1
Load-bearing controls Committed-spec review + scoped key + deny-regex + lint + actor-perm + label gate Scoped key + deny-regex + lint + structural pattern checks + controlled outputSpecPath + actor-perm + label gate + trusted-script provenance

See scripts/verify/SECURITY.md for the full re-audit. Local-dev with the verify-recipe-author skill (human-review path) remains supported for ambiguous PRs.

Per-recipe targets

Recipes declare execution target via a single header line (scanned in the first 30 lines):

Target Boots When
internal-ui (default) yarn storybook:ui:build once → yarn http-server code/storybook-static -p <port> Most fixes — exercises the monorepo's own Storybook UI against the PR head's compiled packages.
sandbox:<template> Pre-existing sandbox flow (snapshot, sanitize, sync core/dist symlink, boot sandbox storybook) Template-specific bugs (rare).

Local CLI: yarn verify-pr <PR#> resolves to .verify-recipes/pr-<#>.spec.ts; explicit --recipe-spec <path> still works.

CI workflow shape

.github/workflows/verify-pr.yml:

  1. Check actor permission (≥ write)
  2. Checkout base (pull_request.base.ref) + Setup Node.js / install root deps + Setup Bun
  3. Compute pathsecho "PR_HEAD_DIR=$RUNNER_TEMP/pr-head" >> $GITHUB_ENV (workaround for runner.temp not being a valid job-env context)
  4. Manual git clone PR head into $PR_HEAD_DIR (clone target is outside the base checkout so nx / yarn / jiti module resolution cannot escape upward into the base node_modules)
  5. gh pr diff/tmp/pr.diff
  6. yarn verify-pr-generate --pr <#> --force --output "$PR_HEAD_DIR/.verify-recipes/pr-<#>.spec.ts" — trusted base scripts emit the prompt bundle with the ephemeral output path baked in
  7. yarn verify-pr-author --bundle … (ANTHROPIC_API_KEY scoped here) dispatches the LLM, runs deny-regex + lint + structural pattern checks, and atomically renames the candidate onto the ephemeral path inside $PR_HEAD_DIR
  8. Verify PR step (working-directory = $PR_HEAD_DIR):
    yarn install --immutable
    yarn playwright install --with-deps chromium
    yarn nx compile core              # topo-pre to satisfy eslint-plugin prebuild
    yarn nx run-many -t compile       # all 42 projects so internal-ui main.ts resolves
    yarn verify-pr --recipe-spec ".verify-recipes/pr-${PR_NUMBER}.spec.ts"
  9. Read verdict → on verified, gh label create (idempotent) + gh pr edit --add-label verified-by-harness
  10. Push screenshots to the _verify-screenshots orphan side branch (raw.githubusercontent.com URLs pinned at the commit pushed)
  11. Upload $PR_HEAD_DIR/.verify-output/ + the ephemeral spec + the base-checkout author run-dir as the artefact bundle
  12. Post PR comment with verdict, trace replay link, and inline screenshots

Files dropped (vs v5-0 + two-round v6)

scripts/verify/Dockerfile                      # v5-0
scripts/verify/harden-build-context.sh         # v5-0
scripts/verify/strip-lifecycle-scripts.mjs     # v5-0
scripts/verify/__tests__/dockerfile-lint.test.ts        # v5-0
scripts/verify/__tests__/head-sha-assertion.test.ts     # v5-0
scripts/verify/__tests__/in-container-shortcircuit.test.ts  # v5-0
.github/actions/verify-spec-precheck/action.yml         # two-round v6 (gated on committed spec)

Plus the v5-0 additions in .dockerignore and the renovate.json Docker-pin rules.

New / changed in single-round v6

  • scripts/verify/target.ts// @verify-target: header parser (default internal-ui).
  • scripts/verify/internal-ui.tsstorybook:ui:build + http-server boot.
  • scripts/verify-pr.ts — rewritten around per-recipe target dispatch; positional <PR#> arg; dropped inContainer short-circuit, HEAD_SHA runtime assertion, imageDigest field.
  • scripts/verify-pr-generate.ts--output <path> flag so CI can point the bundle at an ephemeral PR-head path; default behaviour preserved for local-dev.
  • scripts/verify/recipe-author-core.ts — provenance header acknowledges both modes (local-dev human-reviewed; CI single-round materialised and executed without intermediate review).
  • scripts/verify/core.tstemplate widened from a literal to string; computeVerdict filters generic "Failed to load resource:" console 404s (low-signal browser-side fetch misses).
  • scripts/verify/SECURITY.md — full rewrite for single-round: load-bearing controls, what single-round explicitly gives up, when to add stronger isolation.
  • scripts/verify/README.md and scripts/verify/RUNBOOK.md — full rewrite around single-round v6 architecture and failure-mode debugging.
  • .verify-recipes/_recipe-authoring-guide.md — new §12 documenting target selection.

Validation — fork firetest

Validated end-to-end on a throwaway fork PR (valentinpalkovic/storybook#12).

  • Single-round flow: run 25721180250 — conclusion success in 4m22s
  • Final verdict: verified (target internal-ui)
  • verified-by-harness label applied automatically
  • PR comment posted with verdict + inline screenshots + trace replay link
  • Total of 9 prior CI rounds during v6 development (root causes captured in commit messages); single-round pivot validated cleanly on the next round after the original two-round v6 shipped end-to-end.

Test plan

  • v6 force-pushed to fork's next and to a throwaway PR (#12)
  • ci:verify label fires the workflow
  • Recipe is authored AND executed in the same workflow run (no committed spec at PR head)
  • yarn install --immutable succeeds in $RUNNER_TEMP/pr-head/
  • yarn nx run-many -t compile succeeds for all 42 projects
  • yarn verify-pr boots internal-ui, runs Playwright recipe, writes verify-result.json
  • On verified: verified-by-harness label is applied (auto-created if missing)
  • PR comment renders verdict block + inline screenshots (raw.githubusercontent.com URLs pinned at the screenshot side-branch commit)
  • Land first real PR through the harness on next after merge

Security posture (single-round v6)

Load-bearing controls — re-audited in scripts/verify/SECURITY.md:

  • Scoped API key. ANTHROPIC_API_KEY is mounted only on the Author recipe step's env: block. The Verify step has no API key and no GITHUB_TOKEN.
  • Static deny-regex pass. Blocks child_process, fs.unlink*, fs.rm*, process.exit, eval(, import 'node:*', etc. before the spec lands on disk.
  • Scoped lint gate. Pinned ESLint config; first-failure retry with categorised errors; second failure aborts.
  • Structural pattern checks. listener-before-goto and finally-attach regex assertions on every dispatched candidate.
  • Controlled output path. bundle.outputSpecPath is computed by the trusted base script (verify-pr-generate) and consumed verbatim by recipe-author-core. The LLM cannot influence where its output lands.
  • Trusted-script provenance. verify-pr-generate, recipe-author-core, recipe-deny, lint-invocation, the authoring-guide, and the canonical-smoke reference all read from the base checkout (not the PR head), so a malicious PR cannot replace the gates.
  • Actor-permission gate + label gate + non-draft gate.
  • Local-dev fallback. The verify-recipe-author skill remains supported for PRs where a maintainer wants human review of the spec before execution.

What single-round gives up: no human review of the executed spec, and no replay-by-default in version control (artefacts have 14-day retention). If either is unacceptable for a given PR class, fall back to the local-dev path.

If the threat model later expands to processing third-party PRs at scale with adversarial recipe authors, wrap the playwright test step in sandbox-runtime (bubblewrap on Linux) — ~10 lines of config per Anthropic's "Securely deploying AI agents" doc. Do not reintroduce the Docker + Verdaccio stack.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Full PR verification harness: agent-assisted recipe authoring, safety deny-list and lint-driven retry, single-round in-band execution, CLI dispatch for authoring, and PR comments with trace-replay links, inline screenshots, and artifact uploads.
  • Documentation
    • New authoring guide, verification README, security posture documentation, and field-debugging runbook.
  • Chores
    • Playwright verification setup, helper utilities and fixtures, ESLint config for verification, updated ignore files, and new verification CLI scripts.

Review Change Stack

Day 3 + Day 4 follow-ups (post-v6 single-round pivot)

The squash commit feat(verify): Day 3 + Day 4 follow-ups on top of the single-round pivot adds:

Verdict layering — three orthogonal signals

  • Playwright (recipe execution) + vision evidence-check (claude-haiku-4-5 reading the diff + spec + screenshots) + PR-added unit tests (vitest on *.test.* files from the PR diff).
  • Final verdict gates on AND of Playwright + unit tests. Vision is informational (catches sr-only / invisible changes where assertions pass but screenshots can't confirm).
  • regressionReason derived from playwright-report.json when the recipe author doesn't populate one — reviewers see the failing test title + first error inline.

Retry loop

  • Retry-on-regression: feeds Playwright error context (page snapshot + iframe a11y snapshot + first error from playwright-report.json) back to the recipe author as --retry-context. Author re-emits the spec, Playwright re-runs. Single retry; final verdict gates the label.
  • Retry-on-evidence-undetermined: feeds vision reasoning back so the author can target the diff more precisely.

Sandbox-target CI path

  • Recipes can set // @verify-target: sandbox:<template> (e.g. sandbox:vue3-vite/default-ts). The workflow detects the header, runs nx run <template>:sandbox (NX resolves implicitDependencies, emits the sandbox at code/sandbox/<key>), and verify-pr.ts boots Storybook against that sandbox instead of the internal-ui dev server.
  • Allowlisted templates: react-vite, react-webpack, vue3-vite, svelte-vite, angular-cli, nextjs, nextjs-vite (all default-ts).
  • Skips the global compile target when sandbox-bound — :sandbox handles all transitive deps via the NX project graph.

Layer-1 security: secret stripping

  • pull_request_target runs build / sandbox / recipe code from the untrusted PR head as the runner user that holds GITHUB_TOKEN (contents:write, pull-requests:write) and ANTHROPIC_API_KEY.
  • The Verify-PR, Retry-on-regression, and Run-PR-added-unit-tests steps unset GITHUB_TOKEN GH_TOKEN ANTHROPIC_API_KEY before invoking any PR-head script. Trusted scripts above (verify-pr-generate, verify-pr-author) still see the keys because env -u only strips for the single command.

Layer-2 security: @anthropic-ai/sandbox-runtime jail

  • Wraps yarn verify-pr (initial attempt + retry) in srt with a bubblewrap-backed FS + network jail. Defence-in-depth on top of Layer-1.
  • network.allowLocalBinding: true (Storybook dev server on localhost:6006); network.allowedDomains: [] (no public-internet egress).
  • filesystem.allowWrite: $RUNNER_TEMP, /tmp, $HOME/.cache, $HOME/.local/share, $HOME/.storybook.
  • filesystem.denyRead: $HOME/{.ssh, .aws, .docker, .npmrc, .gitconfig, .config/gh} (belt-and-suspenders alongside the env stripping).
  • CLAUDE_CODE_TMPDIR=$RUNNER_TEMP/sandbox-tmp so the sandbox's TMPDIR bind source exists on the host.

Recipe-author quality

  • Deterministic story-route derivation: scripts/verify/derive-story-routes.ts parses code/.storybook/main.ts via TS AST + inlines Storybook's auto-title / toId / storyNameFromExport algorithms. Routes injected into the prompt bundle verbatim — agents stop guessing 404 paths.
  • Full source of touched non-stories files in the prompt bundle (capped 250 lines per file, 4 files per PR). Agents see actual component props / ariaLabels / data- attrs upfront.
  • Iframe a11y snapshot fixture in _util.ts: on test failure, writes the preview-iframe's body.ariaSnapshot() to iframe-snapshot.md. Retry step appends this alongside the manager page-snapshot.
  • Authoring guide §8.1 expanded with evidence requirement + four-step evidence gate + worked examples (focus-ring, Save-from-Controls icon swap, sr-only label gating).

Compile-failure surfacing

  • When nx compile fails before Playwright runs, the workflow writes a stub verify-result.json with verdict=regression, regressionReason="compile failure", regressionDetails=tail -c 4000 of the log (ANSI-stripped). PR comment renders the build error in-line so reviewers see WHY without downloading artifacts.

UX polish

  • Vision reasoning collapsed inside a <details> block (verdict stays one-glance, reasoning one click away).
  • PR comment unit-tests block renders ✅/❌ alongside Playwright + vision so reviewers see all three signals together.
  • Artifact zip staged under non-dot dirs so reviewers can browse it without toggling Finder's hidden-file display.
  • Replay link points at the run-summary page (where the Artifacts section lives) instead of the 404-emitting /artifacts path.

Telemetry

  • New "Append telemetry" workflow step writes one CSV row per run to telemetry.csv on the _verify-screenshots side branch. Columns: run_id, pr_number, verdict, target, evidence_verdict, evidence_retry, unit_tests_ran, unit_tests_passed, duration_ms, timestamp. After 10–20 PRs the data drives v8 prioritisation (in-app role discovery, 2-retry budget, cross-package story heuristic, etc.).

Validation

Firetest PRs (fork-side):

Fork PR Upstream PR Target Verdict
#12 n/a internal-ui smoke verified
#13 #34767 (UndoIcon swap) internal-ui verified, evidence found
#14 #34766 (Object JSON sr-only label) internal-ui verified after retry
#15 #34756 (ArgsTable border) internal-ui regression (genuine compile-fail)
#16 #34658 (sidebar focus ring) internal-ui verified, three signals positive
#17 #34571 (Vue3 page-style scoping) vue3-vite/default-ts verified, evidence found
#18 #34644 (Svelte docgen refactor) svelte-vite/default-ts verified
#21 #34551 (Angular stats.json) angular-cli/default-ts verified

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Fails
🚫

PR is not labeled with one of: ["cleanup","BREAKING CHANGE","feature request","bug","documentation","maintenance","build","dependencies"]

🚫

PR is not labeled with one of: ["ci:normal","ci:merged","ci:daily","ci:docs"]

🚫 PR title must be in the format of "Area: Summary", With both Area and Summary starting with a capital letter Good examples: - "Docs: Describe Canvas Doc Block" - "Svelte: Support Svelte v4" Bad examples: - "add new api docs" - "fix: Svelte 4 support" - "Vue: improve docs"
🚫 PR description is missing the mandatory "#### Manual testing" section. Please add it so that reviewers know how to manually test your changes.

Generated by 🚫 dangerJS against 58e0609

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a PR verification harness: prompt-bundle generation, agent-driven Playwright spec authoring with deny-regex and ESLint gates (single retry), provenance-tagged committed specs, Storybook sandbox boot/sync, Playwright execution with trace capture, fixtures/docs, and a gated CI workflow that uploads artifacts and posts verdict comments.

Changes

PR Verification Harness

Layer / File(s) Summary
Data Models & Prompt Builder
scripts/verify/agent-prompt.ts
Defines PromptInput, prompt assembly, and token-budget validation.
Prompt Bundle Generator & CLI
scripts/verify-pr-generate.ts, scripts/verify-pr-author.ts, package.json
Generates .verify-output/<runId>/prompt-bundle.json, provides dispatch CLI (stdin/sdk), and adds package scripts plus @anthropic-ai/sdk devDependency.
Agent Dispatch Layer
scripts/verify/agent-dispatch.ts
Resolves model hints, builds Anthropic requests (guide+smoke), supports stub mode, and implements transport retry/backoff and debug artifact emission.
Recipe Author Engine
scripts/verify/recipe-author-core.ts
Orchestrates dispatch attempts, fenced-spec extraction, deny checks, linting, structural listener/attach checks, TOCTOU collision handling, provenance header prepending, result persistence, and retry framing.
Static Deny & ESLint Runner
scripts/verify/recipe-deny.ts, scripts/verify/lint-invocation.ts, .verify-recipes/.eslintrc.cjs
Provides DENY_PATTERNS and assertNoDeniedPatterns; programmatic ESLint runner pinned to scoped config and normalized violation output.
Playwright Runner & Core Parsing
scripts/verify/runner.ts, scripts/verify/core.ts, scripts/verify/playwright.config.ts
Spawns Playwright runs, streams logs, discovers trace attachments, parses Playwright JSON into structured VerifyResult, computes verdicts, and prunes old runs.
Sandbox, Sync & Boot Utilities
scripts/verify/sandbox.ts, scripts/verify/sync.ts, scripts/verify/symlink.ts, scripts/verify/boot.ts
Sandbox discovery/snapshot/restore, compile & sync core, symlink healing or copy fallback, port preflight, signal handlers, and Storybook boot/readiness checks.
Triage Routes & Selection
scripts/verify/recipes/triage-table.ts, scripts/verify/triage.ts
Maps changed-path globs to reference spec basenames and resolves absolute reference spec paths for prompts and provenance metadata.
Helpers, Fixtures & Guide
.verify-recipes/_util.ts, .verify-recipes/_recipe-authoring-guide.md, .verify-recipes/example-smoke.spec.ts, scripts/verify/__fixtures__/*
RecipePage helper, authoring contract guide, example smoke spec and stub fixtures used for smoke/lint/authoring validation.
CI Workflow, Docs & Ignores
.github/workflows/verify-pr.yml, scripts/verify/README.md, scripts/verify/SECURITY.md, .dockerignore, .gitignore
Gated pull_request_target workflow with actor checks, hardened container execution, artifact upload and PR comment step; documentation and ignore updates for artifacts/credentials.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Dev / CI
  participant Generator as verify-pr-generate
  participant Bundle as PromptBundle (.verify-output)
  participant Agent as Anthropic / Agent
  participant AuthorCLI as verify-pr-author (stdin/sdk)
  participant Linter as ESLint (lint-invocation)
  participant Writer as Spec Writer (.verify-recipes)
  participant Runner as verify-pr (Playwright Runner)

  Dev->>Generator: yarn verify-pr-generate (create prompt-bundle.json)
  Generator->>Bundle: write prompt-bundle.json
  Bundle->>Agent: dispatch prompt (agent-dispatch)
  Agent->>AuthorCLI: assistant reply (stdin/sdk)
  AuthorCLI->>Linter: lint candidate spec
  Linter-->>AuthorCLI: lint result / violations
  AuthorCLI->>Writer: write provenance-tagged `.verify-recipes/pr-<N>.spec.ts` or emit retry (exit 75)
  Dev->>Runner: yarn verify-pr --recipe-spec (run committed spec)
  Runner-->>Dev: verify-result.json + artifacts (trace, logs)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (2)
scripts/verify/boot.ts (2)

30-58: 💤 Low value

Same command injection concern applies to lsof/netstat calls.

The lsof and netstat fallback commands on macOS/Linux have the same pattern. Consider refactoring these to use spawnSync with argument arrays as well.

♻️ Suggested refactor for Unix port checks
-    const out = execSync(`lsof -ti :${port}`, { encoding: 'utf-8' }).trim();
+    const lsof = spawnSync('lsof', ['-ti', `:${port}`], { encoding: 'utf-8' });
+    const out = (lsof.stdout ?? '').trim();

And for the netstat fallback:

-        const out = execSync(`netstat -an | grep :${port}`, { encoding: 'utf-8' }).trim();
+        const netstat = spawnSync('netstat', ['-an'], { encoding: 'utf-8' });
+        const out = (netstat.stdout ?? '')
+          .split('\n')
+          .filter((line) => line.includes(`:${port}`))
+          .join('\n')
+          .trim();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify/boot.ts` around lines 30 - 58, The lsof/netstat invocations in
scripts/verify/boot.ts use execSync with string interpolation (the
execSync(`lsof -ti :${port}`) and execSync(`netstat -an | grep :${port}`) calls)
which allows shell injection; change both to use child_process.spawnSync with
argument arrays (e.g., spawnSync('lsof', ['-ti', String(port)]) and for the
fallback use spawnSync('netstat', ['-an']) and pipe/filter the stdout in Node
rather than using a shell grep), handle ENOENT the same way, and preserve the
existing stdout/trimming/error handling logic used in the try/catch around those
calls.

11-11: 💤 Low value

Prefer spawnSync with args array to avoid command injection risk.

While port is typed as number (limiting actual injection risk), using execSync with string interpolation is flagged by static analysis and doesn't follow secure coding best practices. Using spawnSync with an argument array is safer and eliminates the warning.

♻️ Suggested refactor for Windows port check
-      const out = execSync(`netstat -ano | findstr :${port}`, { encoding: 'utf-8' }).trim();
+      const netstat = spawnSync('netstat', ['-ano'], { encoding: 'utf-8' });
+      const out = (netstat.stdout ?? '')
+        .split('\n')
+        .filter((line) => line.includes(`:${port}`))
+        .join('\n')
+        .trim();

Import spawnSync alongside execSync:

-import { execSync, spawn } from 'node:child_process';
+import { execSync, spawn, spawnSync } from 'node:child_process';
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify/boot.ts` at line 11, Replace the execSync call with spawnSync:
import spawnSync from 'child_process' (or include it when importing execSync),
call spawnSync('netstat', ['-ano'], { encoding: 'utf-8' }) to get stdout, then
search/filter the stdout string for the port (e.g., split by newlines and find a
line that includes `:${port}`) instead of using a shell pipeline to findstr;
update the const out assignment to use the spawnSync stdout and handle
empty/no-match cases similarly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/skills/verify-recipe-author/SKILL.md:
- Around line 13-20: The SKILL.md currently hardcodes the absolute machine-local
path /Users/valentinpalkovic/Projects/storybook/... for bundle discovery and
other references; change the contract to use repository-relative or configurable
paths instead: describe auto-discovery as looking under the repo/workspace root
(e.g., ./ .verify-output/) or respect an environment/argument (e.g., BUNDLE_ROOT
or a CLI arg) and update all occurrences mentioned (the auto-discover section
and the other occurrences at lines ~80-86, 137-138, 175-177, 187-202) to
reference the new repo-relative or configurable path rather than the absolute
path so the skill works from any checkout.
- Around line 51-89: The docs in SKILL.md are out-of-date and describe legacy
result shapes and retry flows that conflict with the implemented logic in
verify-pr-author.ts and recipe-author-core.ts; update the step descriptions to
match the shipped contract: ensure the pre-flight check references
bundle.outputSpecPath / bundle.force but writes the current result.json shape
used by recipe-author-core.ts (use the exact status strings and attempts framing
implemented there), change the retry/fence-miss semantics to match
verify-pr-author.ts (how many attempts, what messages trigger retries vs final
failure), and update the deny-regex invocation notes to reference the actual
script function assertNoDeniedPatterns in recipe-deny.ts and the real temp-file
path convention (.verify-output/<runId>/.deny-input.txt) and failure handling
(no retry on throw). Ensure all mentions of statuses, retry counts, and
ownership of header/lint/write behavior align with the real functions in
verify-pr-author.ts and recipe-author-core.ts rather than the legacy text.

In @.github/workflows/verify-pr.yml:
- Around line 23-26: The workflow is checking out the base commit (Checkout
base) using ref: github.event.pull_request.base.sha so the job builds the wrong
code; update the checkout step to instead fetch and check out the PR head (use
ref: github.event.pull_request.head.sha) or perform a two-step checkout/merge
(keep Checkout base, then run a step that fetches
github.event.pull_request.head.ref and merges or checks out that commit) so the
subsequent yarn verify-pr runs against the PR changes rather than the base
commit.
- Around line 19-24: Summary: The workflow uses placeholder all-zero SHAs for
two actions which will fail; replace them with real pinned commit SHAs or stable
tags. Locate the uses:
"prince-chrismc/check-actor-permissions-action@0000000000000000000000000000000000000000"
and "actions/checkout@0000000000000000000000000000000000000000" (and the
repeated occurrence at lines referenced 54-61) and swap the placeholder refs for
the actual commit SHA (preferred) or an official released tag, then re-run the
workflow to confirm the actions resolve correctly.
- Around line 31-51: The workflow assumes Bun and a local Docker image exist;
add steps to provision them before the "Generate bundle" and "Run harness in
container" steps: install Bun (e.g., use actions/setup-bun or an install script)
so the yarn verify-pr-generate command runs on a fresh runner, and add a step to
produce or pull the verify-harness:pinned-sha image (e.g., a "Build harness
image" step that runs docker build -t verify-harness:pinned-sha ... or a docker
pull) so the docker run that executes yarn verify-pr --recipe-spec can find the
image; reference the "Generate bundle" (yarn verify-pr-generate), "Author
recipe" (yarn verify-pr-author) and the docker run using
verify-harness:pinned-sha to place these new steps immediately before those
existing steps.
- Around line 15-17: The workflow's permissions block is missing the repository
contents read scope required by actions/checkout; add "contents: read" to the
existing permissions map (alongside pull-requests: write and statuses: write) so
actions/checkout can access the repo before the harness steps run and avoid the
job failing early.

In `@scripts/verify-pr-author.ts`:
- Around line 186-191: The catch block after runRecipeAuthor currently only logs
to stderr and returns; instead build a structured failure result and persist it
to result.json before returning. In the catch for runRecipeAuthor (and any
thrown by dispatchRecipeAuthor/stdin), populate the existing result variable (or
create one) with failure fields (e.g. success: false, error: msg, errorStack:
err.stack, attempt info), then write that object to result.json (using the same
JSON structure the rest of the flow expects) and flush to disk before calling
console.error and returning 1 so every run always leaves a well-formed
result.json.

In `@scripts/verify/README.md`:
- Around line 228-309: The README describes pre-v4 behavior that no longer
matches the shipped implementation; update the Limitations/Roadmap/Increment 2
text to reflect label-gated CI, direct SDK dispatch for the authoring skill, and
the current stdout retry contract `===VERIFY_PR_AUTHOR_RETRY_BEGIN===` (and
remove obsolete notes about `if: false`, Claude Code-only dispatch, stderr/JSON
retry framing, and older retry ownership). Edit the Increment 2 four-step flow
and Spec-name collision/Header-comment/Triage routing/Diff truncation paragraphs
(and the mirrored section at lines ~322-363) so commands like `yarn
verify-pr-generate`, the skill path
`.claude/skills/verify-recipe-author/SKILL.md`, the emitted
`.verify-output/<runId>/result.json`, and CI behavior reference the actual
label-gated workflow and direct Anthropic SDK dispatch used in this PR; keep the
same user-facing commands but change the descriptive prose and troubleshooting
pointers to the new stdout contract and CI gating.

In `@scripts/verify/recipe-author-core.ts`:
- Around line 114-120: The extractSpecBody function currently returns an empty
string for fences that contain only whitespace; change it to treat
whitespace-only payloads as extraction failures by checking the extracted body
with body.trim() and returning null when it's empty. Locate extractSpecBody (and
the variables SPEC_FENCE_START / SPEC_FENCE_END) and after computing body from
reply.slice(...), add a guard that returns null if body.trim() === '' and
otherwise perform the existing whitespace-adjustments and return the body.
- Around line 252-268: The current existsSync + fs.promises.rename still allows
a race where rename can overwrite bundle.outputSpecPath; change the move logic
to an atomic create-that-fails approach: attempt to create the destination using
fs.promises.copyFile(candidatePath, bundle.outputSpecPath,
fs.constants.COPYFILE_EXCL) (or fs.promises.link(candidatePath,
bundle.outputSpecPath) which also fails if the target exists), and only on
success unlink the candidatePath; on failure treat it as a collision (return the
same 'collision' RecipeAuthorResult). Update the code paths around candidatePath
and bundle.outputSpecPath to use COPYFILE_EXCL or link and handle the thrown
EEXIST to preserve the "do not clobber unless --force" guarantee.

In `@scripts/verify/symlink.ts`:
- Around line 27-42: The dangling-symlink heal code incorrectly calls
access(linkTarget) directly because readlink can return a relative path resolved
against dirname(target); change it to resolve relative targets first by
computing const resolved = path.isAbsolute(linkTarget) ? linkTarget :
path.resolve(path.dirname(target), linkTarget) and then call await
access(resolved) (keep existing lstat, readlink, access, unlink flow and ensure
path is imported/available).

---

Nitpick comments:
In `@scripts/verify/boot.ts`:
- Around line 30-58: The lsof/netstat invocations in scripts/verify/boot.ts use
execSync with string interpolation (the execSync(`lsof -ti :${port}`) and
execSync(`netstat -an | grep :${port}`) calls) which allows shell injection;
change both to use child_process.spawnSync with argument arrays (e.g.,
spawnSync('lsof', ['-ti', String(port)]) and for the fallback use
spawnSync('netstat', ['-an']) and pipe/filter the stdout in Node rather than
using a shell grep), handle ENOENT the same way, and preserve the existing
stdout/trimming/error handling logic used in the try/catch around those calls.
- Line 11: Replace the execSync call with spawnSync: import spawnSync from
'child_process' (or include it when importing execSync), call
spawnSync('netstat', ['-ano'], { encoding: 'utf-8' }) to get stdout, then
search/filter the stdout string for the port (e.g., split by newlines and find a
line that includes `:${port}`) instead of using a shell pipeline to findstr;
update the const out assignment to use the spawnSync stdout and handle
empty/no-match cases similarly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8d11de-840a-42a1-863d-ef397d7616cc

📥 Commits

Reviewing files that changed from the base of the PR and between 0eab5b5 and 7f3c95e.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (34)
  • .agents/skills/verify-recipe-author/SKILL.md
  • .claude/skills/verify-recipe-author/SKILL.md
  • .dockerignore
  • .github/workflows/verify-pr.yml
  • .gitignore
  • .verify-recipes/.eslintrc.cjs
  • .verify-recipes/.gitkeep
  • .verify-recipes/_recipe-authoring-guide.md
  • .verify-recipes/_util.ts
  • .verify-recipes/example-smoke.spec.ts
  • package.json
  • scripts/verify-pr-author.ts
  • scripts/verify-pr-generate.ts
  • scripts/verify-pr.ts
  • scripts/verify/README.md
  • scripts/verify/SECURITY.md
  • scripts/verify/__fixtures__/stub-assistant-reply-clean.txt
  • scripts/verify/__fixtures__/stub-assistant-reply-with-unused-var.txt
  • scripts/verify/__fixtures__/stub-assistant-reply.txt
  • scripts/verify/agent-dispatch.ts
  • scripts/verify/agent-prompt.ts
  • scripts/verify/boot.ts
  • scripts/verify/core.ts
  • scripts/verify/lint-invocation.ts
  • scripts/verify/playwright.config.ts
  • scripts/verify/recipe-author-core.ts
  • scripts/verify/recipe-deny.ts
  • scripts/verify/recipe-retry-policy.ts
  • scripts/verify/recipes/triage-table.ts
  • scripts/verify/runner.ts
  • scripts/verify/sandbox.ts
  • scripts/verify/symlink.ts
  • scripts/verify/sync.ts
  • scripts/verify/triage.ts

Comment thread .agents/skills/verify-recipe-author/SKILL.md Outdated
Comment thread .agents/skills/verify-recipe-author/SKILL.md Outdated
Comment thread .github/workflows/verify-pr.yml
Comment thread .github/workflows/verify-pr.yml Outdated
Comment thread .github/workflows/verify-pr.yml Outdated
Comment on lines +186 to +191
try {
result = await runRecipeAuthor({ bundle, dispatch, runDir, attempt, mode });
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
console.error(`[verify-pr-author] dispatch failed: ${msg}`);
return 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist dispatch failures to result.json.

If dispatchRecipeAuthor() or stdin consumption throws here, the CLI exits with only stderr output. That breaks the “always leave a structured per-run result” contract the rest of this flow relies on.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify-pr-author.ts` around lines 186 - 191, The catch block after
runRecipeAuthor currently only logs to stderr and returns; instead build a
structured failure result and persist it to result.json before returning. In the
catch for runRecipeAuthor (and any thrown by dispatchRecipeAuthor/stdin),
populate the existing result variable (or create one) with failure fields (e.g.
success: false, error: msg, errorStack: err.stack, attempt info), then write
that object to result.json (using the same JSON structure the rest of the flow
expects) and flush to disk before calling console.error and returning 1 so every
run always leaves a well-formed result.json.

Comment thread scripts/verify/README.md Outdated
Comment thread scripts/verify/recipe-author-core.ts
Comment thread scripts/verify/recipe-author-core.ts Outdated
Comment thread scripts/verify/symlink.ts
Comment on lines +27 to +42
// Net-new dangling-symlink heal: if target exists as a symlink but points to a missing location,
// unlink it so ensureSymlink can recreate it correctly.
try {
const stat = await lstat(target);
if (stat.isSymbolicLink()) {
try {
const linkTarget = await readlink(target);
await access(linkTarget);
} catch {
await unlink(target);
console.log('[symlink] healed dangling target ' + target);
}
}
} catch (e: any) {
if (e?.code !== 'ENOENT') throw e;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix relative symlink target resolution.

The dangling-symlink healing logic has a path-resolution bug. When readlink(target) returns a relative path, it's relative to dirname(target), not process.cwd(). However, access(linkTarget) resolves relative paths against process.cwd(), causing incorrect validation.

Example failure: If a symlink at /repo/foo/link points to ../bar/file, readlink returns ../bar/file, but access checks {cwd}/../bar/file instead of /repo/foo/../bar/file.

🔧 Proposed fix to resolve relative symlink targets correctly
       const linkTarget = await readlink(target);
-      await access(linkTarget);
+      const resolvedTarget = path.isAbsolute(linkTarget)
+        ? linkTarget
+        : path.resolve(path.dirname(target), linkTarget);
+      await access(resolvedTarget);
     } catch {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify/symlink.ts` around lines 27 - 42, The dangling-symlink heal
code incorrectly calls access(linkTarget) directly because readlink can return a
relative path resolved against dirname(target); change it to resolve relative
targets first by computing const resolved = path.isAbsolute(linkTarget) ?
linkTarget : path.resolve(path.dirname(target), linkTarget) and then call await
access(resolved) (keep existing lstat, readlink, access, unlink flow and ensure
path is imported/available).

@valentinpalkovic
Copy link
Copy Markdown
Contributor Author

valentinpalkovic commented May 11, 2026

Activation Gates A1–A4 — Results

Approach

A4 testing run on valentinpalkovic/storybook fork to avoid noise on upstream. Fork's next was fast-forwarded to valentin/agentic-review-harness, secret provisioned, draft PR opened + labeled.

A1 — action SHAs pinned ✅

Commit 478ba479ec2. All 4 third-party actions resolved to commit SHAs of latest stable tags.

A2 — ANTHROPIC_API_KEY repo secret ✅

Provisioned on both upstream and fork.

A3 — live AC-V4-1 + AC-V4-3b ✅

Locally validated against bundle from #34761.

Run 1 (populate):  input=11727  cache_creation=4358   cache_read=0     out=958
Run 2 (hit):       input=6      cache_creation=11721  cache_read=4358  out=867

Run 2 cache_read_input_tokens=4358 ≥ 1024. AC-V4-1, AC-V4-3b PASS.

A4 — label-fire end-to-end ✅ (modulo docker gap)

Workflow fires on label, actor-permission gate passes. 7 of 8 substantive steps pass on run 25674270957:

Step Result
Check actor permission
Setup Node.js and Install Dependencies
Setup Bun
Fetch PR diff
Generate bundle (gh + triage)
Author recipe (Anthropic SDK call)
Run harness in container ❌ no Dockerfile in repo
Upload artifacts ✅ (7992 bytes, .verify-output/...)
Post PR comment ✅ (robust fallback)

Fixes shipped to make activation work

Commit Fix
478ba479ec2 Pin 4 action SHAs
51ca9477016 MODEL_ID_MAP pointed at non-existent claude-opus-4-5-20250929 → use valid IDs
e3aae89b69d Workflow missing yarn install + Post PR comment referenced non-existent latest/ symlink
2a1aa10b9e9 Workflow missing bun setup (scripts use bun)
a256f3a7197 path: .verify-output/*/ finds 0 files in upload-artifact v7
7b660d73e26 Debug step (used to confirm dir layout)
0993b55e24e include-hidden-files: true (.verify-output is dot-prefixed)

Remaining activation gap

v5-0 — Dockerfile + image build in CI. Run harness in container step references verify-harness:pinned-sha. No Dockerfile in repo. Needs:

  • scripts/verify/Dockerfile defining the spec-runner sandbox (Node + Playwright + minimal deps, non-root user)
  • Workflow step to build the image (or pull from a registry) before the harness step

Until v5-0, the harness step always fails. Everything upstream of it now works.

🤖 Generated with Claude Code

@valentinpalkovic valentinpalkovic changed the title feat(scripts): agentic PR verification harness (PoC → v4) feat(scripts): agentic PR verification harness (v6 — runner-native, no Docker) May 11, 2026
@valentinpalkovic valentinpalkovic changed the title feat(scripts): agentic PR verification harness (v6 — runner-native, no Docker) feat(scripts): agentic PR verification harness (v6 single-round — author + execute in one run) May 12, 2026
@valentinpalkovic valentinpalkovic force-pushed the valentin/agentic-review-harness branch from d57333c to 046eda5 Compare May 14, 2026 10:54
@valentinpalkovic valentinpalkovic force-pushed the valentin/agentic-review-harness branch 13 times, most recently from 160bdcd to ad75ba9 Compare May 15, 2026 19:40
…posites

LLM-authored single-round PR verification harness running under
pull_request_target on ci:verify label. Authors a Playwright spec from
the PR diff, executes against internal-ui or a sandbox template, runs
vision evidence-check + PR-added unit tests, posts a verdict comment.

Composite actions (.github/actions/agentic-pr-prepare + agentic-pr-publish)
factor the trust-boundary plumbing. Hardenings: C1 HMAC-bound verdict,
srt-sha256 caller-inline, sync-path traversal guards, jq-emitted srt
settings, file-path screenshot output, provenance secret to file,
env-mapped tokens. Layer-2 srt (bubblewrap) isolation wraps all
PR-controlled steps.

Includes eval-driven fixes: vitest --cache=false, evidence-check scans
$PR_HEAD_DIR/.verify-output, playwright-report path via runId, evidence
reasoning 2000-char cap, recipe-body block in PR comment, 4 authoring
triage rules (addon-docs→sandbox, ActionBar hover+scroll, sidebar
expand, .first()-visible), srt allowWrite includes sandbox root.

Clean 64-file harness footprint atop current next (prior squash had
captured stale upstream snapshots surfacing as spurious reverts).
@valentinpalkovic valentinpalkovic force-pushed the valentin/agentic-review-harness branch from ad75ba9 to 099b6f7 Compare May 15, 2026 19:48
valentinpalkovic and others added 5 commits May 15, 2026 22:42
…ssing

Non-visual coverage gap (5 of 9 eval regressions were aria/XSS/type-only
where vision returns `undetermined`):

- scripts/verify/mode.ts: parse `@verify-mode` header (visual default,
  behavioral, pure-fn, build-config; type-only deliberately excluded).
- core.ts: `mode` on VerifyResult, added to SIGNED_FIELDS so a forged
  in-srt result cannot claim a non-visual mode to dodge vision.
- verify-pr.ts: parse/log/stamp mode; visual+behavioral share the
  Playwright path; pure-fn/build-config emit an explicit `skipped`
  (no false verdict) until their execution harness ships.
- verify-evidence-check.ts: skip vision when mode != visual (kills the
  useless `undetermined` on behavioral recipes).
- _recipe-authoring-guide.md §12.5: HARD GATE mode-selection triage +
  behavioral worked example (aria/XSS → behavioral).

Scratch-dir blessing (sanctioned FS-write path, no srt loosening):

- _util.ts: RecipePage.scratchDir + writeFixture(relPath, contents),
  traversal/absolute guarded; $PR_HEAD_DIR/.verify-scratch is already
  in srt allowWrite.
- guide §4 documents it; .gitignore += .verify-scratch.

DESIGN-nonvisual-coverage.md records full design; coverage gate (Part C)
and pure-fn/build-config wiring deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ule import

Wave finding (#36 try-pr-34649 a11yRunner): recipe-author correctly chose
@verify-mode: behavioral but reached the changed module via in-browser
dynamic import() + monkeypatch — the deny-regex gate rejected it at attempt 1
with no retry, producing "no verdict".

- recipe-author-core.ts: on assertNoDeniedPatterns failure, build a retry
  message (denied pattern + §12.5 pointer) and loop, mirroring the lint
  failure path. Terminal `deny-regex-hit` only after MAX_RECIPE_ATTEMPTS.
- _recipe-authoring-guide.md §12.5: HARD GATE — a behavioral recipe must
  never import()/monkeypatch/eval the changed module (deny-regex blocks it
  pre-run = no verdict). Drive the public UI path and assert observable
  effect; if no UI path exists, fall back to visual smoke + filterPageErrors
  rather than fabricating a module import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rity no-verdict cause

Wave finding (#31 try-pr-34712 XSS): recipe-author correctly chose behavioral
mode and drove the change via the public manager-api `setOptions` path (no
module import), but needed `(window as any).__STORYBOOK_ADDONS_MANAGER` to
reach the runtime singleton. `@typescript-eslint/recommended` makes
`no-explicit-any` an error, so the scoped lint gate failed twice → no verdict.

- .verify-recipes/.eslintrc.cjs: `@typescript-eslint/no-explicit-any: 'off'`.
  Code-quality rule, NOT a security control — deny-regex and
  no-restricted-{globals,imports,syntax} remain the load-bearing gates.
  `as any` for window/manager-api globals is correct and unavoidable.
- _recipe-authoring-guide.md §12.5: note that `as any` for runtime globals
  is allowed; don't waste retries trying to type them.

Verified: behavioral recipe using `(window as any).__STORYBOOK_ADDONS_MANAGER`
now lints clean (exit 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ActionBar scope

Wave findings (#28/#29/#31 stuck at regression despite passing PR unit tests —
recipe-author mis-targeted the DOM):

- ActionBar/Canvas rule was conflating the docs-Canvas Zoom/Show-code toolbar
  with the generic `ActionBar` component. Scope-tagged it to the docs-Canvas
  surface only.
- New HARD GATE "additive-only API changes with no story/consumer" — the #1
  false-regression cause. #28/#29 add `ActionItem.ariaLabel` but no story or
  in-diff consumer passes it, so the attribute is never in the DOM; asserting
  it always fails. Rule: detect additive-no-consumer, fall back to
  `@verify-mode: visual` smoke on the component's existing story
  (`components-actionbar--many-items`), never `getByRole('toolbar')` (the
  component renders plain <button>s) nor `.docs-story`.
- New HARD GATE for `Brand` / `theme.brand.title`: the sanitized
  dangerouslySetInnerHTML path runs ONLY when `theme.brand.image === null`.
  Target the existing `manager-sidebar-heading--only-text` /
  `--link-and-text` stories (already `{title, image:null}`); never runtime
  `api.setOptions({theme})` (#31 false regression — never reaches the path).
  XSS-inert proof is the PR's unit test; recipe is a render/boot smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… TMPDIR pinned

Two distinct wave-#31/#36 root causes, both false regressions:

(a) _util.ts previewRoot() filtered `#storybook-root:visible`. Stories with
    `parameters.layout:'fullscreen'` + the internal-ui side-by-side/stacked
    theme decorator wrap the story so #storybook-root has a zero-size
    (Playwright-"not visible") box though it rendered — locator matched
    nothing, waitForStoryLoaded timed out (#31 manager-sidebar-heading--*).
    Use `:has(> *)` instead: selects whichever container actually has
    children, keeps story-vs-docs disambiguation, drops the bounding-box
    requirement.

(b) verify-pr.yml unit-test step runs `env -i … srt … yarn vitest`. `env -i`
    strips TMPDIR, so Yarn's run-temp realpaths a nonexistent srt path
    (`lstat '/tmp/claude'` ENOENT) and aborts before vitest starts →
    false "vitest exited without JSON report" regression (#36 a11yRunner).
    Pin TMPDIR to an existing allowWrite dir
    ($PR_HEAD_DIR/.verify-output/vitest-tmp), same rationale REPORT/LOG
    already live there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
valentinpalkovic and others added 7 commits May 18, 2026 23:17
…bans root-visible assert

Re-run of #36/#31 showed both prior fixes missed the real cause:

- #36: TMPDIR pin had zero effect — Yarn's mktempPromise still ENOENT
  `/tmp/claude`. Root cause: srt derives its sandbox tmp from
  CLAUDE_CODE_TMPDIR, NOT TMPDIR. The main recipe run inherits it via
  $GITHUB_ENV ($SANDBOX_TMPDIR); the unit-test step's `env -i` strips it,
  so srt falls back to its hardcoded `/tmp/claude` (never created). Pass
  CLAUDE_CODE_TMPDIR=$VITEST_TMPDIR (existing allowWrite dir) in env -i.

- #31: previewRoot `:has(> *)` fix removed the _util.ts:66 timeout, but the
  recipe-author hand-rolled `expect('#storybook-root').toBeVisible()` which
  is "hidden" for `Sidebar/Heading` (layout:fullscreen + side-by-side =
  zero-box root). Brand triage rule now explicitly bans root-visibility
  asserts and prescribes a child `toBeAttached()` content assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndate it

Wave #36 (after CLAUDE_CODE_TMPDIR fix let vitest run): Playwright recipe
failed on `expect(consoleErrors).toEqual([])` because the srt egress jail
denies every non-allowlisted domain, so internal-ui's external probes always
log `Failed to load resource: net::ERR_INTERNET_DISCONNECTED` — environmental,
not a PR regression. No console-error equivalent of filterPageErrors existed.

- _util.ts: add `filterConsoleErrors()` dropping `net::ERR_*`
  (INTERNET_DISCONNECTED / NAME_NOT_RESOLVED / BLOCKED_BY_CLIENT /
  CONNECTION_REFUSED / FAILED) + the shared cross-origin sessionStorage
  SecurityError. Verified: keeps only genuine errors.
- _recipe-authoring-guide.md §3: MANDATORY subsection — never assert the
  raw consoleErrors array; always filterConsoleErrors(consoleErrors),
  mirroring the filterPageErrors mandate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on-visual flipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pragmatic-Programmer Wave-1 (EPIC-1/2/4), 17 P0 defects:

Crash Early / Do No Harm:
- loadCostLedger: ENOENT→0, unreadable→fail-safe over-budget; atomic
  ledger write (tmp+rename) so concurrent reads can't see a torn file
- getPricing: unknown model throws instead of silent opus fallback;
  telemetry warns instead of silently charging $0
- telemetry sink hiccup warns + exits 0 (no longer gates the verdict)
- computeVerdict: zero-tests carries an explicit regressionReason
- sync/symlink failures propagate (no stale-core false "verified");
  atomic dir copy; pruneOldRuns surfaces undeletable dirs

Finish What You Start:
- preflightPort uses an authoritative bind probe (no fail-open)
- abort: SIGTERM→timed SIGKILL, process-group kill, await child exit
- abort-race listener leak fixed; runner treats AbortError as non-fatal
- SDK call honours an optional AbortSignal; runResync fetch timeout

Design by Contract (HMAC verdict integrity):
- trusted post-processors re-sign verify-result.json after mutation
  via exported core.signResultFile; result+sig published atomically
  (sig before result); evidence-check missing-key writes undetermined
- SECURITY.md §c1 cites core.RESULT_FILENAME/verifyResultPath
- workflow re-sign is rollout-guarded (feature-detect signResultFile)

Security-reviewed: SHIP, 0 CRITICAL/HIGH/MED. Follow-ups (LOW) tracked
as Wave-1.1: atomicWrite O_EXCL, derive-verdict atomic write,
SIGNED_FIELDS exclusion assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers the prior-session Pragmatic-Programmer H1–H7 fixes on top of the
committed Wave-1 P0 work. Clean H1–H7 files taken as-is; the 5 files that
overlap Wave-1 were 3-way merged (base f573e99 ↔ Wave-1 ↔ H1–H7):

- H1: delete dead recipe-retry-policy.ts (inlined into recipe-author-core)
- H2: model-pricing.ts single source; agent-dispatch + append-telemetry
  now consume it. Semantic merge with Wave-1: getPricing keeps W1's
  fail-loud HARD throw on an unknown model (no silent opus fallback),
  expressed against the H2 single-source table; telemetry single-sources
  too but warns + records 0 on unknown (non-blocking, unchanged intent).
- H3: deny-policy single-sourced; docs cite recipe-deny.ts
- H4: env-jail single source (strip-untrusted-secrets.sh) — merged
  cleanly with Wave-1's EPIC-4 re-sign + rollout guard in verify-pr.yml
- H5/H6: README skeleton + SKILL runbook corrected
- H7: dead provenance verifier removed; SECURITY.md §c1 merged with
  Wave-1's RESULT_FILENAME/verifyResultPath citation
- evidence-check: union import — keeps Wave-1 signResultFile (re-sign)
  AND H2 getModelPrice (vision cost single-source)

Verified: 12 .ts typecheck clean, verify-pr.yml YAML valid, shellcheck
clean, pricing roundtrip (opus-4-7 $5/$25, unknown→throws), zero conflict
markers, no unrelated cross-branch WIP staged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nvariant guard

Security-review follow-ups on the PR-verify harness (all in-worktree):

- core.ts atomicWrite: unpredictable randomBytes(12) temp suffix (was
  guessable pid+Date.now()) opened O_WRONLY|O_CREAT|O_EXCL|O_NOFOLLOW
  0o600 so a pre-planted temp path / symlink cannot hijack, observe, or
  redirect the trusted signed-result write before the atomic rename
  (CWE-377/59). Exported as the one safe writer; no-retry-on-EEXIST
  documented (a retry would reintroduce a name-guessing oracle).
- core.ts: module-load assertion that SIGNED_FIELDS is disjoint from
  {unitTests, evidenceRetry, evidenceVerdict} — crashes early instead of
  silently breaking every verified PR if a post-processor field is ever
  moved into the signed set (EPIC-4.1 HMAC invariant, fail-loud).
- ci/derive-verdict.ts: both result writes (forgery-downgrade persist +
  pre-re-sign write) routed through core.ts atomicWrite so a reader never
  sees a torn result and the temp path can't be symlink-hijacked. Re-sign
  failure after the unit-test merge now fails the step loudly
  (process.exitCode=1) instead of persisting an unregenerable stale .sig.
- SECURITY.md §c1: rewrote the post-signing invariant into the two
  distinct mechanisms (machine-enforced disjointness = load-bearing;
  re-sign = mandatory only where a signed field changes with the secret
  in scope, else defense-in-depth). Corrected the scope: only the
  unit-test jq writers are deliberately unsigned (secret stripped before
  untrusted vitest); evidenceRetry IS re-signed by the workflow.

Verified: tsc clean, node --experimental-strip-types --check + module
load (assertion passes, non-vacuous), atomicWrite functional (write +
atomic overwrite + zero tmp leftovers), workflow YAML still parses.
Mandatory security-reviewer pass: 0 CRIT/HIGH; MED (fail-loud re-sign)
and LOW(a) applied; LOW(b) test-pin deferred to Wave 2 (EPIC-5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8 single-sourcing

EPIC-5 (test the verifier — was 3 test files, zero on the security/cost
core): 8 new vitest suites, 181 tests, auto-globbed by the `scripts`
project (no CI wiring needed; satisfies 5.11):
- 5.1 recipe-deny.test.ts — all 19 DENY patterns, exact 1-based line,
  per-line tripwire (no comment-awareness) pinned; eval-#36
  `dynamic import(` pinned (isolated + overlapping).
- 5.2 agent-dispatch-cost.test.ts — budget gate boundary (computed, not
  hardcoded), resolveModelId round-trip, pricing digit-transpose guard.
- 5.4 derive-verdict-hmac.test.ts — saboteur suite (forged/tampered/
  correct/wrong-secret/non-signed vs signed field) + the deferred
  Wave-1.1 LOW(b) disjointness pin, made non-vacuous (poisoned-set
  replica proves the guard has teeth).
- 5.6 triage/target-suggest, 5.7 mode/target (30-line window edge),
  5.10 agent-prompt-sanitize (ANSI/NUL/fence redaction, cap boundary).

EPIC-6:
- 6.1 SKILL.md — de-absolutized 4 hardcoded /Users/... paths +1 stale
  prose note to runtime-resolved $REPO_ROOT (git rev-parse).
- 6.8 srt pin — replaced manual-paste srt-version/srt-sha256 in
  verify-pr.yml with committed scripts/verify/srt.lock.json read
  fail-closed from the TRUSTED base checkout (values byte-identical:
  0.0.51 / 36de…6338); load-bearing sha verification in the composite
  untouched. Added workflow_dispatch-only _srt-sha-probe.yml.

Mandatory separate review pass (security-reviewer + code-reviewer),
findings addressed and re-verified:
- sec HIGH: probe no longer auto-commits/pushes the supply-chain pin —
  emit-only (summary + artifact + outputs), contents: read, human lands
  it via reviewed PR (restores the "reviewed diff" invariant).
- sec LOW: strict srt version regex (rejects ./../leading-trailing dot)
  in both workflows.
- code HIGH: derive-verdict-hmac.test.ts typed (Partial<VerifyResult>/
  RecipeTest/StepStatus) — 22 tsc errors cleared, assertions unchanged.
- code MED/LOW: dropped unused beforeEach import; over-claiming
  "ordering invariant" tests renamed to honest "disjoint rules resolve
  independently" (no dual-match input exists in the real globs); added
  target grammar negative cases.

Verified: tsc clean for all 8 test files (scripts/tsconfig.json),
181/181 green, both workflows parse, bash -n OK, srt values unchanged,
trust boundary + fail-closed intact (security-reviewed). Scope clean
(no yarn.lock/shared-tree drift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant