Skip to content

feat(verify): verify-on-stop — run verify.toml gates before a 'done' parks for review#136

Merged
thejustinwalsh merged 3 commits into
mainfrom
feat/verify-on-stop
May 25, 2026
Merged

feat(verify): verify-on-stop — run verify.toml gates before a 'done' parks for review#136
thejustinwalsh merged 3 commits into
mainfrom
feat/verify-on-stop

Conversation

@thejustinwalsh

@thejustinwalsh thejustinwalsh commented May 25, 2026

Copy link
Copy Markdown
Owner

What you asked for

Force the agent to run the verifications before it can finish, and feed failures back so it auto-fixes — so it can't ship an unverified PR.

Why this is the first live gate run

The verify.toml gates run nowhere in a live dispatch today: the per-phase checkbox-revert reconciler exists and is tested, but its production trigger is unbuilt (#101). So this is the first place gates actually execute against a running agent.

How (workflow-side, by design)

We discussed a literal Claude Stop hook, but that races the workflow's existing agent.stopped turn-boundary (it would classify + kill the session mid-fix). So this lives in the workflow, reusing the nudge machinery:

When the agent's Stop classifies as done (about to park for review), the workflow runs runGates(config.gates, {cwd: worktree}):

  • passdone stands → parks for review as before.
  • fail → nudge the agent in-session with the gate report ("fix these, then finish"), re-await Stop, re-verify — up to verifyRoundCap (default 3).
  • exhausted → park for a human (waiting-human, worktree kept) — an unverified PR never goes up.
  • A re-classification that's no longer done (agent asked a question, etc.) is handed back to the normal branch.

Runs after the plan-comment guard; never races the Stop boundary or touches the hook protocol. Opt-in per repo — a missing/malformed verify.toml skips (so only repos with gates, like middle's own, enforce).

Verification

format/lint/typecheck clean; full suite 493 pass, incl. 3 new tests: fail-then-pass→parks-for-review, cap-exhausted→waiting-human (worktree kept), and no-seam→unchanged.

Note

Complements, doesn't replace, #101 (the per-push per-phase trigger). This is the "before the PR goes up" backstop.

Summary by CodeRabbit

  • New Features

    • Verify-on-stop: repository verification gates run automatically when tasks are marked done, returning concise, formatted failure reports.
    • In-session nudges show gate failure details so users can fix and retry; verification attempts are capped to avoid infinite retries.
    • Verification runs integrate with stop flow and extended timeouts to allow retry/repair during the stop process.
  • Tests

    • Added end-to-end tests for verify-on-stop retries, failure exhaustion, bare-stop behavior, and session cleanup.

Review Change Stack

@thejustinwalsh thejustinwalsh added the dogfood Work dispatched through middle once dogfooding starts (phases 3+) label May 25, 2026
@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 376b264d-43c1-4d07-bed0-2dfcde4e83ae

📥 Commits

Reviewing files that changed from the base of the PR and between 12c3082 and 2ce9509.

📒 Files selected for processing (3)
  • packages/dispatcher/src/build-deps.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/implementation-workflow.test.ts

📝 Walkthrough

Walkthrough

Adds an optional runVerifyGates hook and enforceVerifyOnDone loop: when a workflow outcome is done, the workflow runs repository verify.toml gates, nudges the agent with gate failure reports and re-awaits Stop until gates pass or a retry cap is reached; tests cover pass/fail/bare-stop/absent seam behaviors.

Changes

Verify-on-stop gate enforcement

Layer / File(s) Summary
Gate verification capability
packages/dispatcher/src/build-deps.ts
formatGateFailures helper truncates and formats gate failure output; new deps.runVerifyGates(worktree) loads and executes verify.toml gates, logs and skips on load error, and returns formatted failure report on gate failure.
Workflow dependency contract and constants
packages/dispatcher/src/workflows/implementation.ts
ImplementationDeps gains optional runVerifyGates and verifyRoundCap fields; DEFAULT_VERIFY_ROUND_CAP and local verifyRoundCap binding added.
Verify-on-done enforcement and integration
packages/dispatcher/src/workflows/implementation.ts
New enforceVerifyOnDone loops: runs gates, returns done on pass, otherwise nudges agent and re-awaits Stop; exhausting verifyRoundCap returns nudge-exhausted. driveOnce invokes this when runVerifyGates is configured; launch-and-drive timeout expanded to include retry budget.
Verify-on-stop test suite
packages/dispatcher/test/implementation-workflow.test.ts
E2E tests: verify failing then passing in-session, verify never passing with round cap (parks for human), bare-stop re-stop cannot bypass verification, and seam-absent behavior (done parks unchanged).

Sequence Diagram(s)

sequenceDiagram
  participant driveOnce
  participant enforceVerifyOnDone
  participant runVerifyGates
  participant tmuxSession as tmux session

  driveOnce->>enforceVerifyOnDone: outcome done + runVerifyGates available
  loop verify rounds until pass or cap exhausted
    enforceVerifyOnDone->>runVerifyGates: run gates for worktree
    runVerifyGates-->>enforceVerifyOnDone: { ok, report }
    opt gates failed
      enforceVerifyOnDone->>tmuxSession: nudge agent with failure report
      enforceVerifyOnDone->>enforceVerifyOnDone: re-await Stop, re-classify
    end
  end

  alt gates eventually pass
    enforceVerifyOnDone->>driveOnce: return done (park for review)
  else verifyRoundCap exhausted
    enforceVerifyOnDone->>driveOnce: return nudge-exhausted (park for human)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • thejustinwalsh/middle#99: Introduces loadVerifyConfig and runGates utilities that are consumed by the new runVerifyGates implementation to load and execute verify.toml gates.

Suggested labels

ready-for-review

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main feature: adding verify-on-stop functionality to run verify.toml gates before 'done' outcomes park for review. It is specific, concise, and clearly communicates the primary change.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/workflows/implementation.ts`:
- Line 391: The step timeout calculation for the launch-and-drive flow currently
only accounts for maxNudges, so additional awaits caused by verifyRoundCap can
make the workflow hit the timeout during verify retries; update the timeout
computation and/or retry loop to include the verifyRoundCap waits (or
multiply/add verifyRoundCap * awaitStop) so verifyRoundCap is reflected in the
overall step timeout budget, and adjust the awaitStop/verify retry logic in the
functions referencing verifyRoundCap and maxNudges (look for verifyRoundCap,
maxNudges, awaitStop and the step timeout/launch-and-drive timeout computation)
to ensure the total estimated wait time includes both maxNudges and
verifyRoundCap contributions.
- Around line 463-468: The current early return of classification after
args.classifyAt(stopPayload) lets a re-classified "bare-stop" bypass done-signal
enforcement; change the branch so that if classification.kind === "bare-stop"
you call the resolver (e.g., deps.resolveBareStop or the existing
resolveBareStop helper) with the stopPayload/classification and return its
result, otherwise keep the existing return for non-"done" kinds; keep using
awaitStop, args.classifyAt and classification to locate the logic and ensure
"bare-stop" flows through resolveBareStop instead of returning raw
classification.

In `@packages/dispatcher/test/implementation-workflow.test.ts`:
- Around line 697-700: The test currently only checks the row reached
"waiting-human" and the worktree was kept; update it to assert the cap-exhausted
terminal semantics by verifying the wait signal was consumed/not re-armed: after
awaitRow(id, "waiting-human") and the expect(verifyCalls).toBe(2) check, add an
assertion that there are no further "waiting-human" signals for this id (e.g.,
query the row state or try a follow-up awaitRow/inspection and assert it is not
"waiting-human"/no new wait created) so the test proves the path is terminal
rather than a reusable park; keep existing checks on listWorktrees and
verifyCalls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a67f81f5-ddaf-4d11-884d-da503b9ee723

📥 Commits

Reviewing files that changed from the base of the PR and between 4cf69ca and dd7b3d6.

📒 Files selected for processing (3)
  • packages/dispatcher/src/build-deps.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/implementation-workflow.test.ts

Comment thread packages/dispatcher/src/workflows/implementation.ts
Comment thread packages/dispatcher/src/workflows/implementation.ts
Comment thread packages/dispatcher/test/implementation-workflow.test.ts
thejustinwalsh added a commit that referenced this pull request May 25, 2026
…lassification

Addresses CodeRabbit review on #136:
- widen launch-and-drive step timeout to include verifyRoundCap nudges
- route a re-classified bare-stop in the verify loop through resolveBareStop
  so it can't bypass the #80 done-signal as legacy completion; loop to
  re-verify when it clears, hand a parked outcome back otherwise
- assert terminal (no armed resume signal) semantics on the cap-exhaustion
  test, and add a bare-stop-bypass regression test
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/dispatcher/src/workflows/implementation.ts (1)

857-862: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Step timeout still undercounts verify→bare-stop retry paths.

Line 857 widens the backstop for verifyRoundCap, but it still only budgets one awaitStop(...) per verify round. That misses the nested resolveBareStop(...) path at Lines 475-483, which can spend up to maxNudges additional stop waits before returning. With maxNudges=3 and verifyRoundCap=3, a done -> verify fail -> bare-stop -> ... -> done cycle can consume 12 nudge waits while this timeout only budgets 6, so bunqueue can still kill the live session mid-retry.

Proposed fix
       .step("launch-and-drive", launchAndDrive, {
         retry: 1,
         // Backstop above the internal waits, widened for both bounded in-session
         // nudge loops — bare-stop done-signal nudges (up to maxNudges) and
-        // verify-on-stop fix nudges (up to verifyRoundCap), each a further
-        // Stop-await — so it can't fire mid-nudge.
+        // verify-on-stop fix nudges. A verify retry can itself fall into the
+        // bare-stop loop, so budget `maxNudges` for the initial stop handling
+        // plus `verifyRoundCap * (1 + maxNudges)` further Stop-awaits.
         timeout:
-          launchTimeout + stopTimeout + (maxNudges + verifyRoundCap) * nudgeStopTimeout + 60_000,
+          launchTimeout +
+          stopTimeout +
+          (maxNudges + verifyRoundCap * (maxNudges + 1)) * nudgeStopTimeout +
+          60_000,
       })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/src/workflows/implementation.ts` around lines 857 - 862,
The timeout backstop undercounts nested stop waits because each verify round can
itself trigger up to maxNudges additional awaitStop waits via resolveBareStop;
update the timeout calculation (the timeout property computed from
launchTimeout, stopTimeout, maxNudges, verifyRoundCap, nudgeStopTimeout) so it
budgets maxNudges for the initial nudges plus maxNudges for each verify round —
e.g. replace the (maxNudges + verifyRoundCap) * nudgeStopTimeout term with
maxNudges * (1 + verifyRoundCap) * nudgeStopTimeout so the timeout covers the
nested verify→bare-stop retry paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/dispatcher/src/workflows/implementation.ts`:
- Around line 857-862: The timeout backstop undercounts nested stop waits
because each verify round can itself trigger up to maxNudges additional
awaitStop waits via resolveBareStop; update the timeout calculation (the timeout
property computed from launchTimeout, stopTimeout, maxNudges, verifyRoundCap,
nudgeStopTimeout) so it budgets maxNudges for the initial nudges plus maxNudges
for each verify round — e.g. replace the (maxNudges + verifyRoundCap) *
nudgeStopTimeout term with maxNudges * (1 + verifyRoundCap) * nudgeStopTimeout
so the timeout covers the nested verify→bare-stop retry paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 091db836-c873-40fa-a73e-5b16f3fe1b4f

📥 Commits

Reviewing files that changed from the base of the PR and between dd7b3d6 and ec6e0e3.

📒 Files selected for processing (2)
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/implementation-workflow.test.ts

@thejustinwalsh

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

… to fix (verify-on-stop)

The agent's verify.toml gates run NOWHERE in a live dispatch today — the
checkbox-revert reconciler exists but its production trigger is unbuilt (#101).
This adds the first live gate run: when the agent claims 'done' (about to park
for review), the workflow runs runGates(config.gates) in the worktree. On
failure it nudges the agent IN-SESSION with the gate report ('fix these, then
finish') and re-awaits the Stop, up to verifyRoundCap (3). It finalizes 'done'
only when gates pass; exhausting the cap parks for a human (worktree kept) so an
unverified PR never goes up. A re-classification that's no longer 'done' is
handed back to the normal branch.

Workflow-side (not a blocking Claude Stop hook) so it reuses the existing nudge
machinery and never races the agent.stopped turn-boundary. Opt-in per repo: a
missing/malformed verify.toml → skip. Wired in build-deps via loadVerifyConfig +
runGates; tests cover fail-then-pass→park, cap-exhausted→waiting-human, and the
no-seam no-op.
…lassification

Addresses CodeRabbit review on #136:
- widen launch-and-drive step timeout to include verifyRoundCap nudges
- route a re-classified bare-stop in the verify loop through resolveBareStop
  so it can't bypass the #80 done-signal as legacy completion; loop to
  re-verify when it clears, hand a parked outcome back otherwise
- assert terminal (no armed resume signal) semantics on the cap-exhaustion
  test, and add a bare-stop-bypass regression test
…step timeout

The launch-and-drive backstop counted one awaitStop per verify round plus
maxNudges, but each verify round's re-stop can re-enter resolveBareStop for up
to maxNudges further stop waits. Worst case is maxNudges + verifyRoundCap*(1 +
maxNudges) (15 with the defaults, not 6), so a done->verify-fail->bare-stop
retry cycle could be killed mid-flight. Hoisted as maxNudgeStopWaits.
@thejustinwalsh thejustinwalsh merged commit 855e222 into main May 25, 2026
1 check was pending
thejustinwalsh added a commit that referenced this pull request May 25, 2026
…lassification

Addresses CodeRabbit review on #136:
- widen launch-and-drive step timeout to include verifyRoundCap nudges
- route a re-classified bare-stop in the verify loop through resolveBareStop
  so it can't bypass the #80 done-signal as legacy completion; loop to
  re-verify when it clears, hand a parked outcome back otherwise
- assert terminal (no armed resume signal) semantics on the cap-exhaustion
  test, and add a bare-stop-bypass regression test
@thejustinwalsh thejustinwalsh deleted the feat/verify-on-stop branch May 25, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dogfood Work dispatched through middle once dogfooding starts (phases 3+)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant