fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly by thejustinwalsh · Pull Request #165 · thejustinwalsh/middle

thejustinwalsh · 2026-05-27T22:34:32Z

Summary

A dispatched agent that wrote .middle/blocked.json but did not exit cleanly (kept running, then idled) ended up compensated with its worktree pruned — so the human's "you can continue" reply could never resume it. This makes a blocked sentinel mean "waiting for a human," not "dead," so the watchdog and drive self-heal around an agent that doesn't follow the exit protocol.

Discovered while debugging why epic #60's parked work never resumed.

Root cause (epic #60 timeline)

The last epic-60 workflow (wf_1779820885218_8bt66i4n):

18:41 — resumed; agent wrote blocked.json but kept running, then went quiet (never fired Stop)
19:00 — watchdog idle-kill: killed the session, set row → failed (no prune; triggerCompensation isn't wired in prod)
~22:41 — driveOnce's 4-hour awaitStop timeout finally elapsed → threw → launch-and-drive failed → saga ran cleanupWorktree → row compensated and the worktree was destroyed

loadPollableWaits only returns state = 'waiting-human' workflows, so the armed blocked:<id> signal was orphaned on a compensated row and the poller (running fine) never resumed it. A watchdog-only fix is insufficient — the 4h awaitStop→compensation path is what actually pruned the worktree.

What changed (`packages/dispatcher`)

driveOnce races the Stop hook against tmux session-liveness and the wait timeout (awaitStopOrSessionEnd). When no Stop arrives but a blocked.json is present, it parks (asked-question) instead of throwing — the saga never compensates, the worktree survives, and the armed signal stays pollable. A dead/hung session with no sentinel still fails (unchanged).
TmuxOps gains an optional status probe; build-deps wires tmux.status so production wakes within ~one poll (5s) of a kill rather than after the 4h timeout.
Watchdog idle-kill defers to a blocked sentinel: it kills the hung session (waking the drive's liveness race) and arms a resume signal, but never fails/compensates. Recorded once via watchdog.blocked-handoff.
parkForResume arms idempotently, so the watchdog's blocked:<id> and the park's epic-<n>-answered don't both land as pollable waits; the earlier created_at is preserved so a reply made during the hang isn't filtered out by classifyNewHumanReply.
HookServer.#await hardening: an abandoned awaitStop (liveness race won) no longer leaves a stale timer that evicts a same-named successor's waiter. Continuations reuse the deterministic session name, so without this the resumed drive would spuriously time out.

Verification

bun test        # 730 pass, 0 fail
bun run typecheck
bun run lint && bun run format

New tests:

test/stop-wait.test.ts — the stop/session-end/timeout race, incl. inconclusive-probe handling
test/watchdog.test.ts — idle-kill + blocked sentinel hands off (no fail/compensate), recorded once
test/implementation-workflow.test.ts — hung agent parks + worktree preserved; pre-armed signal not duplicated; no-sentinel hang still compensates
test/hook-server.test.ts — a re-registered awaitStop survives an abandoned waiter's stale timeout

Out of scope

Epic #60's own worktree was already pruned (5/26), so this fix prevents recurrence but does not recover #60 — that needs a fresh dispatch.

Summary by CodeRabbit

New Features
- Session liveness probing during stop flows to classify stop vs. session-ended vs. timeout
- Watchdog blocked-handoff path: when a blocked sentinel exists, workflows are preserved and handed off for resume instead of failing
Bug Fixes
- Prevent stale-timer races when re-registering waiters
- Avoid duplicate resume-signal arming and skip recording handoff if session-kill fails
Tests
- Added tests for liveness, blocked-sentinel self-heal, waiters, and watchdog handoff scenarios

An agent that wrote .middle/blocked.json but kept running (instead of exiting) was idle-killed by the watchdog and then, when driveOnce's 4-hour awaitStop finally elapsed, compensated — pruning the worktree and orphaning the armed resume signal. loadPollableWaits only sees waiting-human workflows, so the human's "you can continue" reply could never resume it (epic #60). Make the blocked sentinel mean "waiting for a human", not "dead": - driveOnce races the Stop hook against tmux session-liveness and the wait timeout (awaitStopOrSessionEnd). When no Stop arrives but a blocked.json is present, park (asked-question) instead of throwing — so the saga never compensates and the worktree survives. A dead session with no sentinel still fails, unchanged. - TmuxOps gains an optional status probe; build-deps wires tmux.status. - watchdog idle-kill defers to a blocked sentinel: it kills the hung session (waking the drive's liveness race) and arms a resume signal, but never fails/compensates. Recorded once via watchdog.blocked-handoff. - parkForResume arms idempotently, so the watchdog's blocked:<id> and the park's epic-<n>-answered don't both land as pollable waits; the earlier created_at is kept so a reply made during the hang isn't filtered out. - HookServer #await supersedes a stale waiter and identity-guards its timeout, so an abandoned awaitStop (race lost) can't evict the same-named continuation's waiter and make the resumed drive time out.

coderabbitai · 2026-05-27T22:34:38Z

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 92df8174-01be-4b3b-9861-6dc0cd558297

📥 Commits

Reviewing files that changed from the base of the PR and between 9c83ba4 and 3f686ee.

📒 Files selected for processing (5)

packages/dispatcher/src/build-deps.ts
packages/dispatcher/src/watchdog.ts
packages/dispatcher/src/workflows/implementation.ts
packages/dispatcher/test/implementation-workflow.test.ts
packages/dispatcher/test/watchdog.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

packages/dispatcher/src/watchdog.ts
packages/dispatcher/src/workflows/implementation.ts

📝 Walkthrough

Walkthrough

This PR implements blocked-sentinel self-heal for idle workflows: adds session liveness probing to detect when tmux sessions die, races the Stop hook against session death, hands off idle workflows with blocked.json sentinels instead of failing them, fixes a HookServer waiter re-registration race, and makes park-for-resume idempotent.

Changes

Blocked-sentinel self-heal and session liveness

Layer / File(s)	Summary
Session liveness contracts, wiring, and imports `packages/dispatcher/src/workflows/implementation.ts`, `packages/dispatcher/src/build-deps.ts`	TmuxOps gains optional `status(sessionName): Promise<{ alive: boolean }>` interface, `ImplementationDeps` gains optional `livenessPollMs`, and `buildImplementationDeps` wires `tmux.status` and forwards `livenessPollMs`.
Stop-wait racing against session liveness `packages/dispatcher/src/workflows/implementation.ts`, `packages/dispatcher/test/stop-wait.test.ts`	Adds `StopWaitResult` and `awaitStopOrSessionEnd` that polls optional liveness and races it against `awaitStop`, mapping outcomes to `via: "stop"
Implementation workflow Stop boundary with liveness racing `packages/dispatcher/src/workflows/implementation.ts`	Routes all in-session Stop waits through `awaitNextStop` (which uses `awaitStopOrSessionEnd`), classifies outcomes, synthesizes park payloads when `.middle/blocked.json` exists, and integrates worktree-aware calls across drive/resolution/verify flows.
Idempotent park for resume `packages/dispatcher/src/workflows/implementation.ts`	`parkForResume` checks `isWaitForArmed` before calling `armWaitForSignal` to avoid duplicate durable `waitfor_signals` entries.
Watchdog blocked-sentinel handoff behavior `packages/dispatcher/src/watchdog.ts`, `packages/dispatcher/test/watchdog.test.ts`	Adds `BLOCKED_HANDOFF_EVENT`, makes `safeKillSession` return boolean, and implements a handoff path at idle threshold when `blocked.json` exists: optionally kill tmux, arm `blocked:<workflowId>` if not armed, record the handoff event once, and skip failure/compensation. Tests cover kill-failure and idempotency.
HookServer waiter supersession race fix `packages/dispatcher/src/hook-server.ts`, `packages/dispatcher/test/hook-server.test.ts`	`HookServer.#await()` now supersedes prior waiter for the same key: clears prior timer, rejects the prior waiter as superseded, and guards timeout callbacks to avoid stale-timer races. Regression test added.
Implementation workflow blocked self-heal test suite `packages/dispatcher/test/implementation-workflow.test.ts`	Adds e2e tests that simulate a hanging SessionGate and a `blocked.json` sentinel to assert parking, idempotent arming, preservation of worktree, compensation when no sentinel, and nudge-session-death self-heal behavior.

Sequence Diagram: Stop vs Liveness race

sequenceDiagram
  participant Workflow as Implementation Workflow
  participant AwaitStop as awaitStopOrSessionEnd
  participant StopHook as sessionGate.awaitStop
  participant Liveness as tmux.status
  Workflow->>AwaitStop: request stop wait
  AwaitStop->>StopHook: start Stop hook wait
  AwaitStop->>Liveness: poll session.alive (interval)
  alt Stop arrives first
    StopHook->>AwaitStop: Stop payload
    AwaitStop->>Workflow: via: "stop"
  else Session dies first
    Liveness->>AwaitStop: alive = false
    AwaitStop->>Workflow: via: "session-ended"
  else Stop times out
    StopHook->>AwaitStop: reject/timeout
    AwaitStop->>Workflow: via: "timeout"
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

thejustinwalsh/middle#136: Modifies workflows/implementation.ts around Stop→park/verify flow related to liveness-aware Stop handling.
thejustinwalsh/middle#75: Introduced blocked-sentinel and watchdog flow that this PR extends with handoff events and self-heal behavior.

Suggested labels

bug

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: fixing a failure mode where a blocked agent doesn't exit cleanly by implementing self-healing logic. It's concise, directly related to the core objective, and avoids vague language.
Docstring Coverage	✅ Passed	Docstring coverage is 93.75% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

packages/dispatcher/src/build-deps.ts (1)

158-163: ⚡ Quick win

Expose livenessPollMs through the factory.

ImplementationDeps now supports a configurable liveness cadence, but buildImplementationDeps() never accepts or forwards it. Callers using the canonical factory are therefore pinned to the 5s default while the other timeout knobs remain configurable.

♻️ Suggested wiring

 export type BuildImplementationDepsArgs = {
   db: Database;
   ...
   launchTimeoutMs?: number;
   stopTimeoutMs?: number;
+  livenessPollMs?: number;
   reviewRoundCap?: number;
   maxNudges?: number;
   nudgeStopTimeoutMs?: number;
 };

   const deps: ImplementationDeps = {
     ...
     launchTimeoutMs: args.launchTimeoutMs,
     stopTimeoutMs: args.stopTimeoutMs,
+    livenessPollMs: args.livenessPollMs,
     reviewRoundCap: args.reviewRoundCap,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/src/build-deps.ts` around lines 158 - 163, The
ImplementationDeps object being constructed in buildImplementationDeps (the deps
variable) does not include livenessPollMs, so callers cannot override the
default cadence; update the buildImplementationDeps factory to accept a
livenessPollMs parameter (or read it from args, e.g., args.livenessPollMs) and
pass that value into the deps object (set deps.livenessPollMs = provided value,
with a fallback to the existing default if undefined) so the ImplementationDeps
returned by buildImplementationDeps honors the caller-configured liveness
cadence.

packages/dispatcher/test/implementation-workflow.test.ts (1)

378-404: ⚡ Quick win

Assert that the original wait row's timestamp survives the park.

This test proves parkForResume doesn't add a second row, but it would still pass if the existing blocked:<id> row were replaced with a newer created_at. That is the timestamp-preservation contract that prevents replies posted during the hang from being filtered out.

Suggested test hardening

   test("parkForResume keeps a pre-armed blocked signal (no duplicate)", async () => {
     const tmux = makeTmuxStub();
+    let preArmedCreatedAt: number | null = null;
     const deps = makeDeps({
       tmux: { ...tmux.ops, status: async () => ({ alive: false }) },
       sessionGate: hangingGate,
       livenessPollMs: 20,
       getAdapter: () =>
@@
         blockedAdapter(() => {
           const row = db
             .query(
               "SELECT id FROM workflows WHERE epic_number = ? AND state IN ('launching','running')",
             )
             .get(EPIC) as { id: string } | null;
-          if (row) armWaitForSignal(db, `blocked:${row.id}`, row.id, null);
+          if (row) {
+            armWaitForSignal(db, `blocked:${row.id}`, row.id, null);
+            preArmedCreatedAt = (
+              db.query("SELECT created_at FROM waitfor_signals WHERE workflow_id = ?").get(row.id) as {
+                created_at: number;
+              }
+            ).created_at;
+          }
         }),
     });
     const id = await start(deps);
     await awaitParked(id);
 
     const rows = db
-      .query("SELECT signal_name FROM waitfor_signals WHERE workflow_id = ?")
-      .all(id) as Array<{ signal_name: string }>;
+      .query("SELECT signal_name, created_at FROM waitfor_signals WHERE workflow_id = ?")
+      .all(id) as Array<{ signal_name: string; created_at: number }>;
     expect(rows).toHaveLength(1);
     expect(rows[0]!.signal_name).toBe(`blocked:${id}`);
+    expect(rows[0]!.created_at).toBe(preArmedCreatedAt);
   });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/test/implementation-workflow.test.ts` around lines 378 -
404, The test currently only checks that a single blocked:<id> wait row exists
after parkForResume, but doesn't verify the row's created_at wasn't replaced;
modify the test around start(deps)/awaitParked to capture the original
timestamp: after const id = await start(deps) query the waitfor_signals row for
signal_name = `blocked:${id}` and store its created_at, then call
awaitParked(id) and re-query the same row and assert the created_at is
unchanged. Use the existing helpers (start, awaitParked) and reference the
waitfor_signals table, signal_name `blocked:${id}`, and the created_at column
when adding the assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 222-233: The code records BLOCKED_HANDOFF_EVENT even when
safeKillSession()/killSession failed, preventing retries; change the flow so the
session-kill is attempted first and only if it actually succeeds do you call
armWaitForSignal() and recordEvent() for BLOCKED_HANDOFF_EVENT. Concretely:
update safeKillSession (or call killSession directly in a try/catch) to return a
success boolean or propagate errors, call that from the watchdog block around
latestEventType(...), and only when the kill returns success proceed to call
armWaitForSignal(deps.db, `blocked:${row.id}`, row.id, null) and
recordEvent(deps.db, {... type: BLOCKED_HANDOFF_EVENT ...}); keep
isWaitForArmed() checks as-is but ensure failures do not persist the event so
watchdog can retry.

In `@packages/dispatcher/src/workflows/implementation.ts`:
- Around line 797-808: The follow-up Stop waits in resolveBareStop and
enforceVerifyOnDone still call sessionGate.awaitStop(...) directly and must use
the liveness-aware awaitStopOrSessionEnd wrapper; change those direct calls to
invoke awaitStopOrSessionEnd with awaitStop: (timeoutMs) =>
deps.sessionGate.awaitStop(sessionName, timeoutMs), timeoutMs:
nudgeStopTimeoutMs (or the appropriate per-call timeout), isAlive wired to
deps.tmux.status if available (same probeStatus pattern used above), and pollMs:
deps.livenessPollMs so the watchdog/session-death race is handled consistently.

---

Nitpick comments:
In `@packages/dispatcher/src/build-deps.ts`:
- Around line 158-163: The ImplementationDeps object being constructed in
buildImplementationDeps (the deps variable) does not include livenessPollMs, so
callers cannot override the default cadence; update the buildImplementationDeps
factory to accept a livenessPollMs parameter (or read it from args, e.g.,
args.livenessPollMs) and pass that value into the deps object (set
deps.livenessPollMs = provided value, with a fallback to the existing default if
undefined) so the ImplementationDeps returned by buildImplementationDeps honors
the caller-configured liveness cadence.

In `@packages/dispatcher/test/implementation-workflow.test.ts`:
- Around line 378-404: The test currently only checks that a single blocked:<id>
wait row exists after parkForResume, but doesn't verify the row's created_at
wasn't replaced; modify the test around start(deps)/awaitParked to capture the
original timestamp: after const id = await start(deps) query the waitfor_signals
row for signal_name = `blocked:${id}` and store its created_at, then call
awaitParked(id) and re-query the same row and assert the created_at is
unchanged. Use the existing helpers (start, awaitParked) and reference the
waitfor_signals table, signal_name `blocked:${id}`, and the created_at column
when adding the assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ac147cc-d82b-41d3-af5a-7740db8cd4b1

📥 Commits

Reviewing files that changed from the base of the PR and between 720044c and 9c83ba4.

📒 Files selected for processing (8)

packages/dispatcher/src/build-deps.ts
packages/dispatcher/src/hook-server.ts
packages/dispatcher/src/watchdog.ts
packages/dispatcher/src/workflows/implementation.ts
packages/dispatcher/test/hook-server.test.ts
packages/dispatcher/test/implementation-workflow.test.ts
packages/dispatcher/test/stop-wait.test.ts
packages/dispatcher/test/watchdog.test.ts

…lf-heal - watchdog: safeKillSession now reports success; the blocked-handoff only arms the signal and records watchdog.blocked-handoff when the kill actually succeeded. A failed kill retries next pass instead of recording the handoff (which latestEventType would then suppress, stranding the workflow in 'running' with the session still alive). - drive: route every in-session Stop wait through one liveness-aware helper (awaitNextStop). The initial Stop, the resolveBareStop nudges, and the enforceVerifyOnDone verify rounds now all park-on-blocked / fail-on-dead uniformly — previously only the first Stop boundary was hardened, so a watchdog kill or blocked sentinel mid-nudge sat until nudgeStopTimeout then compensated the worktree. - build-deps: forward livenessPollMs through buildImplementationDeps, consistent with the sibling launch/stop/nudge timeout knobs. Tests: nudge-round park-on-blocked; watchdog kill-failure does not record the handoff.

thejustinwalsh marked this pull request as ready for review May 27, 2026 22:37

coderabbitai Bot requested changes May 27, 2026

View reviewed changes

Comment thread packages/dispatcher/src/watchdog.ts Outdated

Comment thread packages/dispatcher/src/workflows/implementation.ts Outdated

thejustinwalsh merged commit 74b4845 into main May 27, 2026
1 check was pending

thejustinwalsh deleted the fix/watchdog-blocked-sentinel-self-heal branch May 27, 2026 22:57

thejustinwalsh mentioned this pull request Jun 4, 2026

fix(dispatcher): race the recommender's Stop wait against tmux liveness #233

Merged

coderabbitai Bot mentioned this pull request Jun 5, 2026

fix(dispatcher): park arms the actual-reason signal even when blocked: is pre-armed #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly#165

fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly#165
thejustinwalsh merged 2 commits into
mainfrom
fix/watchdog-blocked-sentinel-self-heal

thejustinwalsh commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram: Stop vs Liveness race

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thejustinwalsh commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (epic #60 timeline)

What changed (packages/dispatcher)

Verification

Out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram: Stop vs Liveness race

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thejustinwalsh commented May 27, 2026 •

edited by coderabbitai Bot

Loading

What changed (`packages/dispatcher`)

coderabbitai Bot commented May 27, 2026 •

edited

Loading