Skip to content

fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly#165

Merged
thejustinwalsh merged 2 commits into
mainfrom
fix/watchdog-blocked-sentinel-self-heal
May 27, 2026
Merged

fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly#165
thejustinwalsh merged 2 commits into
mainfrom
fix/watchdog-blocked-sentinel-self-heal

Conversation

@thejustinwalsh

@thejustinwalsh thejustinwalsh commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

A dispatched agent that wrote .middle/blocked.json but did not exit cleanly (kept running, then idled) ended up compensated with its worktree pruned — so the human's "you can continue" reply could never resume it. This makes a blocked sentinel mean "waiting for a human," not "dead," so the watchdog and drive self-heal around an agent that doesn't follow the exit protocol.

Discovered while debugging why epic #60's parked work never resumed.

Root cause (epic #60 timeline)

The last epic-60 workflow (wf_1779820885218_8bt66i4n):

  • 18:41 — resumed; agent wrote blocked.json but kept running, then went quiet (never fired Stop)
  • 19:00 — watchdog idle-kill: killed the session, set row → failed (no prune; triggerCompensation isn't wired in prod)
  • ~22:41driveOnce's 4-hour awaitStop timeout finally elapsed → threw → launch-and-drive failed → saga ran cleanupWorktree → row compensated and the worktree was destroyed

loadPollableWaits only returns state = 'waiting-human' workflows, so the armed blocked:<id> signal was orphaned on a compensated row and the poller (running fine) never resumed it. A watchdog-only fix is insufficient — the 4h awaitStop→compensation path is what actually pruned the worktree.

What changed (packages/dispatcher)

  • driveOnce races the Stop hook against tmux session-liveness and the wait timeout (awaitStopOrSessionEnd). When no Stop arrives but a blocked.json is present, it parks (asked-question) instead of throwing — the saga never compensates, the worktree survives, and the armed signal stays pollable. A dead/hung session with no sentinel still fails (unchanged).
  • TmuxOps gains an optional status probe; build-deps wires tmux.status so production wakes within ~one poll (5s) of a kill rather than after the 4h timeout.
  • Watchdog idle-kill defers to a blocked sentinel: it kills the hung session (waking the drive's liveness race) and arms a resume signal, but never fails/compensates. Recorded once via watchdog.blocked-handoff.
  • parkForResume arms idempotently, so the watchdog's blocked:<id> and the park's epic-<n>-answered don't both land as pollable waits; the earlier created_at is preserved so a reply made during the hang isn't filtered out by classifyNewHumanReply.
  • HookServer.#await hardening: an abandoned awaitStop (liveness race won) no longer leaves a stale timer that evicts a same-named successor's waiter. Continuations reuse the deterministic session name, so without this the resumed drive would spuriously time out.

Verification

bun test        # 730 pass, 0 fail
bun run typecheck
bun run lint && bun run format

New tests:

  • test/stop-wait.test.ts — the stop/session-end/timeout race, incl. inconclusive-probe handling
  • test/watchdog.test.ts — idle-kill + blocked sentinel hands off (no fail/compensate), recorded once
  • test/implementation-workflow.test.ts — hung agent parks + worktree preserved; pre-armed signal not duplicated; no-sentinel hang still compensates
  • test/hook-server.test.ts — a re-registered awaitStop survives an abandoned waiter's stale timeout

Out of scope

Epic #60's own worktree was already pruned (5/26), so this fix prevents recurrence but does not recover #60 — that needs a fresh dispatch.

Summary by CodeRabbit

  • New Features

    • Session liveness probing during stop flows to classify stop vs. session-ended vs. timeout
    • Watchdog blocked-handoff path: when a blocked sentinel exists, workflows are preserved and handed off for resume instead of failing
  • Bug Fixes

    • Prevent stale-timer races when re-registering waiters
    • Avoid duplicate resume-signal arming and skip recording handoff if session-kill fails
  • Tests

    • Added tests for liveness, blocked-sentinel self-heal, waiters, and watchdog handoff scenarios

Review Change Stack

An agent that wrote .middle/blocked.json but kept running (instead of
exiting) was idle-killed by the watchdog and then, when driveOnce's
4-hour awaitStop finally elapsed, compensated — pruning the worktree and
orphaning the armed resume signal. loadPollableWaits only sees
waiting-human workflows, so the human's "you can continue" reply could
never resume it (epic #60).

Make the blocked sentinel mean "waiting for a human", not "dead":

- driveOnce races the Stop hook against tmux session-liveness and the
  wait timeout (awaitStopOrSessionEnd). When no Stop arrives but a
  blocked.json is present, park (asked-question) instead of throwing —
  so the saga never compensates and the worktree survives. A dead
  session with no sentinel still fails, unchanged.
- TmuxOps gains an optional status probe; build-deps wires tmux.status.
- watchdog idle-kill defers to a blocked sentinel: it kills the hung
  session (waking the drive's liveness race) and arms a resume signal,
  but never fails/compensates. Recorded once via watchdog.blocked-handoff.
- parkForResume arms idempotently, so the watchdog's blocked:<id> and the
  park's epic-<n>-answered don't both land as pollable waits; the earlier
  created_at is kept so a reply made during the hang isn't filtered out.
- HookServer #await supersedes a stale waiter and identity-guards its
  timeout, so an abandoned awaitStop (race lost) can't evict the
  same-named continuation's waiter and make the resumed drive time out.
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 92df8174-01be-4b3b-9861-6dc0cd558297

📥 Commits

Reviewing files that changed from the base of the PR and between 9c83ba4 and 3f686ee.

📒 Files selected for processing (5)
  • packages/dispatcher/src/build-deps.ts
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/watchdog.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/src/workflows/implementation.ts

📝 Walkthrough

Walkthrough

This PR implements blocked-sentinel self-heal for idle workflows: adds session liveness probing to detect when tmux sessions die, races the Stop hook against session death, hands off idle workflows with blocked.json sentinels instead of failing them, fixes a HookServer waiter re-registration race, and makes park-for-resume idempotent.

Changes

Blocked-sentinel self-heal and session liveness

Layer / File(s) Summary
Session liveness contracts, wiring, and imports
packages/dispatcher/src/workflows/implementation.ts, packages/dispatcher/src/build-deps.ts
TmuxOps gains optional status(sessionName): Promise<{ alive: boolean }> interface, ImplementationDeps gains optional livenessPollMs, and buildImplementationDeps wires tmux.status and forwards livenessPollMs.
Stop-wait racing against session liveness
packages/dispatcher/src/workflows/implementation.ts, packages/dispatcher/test/stop-wait.test.ts
Adds StopWaitResult and awaitStopOrSessionEnd that polls optional liveness and races it against awaitStop, mapping outcomes to `via: "stop"
Implementation workflow Stop boundary with liveness racing
packages/dispatcher/src/workflows/implementation.ts
Routes all in-session Stop waits through awaitNextStop (which uses awaitStopOrSessionEnd), classifies outcomes, synthesizes park payloads when .middle/blocked.json exists, and integrates worktree-aware calls across drive/resolution/verify flows.
Idempotent park for resume
packages/dispatcher/src/workflows/implementation.ts
parkForResume checks isWaitForArmed before calling armWaitForSignal to avoid duplicate durable waitfor_signals entries.
Watchdog blocked-sentinel handoff behavior
packages/dispatcher/src/watchdog.ts, packages/dispatcher/test/watchdog.test.ts
Adds BLOCKED_HANDOFF_EVENT, makes safeKillSession return boolean, and implements a handoff path at idle threshold when blocked.json exists: optionally kill tmux, arm blocked:<workflowId> if not armed, record the handoff event once, and skip failure/compensation. Tests cover kill-failure and idempotency.
HookServer waiter supersession race fix
packages/dispatcher/src/hook-server.ts, packages/dispatcher/test/hook-server.test.ts
HookServer.#await() now supersedes prior waiter for the same key: clears prior timer, rejects the prior waiter as superseded, and guards timeout callbacks to avoid stale-timer races. Regression test added.
Implementation workflow blocked self-heal test suite
packages/dispatcher/test/implementation-workflow.test.ts
Adds e2e tests that simulate a hanging SessionGate and a blocked.json sentinel to assert parking, idempotent arming, preservation of worktree, compensation when no sentinel, and nudge-session-death self-heal behavior.

Sequence Diagram: Stop vs Liveness race

sequenceDiagram
  participant Workflow as Implementation Workflow
  participant AwaitStop as awaitStopOrSessionEnd
  participant StopHook as sessionGate.awaitStop
  participant Liveness as tmux.status
  Workflow->>AwaitStop: request stop wait
  AwaitStop->>StopHook: start Stop hook wait
  AwaitStop->>Liveness: poll session.alive (interval)
  alt Stop arrives first
    StopHook->>AwaitStop: Stop payload
    AwaitStop->>Workflow: via: "stop"
  else Session dies first
    Liveness->>AwaitStop: alive = false
    AwaitStop->>Workflow: via: "session-ended"
  else Stop times out
    StopHook->>AwaitStop: reject/timeout
    AwaitStop->>Workflow: via: "timeout"
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • thejustinwalsh/middle#136: Modifies workflows/implementation.ts around Stop→park/verify flow related to liveness-aware Stop handling.
  • thejustinwalsh/middle#75: Introduced blocked-sentinel and watchdog flow that this PR extends with handoff events and self-heal behavior.

Suggested labels

bug

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: fixing a failure mode where a blocked agent doesn't exit cleanly by implementing self-healing logic. It's concise, directly related to the core objective, and avoids vague language.
Docstring Coverage ✅ Passed Docstring coverage is 93.75% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@thejustinwalsh thejustinwalsh marked this pull request as ready for review May 27, 2026 22:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
packages/dispatcher/src/build-deps.ts (1)

158-163: ⚡ Quick win

Expose livenessPollMs through the factory.

ImplementationDeps now supports a configurable liveness cadence, but buildImplementationDeps() never accepts or forwards it. Callers using the canonical factory are therefore pinned to the 5s default while the other timeout knobs remain configurable.

♻️ Suggested wiring
 export type BuildImplementationDepsArgs = {
   db: Database;
   ...
   launchTimeoutMs?: number;
   stopTimeoutMs?: number;
+  livenessPollMs?: number;
   reviewRoundCap?: number;
   maxNudges?: number;
   nudgeStopTimeoutMs?: number;
 };

   const deps: ImplementationDeps = {
     ...
     launchTimeoutMs: args.launchTimeoutMs,
     stopTimeoutMs: args.stopTimeoutMs,
+    livenessPollMs: args.livenessPollMs,
     reviewRoundCap: args.reviewRoundCap,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/src/build-deps.ts` around lines 158 - 163, The
ImplementationDeps object being constructed in buildImplementationDeps (the deps
variable) does not include livenessPollMs, so callers cannot override the
default cadence; update the buildImplementationDeps factory to accept a
livenessPollMs parameter (or read it from args, e.g., args.livenessPollMs) and
pass that value into the deps object (set deps.livenessPollMs = provided value,
with a fallback to the existing default if undefined) so the ImplementationDeps
returned by buildImplementationDeps honors the caller-configured liveness
cadence.
packages/dispatcher/test/implementation-workflow.test.ts (1)

378-404: ⚡ Quick win

Assert that the original wait row's timestamp survives the park.

This test proves parkForResume doesn't add a second row, but it would still pass if the existing blocked:<id> row were replaced with a newer created_at. That is the timestamp-preservation contract that prevents replies posted during the hang from being filtered out.

Suggested test hardening
   test("parkForResume keeps a pre-armed blocked signal (no duplicate)", async () => {
     const tmux = makeTmuxStub();
+    let preArmedCreatedAt: number | null = null;
     const deps = makeDeps({
       tmux: { ...tmux.ops, status: async () => ({ alive: false }) },
       sessionGate: hangingGate,
       livenessPollMs: 20,
       getAdapter: () =>
@@
         blockedAdapter(() => {
           const row = db
             .query(
               "SELECT id FROM workflows WHERE epic_number = ? AND state IN ('launching','running')",
             )
             .get(EPIC) as { id: string } | null;
-          if (row) armWaitForSignal(db, `blocked:${row.id}`, row.id, null);
+          if (row) {
+            armWaitForSignal(db, `blocked:${row.id}`, row.id, null);
+            preArmedCreatedAt = (
+              db.query("SELECT created_at FROM waitfor_signals WHERE workflow_id = ?").get(row.id) as {
+                created_at: number;
+              }
+            ).created_at;
+          }
         }),
     });
     const id = await start(deps);
     await awaitParked(id);
 
     const rows = db
-      .query("SELECT signal_name FROM waitfor_signals WHERE workflow_id = ?")
-      .all(id) as Array<{ signal_name: string }>;
+      .query("SELECT signal_name, created_at FROM waitfor_signals WHERE workflow_id = ?")
+      .all(id) as Array<{ signal_name: string; created_at: number }>;
     expect(rows).toHaveLength(1);
     expect(rows[0]!.signal_name).toBe(`blocked:${id}`);
+    expect(rows[0]!.created_at).toBe(preArmedCreatedAt);
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/test/implementation-workflow.test.ts` around lines 378 -
404, The test currently only checks that a single blocked:<id> wait row exists
after parkForResume, but doesn't verify the row's created_at wasn't replaced;
modify the test around start(deps)/awaitParked to capture the original
timestamp: after const id = await start(deps) query the waitfor_signals row for
signal_name = `blocked:${id}` and store its created_at, then call
awaitParked(id) and re-query the same row and assert the created_at is
unchanged. Use the existing helpers (start, awaitParked) and reference the
waitfor_signals table, signal_name `blocked:${id}`, and the created_at column
when adding the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 222-233: The code records BLOCKED_HANDOFF_EVENT even when
safeKillSession()/killSession failed, preventing retries; change the flow so the
session-kill is attempted first and only if it actually succeeds do you call
armWaitForSignal() and recordEvent() for BLOCKED_HANDOFF_EVENT. Concretely:
update safeKillSession (or call killSession directly in a try/catch) to return a
success boolean or propagate errors, call that from the watchdog block around
latestEventType(...), and only when the kill returns success proceed to call
armWaitForSignal(deps.db, `blocked:${row.id}`, row.id, null) and
recordEvent(deps.db, {... type: BLOCKED_HANDOFF_EVENT ...}); keep
isWaitForArmed() checks as-is but ensure failures do not persist the event so
watchdog can retry.

In `@packages/dispatcher/src/workflows/implementation.ts`:
- Around line 797-808: The follow-up Stop waits in resolveBareStop and
enforceVerifyOnDone still call sessionGate.awaitStop(...) directly and must use
the liveness-aware awaitStopOrSessionEnd wrapper; change those direct calls to
invoke awaitStopOrSessionEnd with awaitStop: (timeoutMs) =>
deps.sessionGate.awaitStop(sessionName, timeoutMs), timeoutMs:
nudgeStopTimeoutMs (or the appropriate per-call timeout), isAlive wired to
deps.tmux.status if available (same probeStatus pattern used above), and pollMs:
deps.livenessPollMs so the watchdog/session-death race is handled consistently.

---

Nitpick comments:
In `@packages/dispatcher/src/build-deps.ts`:
- Around line 158-163: The ImplementationDeps object being constructed in
buildImplementationDeps (the deps variable) does not include livenessPollMs, so
callers cannot override the default cadence; update the buildImplementationDeps
factory to accept a livenessPollMs parameter (or read it from args, e.g.,
args.livenessPollMs) and pass that value into the deps object (set
deps.livenessPollMs = provided value, with a fallback to the existing default if
undefined) so the ImplementationDeps returned by buildImplementationDeps honors
the caller-configured liveness cadence.

In `@packages/dispatcher/test/implementation-workflow.test.ts`:
- Around line 378-404: The test currently only checks that a single blocked:<id>
wait row exists after parkForResume, but doesn't verify the row's created_at
wasn't replaced; modify the test around start(deps)/awaitParked to capture the
original timestamp: after const id = await start(deps) query the waitfor_signals
row for signal_name = `blocked:${id}` and store its created_at, then call
awaitParked(id) and re-query the same row and assert the created_at is
unchanged. Use the existing helpers (start, awaitParked) and reference the
waitfor_signals table, signal_name `blocked:${id}`, and the created_at column
when adding the assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ac147cc-d82b-41d3-af5a-7740db8cd4b1

📥 Commits

Reviewing files that changed from the base of the PR and between 720044c and 9c83ba4.

📒 Files selected for processing (8)
  • packages/dispatcher/src/build-deps.ts
  • packages/dispatcher/src/hook-server.ts
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/hook-server.test.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/stop-wait.test.ts
  • packages/dispatcher/test/watchdog.test.ts

Comment thread packages/dispatcher/src/watchdog.ts Outdated
Comment thread packages/dispatcher/src/workflows/implementation.ts Outdated
…lf-heal

- watchdog: safeKillSession now reports success; the blocked-handoff only
  arms the signal and records watchdog.blocked-handoff when the kill
  actually succeeded. A failed kill retries next pass instead of recording
  the handoff (which latestEventType would then suppress, stranding the
  workflow in 'running' with the session still alive).
- drive: route every in-session Stop wait through one liveness-aware helper
  (awaitNextStop). The initial Stop, the resolveBareStop nudges, and the
  enforceVerifyOnDone verify rounds now all park-on-blocked / fail-on-dead
  uniformly — previously only the first Stop boundary was hardened, so a
  watchdog kill or blocked sentinel mid-nudge sat until nudgeStopTimeout
  then compensated the worktree.
- build-deps: forward livenessPollMs through buildImplementationDeps,
  consistent with the sibling launch/stop/nudge timeout knobs.

Tests: nudge-round park-on-blocked; watchdog kill-failure does not record
the handoff.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant