fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly#165
Conversation
An agent that wrote .middle/blocked.json but kept running (instead of exiting) was idle-killed by the watchdog and then, when driveOnce's 4-hour awaitStop finally elapsed, compensated — pruning the worktree and orphaning the armed resume signal. loadPollableWaits only sees waiting-human workflows, so the human's "you can continue" reply could never resume it (epic #60). Make the blocked sentinel mean "waiting for a human", not "dead": - driveOnce races the Stop hook against tmux session-liveness and the wait timeout (awaitStopOrSessionEnd). When no Stop arrives but a blocked.json is present, park (asked-question) instead of throwing — so the saga never compensates and the worktree survives. A dead session with no sentinel still fails, unchanged. - TmuxOps gains an optional status probe; build-deps wires tmux.status. - watchdog idle-kill defers to a blocked sentinel: it kills the hung session (waking the drive's liveness race) and arms a resume signal, but never fails/compensates. Recorded once via watchdog.blocked-handoff. - parkForResume arms idempotently, so the watchdog's blocked:<id> and the park's epic-<n>-answered don't both land as pollable waits; the earlier created_at is kept so a reply made during the hang isn't filtered out. - HookServer #await supersedes a stale waiter and identity-guards its timeout, so an abandoned awaitStop (race lost) can't evict the same-named continuation's waiter and make the resumed drive time out.
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR implements blocked-sentinel self-heal for idle workflows: adds session liveness probing to detect when tmux sessions die, races the Stop hook against session death, hands off idle workflows with blocked.json sentinels instead of failing them, fixes a HookServer waiter re-registration race, and makes park-for-resume idempotent. ChangesBlocked-sentinel self-heal and session liveness
Sequence Diagram: Stop vs Liveness racesequenceDiagram
participant Workflow as Implementation Workflow
participant AwaitStop as awaitStopOrSessionEnd
participant StopHook as sessionGate.awaitStop
participant Liveness as tmux.status
Workflow->>AwaitStop: request stop wait
AwaitStop->>StopHook: start Stop hook wait
AwaitStop->>Liveness: poll session.alive (interval)
alt Stop arrives first
StopHook->>AwaitStop: Stop payload
AwaitStop->>Workflow: via: "stop"
else Session dies first
Liveness->>AwaitStop: alive = false
AwaitStop->>Workflow: via: "session-ended"
else Stop times out
StopHook->>AwaitStop: reject/timeout
AwaitStop->>Workflow: via: "timeout"
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
packages/dispatcher/src/build-deps.ts (1)
158-163: ⚡ Quick winExpose
livenessPollMsthrough the factory.
ImplementationDepsnow supports a configurable liveness cadence, butbuildImplementationDeps()never accepts or forwards it. Callers using the canonical factory are therefore pinned to the 5s default while the other timeout knobs remain configurable.♻️ Suggested wiring
export type BuildImplementationDepsArgs = { db: Database; ... launchTimeoutMs?: number; stopTimeoutMs?: number; + livenessPollMs?: number; reviewRoundCap?: number; maxNudges?: number; nudgeStopTimeoutMs?: number; }; const deps: ImplementationDeps = { ... launchTimeoutMs: args.launchTimeoutMs, stopTimeoutMs: args.stopTimeoutMs, + livenessPollMs: args.livenessPollMs, reviewRoundCap: args.reviewRoundCap,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/dispatcher/src/build-deps.ts` around lines 158 - 163, The ImplementationDeps object being constructed in buildImplementationDeps (the deps variable) does not include livenessPollMs, so callers cannot override the default cadence; update the buildImplementationDeps factory to accept a livenessPollMs parameter (or read it from args, e.g., args.livenessPollMs) and pass that value into the deps object (set deps.livenessPollMs = provided value, with a fallback to the existing default if undefined) so the ImplementationDeps returned by buildImplementationDeps honors the caller-configured liveness cadence.packages/dispatcher/test/implementation-workflow.test.ts (1)
378-404: ⚡ Quick winAssert that the original wait row's timestamp survives the park.
This test proves
parkForResumedoesn't add a second row, but it would still pass if the existingblocked:<id>row were replaced with a newercreated_at. That is the timestamp-preservation contract that prevents replies posted during the hang from being filtered out.Suggested test hardening
test("parkForResume keeps a pre-armed blocked signal (no duplicate)", async () => { const tmux = makeTmuxStub(); + let preArmedCreatedAt: number | null = null; const deps = makeDeps({ tmux: { ...tmux.ops, status: async () => ({ alive: false }) }, sessionGate: hangingGate, livenessPollMs: 20, getAdapter: () => @@ blockedAdapter(() => { const row = db .query( "SELECT id FROM workflows WHERE epic_number = ? AND state IN ('launching','running')", ) .get(EPIC) as { id: string } | null; - if (row) armWaitForSignal(db, `blocked:${row.id}`, row.id, null); + if (row) { + armWaitForSignal(db, `blocked:${row.id}`, row.id, null); + preArmedCreatedAt = ( + db.query("SELECT created_at FROM waitfor_signals WHERE workflow_id = ?").get(row.id) as { + created_at: number; + } + ).created_at; + } }), }); const id = await start(deps); await awaitParked(id); const rows = db - .query("SELECT signal_name FROM waitfor_signals WHERE workflow_id = ?") - .all(id) as Array<{ signal_name: string }>; + .query("SELECT signal_name, created_at FROM waitfor_signals WHERE workflow_id = ?") + .all(id) as Array<{ signal_name: string; created_at: number }>; expect(rows).toHaveLength(1); expect(rows[0]!.signal_name).toBe(`blocked:${id}`); + expect(rows[0]!.created_at).toBe(preArmedCreatedAt); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/dispatcher/test/implementation-workflow.test.ts` around lines 378 - 404, The test currently only checks that a single blocked:<id> wait row exists after parkForResume, but doesn't verify the row's created_at wasn't replaced; modify the test around start(deps)/awaitParked to capture the original timestamp: after const id = await start(deps) query the waitfor_signals row for signal_name = `blocked:${id}` and store its created_at, then call awaitParked(id) and re-query the same row and assert the created_at is unchanged. Use the existing helpers (start, awaitParked) and reference the waitfor_signals table, signal_name `blocked:${id}`, and the created_at column when adding the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 222-233: The code records BLOCKED_HANDOFF_EVENT even when
safeKillSession()/killSession failed, preventing retries; change the flow so the
session-kill is attempted first and only if it actually succeeds do you call
armWaitForSignal() and recordEvent() for BLOCKED_HANDOFF_EVENT. Concretely:
update safeKillSession (or call killSession directly in a try/catch) to return a
success boolean or propagate errors, call that from the watchdog block around
latestEventType(...), and only when the kill returns success proceed to call
armWaitForSignal(deps.db, `blocked:${row.id}`, row.id, null) and
recordEvent(deps.db, {... type: BLOCKED_HANDOFF_EVENT ...}); keep
isWaitForArmed() checks as-is but ensure failures do not persist the event so
watchdog can retry.
In `@packages/dispatcher/src/workflows/implementation.ts`:
- Around line 797-808: The follow-up Stop waits in resolveBareStop and
enforceVerifyOnDone still call sessionGate.awaitStop(...) directly and must use
the liveness-aware awaitStopOrSessionEnd wrapper; change those direct calls to
invoke awaitStopOrSessionEnd with awaitStop: (timeoutMs) =>
deps.sessionGate.awaitStop(sessionName, timeoutMs), timeoutMs:
nudgeStopTimeoutMs (or the appropriate per-call timeout), isAlive wired to
deps.tmux.status if available (same probeStatus pattern used above), and pollMs:
deps.livenessPollMs so the watchdog/session-death race is handled consistently.
---
Nitpick comments:
In `@packages/dispatcher/src/build-deps.ts`:
- Around line 158-163: The ImplementationDeps object being constructed in
buildImplementationDeps (the deps variable) does not include livenessPollMs, so
callers cannot override the default cadence; update the buildImplementationDeps
factory to accept a livenessPollMs parameter (or read it from args, e.g.,
args.livenessPollMs) and pass that value into the deps object (set
deps.livenessPollMs = provided value, with a fallback to the existing default if
undefined) so the ImplementationDeps returned by buildImplementationDeps honors
the caller-configured liveness cadence.
In `@packages/dispatcher/test/implementation-workflow.test.ts`:
- Around line 378-404: The test currently only checks that a single blocked:<id>
wait row exists after parkForResume, but doesn't verify the row's created_at
wasn't replaced; modify the test around start(deps)/awaitParked to capture the
original timestamp: after const id = await start(deps) query the waitfor_signals
row for signal_name = `blocked:${id}` and store its created_at, then call
awaitParked(id) and re-query the same row and assert the created_at is
unchanged. Use the existing helpers (start, awaitParked) and reference the
waitfor_signals table, signal_name `blocked:${id}`, and the created_at column
when adding the assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5ac147cc-d82b-41d3-af5a-7740db8cd4b1
📒 Files selected for processing (8)
packages/dispatcher/src/build-deps.tspackages/dispatcher/src/hook-server.tspackages/dispatcher/src/watchdog.tspackages/dispatcher/src/workflows/implementation.tspackages/dispatcher/test/hook-server.test.tspackages/dispatcher/test/implementation-workflow.test.tspackages/dispatcher/test/stop-wait.test.tspackages/dispatcher/test/watchdog.test.ts
…lf-heal - watchdog: safeKillSession now reports success; the blocked-handoff only arms the signal and records watchdog.blocked-handoff when the kill actually succeeded. A failed kill retries next pass instead of recording the handoff (which latestEventType would then suppress, stranding the workflow in 'running' with the session still alive). - drive: route every in-session Stop wait through one liveness-aware helper (awaitNextStop). The initial Stop, the resolveBareStop nudges, and the enforceVerifyOnDone verify rounds now all park-on-blocked / fail-on-dead uniformly — previously only the first Stop boundary was hardened, so a watchdog kill or blocked sentinel mid-nudge sat until nudgeStopTimeout then compensated the worktree. - build-deps: forward livenessPollMs through buildImplementationDeps, consistent with the sibling launch/stop/nudge timeout knobs. Tests: nudge-round park-on-blocked; watchdog kill-failure does not record the handoff.
Summary
A dispatched agent that wrote
.middle/blocked.jsonbut did not exit cleanly (kept running, then idled) ended upcompensatedwith its worktree pruned — so the human's "you can continue" reply could never resume it. This makes a blocked sentinel mean "waiting for a human," not "dead," so the watchdog and drive self-heal around an agent that doesn't follow the exit protocol.Discovered while debugging why epic #60's parked work never resumed.
Root cause (epic #60 timeline)
The last epic-60 workflow (
wf_1779820885218_8bt66i4n):blocked.jsonbut kept running, then went quiet (never fired Stop)failed(no prune;triggerCompensationisn't wired in prod)driveOnce's 4-hourawaitStoptimeout finally elapsed → threw →launch-and-drivefailed → saga rancleanupWorktree→ rowcompensatedand the worktree was destroyedloadPollableWaitsonly returnsstate = 'waiting-human'workflows, so the armedblocked:<id>signal was orphaned on acompensatedrow and the poller (running fine) never resumed it. A watchdog-only fix is insufficient — the 4hawaitStop→compensation path is what actually pruned the worktree.What changed (
packages/dispatcher)driveOnceraces the Stop hook against tmux session-liveness and the wait timeout (awaitStopOrSessionEnd). When no Stop arrives but ablocked.jsonis present, it parks (asked-question) instead of throwing — the saga never compensates, the worktree survives, and the armed signal stays pollable. A dead/hung session with no sentinel still fails (unchanged).TmuxOpsgains an optionalstatusprobe;build-depswirestmux.statusso production wakes within ~one poll (5s) of a kill rather than after the 4h timeout.watchdog.blocked-handoff.parkForResumearms idempotently, so the watchdog'sblocked:<id>and the park'sepic-<n>-answereddon't both land as pollable waits; the earliercreated_atis preserved so a reply made during the hang isn't filtered out byclassifyNewHumanReply.HookServer.#awaithardening: an abandonedawaitStop(liveness race won) no longer leaves a stale timer that evicts a same-named successor's waiter. Continuations reuse the deterministic session name, so without this the resumed drive would spuriously time out.Verification
New tests:
test/stop-wait.test.ts— the stop/session-end/timeout race, incl. inconclusive-probe handlingtest/watchdog.test.ts— idle-kill + blocked sentinel hands off (no fail/compensate), recorded oncetest/implementation-workflow.test.ts— hung agent parks + worktree preserved; pre-armed signal not duplicated; no-sentinel hang still compensatestest/hook-server.test.ts— a re-registeredawaitStopsurvives an abandoned waiter's stale timeoutOut of scope
Epic #60's own worktree was already pruned (5/26), so this fix prevents recurrence but does not recover #60 — that needs a fresh dispatch.
Summary by CodeRabbit
New Features
Bug Fixes
Tests