fix(dispatcher): race the recommender's Stop wait against tmux liveness#233
Conversation
The recommender's `spawn-recommender-agent` step awaited the Stop hook with the
bare `sessionGate.awaitStop(name, agentTimeout)`. That call has no liveness
race, so a tmux session killed out from under the drive — a watchdog idle-kill,
a daemon SIGTERM with `engine.close(true)` force-closing mid-run, a manual
`tmux kill-session` — would never receive `agent.stopped` and the step would
block for the full `agentTimeout` (up to the 30-min ceiling). During that
stall, bunqueue's worker tears the job's lock token down on shutdown and the
step fails generically; `prepare-shallow-worktree`'s compensation runs and
the workflow lands `compensated` with no recoverable reason.
The implementation drive already solved this with `awaitStopOrSessionEnd`,
which races the Stop hook against an `isAlive` probe. Reuse that here:
- import + call `awaitStopOrSessionEnd` from the spawn step, plumbing
`tmux.status` through as the liveness probe and surfacing `session-ended` /
`timeout` outcomes as specific errors ("session ended before Stop hook" /
"Stop hook timed out after Nms") instead of the generic compensation;
- wire `status` into both the daemon's recommender tmux ops (`main.ts`) and
the standalone runner's (`recommender-run.ts`) so the race actually has a
probe in production — matching how `buildImplementationDeps` already wires
the implementation drive's;
- expose `livenessPollMs` on `RecommenderDeps` so the cadence is tunable in
tests (defaults to the 5s the implementation drive uses);
- regression test: a `tmux.status` that flips to `alive: false` mid-run
combined with a never-resolving `awaitStop` settles the workflow in under
the harness's 1.5s budget (≪ the 2s `agentTimeoutMs` the harness sets) —
proving the race won, not the Stop timeout.
When `tmux.status` is unwired (tests that don't opt in, or any future seam
that omits it), the helper degrades to Stop-or-timeout — identical to the
pre-fix behavior. No production caller is left without `status`.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe recommender workflow now implements a liveness-aware race between a Stop hook and tmux session status, enabling fast failure when sessions are killed mid-run rather than waiting for timeouts. The ChangesSession Liveness-Aware Stop Race
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Reviewer brief — #233 (recommender session-liveness race)Authored by an autonomous subagent run; root-cause + fix shape captured in the PR body. NOT a regression from #229 — long-standing session-liveness gap where the recommender drive's How to run itgh pr checkout 233
bun install
bun run typecheck && bun run lint # both clean
bun test packages/dispatcher/test/recommender-workflow.test.ts # the regression test
bun test # 853 / 853 (pre-redesign baseline)What to verify
Fragile bits / follow-ups
|
Why
mm's recommender workflow has been landingcompensatedinstead ofcompletedon consecutive runs whenever the tmux session is killed mid-run — a watchdog idle-kill, a daemonmm stop/SIGTERMwithengine.close(true)force-closing, or any other path that removes the session out from under the drive. The spawn step's baresessionGate.awaitStop(name, agentTimeout)has no liveness race, so the wait blocks for the full per-repo agent timeout (up to the 30-min ceiling). During that stall bunqueue tears the job's lock token down, the step fails generically (you see the[dispatch] suppressed benign bunqueue lifecycle race: Invalid or expired lock tokenlog line — load-bearing perpackages/dispatcher/CLAUDE.md), andprepare-shallow-worktree's compensation runs, marking the rowcompensatedwith no recoverable reason.The implementation drive solved this long ago with
awaitStopOrSessionEnd— it races the Stop hook against anisAliveprobe so a confirmed-dead session ends the wait. This PR points the recommender at the same helper and surfaces specific errors when the race wins (session ended before Stop hook/Stop hook timed out after Nms) instead of a generic compensation.What changed
packages/dispatcher/src/workflows/recommender.ts—spawn-recommender-agentnow callsawaitStopOrSessionEnd(imported fromimplementation.ts), passingtmux.statusas the liveness probe and the newRecommenderDeps.livenessPollMsas the cadence.session-ended/timeoutoutcomes throw with specific reasons.packages/dispatcher/src/main.ts— the daemon's recommender-deps wiring now includesstatusin itstmuxops (mirrors howbuildImplementationDepswires it for the implementation drive).packages/dispatcher/src/recommender-run.ts— the standalone runner does the same.packages/dispatcher/test/recommender-workflow.test.ts— regression test: atmux.statusthat flips toalive: falsemid-run, combined with a never-resolvingawaitStop, settles the workflow inside a 1.5s budget (the harness'sagentTimeoutMsis 2s — proof the race won, not the Stop timeout).When
tmux.statusis unwired (a future seam, or a test that doesn't opt in), the helper degrades to Stop-or-timeout — identical to the pre-fix behavior. No production caller is left withoutstatus.How to verify locally
A clean run completes in ~100s on this repo. To exercise the new fast-fail path, dispatch a recommender and
tmux kill-session -t middle-rec-<slug>-<hash>against its session — the row should landcompensatedwithinlivenessPollMs(5s prod default) plus the worktree rollback, instead of the prior multi-minute stall.What to review
awaitStopOrSessionEndinvocation inspawnRecommenderAgent— specifically that the two non-stop branches throw with distinct, identifiable strings (so the existing per-package CLAUDE.md note about the swallower's narrow regex still holds; the new errors are normal step failures, not lock-token races).main.tsandrecommender-run.ts— they're one-line additions to thetmuxobject literal, but the symmetry withbuildImplementationDepsis the contract.sessionDiesAfterMs+neverStopping) to the existing harness; no existing test changes its assertions.Fragile bits
livenessPollMs(defaults to 5s in production, matching the implementation drive). A session that dies between two polls won't be detected until the next poll — within seconds, not minutes, so this is a meaningful improvement, but it's not synchronous.engine.close(true)on SIGTERM) can still race the step's exception handler; if the lock token expires before the new throws propagate, you'll still seecompensated. The fix reduces the window the bunqueue race can land in fromagentTimeout(15–30 min) tolivenessPollMs(5s). A graceful drain (engine.close(false)) is the deeper fix, but it's outside this PR's blast radius.Punted as follow-up
compensatedrate (~50/50 over the workflow's lifetime, perSELECT state, COUNT(*) FROM workflows WHERE kind='recommender' GROUP BY state) reflects this same class of failure under the prior daemon-bouncing pattern; no DB cleanup is in scope. A follow-up could add a "recommender row was force-killed mid-run" distinction to the compensated state, but the schema's terminal-state enum doesn't currently carry that nuance.origin/mainshowed it only touchesrecommender-cron.ts(positive-number guards),github.ts(mapGhIssueStateextraction),repo-config.ts(path normalization + migration 011),blocker-resolution.ts(empty-title fallback + truncation), andfile-epic-gateway.ts(non-numeric ref guard). None of those touch the recommender lifecycle, tmux, hook-server, or the bunqueue lock-token race; the user-reported regression is the symptom of the long-standing session-liveness gap PR fix(dispatcher): multi-repo coordination — close the real holes #229 didn't introduce.dash/operator-console-refined) would benefit from rendering the newsession ended before Stop hook/Stop hook timed out after Nmsreasons explicitly; that's a UI follow-up, not in this PR's blast radius (per the brief's constraint: don't touchpackages/dashboard/**).Summary by CodeRabbit
Release Notes
Improvements
New Features
livenessPollMsconfiguration to control session liveness polling frequency