fix(dispatcher): race the recommender's Stop wait against tmux liveness by thejustinwalsh · Pull Request #233 · thejustinwalsh/middle

thejustinwalsh · 2026-06-04T18:30:57Z

Why

mm's recommender workflow has been landing compensated instead of completed on consecutive runs whenever the tmux session is killed mid-run — a watchdog idle-kill, a daemon mm stop/SIGTERM with engine.close(true) force-closing, or any other path that removes the session out from under the drive. The spawn step's bare sessionGate.awaitStop(name, agentTimeout) has no liveness race, so the wait blocks for the full per-repo agent timeout (up to the 30-min ceiling). During that stall bunqueue tears the job's lock token down, the step fails generically (you see the [dispatch] suppressed benign bunqueue lifecycle race: Invalid or expired lock token log line — load-bearing per packages/dispatcher/CLAUDE.md), and prepare-shallow-worktree's compensation runs, marking the row compensated with no recoverable reason.

The implementation drive solved this long ago with awaitStopOrSessionEnd — it races the Stop hook against an isAlive probe so a confirmed-dead session ends the wait. This PR points the recommender at the same helper and surfaces specific errors when the race wins (session ended before Stop hook / Stop hook timed out after Nms) instead of a generic compensation.

What changed

packages/dispatcher/src/workflows/recommender.ts — spawn-recommender-agent now calls awaitStopOrSessionEnd (imported from implementation.ts), passing tmux.status as the liveness probe and the new RecommenderDeps.livenessPollMs as the cadence. session-ended / timeout outcomes throw with specific reasons.
packages/dispatcher/src/main.ts — the daemon's recommender-deps wiring now includes status in its tmux ops (mirrors how buildImplementationDeps wires it for the implementation drive).
packages/dispatcher/src/recommender-run.ts — the standalone runner does the same.
packages/dispatcher/test/recommender-workflow.test.ts — regression test: a tmux.status that flips to alive: false mid-run, combined with a never-resolving awaitStop, settles the workflow inside a 1.5s budget (the harness's agentTimeoutMs is 2s — proof the race won, not the Stop timeout).

When tmux.status is unwired (a future seam, or a test that doesn't opt in), the helper degrades to Stop-or-timeout — identical to the pre-fix behavior. No production caller is left without status.

How to verify locally

git fetch origin fix/recommender-session-gone && git checkout fix/recommender-session-gone
bun install
bun test packages/dispatcher/test/recommender-workflow.test.ts
bun run typecheck
mm stop && mm start
mm run-recommender ~/Developer/middle
# watch the workflow row land 'completed', not 'compensated':
sqlite3 ~/.middle/db.sqlite3 \
  "SELECT id, state, (updated_at-created_at)/1000.0 AS dur_s FROM workflows \
     WHERE kind='recommender' ORDER BY created_at DESC LIMIT 3;"

A clean run completes in ~100s on this repo. To exercise the new fast-fail path, dispatch a recommender and tmux kill-session -t middle-rec-<slug>-<hash> against its session — the row should land compensated within livenessPollMs (5s prod default) plus the worktree rollback, instead of the prior multi-minute stall.

What to review

The awaitStopOrSessionEnd invocation in spawnRecommenderAgent — specifically that the two non-stop branches throw with distinct, identifiable strings (so the existing per-package CLAUDE.md note about the swallower's narrow regex still holds; the new errors are normal step failures, not lock-token races).
The wiring change in main.ts and recommender-run.ts — they're one-line additions to the tmux object literal, but the symmetry with buildImplementationDeps is the contract.
The test only adds a thin opt-in seam (sessionDiesAfterMs + neverStopping) to the existing harness; no existing test changes its assertions.

Fragile bits

The race's cadence is livenessPollMs (defaults to 5s in production, matching the implementation drive). A session that dies between two polls won't be detected until the next poll — within seconds, not minutes, so this is a meaningful improvement, but it's not synchronous.
A force-close (engine.close(true) on SIGTERM) can still race the step's exception handler; if the lock token expires before the new throws propagate, you'll still see compensated. The fix reduces the window the bunqueue race can land in from agentTimeout (15–30 min) to livenessPollMs (5s). A graceful drain (engine.close(false)) is the deeper fix, but it's outside this PR's blast radius.

Punted as follow-up

The historic compensated rate (~50/50 over the workflow's lifetime, per SELECT state, COUNT(*) FROM workflows WHERE kind='recommender' GROUP BY state) reflects this same class of failure under the prior daemon-bouncing pattern; no DB cleanup is in scope. A follow-up could add a "recommender row was force-killed mid-run" distinction to the compensated state, but the schema's terminal-state enum doesn't currently carry that nuance.
This does NOT address the PR fix(dispatcher): multi-repo coordination — close the real holes #229 suspicion in the brief — diffing PR fix(dispatcher): multi-repo coordination — close the real holes #229 against origin/main showed it only touches recommender-cron.ts (positive-number guards), github.ts (mapGhIssueState extraction), repo-config.ts (path normalization + migration 011), blocker-resolution.ts (empty-title fallback + truncation), and file-epic-gateway.ts (non-numeric ref guard). None of those touch the recommender lifecycle, tmux, hook-server, or the bunqueue lock-token race; the user-reported regression is the symptom of the long-standing session-liveness gap PR fix(dispatcher): multi-repo coordination — close the real holes #229 didn't introduce.
The dashboard's "recommender run health" surface (the user's separate work on dash/operator-console-refined) would benefit from rendering the new session ended before Stop hook / Stop hook timed out after Nms reasons explicitly; that's a UI follow-up, not in this PR's blast radius (per the brief's constraint: don't touch packages/dashboard/**).

Summary by CodeRabbit

Release Notes

Improvements
- Recommender workflows now detect session failures faster and fail immediately instead of waiting for timeouts
- Enhanced session liveness monitoring for better error recovery during runs
New Features
- Added optional livenessPollMs configuration to control session liveness polling frequency

The recommender's `spawn-recommender-agent` step awaited the Stop hook with the bare `sessionGate.awaitStop(name, agentTimeout)`. That call has no liveness race, so a tmux session killed out from under the drive — a watchdog idle-kill, a daemon SIGTERM with `engine.close(true)` force-closing mid-run, a manual `tmux kill-session` — would never receive `agent.stopped` and the step would block for the full `agentTimeout` (up to the 30-min ceiling). During that stall, bunqueue's worker tears the job's lock token down on shutdown and the step fails generically; `prepare-shallow-worktree`'s compensation runs and the workflow lands `compensated` with no recoverable reason. The implementation drive already solved this with `awaitStopOrSessionEnd`, which races the Stop hook against an `isAlive` probe. Reuse that here: - import + call `awaitStopOrSessionEnd` from the spawn step, plumbing `tmux.status` through as the liveness probe and surfacing `session-ended` / `timeout` outcomes as specific errors ("session ended before Stop hook" / "Stop hook timed out after Nms") instead of the generic compensation; - wire `status` into both the daemon's recommender tmux ops (`main.ts`) and the standalone runner's (`recommender-run.ts`) so the race actually has a probe in production — matching how `buildImplementationDeps` already wires the implementation drive's; - expose `livenessPollMs` on `RecommenderDeps` so the cadence is tunable in tests (defaults to the 5s the implementation drive uses); - regression test: a `tmux.status` that flips to `alive: false` mid-run combined with a never-resolving `awaitStop` settles the workflow in under the harness's 1.5s budget (≪ the 2s `agentTimeoutMs` the harness sets) — proving the race won, not the Stop timeout. When `tmux.status` is unwired (tests that don't opt in, or any future seam that omits it), the helper degrades to Stop-or-timeout — identical to the pre-fix behavior. No production caller is left without `status`.

coderabbitai · 2026-06-04T18:31:04Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 09f72ce4-20a2-470b-b4d1-ce72b4e1be05

📥 Commits

Reviewing files that changed from the base of the PR and between 128056e and dd50ad5.

📒 Files selected for processing (4)

packages/dispatcher/src/main.ts
packages/dispatcher/src/recommender-run.ts
packages/dispatcher/src/workflows/recommender.ts
packages/dispatcher/test/recommender-workflow.test.ts

📝 Walkthrough

Walkthrough

The recommender workflow now implements a liveness-aware race between a Stop hook and tmux session status, enabling fast failure when sessions are killed mid-run rather than waiting for timeouts. The status function is threaded through the dependency bundle, and a new livenessPollMs configuration controls polling frequency. Tests verify the race behavior deterministically.

Changes

Session Liveness-Aware Stop Race

Layer / File(s)	Summary
Liveness-aware stop contract and imports `packages/dispatcher/src/workflows/recommender.ts`	Added `awaitStopOrSessionEnd` import and optional `livenessPollMs: number` field to `RecommenderDeps` contract to configure session status polling cadence.
Spawn step stop-race implementation `packages/dispatcher/src/workflows/recommender.ts`	Replaced `awaitStop` with `awaitStopOrSessionEnd` to race the Stop hook against periodic tmux session status checks, throwing specific errors when the session ends early or when the Stop hook times out.
Dependency wiring of tmux.status across entry points `packages/dispatcher/src/recommender-run.ts`, `packages/dispatcher/src/main.ts`	Imported and passed `status` function through the `tmux` dependency bundle in both the recommender run module and the daemon's main workflow registration so the race logic can query session liveness.
Test harness liveness simulation and regression test `packages/dispatcher/test/recommender-workflow.test.ts`	Extended `makeHarness` test seam with `sessionDiesAfterMs` and `neverStopping` options to deterministically simulate tmux session death and non-resolving Stop hooks; added regression test verifying the workflow compensates cleanly and completes quickly when a session is killed mid-run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

thejustinwalsh/middle#165: Implements awaitStopOrSessionEnd and livenessPollMs liveness-aware stop race for watchdog self-healing, directly related to the session-status polling mechanism added here.
thejustinwalsh/middle#105: Modifies recommender workflow wiring in the same files and dependency areas, establishing prior context for the spawn step and SessionGate integration.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: racing the Stop wait against tmux liveness to prevent long blocks when sessions are killed mid-run.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

thejustinwalsh · 2026-06-04T20:54:58Z

Reviewer brief — #233 (recommender session-liveness race)

Authored by an autonomous subagent run; root-cause + fix shape captured in the PR body. NOT a regression from #229 — long-standing session-liveness gap where the recommender drive's spawn-recommender-agent called bare awaitStop without a tmux-status race, so a killed tmux session stalled for the full agentTimeoutMs, bunqueue tore the lock token, and the step compensated.

How to run it

gh pr checkout 233
bun install
bun run typecheck && bun run lint                # both clean
bun test packages/dispatcher/test/recommender-workflow.test.ts   # the regression test
bun test                                          # 853 / 853 (pre-redesign baseline)

What to verify

The new regression test (a tmux session killed mid-run fails the spawn step fast (#229-followup)) settles compensated in under 1.5s instead of stalling the full 2s agentTimeoutMs. Proves the race won, not the Stop timeout.

End-to-end smoke — this is the gate the agent deferred to the human / operator:

mm stop && mm start                              # bounce daemon onto the fix
mm run-recommender thejustinwalsh/middle         # dispatch one tick
# watch ~/.middle/db.sqlite3 — the new wf row lands `completed`, not `compensated`

Read the new awaitStopOrSessionEnd call site in packages/dispatcher/src/workflows/recommender.ts — pattern is the same one the implementation drive has used since fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly #165.

Fragile bits / follow-ups

~400 of 828 historical recommender runs are compensated. Same class. Not cleaned up in this PR.
A graceful drain on SIGTERM (engine.close(false) instead of close(true)) would eliminate the lock-token race window entirely. Out of scope.
The dashboard doesn't yet surface the new session ended before Stop hook / Stop hook timed out reasons — deferred per the brief's "do NOT touch packages/dashboard/**" constraint.

thejustinwalsh mentioned this pull request Jun 4, 2026

agent-queue: dispatch state #84

Open

thejustinwalsh marked this pull request as ready for review June 4, 2026 20:51

thejustinwalsh added the ready-for-review All phases done and verified — PR ready for final human review and merge label Jun 4, 2026

coderabbitai Bot approved these changes Jun 4, 2026

View reviewed changes

thejustinwalsh merged commit 3dc0752 into main Jun 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dispatcher): race the recommender's Stop wait against tmux liveness#233

fix(dispatcher): race the recommender's Stop wait against tmux liveness#233
thejustinwalsh merged 1 commit into
mainfrom
fix/recommender-session-gone

thejustinwalsh commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

thejustinwalsh commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thejustinwalsh commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

How to verify locally

What to review

Fragile bits

Punted as follow-up

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

thejustinwalsh commented Jun 4, 2026

Reviewer brief — #233 (recommender session-liveness race)

How to run it

What to verify

Fragile bits / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thejustinwalsh commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading