Skip to content

fix(dispatcher): race the recommender's Stop wait against tmux liveness#233

Merged
thejustinwalsh merged 1 commit into
mainfrom
fix/recommender-session-gone
Jun 4, 2026
Merged

fix(dispatcher): race the recommender's Stop wait against tmux liveness#233
thejustinwalsh merged 1 commit into
mainfrom
fix/recommender-session-gone

Conversation

@thejustinwalsh

@thejustinwalsh thejustinwalsh commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Why

mm's recommender workflow has been landing compensated instead of completed on consecutive runs whenever the tmux session is killed mid-run — a watchdog idle-kill, a daemon mm stop/SIGTERM with engine.close(true) force-closing, or any other path that removes the session out from under the drive. The spawn step's bare sessionGate.awaitStop(name, agentTimeout) has no liveness race, so the wait blocks for the full per-repo agent timeout (up to the 30-min ceiling). During that stall bunqueue tears the job's lock token down, the step fails generically (you see the [dispatch] suppressed benign bunqueue lifecycle race: Invalid or expired lock token log line — load-bearing per packages/dispatcher/CLAUDE.md), and prepare-shallow-worktree's compensation runs, marking the row compensated with no recoverable reason.

The implementation drive solved this long ago with awaitStopOrSessionEnd — it races the Stop hook against an isAlive probe so a confirmed-dead session ends the wait. This PR points the recommender at the same helper and surfaces specific errors when the race wins (session ended before Stop hook / Stop hook timed out after Nms) instead of a generic compensation.

What changed

  • packages/dispatcher/src/workflows/recommender.tsspawn-recommender-agent now calls awaitStopOrSessionEnd (imported from implementation.ts), passing tmux.status as the liveness probe and the new RecommenderDeps.livenessPollMs as the cadence. session-ended / timeout outcomes throw with specific reasons.
  • packages/dispatcher/src/main.ts — the daemon's recommender-deps wiring now includes status in its tmux ops (mirrors how buildImplementationDeps wires it for the implementation drive).
  • packages/dispatcher/src/recommender-run.ts — the standalone runner does the same.
  • packages/dispatcher/test/recommender-workflow.test.ts — regression test: a tmux.status that flips to alive: false mid-run, combined with a never-resolving awaitStop, settles the workflow inside a 1.5s budget (the harness's agentTimeoutMs is 2s — proof the race won, not the Stop timeout).

When tmux.status is unwired (a future seam, or a test that doesn't opt in), the helper degrades to Stop-or-timeout — identical to the pre-fix behavior. No production caller is left without status.

How to verify locally

git fetch origin fix/recommender-session-gone && git checkout fix/recommender-session-gone
bun install
bun test packages/dispatcher/test/recommender-workflow.test.ts
bun run typecheck
mm stop && mm start
mm run-recommender ~/Developer/middle
# watch the workflow row land 'completed', not 'compensated':
sqlite3 ~/.middle/db.sqlite3 \
  "SELECT id, state, (updated_at-created_at)/1000.0 AS dur_s FROM workflows \
     WHERE kind='recommender' ORDER BY created_at DESC LIMIT 3;"

A clean run completes in ~100s on this repo. To exercise the new fast-fail path, dispatch a recommender and tmux kill-session -t middle-rec-<slug>-<hash> against its session — the row should land compensated within livenessPollMs (5s prod default) plus the worktree rollback, instead of the prior multi-minute stall.

What to review

  • The awaitStopOrSessionEnd invocation in spawnRecommenderAgent — specifically that the two non-stop branches throw with distinct, identifiable strings (so the existing per-package CLAUDE.md note about the swallower's narrow regex still holds; the new errors are normal step failures, not lock-token races).
  • The wiring change in main.ts and recommender-run.ts — they're one-line additions to the tmux object literal, but the symmetry with buildImplementationDeps is the contract.
  • The test only adds a thin opt-in seam (sessionDiesAfterMs + neverStopping) to the existing harness; no existing test changes its assertions.

Fragile bits

  • The race's cadence is livenessPollMs (defaults to 5s in production, matching the implementation drive). A session that dies between two polls won't be detected until the next poll — within seconds, not minutes, so this is a meaningful improvement, but it's not synchronous.
  • A force-close (engine.close(true) on SIGTERM) can still race the step's exception handler; if the lock token expires before the new throws propagate, you'll still see compensated. The fix reduces the window the bunqueue race can land in from agentTimeout (15–30 min) to livenessPollMs (5s). A graceful drain (engine.close(false)) is the deeper fix, but it's outside this PR's blast radius.

Punted as follow-up

  • The historic compensated rate (~50/50 over the workflow's lifetime, per SELECT state, COUNT(*) FROM workflows WHERE kind='recommender' GROUP BY state) reflects this same class of failure under the prior daemon-bouncing pattern; no DB cleanup is in scope. A follow-up could add a "recommender row was force-killed mid-run" distinction to the compensated state, but the schema's terminal-state enum doesn't currently carry that nuance.
  • This does NOT address the PR fix(dispatcher): multi-repo coordination — close the real holes #229 suspicion in the brief — diffing PR fix(dispatcher): multi-repo coordination — close the real holes #229 against origin/main showed it only touches recommender-cron.ts (positive-number guards), github.ts (mapGhIssueState extraction), repo-config.ts (path normalization + migration 011), blocker-resolution.ts (empty-title fallback + truncation), and file-epic-gateway.ts (non-numeric ref guard). None of those touch the recommender lifecycle, tmux, hook-server, or the bunqueue lock-token race; the user-reported regression is the symptom of the long-standing session-liveness gap PR fix(dispatcher): multi-repo coordination — close the real holes #229 didn't introduce.
  • The dashboard's "recommender run health" surface (the user's separate work on dash/operator-console-refined) would benefit from rendering the new session ended before Stop hook / Stop hook timed out after Nms reasons explicitly; that's a UI follow-up, not in this PR's blast radius (per the brief's constraint: don't touch packages/dashboard/**).

Summary by CodeRabbit

Release Notes

  • Improvements

    • Recommender workflows now detect session failures faster and fail immediately instead of waiting for timeouts
    • Enhanced session liveness monitoring for better error recovery during runs
  • New Features

    • Added optional livenessPollMs configuration to control session liveness polling frequency

The recommender's `spawn-recommender-agent` step awaited the Stop hook with the
bare `sessionGate.awaitStop(name, agentTimeout)`. That call has no liveness
race, so a tmux session killed out from under the drive — a watchdog idle-kill,
a daemon SIGTERM with `engine.close(true)` force-closing mid-run, a manual
`tmux kill-session` — would never receive `agent.stopped` and the step would
block for the full `agentTimeout` (up to the 30-min ceiling). During that
stall, bunqueue's worker tears the job's lock token down on shutdown and the
step fails generically; `prepare-shallow-worktree`'s compensation runs and
the workflow lands `compensated` with no recoverable reason.

The implementation drive already solved this with `awaitStopOrSessionEnd`,
which races the Stop hook against an `isAlive` probe. Reuse that here:

- import + call `awaitStopOrSessionEnd` from the spawn step, plumbing
  `tmux.status` through as the liveness probe and surfacing `session-ended` /
  `timeout` outcomes as specific errors ("session ended before Stop hook" /
  "Stop hook timed out after Nms") instead of the generic compensation;
- wire `status` into both the daemon's recommender tmux ops (`main.ts`) and
  the standalone runner's (`recommender-run.ts`) so the race actually has a
  probe in production — matching how `buildImplementationDeps` already wires
  the implementation drive's;
- expose `livenessPollMs` on `RecommenderDeps` so the cadence is tunable in
  tests (defaults to the 5s the implementation drive uses);
- regression test: a `tmux.status` that flips to `alive: false` mid-run
  combined with a never-resolving `awaitStop` settles the workflow in under
  the harness's 1.5s budget (≪ the 2s `agentTimeoutMs` the harness sets) —
  proving the race won, not the Stop timeout.

When `tmux.status` is unwired (tests that don't opt in, or any future seam
that omits it), the helper degrades to Stop-or-timeout — identical to the
pre-fix behavior. No production caller is left without `status`.
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 09f72ce4-20a2-470b-b4d1-ce72b4e1be05

📥 Commits

Reviewing files that changed from the base of the PR and between 128056e and dd50ad5.

📒 Files selected for processing (4)
  • packages/dispatcher/src/main.ts
  • packages/dispatcher/src/recommender-run.ts
  • packages/dispatcher/src/workflows/recommender.ts
  • packages/dispatcher/test/recommender-workflow.test.ts

📝 Walkthrough

Walkthrough

The recommender workflow now implements a liveness-aware race between a Stop hook and tmux session status, enabling fast failure when sessions are killed mid-run rather than waiting for timeouts. The status function is threaded through the dependency bundle, and a new livenessPollMs configuration controls polling frequency. Tests verify the race behavior deterministically.

Changes

Session Liveness-Aware Stop Race

Layer / File(s) Summary
Liveness-aware stop contract and imports
packages/dispatcher/src/workflows/recommender.ts
Added awaitStopOrSessionEnd import and optional livenessPollMs: number field to RecommenderDeps contract to configure session status polling cadence.
Spawn step stop-race implementation
packages/dispatcher/src/workflows/recommender.ts
Replaced awaitStop with awaitStopOrSessionEnd to race the Stop hook against periodic tmux session status checks, throwing specific errors when the session ends early or when the Stop hook times out.
Dependency wiring of tmux.status across entry points
packages/dispatcher/src/recommender-run.ts, packages/dispatcher/src/main.ts
Imported and passed status function through the tmux dependency bundle in both the recommender run module and the daemon's main workflow registration so the race logic can query session liveness.
Test harness liveness simulation and regression test
packages/dispatcher/test/recommender-workflow.test.ts
Extended makeHarness test seam with sessionDiesAfterMs and neverStopping options to deterministically simulate tmux session death and non-resolving Stop hooks; added regression test verifying the workflow compensates cleanly and completes quickly when a session is killed mid-run.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • thejustinwalsh/middle#165: Implements awaitStopOrSessionEnd and livenessPollMs liveness-aware stop race for watchdog self-healing, directly related to the session-status polling mechanism added here.
  • thejustinwalsh/middle#105: Modifies recommender workflow wiring in the same files and dependency areas, establishing prior context for the spawn step and SessionGate integration.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: racing the Stop wait against tmux liveness to prevent long blocks when sessions are killed mid-run.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@thejustinwalsh thejustinwalsh marked this pull request as ready for review June 4, 2026 20:51
@thejustinwalsh thejustinwalsh added the ready-for-review All phases done and verified — PR ready for final human review and merge label Jun 4, 2026
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

Reviewer brief — #233 (recommender session-liveness race)

Authored by an autonomous subagent run; root-cause + fix shape captured in the PR body. NOT a regression from #229 — long-standing session-liveness gap where the recommender drive's spawn-recommender-agent called bare awaitStop without a tmux-status race, so a killed tmux session stalled for the full agentTimeoutMs, bunqueue tore the lock token, and the step compensated.

How to run it

gh pr checkout 233
bun install
bun run typecheck && bun run lint                # both clean
bun test packages/dispatcher/test/recommender-workflow.test.ts   # the regression test
bun test                                          # 853 / 853 (pre-redesign baseline)

What to verify

  • The new regression test (a tmux session killed mid-run fails the spawn step fast (#229-followup)) settles compensated in under 1.5s instead of stalling the full 2s agentTimeoutMs. Proves the race won, not the Stop timeout.
  • End-to-end smoke — this is the gate the agent deferred to the human / operator:
    mm stop && mm start                              # bounce daemon onto the fix
    mm run-recommender thejustinwalsh/middle         # dispatch one tick
    # watch ~/.middle/db.sqlite3 — the new wf row lands `completed`, not `compensated`
  • Read the new awaitStopOrSessionEnd call site in packages/dispatcher/src/workflows/recommender.ts — pattern is the same one the implementation drive has used since fix(dispatcher): self-heal a blocked agent that doesn't exit cleanly #165.

Fragile bits / follow-ups

  • ~400 of 828 historical recommender runs are compensated. Same class. Not cleaned up in this PR.
  • A graceful drain on SIGTERM (engine.close(false) instead of close(true)) would eliminate the lock-token race window entirely. Out of scope.
  • The dashboard doesn't yet surface the new session ended before Stop hook / Stop hook timed out reasons — deferred per the brief's "do NOT touch packages/dashboard/**" constraint.

@thejustinwalsh thejustinwalsh merged commit 3dc0752 into main Jun 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-review All phases done and verified — PR ready for final human review and merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant