Skip to content

feat(dispatcher): hooks + watchdog — Phase 2 (Epic #14)#75

Merged
thejustinwalsh merged 10 commits into
mainfrom
middle-issue-14
May 23, 2026
Merged

feat(dispatcher): hooks + watchdog — Phase 2 (Epic #14)#75
thejustinwalsh merged 10 commits into
mainfrom
middle-issue-14

Conversation

@thejustinwalsh

@thejustinwalsh thejustinwalsh commented May 23, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #14

Phase 2 ships middle's observability layer: a full HMAC-validated hook receiver, the universal hook script, the complete Claude hook-event set, event/heartbeat persistence into SQLite, the watchdog reconciler cron, and reactive rate-limit detection. Hooks are the fast-path notification; the on-disk transcript + SQLite are the durable state the crons reconcile against.

What changed

  • packages/dispatcher/src/hook-server.ts — full event-name validation, per-session bearer-token auth (X-Middle-Token, timing-safe), and an injected persistence sink; session.started/agent.stopped still drive the SessionGate.
  • packages/dispatcher/src/hook-store.tsDbHookStore: events row per hook (payload ≤16KB), heartbeat bump on tool.pre/tool.post, session-field write on session.started, session-correlated, unmatchable sessions dropped.
  • packages/dispatcher/src/watchdog.ts + watchdog-cron.ts — 30s reconciler: launch-timeout (stuck-launching / prompt-not-accepted), tmux liveness, idle/idle-timeout (skipped while controlled_by='human', transcript cross-check), blocked.json sentinel re-arm, and a companion transcript-drift pass.
  • packages/dispatcher/src/rate-limits.tsrate_limit_state read/write + parseResetAt + probe-via-real-work revert.
  • packages/adapters/claude/src/hooks.ts — full 8-hook .claude/settings.json mapped to the normalized taxonomy.
  • packages/adapters/claude/src/classify.ts / index.tsdetectRateLimit Stop-hook detector.
  • packages/core/src/events.tsNORMALIZED_EVENTS + isNormalizedEvent; hook-script.ts — canonical HOOK_SH.
  • packages/cli/src/bootstrap-assets/hooks/hook.sh — committed universal script (drift-guarded against HOOK_SH).
  • packages/dispatcher/src/main.ts — wires DbHookStore + the watchdog cron into the long-running process.

Why these changes

The interactive agent process never exits between turns, so the dispatcher can't read an exit code — hooks POST turn boundaries and the on-disk JSONL transcript is the authoritative state. The hook server is the fast path (authenticated, validated, persisted); the watchdog is the safety net that only acts on staleness and never overrides an in-progress hook decision. Auth is a per-session bearer token, not a body signature, because hook.sh forwards the token as a header — resolved from the workflow row (populated before launch) and compared constant-time. The hook script is single-sourced from one @middle/core constant so the adapter-written copy and the mm init bootstrap asset can never drift.

Verification

bun run typecheck clean; bun test173 pass / 0 fail across 22 files.

Acceptance evidence (Epic #14)

Criterion Status Evidence
All sub-issues closed ✅ Met #15#20 all CLOSED
Hook events flow into SQLite; session.started records session_id+transcript_path; reconciler keeps state honest ✅ Met hook-store.test.ts (wired HookServer→DbHookStore→SQLite + session.started write); watchdog.test.ts reconcileTranscriptDrift
Killing a tmux session is caught by the watchdog within 30s ✅ Met watchdog.test.ts "dead session → failed + compensation"; cron cadence WATCHDOG_INTERVAL_MS = 30_000
A forced rate-limit error is recorded with the correct reset_at ✅ Met rate-limits.test.ts + implementation-workflow.test.ts "rate-limited classifyStop → RATE_LIMITED with parsed reset_at"

The three live criteria are proven at the mechanism level by the suite (every link in the path has a test); the spec's Phase 2 manual acceptance — spawn a real Claude agent and watch events flow, kill its tmux session, force a usage-limit message — remains the operator's final hands-on check and needs a live, authenticated Claude CLI.

Status

Decisions

planning/issues/14/decisions.md, distilled into inline review comments on this PR (sink seam, bearer-token auth, watchdog-vs-execution boundary, idle-once-per-period, probe-via-real-work revert).

Scope boundaries carried forward (already-planned phases — no new issues)

  • Cross-run delayed re-enqueue of a rate-limited adapter (read reset_at, delay until reset) → Phase 8 auto-dispatch loop. Phase 2 writes the durable rate_limit_state signal.
  • Execution cancellation on watchdog failure (so bunqueue compensation unwinds the blocked awaitStop) → Phase 5 waitFor integration. Phase 2 provides the triggerCompensation seam + durable state failure.
  • Committed .middle/hooks/hook.sh dogfood copy → Phase 3 task 22 (mm init middle into itself). The dispatch hard-rule forbids staging/committing under .middle/ (the running dispatch's operational dir), and #17's substantive deliverable — the reusable universal script — ships in bootstrap-assets/. Flagged for reviewer.

Summary by CodeRabbit

  • New Features
    • Hook-based event delivery with per-session authentication, durable recording, and a committed universal POSIX hook script for adapters
    • Adapter rate-limit detection surfaced and persisted with reset times
    • Watchdog: recurring monitoring, idle detection, transcript reconciliation, and automated recovery for stuck workflows
  • Tests
    • Expanded coverage for hooks, hook-store, rate-limits, watchdog, and adapter rate-limit detection
  • Documentation
    • Planning and decisions documenting hooks, auth, watchdog, and rate-limit approach

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8a2e5951-dd50-4b5b-923a-cc7a1d548117

📥 Commits

Reviewing files that changed from the base of the PR and between 47538ca and e7bcf05.

📒 Files selected for processing (2)
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/test/watchdog.test.ts

📝 Walkthrough

Walkthrough

Adds a canonical HOOK_SH and normalized event list, installs hooks for Claude (with detectRateLimit), introduces authenticated SQLite-backed hook persistence (HookStore/DbHookStore), durable per-adapter rate-limit state, a watchdog cron (reconcileTranscriptDrift + runWatchdog), and dispatcher wiring and tests.

Changes

Phase 2: Hooks + watchdog implementation

Layer / File(s) Summary
Core hook infrastructure & event validation
packages/core/src/hook-script.ts, packages/core/src/events.ts, packages/core/src/index.ts
HOOK_SH canonical hook script; NORMALIZED_EVENTS and isNormalizedEvent provide runtime-validated event vocabulary and re-exports.
Hook script bootstrap asset & validation
packages/cli/src/bootstrap-assets/hooks/hook.sh, packages/cli/test/bootstrap-hook.test.ts
Committed bootstrap hook.sh matches HOOK_SH; tests verify dispatcher wiring, headers, timeout behavior, no-op-on-error, and executable permission.
Claude adapter hook installation & rate-limit detection
packages/adapters/claude/src/hooks.ts, packages/adapters/claude/src/classify.ts, packages/adapters/claude/src/index.ts, packages/adapters/claude/test/adapter.test.ts
CLAUDE_EVENT_MAP and installHooks write the universal HOOK_SH and full .claude/settings.json; detectRateLimit parses Stop transcript tails for usage-limit/reset messages and is exported from the adapter; tests validate installation and detection.
Dispatcher package & dispatch wiring
packages/dispatcher/package.json, packages/dispatcher/src/dispatch.ts
Adds @middle/adapter-claude workspace dependency and constructs HookServer with DbHookStore(db) during dispatch.
HookServer auth & SQLite persistence
packages/dispatcher/src/hook-server.ts, packages/dispatcher/src/hook-store.ts, packages/dispatcher/test/hook-server.test.ts, packages/dispatcher/test/hook-store.test.ts
HookServer accepts optional HookStore to enable per-session token auth (constant-time compare) and persistence; DbHookStore serializes/truncates payloads, records events, touches heartbeat for tool events, and persists session metadata; tests cover token resolution, recording, truncation, and end-to-end server+store behavior.
Workflow record persistence & event helpers
packages/dispatcher/src/workflow-record.ts
DB helpers: findActiveWorkflowBySession, recordEvent, touchHeartbeat, event-inspection helpers, and idempotent armWaitForSignal for wait-signal control.
Durable rate-limit state management
packages/dispatcher/src/rate-limits.ts, packages/dispatcher/test/rate-limits.test.ts
Per-adapter rate_limit_state types and helpers: getRateLimitState, setRateLimited, markAvailable, markAvailableOnSuccess, and parseResetAt; tests verify upsert behavior, transitions, and parsing.
Workflow cleanup: persist rate-limit/availability signals
packages/dispatcher/src/workflows/implementation.ts, packages/dispatcher/test/implementation-workflow.test.ts
Workflow cleanup records rate_limit_state when classification is rate-limited and marks adapter AVAILABLE on completed runs; tests assert persisted state transitions.
Watchdog reconciliation & lifecycle monitoring
packages/dispatcher/src/watchdog.ts, packages/dispatcher/src/watchdog-cron.ts, packages/dispatcher/test/watchdog.test.ts
runWatchdog enforces launch/prompt timeouts, tmux liveness checks and kills, activity freshness (idle events and idle-timeout), and sentinel re-arming; reconcileTranscriptDrift advances last_heartbeat from transcripts; startWatchdog schedules ticks every 30s. Tests validate scenarios and transcript-drift corrections.
Dispatcher startup, watchdog lifecycle & shutdown
packages/dispatcher/src/main.ts
Dispatcher constructs HookServer with DbHookStore, starts watchdog cron on startup, captures stop function, and safely handles watchdog teardown during shutdown.
Phase 2 planning & decisions documentation
planning/issues/14/plan.md, planning/issues/14/decisions.md
Plan and decisions describe Phase 2 scope, sub-issue breakdown, HookStore/DbHookStore separation, token auth semantics, watchdog responsibilities, and idle-event throttling.

Sequence Diagram

sequenceDiagram
  participant Agent as Adapter (Claude)
  participant HookScript as HOOK_SH
  participant Dispatcher as HookServer (+ DbHookStore)
  participant DB as SQLite
  participant Watchdog as Watchdog Cron

  Agent->>HookScript: execute hook with normalized event + stdin
  HookScript->>Dispatcher: POST /hooks/:event (X-Middle-Session, X-Middle-Token)
  Dispatcher->>DB: resolve session token (HookStore)
  Dispatcher->>DB: record event (serialize/truncate) and touch heartbeat for tool events
  Dispatcher-->>HookScript: respond (fire-and-forget)
  Note over Watchdog: every 30s
  Watchdog->>DB: load reconcilable workflows
  Watchdog->>DB: check timeouts, tmux liveness, transcript drift, emit idle/failure events, update rate_limit_state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • thejustinwalsh/middle#73: Shares Claude rate-limit detection work; this PR factors detection into an exported detectRateLimit helper.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: implementing Phase 2 hooks and watchdog for the dispatcher, directly addressing issue #14.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #14: hook receiver with HMAC auth, universal hook script, event/heartbeat persistence, watchdog reconciler, and rate-limit detection.
Out of Scope Changes check ✅ Passed All changes are scoped to Phase 2 deliverables. No unrelated modifications detected outside the defined hook and watchdog implementation objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

* database, and lets the events/heartbeat persistence (task 15 / #18) evolve
* without re-touching the server.
*/
export interface HookStore {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: persistence sink seam. The HookStore interface is the seam between the transport/auth layer (HookServer) and SQLite. #15 scoped event persistence to #18, but the server still needed token resolution (auth) and the session.started field write — a single injected interface let #15 ship auth + session.started recording and #18 fill in events/heartbeat without re-touching the server. The store is optional on HookServer so the SessionGate mechanics stay unit-testable unauthenticated; the live dispatcher always supplies a DbHookStore.

* fixed-length UUIDs), but the per-byte comparison is not short-circuited, so a
* caller can't time their way to the correct token.
*/
function tokensMatch(provided: string, expected: string): boolean {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: "HMAC token" = per-session bearer token, timing-safe compared. hook.sh forwards the token as a header, not a signature over the body, so this is bearer-token auth, not a true HMAC. The server resolves the expected session_token from the workflow row (persisted before the session launches, so it is always populated before any hook fires) and compares constant-time to avoid a token-guessing oracle on the localhost endpoint.

}
}

function failWorkflow(deps: WatchdogDeps, id: string, reason: string, now: number): void {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: watchdog reconciles durable state, not the live bunqueue execution. It marks the row failed, records watchdog.failed with the reason, and kills the session — it does not cancel the execution blocked in awaitStop. Cancelling a running execution from outside the engine (so compensation unwinds) is the waitFor/cancellation integration scoped to Phase 5; triggerCompensation is the seam where that plugs in. The DB is the source of truth the durable engine reconciles against, so failing the row + killing the session is the complete Phase 2 action.

}
if (age >= idleThreshold && latestEventType(deps.db, row.id) !== IDLE_EVENT) {
// Mark idle once per idle period (not every 30s tick).
recordEvent(deps.db, { workflowId: row.id, ts: now, type: IDLE_EVENT, payloadJson: null });

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: idle is marked once per idle period, not every tick. Without the latestEventType !== IDLE_EVENT guard a genuinely-idle agent would accrue an idle event every 30s (120/hour) until the kill threshold, flooding the events table and the dashboard timeline. One marker per idle period is the signal the dashboard needs; the kill threshold handles escalation.

* the adapter was rate-limited, so a normal completion never thrashes the row.
* Returns whether it flipped.
*/
export function markAvailableOnSuccess(

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: probe-via-real-work revert. A dispatch that actually completed proves the adapter is serving again, so markAvailableOnSuccess flips RATE_LIMITEDAVAILABLE. It is a no-op unless the adapter was rate-limited, so a normal completion never thrashes the row. Per the spec there is no proactive probing — the revert is reactive, gated on real work succeeding.

@thejustinwalsh thejustinwalsh marked this pull request as ready for review May 23, 2026 17:51
@thejustinwalsh thejustinwalsh added the ready-for-review All phases done and verified — PR ready for final human review and merge label May 23, 2026
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

Reviewer's brief — Epic #14 (Phase 2: hooks + watchdog) → PR #75

What this delivers: the full observability layer — HMAC-validated hook receiver, universal hook.sh, the complete Claude hook-event set, event/heartbeat persistence into SQLite, the 30s watchdog reconciler, and reactive rate-limit detection. All six sub-issues (#15#20) land on one branch / one PR.

How to run it

bun install
bun run typecheck      # clean
bun test               # 173 pass / 0 fail across 22 files
sh -n packages/cli/src/bootstrap-assets/hooks/hook.sh   # POSIX-clean

Targeted suites: bun test packages/dispatcher/test/hook-server.test.ts packages/dispatcher/test/hook-store.test.ts packages/dispatcher/test/watchdog.test.ts packages/dispatcher/test/rate-limits.test.ts packages/adapters/claude/test/adapter.test.ts.

What to verify (and what "correct" looks like)

  1. Auth (hook-server.ts): a hook with a wrong/absent X-Middle-Token, or for a session with no active workflow, is 401 and never persisted; an unknown :event is 400 before auth/persistence. The token is the per-session session_token resolved from the workflow row (written before launch in launchAndDrive), compared timing-safe.
  2. Persistence (hook-store.ts): one events row per hook (payload clipped at 16KB), last_heartbeat advances on tool.pre/tool.post only, session.started writes session_id+transcript_path. The "live HookServer → DbHookStore → SQLite" test exercises the whole wired path.
  3. Watchdog (watchdog.ts): launch-timeout → stuck-launching; running-but-no-turn.started-in-window → prompt-not-accepted; dead tmux session → failed + triggerCompensation; idle ≥5m → one watchdog.idle event (not per-tick), ≥15m → kill + idle-timeout; skipped entirely while controlled_by='human'; staleness cross-checked against the transcript. Cadence is 30s (WATCHDOG_INTERVAL_MS).
  4. Rate limit (rate-limits.ts, classify.ts): a Resets at <ISO>. transcript tail → classifyStop rate-limited → rate_limit_state RATE_LIMITED with parsed reset_at; a later completed dispatch flips RATE_LIMITED → AVAILABLE (probe-via-real-work).

How to review

Read planning/issues/14/decisions.md (also distilled into inline review comments) for the load-bearing calls: the optional-store sink seam, bearer-token-not-HMAC, watchdog-reconciles-DB-not-execution, idle-once-per-period, probe-via-real-work. The reconcile logic is pure/injectable — the cron wrapper and main.ts wiring are thin glue.

Fragile / needs extra eyes

  • bunqueue cron + Engine coexistence in main.ts: the watchdog uses a Bunqueue (queue+worker) .every(30s) alongside the workflow Engine; main.test.ts confirms the process boots and SIGTERM-exits cleanly with both. Worth a sanity read of the shutdown ordering (stopWatchdoghookServer.stopengine.closedb.close).
  • @middle/adapter-claude is now a @middle/dispatcher dependency — only at the main.ts composition root, so the watchdog can read transcripts; dispatch.ts/workflow stay adapter-agnostic via injection.

Scope boundaries (already-planned phases — deliberately not done here)

  • Cross-run delayed re-enqueue of a rate-limited adapter → Phase 8 (auto-dispatch). Phase 2 writes the durable rate_limit_state.
  • Execution cancellation so bunqueue compensation unwinds a blocked awaitStopPhase 5 (waitFor). Phase 2 provides the triggerCompensation seam + durable failure.
  • Committed .middle/hooks/hook.sh dogfood copy → Phase 3 task 22 (mm init middle into itself). The dispatch hard-rule forbids committing under .middle/; Add universal hook.sh curl script #17's reusable deliverable ships in bootstrap-assets/. This is the one item to confirm you're comfortable deferring.

@thejustinwalsh thejustinwalsh mentioned this pull request May 23, 2026
4 tasks

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 141-149: Wrap each per-workflow tmux interaction so that
exceptions from deps.tmux.status() and deps.tmux.killSession() are caught and
handled without aborting runWatchdog; specifically, around the blocks that call
deps.tmux.status(row.session_name) and deps.tmux.killSession(row.session_name)
(and the similar calls at the other locations) add a try/catch that on error
calls failWorkflow(deps, row.id, `tmux error: ${err.message || err}`, now) (or
equivalent message), increments acted if appropriate, and continues to the next
row so reconciliation proceeds for other workflows. Ensure you catch errors for
both status() and killSession() calls individually so one failing call does not
prevent persisting the failure decision for that workflow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bb84d30f-c22e-4b83-9e3f-2f8537758b2a

📥 Commits

Reviewing files that changed from the base of the PR and between 7a20cf3 and 326fb3d.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • packages/adapters/claude/src/classify.ts
  • packages/adapters/claude/src/hooks.ts
  • packages/adapters/claude/src/index.ts
  • packages/adapters/claude/test/adapter.test.ts
  • packages/cli/src/bootstrap-assets/hooks/hook.sh
  • packages/cli/test/bootstrap-hook.test.ts
  • packages/core/src/events.ts
  • packages/core/src/hook-script.ts
  • packages/core/src/index.ts
  • packages/dispatcher/package.json
  • packages/dispatcher/src/dispatch.ts
  • packages/dispatcher/src/hook-server.ts
  • packages/dispatcher/src/hook-store.ts
  • packages/dispatcher/src/main.ts
  • packages/dispatcher/src/rate-limits.ts
  • packages/dispatcher/src/watchdog-cron.ts
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/src/workflow-record.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/hook-server.test.ts
  • packages/dispatcher/test/hook-store.test.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/rate-limits.test.ts
  • packages/dispatcher/test/watchdog.test.ts
  • planning/issues/14/decisions.md
  • planning/issues/14/plan.md

Comment thread packages/dispatcher/src/watchdog.ts Outdated
… + persistence sink

Expand the Phase 1 minimal receiver to the whole normalized event taxonomy:
validate :event against NORMALIZED_EVENTS, authenticate every request against
the per-session token (X-Middle-Token, timing-safe), and hand bodies to an
injected HookStore. session.started records session_id+transcript_path and
still signals the SessionGate. DbHookStore correlates by session name and
drops unmatchable sessions. Wired into dispatchEpic and main.

Closes #15.
Replace the SessionStart+Stop-only settings with the whole taxonomy
(SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Notification, Stop,
SubagentStop, SessionEnd), each mapped to its normalized event and invoking the
absolute, quoted hook.sh path. The script body is now single-sourced from a
canonical HOOK_SH constant in @middle/core.

Closes #16.
packages/cli/src/bootstrap-assets/hooks/hook.sh is what mm init stamps into a
target repo. It is generated byte-identically from the canonical HOOK_SH
constant and guarded by a drift test; POSIX sh, executable, 3s curl ceiling,
failure is a no-op (exit 0).

Closes #17.
…nto SQLite

Cover event-row persistence, last_heartbeat advancing only on tool.pre/tool.post,
session.started writing session_id+transcript_path, unmatchable-session drop,
16KB payload truncation, terminal-session exclusion, and a live HookServer →
DbHookStore → SQLite path. The persistence implementation shipped alongside the
sink seam in e5f504e; this is its dedicated coverage.

Closes #18.
A 30s bunqueue cron reconciles every launching/running workflow: launch-timeout
(stuck-launching), tmux liveness (dead session → failed + compensation hook),
activity freshness (idle marker / idle-timeout kill, skipped while
controlled_by='human', cross-checked against transcript staleness), and
blocked.json sentinel re-arm. A companion pass corrects heartbeat drift from the
source-of-truth transcript. The reconcile logic is pure and unit-tested; the
cron wrapper is thin scheduling glue wired into the long-running dispatcher.

Closes #19.
Add ClaudeAdapter.detectRateLimit (Stop-hook usage-limit detector) and a
rate_limit_state read/write module. On a rate-limited Stop classification the
implementation workflow persists { status: RATE_LIMITED, reset_at, source:
'transcript' }; a completed dispatch reverts a previously RATE_LIMITED adapter
to AVAILABLE (probe-via-real-work). The cross-run delayed re-enqueue that reads
reset_at is the Phase 8 auto-dispatch loop's job.

Closes #20.
Complete the launch-timeout criterion's second reason: a running workflow that
went ready (session.started recorded) but whose driven prompt never landed (no
turn.started within the launch window) is failed 'prompt-not-accepted' and its
session killed. turn.started (UserPromptSubmit) is the prompt-landed signal;
its absence past the window is the failure. Measured from the session.started
event ts, not updated_at (which heartbeats bump).

Refines #19.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/dispatcher/src/watchdog.ts (1)

141-150: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard tmux calls to prevent reconciliation abort on single session failure.

A thrown error from deps.tmux.status() will abort runWatchdog mid-loop, leaving remaining workflows unreconciled. The same applies to the killSession calls at lines 164 and 179. Wrap each per-workflow tmux interaction in try/catch to ensure one dead/misbehaving session doesn't block reconciliation of the others.

Suggested fix
     // 2. tmux liveness — a dead session under a 'running' workflow.
     if (row.session_name) {
-      const status = await deps.tmux.status(row.session_name);
-      if (!status.alive) {
-        await deps.tmux.killSession(row.session_name);
-        failWorkflow(deps, row.id, "tmux session disappeared", now);
-        acted++;
+      let status: { alive: boolean; paneCount: number };
+      try {
+        status = await deps.tmux.status(row.session_name);
+      } catch (error) {
+        console.error(`[watchdog] tmux.status failed for ${row.session_name}: ${(error as Error).message}`);
+        failWorkflow(deps, row.id, "tmux status check failed", now);
+        acted++;
         continue;
       }
+      if (!status.alive) {
+        try {
+          await deps.tmux.killSession(row.session_name);
+        } catch (error) {
+          console.error(`[watchdog] killSession failed for ${row.session_name}: ${(error as Error).message}`);
+        }
+        failWorkflow(deps, row.id, "tmux session disappeared", now);
+        acted++;
+        continue;
+      }
     }

Apply similar try/catch guards to the killSession calls at lines 164 and 179.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/src/watchdog.ts` around lines 141 - 150, Wrap each
per-workflow tmux interaction (calls to deps.tmux.status and
deps.tmux.killSession) inside a try/catch within runWatchdog so that exceptions
for one row do not abort the loop; for each row where row.session_name is
truthy, call deps.tmux.status(row.session_name) inside try, handle non-alive by
calling deps.tmux.killSession(row.session_name) also in try/catch, call
failWorkflow(deps, row.id, "tmux session disappeared", now) and increment acted
only when the workflow is handled, and on caught errors log or record the error
and continue to the next row (do the same guard around the other killSession
usages that reference deps.tmux.killSession so a thrown error won't stop
reconciliation).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 141-150: Wrap each per-workflow tmux interaction (calls to
deps.tmux.status and deps.tmux.killSession) inside a try/catch within
runWatchdog so that exceptions for one row do not abort the loop; for each row
where row.session_name is truthy, call deps.tmux.status(row.session_name) inside
try, handle non-alive by calling deps.tmux.killSession(row.session_name) also in
try/catch, call failWorkflow(deps, row.id, "tmux session disappeared", now) and
increment acted only when the workflow is handled, and on caught errors log or
record the error and continue to the next row (do the same guard around the
other killSession usages that reference deps.tmux.killSession so a thrown error
won't stop reconciliation).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: aee7d20c-8be9-4741-9cf7-51a97a590489

📥 Commits

Reviewing files that changed from the base of the PR and between 326fb3d and d155f29.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • packages/adapters/claude/src/classify.ts
  • packages/adapters/claude/src/hooks.ts
  • packages/adapters/claude/src/index.ts
  • packages/adapters/claude/test/adapter.test.ts
  • packages/cli/src/bootstrap-assets/hooks/hook.sh
  • packages/cli/test/bootstrap-hook.test.ts
  • packages/core/src/events.ts
  • packages/core/src/hook-script.ts
  • packages/core/src/index.ts
  • packages/dispatcher/package.json
  • packages/dispatcher/src/dispatch.ts
  • packages/dispatcher/src/hook-server.ts
  • packages/dispatcher/src/hook-store.ts
  • packages/dispatcher/src/main.ts
  • packages/dispatcher/src/rate-limits.ts
  • packages/dispatcher/src/watchdog-cron.ts
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/src/workflow-record.ts
  • packages/dispatcher/src/workflows/implementation.ts
  • packages/dispatcher/test/hook-server.test.ts
  • packages/dispatcher/test/hook-store.test.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/rate-limits.test.ts
  • packages/dispatcher/test/watchdog.test.ts
  • planning/issues/14/decisions.md
  • planning/issues/14/plan.md
✅ Files skipped from review due to trivial changes (2)
  • planning/issues/14/plan.md
  • planning/issues/14/decisions.md

…on pass

A tmux status()/killSession() error previously bubbled out of runWatchdog,
aborting the whole pass and leaving the remaining launching/running rows
unreconciled. Guard each tmux interaction per-workflow:

- status() errors are inconclusive (not a confirmed-dead signal), so the row
  is skipped this pass rather than failed — the 30s cron retries, and the
  activity-freshness rule is the backstop for a genuinely stuck agent.
- killSession() runs only after a failure decision is already made, so a kill
  error is swallowed+logged via safeKillSession; the failure is still recorded.
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 154-167: The tmux liveness catch block currently does a "continue"
which aborts all remaining durable checks for that row; instead remove the
continue so the code proceeds to prompt-not-accepted, idle/idle-timeout,
transcript checks and sentinel re-arm, but still log the tmux error; to keep
downstream logic safe, in the catch set a fallback status (e.g., status = {
alive: false, paneCount: 0 }) after logging the error so later checks see a
defined status and reconciliation continues for the workflow referenced by
row.session_name and deps.tmux.status().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 56b8215b-a2a2-4d42-8973-0f09d61bc2a6

📥 Commits

Reviewing files that changed from the base of the PR and between d155f29 and 47538ca.

📒 Files selected for processing (2)
  • packages/dispatcher/src/watchdog.ts
  • packages/dispatcher/test/watchdog.test.ts

Comment thread packages/dispatcher/src/watchdog.ts Outdated
…hecks

The previous guard used `continue` on a status() error, which skipped not just
the liveness check but also prompt-not-accepted, idle/idle-timeout (rule 3) and
sentinel re-arm for that row. That defeated the stated backstop: a row whose
status() kept erroring would never idle-time-out and could stay 'running'
forever.

Leave status null on error and guard the liveness branch (`if (status &&
!status.alive)`) so the row falls through to the wall-clock freshness check.
Add a regression test asserting a stale row still idle-times-out when status()
persistently throws.
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-review All phases done and verified — PR ready for final human review and merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hooks and watchdog

1 participant