feat(dispatcher): hooks + watchdog — Phase 2 (Epic #14)#75
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a canonical HOOK_SH and normalized event list, installs hooks for Claude (with detectRateLimit), introduces authenticated SQLite-backed hook persistence (HookStore/DbHookStore), durable per-adapter rate-limit state, a watchdog cron (reconcileTranscriptDrift + runWatchdog), and dispatcher wiring and tests. ChangesPhase 2: Hooks + watchdog implementation
Sequence DiagramsequenceDiagram
participant Agent as Adapter (Claude)
participant HookScript as HOOK_SH
participant Dispatcher as HookServer (+ DbHookStore)
participant DB as SQLite
participant Watchdog as Watchdog Cron
Agent->>HookScript: execute hook with normalized event + stdin
HookScript->>Dispatcher: POST /hooks/:event (X-Middle-Session, X-Middle-Token)
Dispatcher->>DB: resolve session token (HookStore)
Dispatcher->>DB: record event (serialize/truncate) and touch heartbeat for tool events
Dispatcher-->>HookScript: respond (fire-and-forget)
Note over Watchdog: every 30s
Watchdog->>DB: load reconcilable workflows
Watchdog->>DB: check timeouts, tmux liveness, transcript drift, emit idle/failure events, update rate_limit_state
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
| * database, and lets the events/heartbeat persistence (task 15 / #18) evolve | ||
| * without re-touching the server. | ||
| */ | ||
| export interface HookStore { |
There was a problem hiding this comment.
Decision: persistence sink seam. The HookStore interface is the seam between the transport/auth layer (HookServer) and SQLite. #15 scoped event persistence to #18, but the server still needed token resolution (auth) and the session.started field write — a single injected interface let #15 ship auth + session.started recording and #18 fill in events/heartbeat without re-touching the server. The store is optional on HookServer so the SessionGate mechanics stay unit-testable unauthenticated; the live dispatcher always supplies a DbHookStore.
| * fixed-length UUIDs), but the per-byte comparison is not short-circuited, so a | ||
| * caller can't time their way to the correct token. | ||
| */ | ||
| function tokensMatch(provided: string, expected: string): boolean { |
There was a problem hiding this comment.
Decision: "HMAC token" = per-session bearer token, timing-safe compared. hook.sh forwards the token as a header, not a signature over the body, so this is bearer-token auth, not a true HMAC. The server resolves the expected session_token from the workflow row (persisted before the session launches, so it is always populated before any hook fires) and compares constant-time to avoid a token-guessing oracle on the localhost endpoint.
| } | ||
| } | ||
|
|
||
| function failWorkflow(deps: WatchdogDeps, id: string, reason: string, now: number): void { |
There was a problem hiding this comment.
Decision: watchdog reconciles durable state, not the live bunqueue execution. It marks the row failed, records watchdog.failed with the reason, and kills the session — it does not cancel the execution blocked in awaitStop. Cancelling a running execution from outside the engine (so compensation unwinds) is the waitFor/cancellation integration scoped to Phase 5; triggerCompensation is the seam where that plugs in. The DB is the source of truth the durable engine reconciles against, so failing the row + killing the session is the complete Phase 2 action.
| } | ||
| if (age >= idleThreshold && latestEventType(deps.db, row.id) !== IDLE_EVENT) { | ||
| // Mark idle once per idle period (not every 30s tick). | ||
| recordEvent(deps.db, { workflowId: row.id, ts: now, type: IDLE_EVENT, payloadJson: null }); |
There was a problem hiding this comment.
Decision: idle is marked once per idle period, not every tick. Without the latestEventType !== IDLE_EVENT guard a genuinely-idle agent would accrue an idle event every 30s (120/hour) until the kill threshold, flooding the events table and the dashboard timeline. One marker per idle period is the signal the dashboard needs; the kill threshold handles escalation.
| * the adapter was rate-limited, so a normal completion never thrashes the row. | ||
| * Returns whether it flipped. | ||
| */ | ||
| export function markAvailableOnSuccess( |
There was a problem hiding this comment.
Decision: probe-via-real-work revert. A dispatch that actually completed proves the adapter is serving again, so markAvailableOnSuccess flips RATE_LIMITED → AVAILABLE. It is a no-op unless the adapter was rate-limited, so a normal completion never thrashes the row. Per the spec there is no proactive probing — the revert is reactive, gated on real work succeeding.
Reviewer's brief — Epic #14 (Phase 2: hooks + watchdog) → PR #75What this delivers: the full observability layer — HMAC-validated hook receiver, universal How to run itbun install
bun run typecheck # clean
bun test # 173 pass / 0 fail across 22 files
sh -n packages/cli/src/bootstrap-assets/hooks/hook.sh # POSIX-cleanTargeted suites: What to verify (and what "correct" looks like)
How to reviewRead Fragile / needs extra eyes
Scope boundaries (already-planned phases — deliberately not done here)
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 141-149: Wrap each per-workflow tmux interaction so that
exceptions from deps.tmux.status() and deps.tmux.killSession() are caught and
handled without aborting runWatchdog; specifically, around the blocks that call
deps.tmux.status(row.session_name) and deps.tmux.killSession(row.session_name)
(and the similar calls at the other locations) add a try/catch that on error
calls failWorkflow(deps, row.id, `tmux error: ${err.message || err}`, now) (or
equivalent message), increments acted if appropriate, and continues to the next
row so reconciliation proceeds for other workflows. Ensure you catch errors for
both status() and killSession() calls individually so one failing call does not
prevent persisting the failure decision for that workflow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: bb84d30f-c22e-4b83-9e3f-2f8537758b2a
⛔ Files ignored due to path filters (1)
bun.lockis excluded by!**/*.lock
📒 Files selected for processing (26)
packages/adapters/claude/src/classify.tspackages/adapters/claude/src/hooks.tspackages/adapters/claude/src/index.tspackages/adapters/claude/test/adapter.test.tspackages/cli/src/bootstrap-assets/hooks/hook.shpackages/cli/test/bootstrap-hook.test.tspackages/core/src/events.tspackages/core/src/hook-script.tspackages/core/src/index.tspackages/dispatcher/package.jsonpackages/dispatcher/src/dispatch.tspackages/dispatcher/src/hook-server.tspackages/dispatcher/src/hook-store.tspackages/dispatcher/src/main.tspackages/dispatcher/src/rate-limits.tspackages/dispatcher/src/watchdog-cron.tspackages/dispatcher/src/watchdog.tspackages/dispatcher/src/workflow-record.tspackages/dispatcher/src/workflows/implementation.tspackages/dispatcher/test/hook-server.test.tspackages/dispatcher/test/hook-store.test.tspackages/dispatcher/test/implementation-workflow.test.tspackages/dispatcher/test/rate-limits.test.tspackages/dispatcher/test/watchdog.test.tsplanning/issues/14/decisions.mdplanning/issues/14/plan.md
… + persistence sink Expand the Phase 1 minimal receiver to the whole normalized event taxonomy: validate :event against NORMALIZED_EVENTS, authenticate every request against the per-session token (X-Middle-Token, timing-safe), and hand bodies to an injected HookStore. session.started records session_id+transcript_path and still signals the SessionGate. DbHookStore correlates by session name and drops unmatchable sessions. Wired into dispatchEpic and main. Closes #15.
Replace the SessionStart+Stop-only settings with the whole taxonomy (SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Notification, Stop, SubagentStop, SessionEnd), each mapped to its normalized event and invoking the absolute, quoted hook.sh path. The script body is now single-sourced from a canonical HOOK_SH constant in @middle/core. Closes #16.
packages/cli/src/bootstrap-assets/hooks/hook.sh is what mm init stamps into a target repo. It is generated byte-identically from the canonical HOOK_SH constant and guarded by a drift test; POSIX sh, executable, 3s curl ceiling, failure is a no-op (exit 0). Closes #17.
…nto SQLite Cover event-row persistence, last_heartbeat advancing only on tool.pre/tool.post, session.started writing session_id+transcript_path, unmatchable-session drop, 16KB payload truncation, terminal-session exclusion, and a live HookServer → DbHookStore → SQLite path. The persistence implementation shipped alongside the sink seam in e5f504e; this is its dedicated coverage. Closes #18.
A 30s bunqueue cron reconciles every launching/running workflow: launch-timeout (stuck-launching), tmux liveness (dead session → failed + compensation hook), activity freshness (idle marker / idle-timeout kill, skipped while controlled_by='human', cross-checked against transcript staleness), and blocked.json sentinel re-arm. A companion pass corrects heartbeat drift from the source-of-truth transcript. The reconcile logic is pure and unit-tested; the cron wrapper is thin scheduling glue wired into the long-running dispatcher. Closes #19.
Add ClaudeAdapter.detectRateLimit (Stop-hook usage-limit detector) and a
rate_limit_state read/write module. On a rate-limited Stop classification the
implementation workflow persists { status: RATE_LIMITED, reset_at, source:
'transcript' }; a completed dispatch reverts a previously RATE_LIMITED adapter
to AVAILABLE (probe-via-real-work). The cross-run delayed re-enqueue that reads
reset_at is the Phase 8 auto-dispatch loop's job.
Closes #20.
Complete the launch-timeout criterion's second reason: a running workflow that went ready (session.started recorded) but whose driven prompt never landed (no turn.started within the launch window) is failed 'prompt-not-accepted' and its session killed. turn.started (UserPromptSubmit) is the prompt-landed signal; its absence past the window is the failure. Measured from the session.started event ts, not updated_at (which heartbeats bump). Refines #19.
326fb3d to
d155f29
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/dispatcher/src/watchdog.ts (1)
141-150:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard tmux calls to prevent reconciliation abort on single session failure.
A thrown error from
deps.tmux.status()will abortrunWatchdogmid-loop, leaving remaining workflows unreconciled. The same applies to thekillSessioncalls at lines 164 and 179. Wrap each per-workflow tmux interaction in try/catch to ensure one dead/misbehaving session doesn't block reconciliation of the others.Suggested fix
// 2. tmux liveness — a dead session under a 'running' workflow. if (row.session_name) { - const status = await deps.tmux.status(row.session_name); - if (!status.alive) { - await deps.tmux.killSession(row.session_name); - failWorkflow(deps, row.id, "tmux session disappeared", now); - acted++; + let status: { alive: boolean; paneCount: number }; + try { + status = await deps.tmux.status(row.session_name); + } catch (error) { + console.error(`[watchdog] tmux.status failed for ${row.session_name}: ${(error as Error).message}`); + failWorkflow(deps, row.id, "tmux status check failed", now); + acted++; continue; } + if (!status.alive) { + try { + await deps.tmux.killSession(row.session_name); + } catch (error) { + console.error(`[watchdog] killSession failed for ${row.session_name}: ${(error as Error).message}`); + } + failWorkflow(deps, row.id, "tmux session disappeared", now); + acted++; + continue; + } }Apply similar try/catch guards to the
killSessioncalls at lines 164 and 179.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/dispatcher/src/watchdog.ts` around lines 141 - 150, Wrap each per-workflow tmux interaction (calls to deps.tmux.status and deps.tmux.killSession) inside a try/catch within runWatchdog so that exceptions for one row do not abort the loop; for each row where row.session_name is truthy, call deps.tmux.status(row.session_name) inside try, handle non-alive by calling deps.tmux.killSession(row.session_name) also in try/catch, call failWorkflow(deps, row.id, "tmux session disappeared", now) and increment acted only when the workflow is handled, and on caught errors log or record the error and continue to the next row (do the same guard around the other killSession usages that reference deps.tmux.killSession so a thrown error won't stop reconciliation).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 141-150: Wrap each per-workflow tmux interaction (calls to
deps.tmux.status and deps.tmux.killSession) inside a try/catch within
runWatchdog so that exceptions for one row do not abort the loop; for each row
where row.session_name is truthy, call deps.tmux.status(row.session_name) inside
try, handle non-alive by calling deps.tmux.killSession(row.session_name) also in
try/catch, call failWorkflow(deps, row.id, "tmux session disappeared", now) and
increment acted only when the workflow is handled, and on caught errors log or
record the error and continue to the next row (do the same guard around the
other killSession usages that reference deps.tmux.killSession so a thrown error
won't stop reconciliation).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: aee7d20c-8be9-4741-9cf7-51a97a590489
⛔ Files ignored due to path filters (1)
bun.lockis excluded by!**/*.lock
📒 Files selected for processing (26)
packages/adapters/claude/src/classify.tspackages/adapters/claude/src/hooks.tspackages/adapters/claude/src/index.tspackages/adapters/claude/test/adapter.test.tspackages/cli/src/bootstrap-assets/hooks/hook.shpackages/cli/test/bootstrap-hook.test.tspackages/core/src/events.tspackages/core/src/hook-script.tspackages/core/src/index.tspackages/dispatcher/package.jsonpackages/dispatcher/src/dispatch.tspackages/dispatcher/src/hook-server.tspackages/dispatcher/src/hook-store.tspackages/dispatcher/src/main.tspackages/dispatcher/src/rate-limits.tspackages/dispatcher/src/watchdog-cron.tspackages/dispatcher/src/watchdog.tspackages/dispatcher/src/workflow-record.tspackages/dispatcher/src/workflows/implementation.tspackages/dispatcher/test/hook-server.test.tspackages/dispatcher/test/hook-store.test.tspackages/dispatcher/test/implementation-workflow.test.tspackages/dispatcher/test/rate-limits.test.tspackages/dispatcher/test/watchdog.test.tsplanning/issues/14/decisions.mdplanning/issues/14/plan.md
✅ Files skipped from review due to trivial changes (2)
- planning/issues/14/plan.md
- planning/issues/14/decisions.md
…on pass A tmux status()/killSession() error previously bubbled out of runWatchdog, aborting the whole pass and leaving the remaining launching/running rows unreconciled. Guard each tmux interaction per-workflow: - status() errors are inconclusive (not a confirmed-dead signal), so the row is skipped this pass rather than failed — the 30s cron retries, and the activity-freshness rule is the backstop for a genuinely stuck agent. - killSession() runs only after a failure decision is already made, so a kill error is swallowed+logged via safeKillSession; the failure is still recorded.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/dispatcher/src/watchdog.ts`:
- Around line 154-167: The tmux liveness catch block currently does a "continue"
which aborts all remaining durable checks for that row; instead remove the
continue so the code proceeds to prompt-not-accepted, idle/idle-timeout,
transcript checks and sentinel re-arm, but still log the tmux error; to keep
downstream logic safe, in the catch set a fallback status (e.g., status = {
alive: false, paneCount: 0 }) after logging the error so later checks see a
defined status and reconciliation continues for the workflow referenced by
row.session_name and deps.tmux.status().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 56b8215b-a2a2-4d42-8973-0f09d61bc2a6
📒 Files selected for processing (2)
packages/dispatcher/src/watchdog.tspackages/dispatcher/test/watchdog.test.ts
…hecks The previous guard used `continue` on a status() error, which skipped not just the liveness check but also prompt-not-accepted, idle/idle-timeout (rule 3) and sentinel re-arm for that row. That defeated the stated backstop: a row whose status() kept erroring would never idle-time-out and could stay 'running' forever. Leave status null on error and guard the liveness branch (`if (status && !status.alive)`) so the row falls through to the wall-clock freshness check. Add a regression test asserting a stale row still idle-times-out when status() persistently throws.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Closes #14
Phase 2 ships middle's observability layer: a full HMAC-validated hook receiver, the universal hook script, the complete Claude hook-event set, event/heartbeat persistence into SQLite, the watchdog reconciler cron, and reactive rate-limit detection. Hooks are the fast-path notification; the on-disk transcript + SQLite are the durable state the crons reconcile against.
What changed
packages/dispatcher/src/hook-server.ts— full event-name validation, per-session bearer-token auth (X-Middle-Token, timing-safe), and an injected persistence sink;session.started/agent.stoppedstill drive theSessionGate.packages/dispatcher/src/hook-store.ts—DbHookStore: events row per hook (payload ≤16KB), heartbeat bump ontool.pre/tool.post, session-field write onsession.started, session-correlated, unmatchable sessions dropped.packages/dispatcher/src/watchdog.ts+watchdog-cron.ts— 30s reconciler: launch-timeout (stuck-launching/prompt-not-accepted), tmux liveness, idle/idle-timeout (skipped whilecontrolled_by='human', transcript cross-check),blocked.jsonsentinel re-arm, and a companion transcript-drift pass.packages/dispatcher/src/rate-limits.ts—rate_limit_stateread/write +parseResetAt+ probe-via-real-work revert.packages/adapters/claude/src/hooks.ts— full 8-hook.claude/settings.jsonmapped to the normalized taxonomy.packages/adapters/claude/src/classify.ts/index.ts—detectRateLimitStop-hook detector.packages/core/src/events.ts—NORMALIZED_EVENTS+isNormalizedEvent;hook-script.ts— canonicalHOOK_SH.packages/cli/src/bootstrap-assets/hooks/hook.sh— committed universal script (drift-guarded againstHOOK_SH).packages/dispatcher/src/main.ts— wiresDbHookStore+ the watchdog cron into the long-running process.Why these changes
The interactive agent process never exits between turns, so the dispatcher can't read an exit code — hooks POST turn boundaries and the on-disk JSONL transcript is the authoritative state. The hook server is the fast path (authenticated, validated, persisted); the watchdog is the safety net that only acts on staleness and never overrides an in-progress hook decision. Auth is a per-session bearer token, not a body signature, because
hook.shforwards the token as a header — resolved from the workflow row (populated before launch) and compared constant-time. The hook script is single-sourced from one@middle/coreconstant so the adapter-written copy and themm initbootstrap asset can never drift.Verification
bun run typecheckclean;bun test→ 173 pass / 0 fail across 22 files.packages/dispatcher/test/hook-server.test.ts: valid POST accepted+recorded, bad-HMAC → 401, unknown session → 401, unknown event → 400.packages/adapters/claude/test/adapter.test.ts: all 8 Claude hook keys + each event→normalized mapping.packages/cli/test/bootstrap-hook.test.ts: byte-identity drift test, POSIX shape, executable (git mode100755),sh -nclean.packages/dispatcher/test/hook-store.test.ts: events rows, heartbeat advance on tool.pre/post only, session.started field write, unmatchable-session drop, 16KB truncation, live HookServer → DbHookStore → SQLite.packages/dispatcher/test/watchdog.test.ts: launch-timeout, prompt-not-accepted, tmux liveness (+compensation), idle/idle-timeout, human-controlled skip, transcript cross-check, sentinel re-arm, drift correction.packages/dispatcher/test/rate-limits.test.ts+ the workflow rate-limit tests: classification →rate_limit_stateRATE_LIMITED with parsedreset_at; completed dispatch reverts to AVAILABLE.Acceptance evidence (Epic #14)
session.startedrecordssession_id+transcript_path; reconciler keeps state honesthook-store.test.ts(wired HookServer→DbHookStore→SQLite + session.started write);watchdog.test.tsreconcileTranscriptDriftwatchdog.test.ts"dead session → failed + compensation"; cron cadenceWATCHDOG_INTERVAL_MS = 30_000reset_atrate-limits.test.ts+implementation-workflow.test.ts"rate-limited classifyStop → RATE_LIMITED with parsed reset_at"The three live criteria are proven at the mechanism level by the suite (every link in the path has a test); the spec's Phase 2 manual acceptance — spawn a real Claude agent and watch events flow, kill its tmux session, force a usage-limit message — remains the operator's final hands-on check and needs a live, authenticated Claude CLI.
Status
POST /hooks/:event+ event validation + sink seamClaudeAdapter.installHooksfull 8-hook.claude/settings.jsonhook.shbootstrap asseteventstable + heartbeats from hooksclassifyStopDecisions
planning/issues/14/decisions.md, distilled into inline review comments on this PR (sink seam, bearer-token auth, watchdog-vs-execution boundary, idle-once-per-period, probe-via-real-work revert).Scope boundaries carried forward (already-planned phases — no new issues)
reset_at, delay until reset) → Phase 8 auto-dispatch loop. Phase 2 writes the durablerate_limit_statesignal.awaitStop) → Phase 5waitForintegration. Phase 2 provides thetriggerCompensationseam + durable state failure..middle/hooks/hook.shdogfood copy → Phase 3 task 22 (mm initmiddle into itself). The dispatch hard-rule forbids staging/committing under.middle/(the running dispatch's operational dir), and#17's substantive deliverable — the reusable universal script — ships inbootstrap-assets/. Flagged for reviewer.Summary by CodeRabbit