Skip to content

feat(dispatcher): persist parked executions across daemon restart (#116)#160

Merged
thejustinwalsh merged 7 commits into
mainfrom
middle-issue-116
May 28, 2026
Merged

feat(dispatcher): persist parked executions across daemon restart (#116)#160
thejustinwalsh merged 7 commits into
mainfrom
middle-issue-116

Conversation

@thejustinwalsh

@thejustinwalsh thejustinwalsh commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #116

The daemon's workflow engine was in-memory (new Engine({ embedded: true })), so a daemon restart lost every parked waiting execution while the durable waitfor_signals rows survived — after a restart the poller would fire a resume signal at an execution that no longer existed. This gives the engine a durable execution store (a SQLite db alongside db.sqlite3) with a transient in-memory step queue, calls recover() on boot, and reconciles orphaned signals so nothing is left silently stuck.

What changed

  • packages/dispatcher/src/recovery.ts (new) — createDurableEngine (persistent store + in-memory queue), recoverEngine (drop mid-drive execs → recover()), reconcileOrphanedSignals.
  • packages/dispatcher/src/main.ts — build the engine via createDurableEngine; run recoverEngine + reconcileOrphanedSignals on boot (after the workflow registers, before the poller), surfacing orphans on the Epic.
  • packages/dispatcher/src/index.ts — export the recovery surface.
  • packages/dispatcher/CLAUDE.md — replace the stale "in-memory; don't add recover()" note with the durable-store / transient-queue reality + boot order.
  • Tests — restart-resume e2e + orphan reconciliation.

Why these changes

Only the execution store needs durability; the step queue must stay transient. A persistent queue replays stale step jobs onto the fresh worker after a restart, re-driving launch-and-drive and double-launching a tmux session the restart left alive (the regression #116's out-of-scope note guards against). bunqueue couples queue+store to one dataPath via a process-singleton manager and exposes no store-only option, so createDurableEngine claims that singleton as in-memory (a throwaway Queue) before constructing the Engine. recoverEngine first drops running/compensating executions (re-driving those is the watchdog's domain, explicitly out of scope) then re-arms parked waiting ones. Rationale is inlined as PR review comments and lives in planning/issues/116/decisions.md.

Status

  • Phase 1: Persistent engine + boot recovery (createDurableEngine, recoverEngine)
  • Phase 2: Orphaned-signal reconciliation
  • Phase 3: Tests (restart-resume e2e + orphan unit) + docs

Acceptance criteria

Verification

  • bun test730 pass, 0 fail across 81 files (includes 9 new recovery.test.ts + 2 new restart e2e tests).
  • bun run typecheck — clean (tsc --noEmit).
  • bun run lint (oxlint --deny-warnings) + bun run format — clean.
  • Behavior was de-risked with throwaway spikes before implementation: a persistent queue replays a stale step job on a fresh worker (re-drive); the in-memory-queue + persistent-store config survives a simulated restart and resumes with zero re-drives; engine.signal on a missing id throws Execution "<id>" not found (the orphan symptom).

Stumbling points

  • The headline fix (new Engine({ dataPath })) is a trap: bunqueue persists the queue and the store off one dataPath, and the persistent queue replays stale step jobs on the fresh worker — silently re-driving launch-and-drive. The first restart e2e caught it (an extra "initial" drive before any signal). The fix is the transient-queue/persistent-store split via the process-singleton claim.
  • bunqueue's queue manager is a process-level singleton keyed by the first dataPath, so a single test process can't model a restart without shutdownManager() between engines.

Suggested CLAUDE.md updates

Applied in this PR (packages/dispatcher/CLAUDE.md): the durable-store/transient-queue rule, the boot recovery order, and the singleton/shutdownManager() test note.

Known edges (out of scope)

  • Broader crash-recovery of running (mid-drive) executions — the watchdog's domain; this PR deliberately preserves that by dropping them before recover().
  • A parked workflow surviving a >7-day daemon outage: recover() fires the waitFor timeout and bunqueue fails the workflow (with compensation/worktree cleanup). Pre-existing bunqueue behavior for any waitFor timeout; extreme and benign.

Follow-up issues

None — no parallelizable discovery surfaced.

Summary by CodeRabbit

  • New Features

    • Durable workflow recovery across daemon restarts so parked executions persist and can be resumed
    • Orphaned-signal reconciliation to detect and finalize unrecoverable parked workflows, preventing repeated polls
  • Behavior Change

    • Parked-workflow finalization now enforces terminal-only final states to prevent invalid transitions
  • Tests

    • End-to-end and unit tests covering durable recovery, restart/resume scenarios, and orphan reconciliation
  • Documentation

    • Added planning/decision docs describing recovery, boot sequence, and durability guidance

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f793c06f-7207-44d6-b3e0-d8eeebeadc29

📥 Commits

Reviewing files that changed from the base of the PR and between 9f244d7 and 708fe4a.

📒 Files selected for processing (9)
  • packages/dispatcher/CLAUDE.md
  • packages/dispatcher/src/index.ts
  • packages/dispatcher/src/main.ts
  • packages/dispatcher/src/recovery.ts
  • packages/dispatcher/src/workflow-record.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/recovery.test.ts
  • planning/issues/116/decisions.md
  • planning/issues/116/plan.md

📝 Walkthrough

Walkthrough

Makes the dispatcher’s bunqueue Engine durable (SQLite-backed), adds createDurableEngine/recoverEngine/reconcileOrphanedSignals, integrates recovery into daemon boot before polling, tightens terminal-state typing, and adds unit and end-to-end tests plus planning docs for Issue #116.

Changes

Durable Workflow Recovery

Layer / File(s) Summary
Recovery module: durable engine and orphan reconciliation
packages/dispatcher/src/recovery.ts
createDurableEngine(dataPath) builds an Engine with persistent SQLite storage while keeping the step queue transient. recoverEngine(engine) clears running/compensating then awaits engine.recover(). reconcileOrphanedSignals(deps) scans pollable waits, finalizes orphaned waiting-human workflows to a terminal state, and consumes corresponding signal rows, optionally surfacing orphans.
Daemon boot sequence: durable engine initialization and recovery
packages/dispatcher/src/main.ts
Replaces in-memory Engine with durable Engine (queue.sqlite3), registers workflows, then runs recoverEngine and reconcileOrphanedSignals before starting the poller; finalized orphans may trigger best-effort GitHub comments when an Epic is present.
Public API surface: recovery exports
packages/dispatcher/src/index.ts
Re-exports createDurableEngine, recoverEngine, reconcileOrphanedSignals, and types (EngineRecoveryResult, OrphanedSignal, ReconcileOrphanedSignalsDeps).
Workflow record typing
packages/dispatcher/src/workflow-record.ts
Adds TerminalWorkflowState type, narrows finalizeParkedWorkflow to accept only terminal states, and retypes TERMINAL_STATES.
Unit tests: recovery and orphan reconciliation
packages/dispatcher/test/recovery.test.ts
Comprehensive tests for reconcileOrphanedSignals (orphan detection, finalization, consumption, surface callback behavior, epicNumber null handling) and recoverEngine (durable restart re-arming and clearing mid-drive executions).
Integration tests: end-to-end restart and orphaned signal scenarios
packages/dispatcher/test/implementation-workflow.test.ts
Adds durable-recovery suite: parks workflows, simulates daemon restart via engine.close(true) + shutdownManager(), runs recoverEngine, verifies resume behavior and no re-drive, and tests orphan reconciliation by deleting the durable store before restart.
Planning and documentation
packages/dispatcher/CLAUDE.md, planning/issues/116/plan.md, planning/issues/116/decisions.md
Documents durability split (persistent workflow store; in-memory step queue), engine construction, boot cleanup→recover→reconcile ordering, dataPath co-location with db.sqlite3, and test restart modeling (shutdownManager).

Sequence Diagram

sequenceDiagram
  participant Daemon as Dispatcher Daemon
  participant createDurable as createDurableEngine
  participant Engine
  participant Recover as recoverEngine
  participant Reconcile as reconcileOrphanedSignals
  participant DB as queue.sqlite3
  participant Poller as Resume Poller

  Note over Daemon,DB: Boot with persistent queue
  Daemon->>createDurable: createDurableEngine(dataPath)
  createDurable->>Engine: returns Engine(embedded, dataPath)
  Daemon->>Daemon: registerWorkflows()

  Note over Daemon,DB: Recovery: re-arm parked executions
  Daemon->>Recover: recoverEngine(engine)
  Recover->>Engine: engine.cleanup() — clear running/compensating
  Recover->>Engine: engine.recover() — re-arm waiting executions

  Note over Daemon,Reconcile: Reconcile orphaned signals
  Daemon->>Reconcile: reconcileOrphanedSignals({db, hasExecution})
  Reconcile->>DB: scan waiting-human signals
  Reconcile->>Reconcile: finalize orphan + consume signal

  Note over Poller,DB: Start polling/resume flow
  Daemon->>Poller: startPolling()
  Poller->>DB: poll waitfor_signals
  Poller->>Engine: engine.signal() on recoverable executions
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • thejustinwalsh/middle#139: Touches parked-workflow finalization and calls into finalizeParkedWorkflow, related to the terminal-state typing and finalization behavior changed here.
  • thejustinwalsh/middle#90: Introduces the waitfor_signals poll/consume mechanism; this PR extends durable handling and reconciliation for those same DB rows.
  • thejustinwalsh/middle#160: Implements the same Issue #116 durable parked-execution restart/recovery changes (recovery.ts, main wiring, exports, and tests).
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(dispatcher): persist parked executions across daemon restart (#116)' clearly and concisely summarizes the main change: implementing durable persistence for parked workflow executions across daemon restarts.
Linked Issues check ✅ Passed All acceptance criteria from issue #116 are implemented: durable dataPath for engine store, engine.recover() called on boot, parked executions survive and resume after restart, orphaned signals reconciled, comprehensive tests added, and tests pass.
Out of Scope Changes check ✅ Passed All changes align with issue #116 objectives: recovery infrastructure, durable engine initialization, boot-time recovery sequence, orphan reconciliation, and tests for these scenarios. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@thejustinwalsh thejustinwalsh left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision-log rationale (from planning/issues/116/decisions.md), inlined where it lives in the code.

* Caveat: if a `BUNQUEUE_DATA_PATH`/`BQ_DATA_PATH`/`DATA_PATH`/`SQLITE_PATH` env var is
* set, the throwaway `Queue` would itself become persistent — middle never sets these.
*/
export function createDurableEngine(dataPath: string): Engine {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a transient queue, not just new Engine({ dataPath }). A spike proved a persistent queue replays stale step jobs onto the fresh worker after a restart — re-driving launch-and-drive and double-launching a tmux session the restart left alive (the exact regression #116's out-of-scope note guards against). The branch-before-waitFor shape in the implementation workflow leaves a non-terminal step job the new worker auto-processes on construct. Only the execution store needs durability; recoverEngine rebuilds the queue from it. bunqueue couples queue+store to one dataPath via a process-singleton manager keyed by the first caller and exposes no store-only option, so claiming the singleton in-memory first (the throwaway Queue) is the lever. Verified by spikes: branchless→no replay, branch→replay, in-memory-queue→no replay + resume works.

* needs the workflow definition) and BEFORE the poller starts (so it never fires a
* resume at an exec recover hasn't re-armed yet).
*/
export async function recoverEngine(engine: Engine): Promise<EngineRecoveryResult> {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop running/compensating before recover(). engine.recover() is all-or-nothing: it re-enqueues running execs (re-running launch-and-drive) and re-runs compensating ones. #116 explicitly scopes running-execution recovery OUT ("a tmux session lost to a restart is the watchdog's domain"), and a daemon restart does not kill the agent's tmux sessions, so re-driving would double-launch. Dropping them preserves today's behavior exactly: the watchdog reconciles launching/running rows on its first tick. cleanup(0, …) deletes execs with updated_at < now — every pre-restart row in those states, and never waiting.

* its signal row consumed so the poller stops watching it, and surfaced for a human.
* Returns the orphans it reconciled.
*/
export async function reconcileOrphanedSignals(

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an orphan is finalized failed (not left, not cancelled). Post-restart the execution should be recoverable from the durable store; an orphan means the store never had it (a park from before persistence shipped, or a wiped queue db). The bug #116 calls out is the poller firing engine.signal at a dead execution — which throws Execution "<id>" not found every pass forever. Finalizing stops the poller (its loadPollableWaits only sees waiting-human rows), frees the slot, and makes the failure visible; failed (vs cancelled) because it genuinely failed to recover and warrants a human look. finalizeParkedWorkflow is conditional on the row still being waiting-human, so it can't clobber a concurrent resume. finalState is overridable if a reviewer prefers cancelled.

// watchdog's domain — then re-arm parked `waiting` executions so the poller can
// resume them. Runs AFTER the registers (recover needs the definitions) and
// BEFORE the poller (so it never signals an exec recover hasn't re-armed).
const recovery = await recoverEngine(engine);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boot recovery order. Runs AFTER both engine.register(...) (recover may re-enqueue/resume, which needs the workflow definitions) and AFTER the control-feed workflow observer is registered (so an orphan's failed transition broadcasts), and BEFORE startWatchdog/startPoller (so the poller never fires a resume at an exec recover hasn't re-armed). engine.recover() is mandated by the issue and uniquely re-arms the waitFor 7-day timeout timer — resume-via-signal itself already works off the durable store without it.

@thejustinwalsh thejustinwalsh marked this pull request as ready for review May 26, 2026 21:14
@thejustinwalsh thejustinwalsh added the ready-for-review All phases done and verified — PR ready for final human review and merge label May 26, 2026
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

Reviewer's brief — #116 durable parked-execution recovery (PR #160)

What it does: the dispatcher's workflow engine now has a durable execution store (<dbdir>/queue.sqlite3) with a transient in-memory step queue, recovers parked waiting executions on boot, and reconciles orphaned waitfor_signals rows.

How to run it

bun install
bun run typecheck                                   # tsc --noEmit, clean
bun run lint                                        # oxlint --deny-warnings, clean
bun test                                            # 730 pass, 0 fail
bun test packages/dispatcher/test/recovery.test.ts  # the new unit suite (9)
bun test packages/dispatcher/test/implementation-workflow.test.ts -t "durable recovery"  # the 2 restart e2e

What to verify (and what "correct" looks like)

  1. Transient queue is the load-bearing call. packages/dispatcher/src/recovery.tscreateDurableEngine constructs a throwaway in-memory Queue before new Engine({ embedded: true, dataPath }). This is intentional: bunqueue's embedded queue+worker share a process-singleton manager keyed by the first dataPath, so a persistent queue replays stale step jobs on the fresh worker after a restart and re-drives launch-and-drive (double tmux session). Correct = only the WorkflowStore persists; the queue stays in-memory; recover() rebuilds it. The e2e asserts prompts === ["initial"] after recover — a persistent queue would show a second drive.
  2. Boot order (main.ts, ~L602): recoverEngine runs after both engine.register(...) and after the control-feed observer is registered, and before startWatchdog/startPoller. recoverEngine drops running/compensating execs (engine.cleanup(0, …)) — re-driving those is the watchdog's domain, explicitly out of scope — then engine.recover() re-arms parked waiting ones.
  3. Orphan reconciliation (reconcileOrphanedSignals): a waiting-human workflow whose bunqueue execution the store no longer has (engine.getExecution(id) === null) is finalized failed, its signal consumed, and surfaced (log + best-effort Epic comment). Without it the poller fires engine.signal at a dead execution every pass forever (Execution "<id>" not found). The orphan-after-store-loss e2e physically rms the queue db to manufacture this.

How to review

  • The decision rationale is inlined as four PR review comments (and in planning/issues/116/decisions.md).
  • The orphan disposition is failed (visible, frees the slot, stops the poller); finalState is overridable if you'd prefer cancelled.

Fragile / extra eyes

  • The process-singleton claim in createDurableEngine is the one bunqueue-internals coupling. It breaks if (a) another embedded bunqueue is constructed before the engine in main.ts, or (b) a BUNQUEUE_DATA_PATH/DATA_PATH env var is set (middle sets none). Documented in packages/dispatcher/CLAUDE.md.
  • Tests simulate a restart with shutdownManager() between engines (resets the singleton) — required because a single process otherwise reuses the closed manager.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
packages/dispatcher/test/implementation-workflow.test.ts (2)

957-1028: ⚡ Quick win

Guarantee durable-engine teardown even when an assertion fails.

These tests close the shared in-memory engine first, then manage e1/e2 manually. If anything throws before the final close(true), the file-level afterEach only closes the already-closed shared engine, so the durable engine and its SQLite handle are left behind. A try/finally around each test body, or storing the active durable engine in shared teardown, would make these restart cases much less flaky.

Also applies to: 1030-1067

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/test/implementation-workflow.test.ts` around lines 957 -
1028, The test tears down the shared in-memory engine but can leak durable
engines e1/e2 (and their SQLite handles) if an assertion throws; wrap the test
body that creates e1/e2 in a try/finally (or assign the active durable Engine to
the module-level engine variable) and in the finally ensure any created durable
Engine (e1, e2 or whatever Engine instance is currentEngine) is closed with
close(true) and shutdownManager() so the SQLite handle is always released;
update both the restart-case test at lines ~957–1028 and the similar block at
~1030–1067 to follow this pattern and reference currentEngine/e1/e2 to locate
where to add the finally cleanup.

1013-1016: ⚡ Quick win

Assert that the post-restart continuation reuses the parked worktree.

This restart path currently proves the continuation re-drives, but not that it keeps the original checkout alive. A regression that recreates a fresh worktree after restart would still pass here while silently dropping the parked state. Please add the same worktreePath equality check the in-process resume test already has.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/test/implementation-workflow.test.ts` around lines 1013 -
1016, The test verifies continuation re-drives but not that it reused the
original parked worktree; capture the worktreePath before the restart (e.g.
worktreePathBefore) and after the restart (e.g. worktreePathAfter) and assert
they are equal — mirror the worktreePath equality assertion from the in-process
resume test by obtaining the parked worktreePath for id1 (using the same helper
used there) after awaitParkedOn(e2, id1) and comparing it to the original
worktreePath saved earlier in the test, keeping the existing
awaitContinuation/awaitParkedOn, prompts, and readPromptBrief assertions intact.
packages/dispatcher/test/recovery.test.ts (1)

225-252: ⚡ Quick win

Always close the durable engines on failure paths.

Both restart tests only close e1/e2 on the happy path. If an assertion fails earlier, the suite-level afterEach only resets bunqueue and removes the temp dir, so the durable engine can stay alive and bleed into the next case. Wrap each test body in try/finally or track the active engine in shared teardown.

Also applies to: 254-289

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/dispatcher/test/recovery.test.ts` around lines 225 - 252, The tests
create durable engines e1 and e2 (via createDurableEngine) but only close them
on the happy path, so if an assertion fails the engine stays alive and leaks
state; update each test to ensure e1 and e2 are always closed by wrapping the
test body in try/finally (or track the active engine and close in shared
teardown) and call e1.close(true) and e2.close(true) in finally; ensure
shutdownManager() and recoverEngine() usage remains the same but that any
created engine references are closed on all paths to prevent cross-test leakage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/dispatcher/src/recovery.ts`:
- Around line 78-90: The declared finalState on ReconcileOrphanedSignalsDeps
must be narrowed to only terminal workflow states to prevent consuming wait rows
into non-terminal states; define a new union type (e.g., TerminalWorkflowState)
containing the concrete terminal states used by the engine and change
finalState?: WorkflowState to finalState?: TerminalWorkflowState on
ReconcileOrphanedSignalsDeps, and add a runtime check where the deps object is
constructed/used to assert/throw if a non-terminal value is passed (so misuses
fail fast).
- Around line 25-30: createDurableEngine currently creates a transient
Queue("__mm:engine-queue", { embedded: true }) but relies only on docs to ensure
BUNQUEUE_DATA_PATH/BQ_DATA_PATH/DATA_PATH/SQLITE_PATH are unset; add a runtime
guard at the start of createDurableEngine that checks process.env for those four
vars and throws a clear Error if any are set so the embedded Queue cannot be
made persistent via parent-process env injection, then proceed to instantiate
new Queue("__mm:engine-queue", { embedded: true }) and return new Engine({
embedded: true, dataPath }) as before.

---

Nitpick comments:
In `@packages/dispatcher/test/implementation-workflow.test.ts`:
- Around line 957-1028: The test tears down the shared in-memory engine but can
leak durable engines e1/e2 (and their SQLite handles) if an assertion throws;
wrap the test body that creates e1/e2 in a try/finally (or assign the active
durable Engine to the module-level engine variable) and in the finally ensure
any created durable Engine (e1, e2 or whatever Engine instance is currentEngine)
is closed with close(true) and shutdownManager() so the SQLite handle is always
released; update both the restart-case test at lines ~957–1028 and the similar
block at ~1030–1067 to follow this pattern and reference currentEngine/e1/e2 to
locate where to add the finally cleanup.
- Around line 1013-1016: The test verifies continuation re-drives but not that
it reused the original parked worktree; capture the worktreePath before the
restart (e.g. worktreePathBefore) and after the restart (e.g. worktreePathAfter)
and assert they are equal — mirror the worktreePath equality assertion from the
in-process resume test by obtaining the parked worktreePath for id1 (using the
same helper used there) after awaitParkedOn(e2, id1) and comparing it to the
original worktreePath saved earlier in the test, keeping the existing
awaitContinuation/awaitParkedOn, prompts, and readPromptBrief assertions intact.

In `@packages/dispatcher/test/recovery.test.ts`:
- Around line 225-252: The tests create durable engines e1 and e2 (via
createDurableEngine) but only close them on the happy path, so if an assertion
fails the engine stays alive and leaks state; update each test to ensure e1 and
e2 are always closed by wrapping the test body in try/finally (or track the
active engine and close in shared teardown) and call e1.close(true) and
e2.close(true) in finally; ensure shutdownManager() and recoverEngine() usage
remains the same but that any created engine references are closed on all paths
to prevent cross-test leakage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f0b8b13e-5548-4547-8162-874ddca5ab9a

📥 Commits

Reviewing files that changed from the base of the PR and between 720044c and 7c4245c.

📒 Files selected for processing (8)
  • packages/dispatcher/CLAUDE.md
  • packages/dispatcher/src/index.ts
  • packages/dispatcher/src/main.ts
  • packages/dispatcher/src/recovery.ts
  • packages/dispatcher/test/implementation-workflow.test.ts
  • packages/dispatcher/test/recovery.test.ts
  • planning/issues/116/decisions.md
  • planning/issues/116/plan.md

Comment thread packages/dispatcher/src/recovery.ts Outdated
Comment thread packages/dispatcher/src/recovery.ts Outdated
thejustinwalsh added a commit that referenced this pull request May 26, 2026
Two robustness fixes from the PR #160 review:

- createDurableEngine now throws if any of BUNQUEUE_DATA_PATH / BQ_DATA_PATH /
  DATA_PATH / SQLITE_PATH is set. bunqueue's getDataPath() coalesces exactly
  those four with `??`, so a parent-process env injection would otherwise make
  the throwaway in-memory Queue persistent — replaying stale step jobs onto the
  fresh worker after a restart and double-launching a session. The guard runs
  before claiming the singleton and matches bunqueue's nullish semantics ("" is
  a set value, not a fallback).

- Narrow finalizeParkedWorkflow's finalState and ReconcileOrphanedSignalsDeps's
  finalState from WorkflowState to a new TerminalWorkflowState (a strict subset:
  completed | compensated | failed | cancelled). Finalizing a parked workflow to
  a non-terminal state would consume its wait row yet strand it with no recovery
  path; the narrowing makes that a compile error at every call site. TERMINAL_STATES
  now `satisfies readonly TerminalWorkflowState[]` to keep the two in sync.
@thejustinwalsh

Copy link
Copy Markdown
Owner Author

Review round 1 — addressed (3cc2004, 9f244d7)

Actionable (2): replied in-thread.

  • Transient-queue env guard in createDurableEngine — runtime guard over the exact four vars getDataPath() reads, runs before claiming the singleton.
  • finalState narrowed to a new TerminalWorkflowState — at the write site (finalizeParkedWorkflow), so both callers are covered, plus a @ts-expect-error typecheck pin.

Nitpicks (3): all addressed.

  • Durable-engine teardown on failure paths (both test files) — each restart suite now tracks every engine it opens and closes them idempotently in afterEach; the post-restart shutdownManager + fresh in-memory engine reset is centralized into teardown, so a mid-test assertion failure can't leak a durable engine/SQLite handle or bleed the bunqueue singleton into the next case.
  • Post-restart worktree reuse (implementation-workflow.test.ts) — added the worktreePath equality assertion mirroring the in-process resume test.

Verification: bun run typecheck clean, bun run lint clean, bun test packages/dispatcher/ → 421 pass / 0 fail.

Give the workflow engine a durable execution store (a SQLite db alongside
db.sqlite3) so a parked `waiting` execution survives a daemon restart, while
keeping the step queue in-memory — a persistent queue replays stale step jobs
onto the fresh worker and re-drives launch-and-drive (double session). The
`createDurableEngine` factory claims bunqueue's process-singleton queue manager
as in-memory before constructing the Engine; `recoverEngine` drops mid-drive
(running/compensating) executions (the watchdog's domain) then re-arms parked
`waiting` ones, and `reconcileOrphanedSignals` finalizes any waiting-human row
whose execution the store no longer has so the poller stops firing at a dead
execution.
…onciliation (#116)

Restart is simulated by engine.close(true) + shutdownManager() (resetting
bunqueue's process-singleton manager) before the second createDurableEngine on
the same dataPath — modelling a real separate-process boot. Covers: a real
implementation workflow parked on .waitFor(RESUME_EVENT) surviving a restart and
a review verdict resuming it (no re-drive); recoverEngine re-arming a parked
waiting exec and dropping a mid-drive running one; and orphaned-signal
reconciliation (orphan finalized + consumed + surfaced; alive parks untouched).
Replace the stale "engine is in-memory; don't add recover()" note with the
durable-store / transient-queue reality, the boot recovery order, and why the
queue must not persist. Add the recovery surface to the module front door.
Two robustness fixes from the PR #160 review:

- createDurableEngine now throws if any of BUNQUEUE_DATA_PATH / BQ_DATA_PATH /
  DATA_PATH / SQLITE_PATH is set. bunqueue's getDataPath() coalesces exactly
  those four with `??`, so a parent-process env injection would otherwise make
  the throwaway in-memory Queue persistent — replaying stale step jobs onto the
  fresh worker after a restart and double-launching a session. The guard runs
  before claiming the singleton and matches bunqueue's nullish semantics ("" is
  a set value, not a fallback).

- Narrow finalizeParkedWorkflow's finalState and ReconcileOrphanedSignalsDeps's
  finalState from WorkflowState to a new TerminalWorkflowState (a strict subset:
  completed | compensated | failed | cancelled). Finalizing a parked workflow to
  a non-terminal state would consume its wait row yet strand it with no recovery
  path; the narrowing makes that a compile error at every call site. TERMINAL_STATES
  now `satisfies readonly TerminalWorkflowState[]` to keep the two in sync.
… all paths

- Assert createDurableEngine throws for each persistent-queue env var (and an
  empty-string value, matching bunqueue's `??`), naming every offending var.
- Add a compile-time @ts-expect-error guard that finalState rejects non-terminal
  states (enforced by the typecheck gate).
- Track every durable engine the restart suites open and close them idempotently
  in afterEach, so a mid-test assertion failure can't leak a durable engine + its
  SQLite handle or bleed the bunqueue singleton into the next case. Centralizes the
  post-restart cleanup (shutdownManager + fresh in-memory engine) into teardown.
- After a restart-driven continuation, assert it inherits the parked worktree path
  (mirrors the in-process resume test).
@thejustinwalsh thejustinwalsh merged commit c496ae7 into main May 28, 2026
1 check was pending
@thejustinwalsh thejustinwalsh deleted the middle-issue-116 branch May 28, 2026 19:31
thejustinwalsh added a commit that referenced this pull request May 28, 2026
Two robustness fixes from the PR #160 review:

- createDurableEngine now throws if any of BUNQUEUE_DATA_PATH / BQ_DATA_PATH /
  DATA_PATH / SQLITE_PATH is set. bunqueue's getDataPath() coalesces exactly
  those four with `??`, so a parent-process env injection would otherwise make
  the throwaway in-memory Queue persistent — replaying stale step jobs onto the
  fresh worker after a restart and double-launching a session. The guard runs
  before claiming the singleton and matches bunqueue's nullish semantics ("" is
  a set value, not a fallback).

- Narrow finalizeParkedWorkflow's finalState and ReconcileOrphanedSignalsDeps's
  finalState from WorkflowState to a new TerminalWorkflowState (a strict subset:
  completed | compensated | failed | cancelled). Finalizing a parked workflow to
  a non-terminal state would consume its wait row yet strand it with no recovery
  path; the narrowing makes that a compile error at every call site. TERMINAL_STATES
  now `satisfies readonly TerminalWorkflowState[]` to keep the two in sync.
thejustinwalsh added a commit that referenced this pull request May 29, 2026
Single-pass new-work-as-base merge of origin/main after rebase kept
re-conflicting on the same hunks across multiple commits (CLAUDE.md
escape hatch).

- packages/dispatcher/src/poller-cron.ts — unified `startPoller(deps,
  opts)` signature; folded `ReconcilerHooks` into `StartPollerOptions`
  as `opts.reconcilers` (alongside `opts.checkboxRevert` and
  `opts.intervalMs`).
- packages/dispatcher/src/main.ts — unified daemon-startup: keeps the
  durable engine + `recoverEngine` + `reconcileOrphanedSignals` from
  #160, the notification-failsafe watchdog comment from #162, and adds
  the `reconcileOpenPRsForRepo` block + `reconcilers` config in the
  `startPoller` call. Dropped the now-unused `Engine` import (main
  routes through `createDurableEngine`).
- packages/core/src/index.ts — kept both export blocks: integration
  rubric from #163, `selectAdapter` from this PR.
- packages/dispatcher/test/recommender-run.test.ts — kept both describe
  blocks (adapter-enabled gate from this PR, schema-resolution from
  #157); added `enabled: true` to the schema test's adapter config so
  it passes the new gate.
- packages/dispatcher/test/gates/checkbox-revert-pass.test.ts — added
  the five new `GitHubGateway` methods to the test stub
  (`listOpenIssues`, `addLabel`, `listMergedPrsClosingRefs`,
  `closeIssue`, `createIssue`) main grew during the marathon.

Gates re-verified locally: `bun run typecheck` clean, `bun test
packages/dispatcher` 620/620 pass, `bun run lint` clean, `bun run
format` clean (no changes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-review All phases done and verified — PR ready for final human review and merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist parked executions across daemon restart (durable bunqueue store)

1 participant