fix(gateway): surface worker failures to chat clients as terminal errors (#946) by buremba · Pull Request #971 · lobu-ai/lobu

buremba · 2026-05-20T16:14:18Z

Summary

Fixes #946. A worker that can't produce a reply (crash, hang, or pod death) left the client's SSE/CLI hanging forever, or returned a silent complete with no content. This makes a failed turn always produce a terminal event the client renders — correct under N>1 k8s replicas (the constraint codified in #962).

Supersedes the closed #958 (which was in-memory / single-process only).

Mechanism — turn-liveness election (no new schema)

Every dispatched turn arms a passive marker row in public.runs on the consumer-less internal:turn_timeout queue (turn-liveness.ts). It's never claimed as a job, so the queue's status machinery never touches it — its existence is the durable obligation that the turn owes the client a terminal event.

Election: DELETE … RETURNING is first-writer-wins; the deleter emits a terminal thread_response{error} in the same transaction (atomic, crash-safe).
Fast path (instant): EmbeddedDeploymentManager child.once("exit"/"error") fails the deployment's markers immediately. killWorker sets an intentionalExit flag so deliberate stops never notify.
Backstop (deadline): a periodic SKIP LOCKED sweep fails lapsed markers — covers a hung worker and a worker-pod death (the marker outlives the pod; any replica sweeps it). Deadline pushed forward by the worker's 20s heartbeat.
Owner-gated delivery: terminal API rows (success completion and error) re-queue unless this pod holds the client's SSE, so a row produced on one replica reaches the SSE-owning replica. Also fixes the pre-existing cross-pod success-completion drop. Marker key deploymentName:messageId (globally unique — platform message IDs are per-chat).
Render via error: trackFailedDeployment + input-guardrail reject moved off the dropped content field onto error (renders end-to-end: SSE error event + CLI exit 1; platforms post Error: …).

Reproducer (red → green)

Red (#946 + handoff): lobu chat against an agent whose worker can't reply → SSE sits on connected + ping forever (or silent complete).

Green — live PGlite gateway, LOBU_WORKER_ENTRYPOINT = process.exit(1), real Agent API:

connected → error → agent-error → complete
error: "The worker handling your request stopped unexpectedly before it could reply…"

Validation

Strict tsc --noEmit ✅
EXPLAIN (100k-row runs): all three marker queries index-backed via runs_lobu_claim_idx / runs_idempotency_key_uniq — caught + fixed a discharge seq-scan (status='pending'). ✅
Single-pod E2E (broken worker → terminal error, no hang), ran before & after review fixes. ✅
Multi-replica 2-process E2E (two real RunsQueue + two SseManager + real UnifiedThreadResponseConsumer, shared Postgres): error produced on the non-SSE pod → owner-gate re-queues → SSE-owning pod delivers error,agent-error,complete; the other pod delivers nothing. This is the test fix(server): surface worker failures to chat clients instead of silent complete / hang (#946) #958 could never satisfy. ✅
discharge / sweep / exactly-once verified vs Postgres. ✅
Reviewed twice with GPT-5.5 (codex) on the implementation diff — 5 HIGH findings caught and fixed (arm-before-enqueue, enqueue-before-discharge crash-safety, owner-gate platform||teamId, composite marker key, trackFailedDeployment discharge); final re-review clean. ✅

Accepted tradeoff (documented in code): a mid-window pod crash between enqueueing the real reply and discharging the marker can emit one bounded duplicate error — deliberately favored over re-introducing a hang.

Notes for the reviewer

make review was not run here — its integration tests run DDL against the worktree .env's shared prod DB (per the repo handoff). Run it against a local throwaway DB before merge. The GPT-5.5 diff reviews above are the DB-safe substitute.
Happy-path discharge-on-success is validated at the SQL level + the multi-replica test delivered the success complete cross-pod; a full-stack happy-path E2E needs a real worker (provider key).

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Durable turn-liveness markers with arming/extend/discharge and periodic sweep to ensure stalled turns emit exactly one terminal response.
- Atomic terminal-reply commit that persists terminal replies and discharges liveness markers in one step.
- Owner-based gating for terminal API/SSE deliveries so terminal messages route to the owning pod.
Bug Fixes
- Heartbeat ACKs extend turn deadlines to prevent premature failures.
- Worker crashes reliably fail in-progress turns and avoid duplicate terminal notifications.
Tests
- Added/updated tests validating turn-liveness, single-shot failures, and guardrail error routing.

coderabbitai · 2026-05-20T16:14:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9e487fc4-aa8c-417c-8590-540976f51c40

📥 Commits

Reviewing files that changed from the base of the PR and between bbc0c13 and 95613d5.

📒 Files selected for processing (10)

packages/server/src/__tests__/guardrails-runtime.test.ts
packages/server/src/gateway/__tests__/turn-liveness.test.ts
packages/server/src/gateway/gateway-main.ts
packages/server/src/gateway/gateway/index.ts
packages/server/src/gateway/infrastructure/queue/index.ts
packages/server/src/gateway/infrastructure/queue/types.ts
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
packages/server/src/gateway/orchestration/message-consumer.ts
packages/server/src/gateway/orchestration/turn-liveness.ts
packages/server/src/gateway/platform/unified-thread-consumer.ts

📝 Walkthrough

Walkthrough

Durable turn-liveness markers are persisted and managed to ensure exactly-once terminal delivery: markers are armed on dispatch, extended by heartbeats, discharged by terminal replies or worker failure, backstopped by a periodic sweep, and integrated across queue options, gateway lifecycle, unified consumer owner-gating, and tests.

Changes

Turn-Liveness Error Backstop Implementation

Layer / File(s)	Summary
Turn-Liveness Core Module: Marker Lifecycle and Failure Paths `packages/server/src/gateway/orchestration/turn-liveness.ts`	Introduces `TurnRouting`, arm/discharge/extend primitives, `failTurnsForDeployment`, `sweepExpiredTurns` with SKIP LOCKED exactly-once election, helpers to atomically enqueue terminal `thread_response` rows, `failTurnIfPending`, `commitTerminalReply`, `pg_notify` notifications, and sweep lifecycle controls.
Queue Infrastructure for Terminal Delivery `packages/server/src/gateway/infrastructure/queue/types.ts`, `packages/server/src/gateway/infrastructure/queue/index.ts`	Defines `TERMINAL_DELIVERY_SEND_OPTS` with terminal retry settings and re-exports `QueueOptions` and terminal options for consumers.
Gateway Startup and Periodic Sweep Scheduling `packages/server/src/gateway/gateway-main.ts`	Starts periodic turn-timeout sweep during `Gateway.start()` after the unified thread_response consumer is ready; stops it during graceful shutdown before stopping consumers and platforms.
Message Consumer: Dispatch Arming and Terminal Error Routing `packages/server/src/gateway/orchestration/message-consumer.ts`	Arms timeout markers before sending work to workers; captures input guardrail trips and emits terminal error `thread_response` rows with `TERMINAL_DELIVERY_SEND_OPTS`; replaces direct startup-failure enqueue with `failTurnIfPending`.
Worker Gateway: Terminal Response Detection and Discharge `packages/server/src/gateway/gateway/index.ts`	Best-effort `extendTurnDeadlines` on received/heartbeat ACKs; classifies terminal responses (explicit `error` or non-empty `processedMessageIds`) and uses `commitTerminalReply` to atomically discharge markers and insert terminal replies; non-terminal responses still enqueue normally.
Embedded Deployment: Worker Crash Detection and Failure Surfacing `packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`	Tracks intentional vs. unexpected worker exits via `intentionalExits`; on spawn errors or unexpected exits calls `failTurnsForDeployment` with `WORKER_DIED_MESSAGE`; `killWorker` marks intentional teardowns and handles already-exited cases.
Unified Thread Consumer: Terminal Owner-Routing Gate `packages/server/src/gateway/platform/unified-thread-consumer.ts`	Owner-gates terminal API/SSE rows by checking SSE session ownership and throwing to force re-queue when the current pod does not own the session; non-terminal rows retain prior behavior.
Tests and Guardrails Update `packages/server/src/gateway/__tests__/turn-liveness.test.ts`, `packages/server/src/__tests__/guardrails-runtime.test.ts`	Adds end-to-end turn-liveness integration tests and updates guardrail runtime unit tests to expect rejection payloads in `thread_response.data.error`.

Sequence Diagram

sequenceDiagram
  participant Dispatcher
  participant Worker
  participant TurnLiveness
  participant DB
  participant ThreadResponseConsumer
  Dispatcher->>TurnLiveness: armTurnTimeout(routing)
  TurnLiveness->>DB: INSERT internal marker (delayed run_at)
  Dispatcher->>Worker: enqueue work (worker queue)
  Worker->>Dispatcher: enrichedResponse (including heartbeat/ACK)
  alt Heartbeat/ACK
    Dispatcher->>TurnLiveness: extendTurnDeadlines(deployment)
    TurnLiveness->>DB: UPDATE run_at for deployment markers
  end
  alt Worker responds terminally (error/processedMessageIds)
    Dispatcher->>TurnLiveness: commitTerminalReply(deployment, messageIds, reply)
    TurnLiveness->>DB: DELETE markers RETURNING -> INSERT terminal thread_response (atomic)
    TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
  else Worker dies / crash detected
    Dispatcher->>TurnLiveness: failTurnsForDeployment(deployment, reason)
    TurnLiveness->>DB: DELETE markers -> INSERT error thread_response rows (atomic)
    TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
  else Periodic sweep fires
    TurnLiveness->>DB: sweepExpiredTurns() (FOR UPDATE SKIP LOCKED + DELETE ... RETURNING)
    TurnLiveness->>DB: INSERT terminal error thread_response rows
    TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
  end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly Related PRs

lobu-ai/lobu#919: Modifies message-consumer.ts and thread_response terminal rejection generation—overlaps with guardrail-to-error routing changes included here.

Poem

🐰 I arm the turns with tiny clocks,

I hop to guard both knocks and knocks,
When workers sleep, I mark and sweep,
I nudge late replies from buried deep,
A cheerful hop to keep threads neat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: surfacing worker failures to chat clients as terminal errors, with reference to the issue `#946`.
Description check	✅ Passed	The description is comprehensive and well-structured, covering summary, mechanism, reproducer, validation, and reviewer notes. All required template sections are present and thoroughly filled out with detailed technical context.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/fix-worker-failure-multipod

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-20T16:16:34Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-317: The catch block after attempting to create/send to
responseQueue ("thread_response") currently swallows enqueue failures
(notifyError) and returns before arming any turn liveness marker; update the
catch to either re-throw notifyError or invoke the same liveness-backed failure
flow used elsewhere (e.g., routeThroughLivenessFailure or the turn-marker arming
routine) before returning so blocked turns still emit a terminal event and are
swept; ensure you handle failures from this.queue.createQueue and
this.queue.send (with TERMINAL_DELIVERY_SEND_OPTS and blockMessage) by routing
them through that liveness-backed handler instead of simply logging and
returning.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 317-324: The payload currently may emit platform as undefined
(platform: routing.platform) which violates the explicit-platform requirement;
update the payload construction in turn-liveness.ts so the platform field always
defaults (e.g., platform: routing.platform ?? "api")—mirror the existing teamId
defaulting pattern—so InteractionService events always carry an explicit
platform value and platform-based routing/isolation works correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 960c3bb5-61a3-4a74-bfa3-5cbb2ac7f33c

📥 Commits

Reviewing files that changed from the base of the PR and between 9e30b65 and 78e0226.

📒 Files selected for processing (8)

packages/server/src/gateway/gateway-main.ts
packages/server/src/gateway/gateway/index.ts
packages/server/src/gateway/infrastructure/queue/index.ts
packages/server/src/gateway/infrastructure/queue/types.ts
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
packages/server/src/gateway/orchestration/message-consumer.ts
packages/server/src/gateway/orchestration/turn-liveness.ts
packages/server/src/gateway/platform/unified-thread-consumer.ts

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

packages/server/src/gateway/orchestration/turn-liveness.ts (1)
317-324: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Default platform on the terminal payload.

Line 323 can still enqueue platform: undefined for API turns, which violates the explicit-platform contract for gateway events and makes renderer-side filtering brittle.
Proposed fix
   const payload = {
     messageId: routing.messageId,
     channelId: routing.channelId,
     conversationId: routing.conversationId,
     userId: routing.userId,
     teamId: routing.platform ?? "api",
-    platform: routing.platform,
+    platform: routing.platform ?? "api",
     platformMetadata: routing.platformMetadata,
     error: reason,
     processedMessageIds: [routing.messageId],
     timestamp: Date.now(),
   };
As per coding guidelines, packages/server/src/gateway/**/*.ts: Gateway orchestration and platform isolation: InteractionService events MUST carry an explicit platform field; each platform renderer MUST filter on its own identity (e.g., platform === "telegram"), never reference another platform's identifier.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 317
- 324, The payload being enqueued may set platform to undefined; update the
payload construction in turn-liveness.ts so the payload.platform uses the
explicit default (use routing.platform ?? "api") just like teamId does, ensuring
InteractionService events always carry an explicit platform value and renderers
can reliably filter by platform.
packages/server/src/gateway/orchestration/message-consumer.ts (2)
295-322: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't complete the guardrail short-circuit when terminal delivery fails.

If thread_response enqueue fails here, this branch logs and returns before any liveness marker exists. That permanently drops the blocked turn: no client error, no retry, and no sweep backstop. Re-throw notifyError or route this branch through the same liveness-backed failure helper before returning.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines
295 - 322, The current guardrail short-circuit swallows failures from
creating/sending to responseQueue ("thread_response") by logging and returning;
update the try/catch in the block that uses
this.queue.createQueue(responseQueue) and this.queue.send(...,
TERMINAL_DELIVERY_SEND_OPTS) so that on notifyError you either re-throw the
notifyError or invoke the existing liveness-backed failure helper (the same
helper used by other failure paths) before returning, rather than simply calling
logger.error; ensure the catch includes any relevant context (messageId, userId,
channelId, blockMessage) when delegating/throwing so the blocked turn is retried
or swept as intended.
679-701: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Gate this startup-failure send through the same first-writer-wins path.

By this point the turn is already in thread_message_${deploymentName}. A worker can still win the race and enqueue a terminal reply before this background failure path runs, but this code sends its own terminal error unconditionally and only discharges afterward. That can double-signal the client. Use the atomic liveness failure helper here instead of manual queue.send(...) + dischargeTurn(...).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/gateway/index.ts`:
- Around line 442-468: The terminal-response enqueue followed by separate
dischargeTurn calls is not crash-atomic: if the process dies after
queue.send("thread_response", enrichedResponse, ...) but before
dischargeTurn(deploymentName, id) runs, a surviving marker can later cause a
duplicate terminal error; fix by making the persist-terminal-row + delete-marker
operation atomic (either implement a transactional helper that writes the
terminal delivery row and removes the marker in one DB transaction inside the
same code path that enqueues the terminal response, or modify the sweep to
detect and suppress markers when a terminal row already exists for the same
(deploymentName, messageId)); locate the enqueue call to queue.send and the
subsequent dischargeTurn loop (using enrichedResponse,
enrichedResponse.processedMessageIds, isTerminalResponse and
TERMINAL_DELIVERY_SEND_OPTS) and either (a) move/delete the marker removal into
the same transactional helper used to persist the thread_response or (b) add a
sweep-side existence check for a terminal row for (deploymentName, messageId)
before emitting a terminal error.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 118-129: The try/catch around queue.createQueue and queue.send
currently swallows errors and allows dispatch to continue without the durable
turn-timeout marker; change it so failures abort dispatch instead of logging and
continuing: in the block that calls queue.createQueue(TURN_TIMEOUT_QUEUE) and
queue.send(..., { singletonKey: turnMarkerKey(...) }), remove or alter the catch
to rethrow the error (or return a failure) so the caller will not proceed to
dispatch; this ensures dischargeTurn, failTurnsForDeployment and the periodic
sweep have the marker to elect against.

---

Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-322: The current guardrail short-circuit swallows failures
from creating/sending to responseQueue ("thread_response") by logging and
returning; update the try/catch in the block that uses
this.queue.createQueue(responseQueue) and this.queue.send(...,
TERMINAL_DELIVERY_SEND_OPTS) so that on notifyError you either re-throw the
notifyError or invoke the existing liveness-backed failure helper (the same
helper used by other failure paths) before returning, rather than simply calling
logger.error; ensure the catch includes any relevant context (messageId, userId,
channelId, blockMessage) when delegating/throwing so the blocked turn is retried
or swept as intended.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 317-324: The payload being enqueued may set platform to undefined;
update the payload construction in turn-liveness.ts so the payload.platform uses
the explicit default (use routing.platform ?? "api") just like teamId does,
ensuring InteractionService events always carry an explicit platform value and
renderers can reliably filter by platform.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 833e2580-c4db-4650-91f6-99f85cf8d7d1

📥 Commits

Reviewing files that changed from the base of the PR and between 78e0226 and f0842a9.

📒 Files selected for processing (9)

packages/server/src/__tests__/guardrails-runtime.test.ts
packages/server/src/gateway/gateway-main.ts
packages/server/src/gateway/gateway/index.ts
packages/server/src/gateway/infrastructure/queue/index.ts
packages/server/src/gateway/infrastructure/queue/types.ts
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
packages/server/src/gateway/orchestration/message-consumer.ts
packages/server/src/gateway/orchestration/turn-liveness.ts
packages/server/src/gateway/platform/unified-thread-consumer.ts

buremba · 2026-05-20T16:50:53Z

Thanks @coderabbitai — all four findings were real; addressed in bbc0c133:

Guardrail swallows enqueue failure — the reject path now re-throws on enqueue failure so the messages run retries (the trip is deterministic, so the reject is re-attempted). No more vanishing blocked turn.
trackFailedDeployment double-signal — replaced the unconditional send + post-hoc discharge with a new failTurnIfPending(deployment, messageId, reason): an atomic DELETE marker … RETURNING + emit-in-same-tx, election-gated. If a still-attached worker raced a real reply (discharging the marker), it no-ops instead of double-signalling.
platform: undefined — terminal payloads now default platform (and teamId) to "api", honoring the explicit-platform rule.
enqueue + discharge not crash-atomic — terminal replies now go through commitTerminalReply(), which deletes the marker(s) and inserts the terminal thread_response row in one transaction. A mid-transaction crash leaves the marker (sweep emits one honest error) instead of producing a duplicate.

Added a committed integration test (turn-liveness.test.ts, 8 cases) covering arm/discharge, the election (win + no-double + worker-raced), atomic commitTerminalReply, the sweep + exactly-once, heartbeat extension, and the globally-unique deploymentName:messageId marker key. Strict packages/server tsc --noEmit clean.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/server/src/gateway/orchestration/impl/embedded-deployment.ts (1)

797-807: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the error handler to spawn failures only.

Node.js emits the error event not only for spawn failures, but also for failed kill() and send() operations after the child has been spawned. The current handler will wrongly clear intentionalExits and surface a bogus terminal WORKER_DIED_MESSAGE during a deliberate teardown if error fires after a post-spawn kill() call. Track the spawn event and gate the failure path to true pre-spawn failures; let post-spawn failures flow through the exit path instead.

Fix

+    let sawSpawn = false;
+    child.once("spawn", () => {
+      sawSpawn = true;
+    });
+
     child.once("error", (err) => {
+      if (sawSpawn) {
+        logger.error(
+          `Embedded worker ${deploymentName} child-process error after spawn: ${err.message}`
+        );
+        return;
+      }
       logger.error(
         `Embedded worker ${deploymentName} spawn error: ${err.message}`
       );
       this.workers.delete(deploymentName);
       releaseLockOnce();
       // A spawn error is never a deliberate stop. Fail any in-flight turn(s)
       // for this deployment so the client gets a terminal error instead of a
       // hang. No-op if nothing is in flight (markers already discharged).
       this.intentionalExits.delete(deploymentName);
       void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE);
     });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around
lines 797 - 807, The current child.once("error") handler treats all child errors
as pre-spawn spawn failures; change it to only handle true spawn failures by
tracking the child's "spawn" event: add a boolean flag (e.g., spawned = false),
set spawned = true in child.once("spawn", ...) and in the child.once("error",
...) handler only perform the pre-spawn failure path (logger.error with
err.message, this.workers.delete(deploymentName), releaseLockOnce(),
this.intentionalExits.delete(deploymentName), and void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE)) when spawned is
false; for errors after spawn, let the normal "exit" handling run and avoid
clearing intentionalExits or surfacing WORKER_DIED_MESSAGE.

♻️ Duplicate comments (2)

packages/server/src/gateway/orchestration/turn-liveness.ts (1)
118-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't continue when timeout-marker arming fails.

armTurnTimeout() still logs and returns on createQueue()/send() failure, so dispatch can proceed without any durable marker. In that state there is nothing for failTurnIfPending(), failTurnsForDeployment(), or the sweep to elect against, which brings back the hanging-turn behavior this module is meant to remove.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 118
- 129, The armTurnTimeout() implementation currently swallows errors from
queue.createQueue()/queue.send(), allowing dispatch to continue without a
durable TURN_TIMEOUT_QUEUE marker; change it so failures are not ignored: when
createQueue() or send() throws, log the error as before but then rethrow (or
otherwise propagate) the exception so the caller cannot proceed. Update
armTurnTimeout() to let the error bubble up (instead of returning normally) so
callers like failTurnIfPending() / failTurnsForDeployment() / the sweeper always
have a marker to elect against; reference TURN_TIMEOUT_QUEUE, turnMarkerKey(),
createQueue(), and send() to locate the code to modify.
packages/server/src/gateway/orchestration/message-consumer.ts (1)
295-352: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The guardrail reject path still fails open on notify errors.

throw notifyError at Line 328 never escapes handleMessage(): the outer catch at Lines 343-352 catches it, logs "proceeding without guardrails", and then continues into marker arming and worker dispatch. If thread_response enqueue fails, a message that already tripped the input guardrail reaches the worker instead of being retried/blocked.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines
295 - 352, The guardrail notification failure is still being swallowed by the
outer input-guardrail catch in handleMessage(), so a tripped block can continue
into normal processing. Update the control flow around the notifyError path in
message-consumer.ts so failures from the thread_response send are not caught by
the broader “proceeding without guardrails” handler; either rethrow outside that
catch scope or add a dedicated branch so the message is retried/aborted instead
of falling through to marker arming and worker dispatch.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 424-435: The current transaction always calls
insertThreadResponseRow() and notifyThreadResponse() even if none of the TURN
markers were deleted; change the logic in commitTerminalReply()/the tx block to
only emit the terminal row when the transaction actually deleted at least one
pending marker: replace the per-message DELETEs with a DELETE ... RETURNING (or
use tx`DELETE ... RETURNING idempotency_key`) inside the loop or a single
set-based DELETE that RETURNS rows, track whether any rows were returned (i.e.
any marker was removed), and only call insertThreadResponseRow(tx, replyPayload,
organizationId) and the subsequent notifyThreadResponse() if that delete
count/returned set is non-empty.

In `@packages/server/src/gateway/platform/unified-thread-consumer.ts`:
- Around line 189-205: The SSE-owner check in routeToRenderer is incorrectly
applied to Chat SDK rows; update the terminal API-row gate so it skips the SSE
ownership throw when the Chat SDK can handle the message. Concretely, in the
block that computes isApiRow / isTerminal and checks
!this.sseManager.hasActiveConnection(sseKey) before throwing, add a guard that
also allows delivery when chatResponseBridge.canHandle(data) returns true (i.e.,
add && !this.chatResponseBridge?.canHandle(data) to the condition or
equivalent), so Chat SDK-handled rows are not re-queued.

---

Outside diff comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 797-807: The current child.once("error") handler treats all child
errors as pre-spawn spawn failures; change it to only handle true spawn failures
by tracking the child's "spawn" event: add a boolean flag (e.g., spawned =
false), set spawned = true in child.once("spawn", ...) and in the
child.once("error", ...) handler only perform the pre-spawn failure path
(logger.error with err.message, this.workers.delete(deploymentName),
releaseLockOnce(), this.intentionalExits.delete(deploymentName), and void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE)) when spawned is
false; for errors after spawn, let the normal "exit" handling run and avoid
clearing intentionalExits or surfacing WORKER_DIED_MESSAGE.

---

Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-352: The guardrail notification failure is still being
swallowed by the outer input-guardrail catch in handleMessage(), so a tripped
block can continue into normal processing. Update the control flow around the
notifyError path in message-consumer.ts so failures from the thread_response
send are not caught by the broader “proceeding without guardrails” handler;
either rethrow outside that catch scope or add a dedicated branch so the message
is retried/aborted instead of falling through to marker arming and worker
dispatch.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 118-129: The armTurnTimeout() implementation currently swallows
errors from queue.createQueue()/queue.send(), allowing dispatch to continue
without a durable TURN_TIMEOUT_QUEUE marker; change it so failures are not
ignored: when createQueue() or send() throws, log the error as before but then
rethrow (or otherwise propagate) the exception so the caller cannot proceed.
Update armTurnTimeout() to let the error bubble up (instead of returning
normally) so callers like failTurnIfPending() / failTurnsForDeployment() / the
sweeper always have a marker to elect against; reference TURN_TIMEOUT_QUEUE,
turnMarkerKey(), createQueue(), and send() to locate the code to modify.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7893bef9-bd7b-4c91-a107-71f7ae246f70

📥 Commits

Reviewing files that changed from the base of the PR and between f0842a9 and bbc0c13.

📒 Files selected for processing (10)

packages/server/src/__tests__/guardrails-runtime.test.ts
packages/server/src/gateway/__tests__/turn-liveness.test.ts
packages/server/src/gateway/gateway-main.ts
packages/server/src/gateway/gateway/index.ts
packages/server/src/gateway/infrastructure/queue/index.ts
packages/server/src/gateway/infrastructure/queue/types.ts
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
packages/server/src/gateway/orchestration/message-consumer.ts
packages/server/src/gateway/orchestration/turn-liveness.ts
packages/server/src/gateway/platform/unified-thread-consumer.ts

✅ Files skipped from review due to trivial changes (1)

packages/server/src/gateway/tests/turn-liveness.test.ts

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

packages/server/src/gateway/orchestration/message-consumer.ts (1)
295-352: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guardrail notification failures still fail open here.

The throw notifyError at Line 328 is immediately caught by the outer catch at Line 343, which logs and continues with normal dispatch. That turns a thread_response outage into a guardrail bypass instead of a retry. Narrow the outer catch to just guardrail evaluation/metadata work, or explicitly rethrow notification failures past it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines
295 - 352, The inner catch around the queue.send to "thread_response" currently
throws notifyError but that gets swallowed by the outer catch (the guardrail
evaluation block), turning enqueue failures into a fail-open; fix by surfacing
notification failures past the outer catch: declare an outer-scope variable
(e.g., let notifyErrorOutside: unknown = undefined) before the guardrail try,
assign notifyErrorOutside = notifyError in the inner catch instead of directly
throwing, and after the outer try/catch (or at the end of the guardrail
handling) if notifyErrorOutside is set rethrow it so the messages handler can
retry; reference the inner catch variable notifyError, the queue.send to
responseQueue (TERMINAL_DELIVERY_SEND_OPTS), and the outer catch that currently
logs and continues.
packages/server/src/gateway/platform/unified-thread-consumer.ts (1)
189-205: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip the SSE owner gate for Chat SDK-deliverable terminal rows.

routeToRenderer() is also used by the Chat SDK fast path at Lines 97-100. For a terminal API row that chatResponseBridge.canHandle(data) can already deliver, this block still throws when there is no matching SSE session on the pod, so the reply gets re-queued instead of sent.
💡 Localized fix
       if (
         isTerminal &&
         sseKey &&
-        !this.sseManager.hasActiveConnection(sseKey)
+        !this.sseManager.hasActiveConnection(sseKey) &&
+        !this.chatResponseBridge?.canHandle(data)
       ) {
         throw new Error(
           `API SSE session ${sseKey} not owned by this gateway instance; re-queueing for owner delivery`
         );
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/platform/unified-thread-consumer.ts` around lines
189 - 205, The SSE-owner throw currently blocks terminal API rows even when the
Chat SDK fast path can deliver them; update the terminal-row check around
isTerminal/sseKey so it only throws when the Chat SDK cannot already handle
delivery (i.e., add a guard using chatResponseBridge.canHandle(data)).
Concretely, modify the if that checks isTerminal && sseKey &&
!this.sseManager.hasActiveConnection(sseKey) to also require
!chatResponseBridge.canHandle(data) (or equivalent), referencing the existing
isTerminal, sseKey, this.sseManager.hasActiveConnection, and
chatResponseBridge.canHandle symbols so SDK-deliverable terminal rows skip the
SSE owner gate.

🧹 Nitpick comments (3)

packages/server/src/gateway/infrastructure/queue/types.ts (1)
44-47: ⚡ Quick win

Rename this exported constant to camelCase.

TERMINAL_DELIVERY_SEND_OPTS adds a non-standard public API name that every consumer now has to mirror. Prefer terminalDeliverySendOpts and update the re-export/import sites with it. As per coding guidelines, Use camelCase for variable names in TypeScript/JavaScript.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/infrastructure/queue/types.ts` around lines 44 -
47, Rename the exported constant TERMINAL_DELIVERY_SEND_OPTS to camelCase
(terminalDeliverySendOpts) and update its declaration (export const
terminalDeliverySendOpts: QueueOptions = { retryLimit: 30, retryDelay: 1 }) and
all places that import or re-export TERMINAL_DELIVERY_SEND_OPTS to use
terminalDeliverySendOpts instead; ensure any re-export barrels or named imports
are updated so consumers use the new camelCase public API and run tests/TS build
to catch remaining references.
packages/server/src/__tests__/guardrails-runtime.test.ts (1)
500-514: ⚡ Quick win

Assert the terminal send options here too.

This now proves the payload uses error, but it still won't catch someone dropping TERMINAL_DELIVERY_SEND_OPTS from the thread_response enqueue. Capturing the third send() arg and asserting it matches the terminal delivery opts would lock down the full contract this PR depends on.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/__tests__/guardrails-runtime.test.ts` around lines 500 -
514, The test currently checks threadResponses for a single rejection payload
but doesn't assert the terminal delivery options; update the assertions to also
capture the third send() argument for the thread_response enqueue and assert it
equals TERMINAL_DELIVERY_SEND_OPTS. Locate the threadResponses array derived
from sentToQueue, extract the send-args (third argument) from the matching
sentToQueue entry for queue === 'thread_response' (or from threadResponses[0] if
it contains send args), and add an
expect(...).toEqual(TERMINAL_DELIVERY_SEND_OPTS) alongside the existing
error/content assertions to lock down the delivery contract.
packages/server/src/gateway/__tests__/turn-liveness.test.ts (1)
1-33: ⚡ Quick win

Move this suite under packages/server/src/__tests__/.

This new file lives under packages/server/src/gateway/__tests__/, which breaks the repo's test layout convention and makes package-level test discovery less consistent. Please move it into the package src/__tests__/ tree instead. As per coding guidelines, Test files follow package structure: packages/*/src/__tests__/.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/__tests__/turn-liveness.test.ts` around lines 1 -
33, The test file turn-liveness.test.ts currently living under the gateway tests
needs to be moved into the package's src/__tests__ tree to match project
conventions; relocate the file into the package-level src/__tests__ directory,
update any relative imports used in that file (references such as getDb,
RunsQueue, armTurnTimeout, commitTerminalReply, dischargeTurn,
extendTurnDeadlines, failTurnIfPending, failTurnsForDeployment,
sweepExpiredTurns and helpers like
ensurePgliteForGatewayTests/resetTestDatabase) so they resolve from the new
location, and run the test discovery to confirm the suite is picked up under the
correct package test tree.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/gateway/index.ts`:
- Around line 413-416: The fire-and-forget call to
extendTurnDeadlines(deploymentName) can reject and cause an unhandled promise
rejection; change the call site to attach a .catch(...) handler that logs the
error (including deploymentName and context) via the existing logger so failures
are observed; ensure you do not block the handler (keep it best-effort) but log
the rejection and any relevant metadata to aid debugging.

In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 803-807: The async calls to failTurnsForDeployment are currently
fire-and-forget (using void) so any rejection is unobserved; update both
crash/exit sites where intentionalExits.delete(deploymentName); void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); is used to attach a
.catch(...) that logs the error (include deploymentName and WORKER_DIED_MESSAGE
context) so rejections from the liveness/DB work are not swallowed; search for
the same pattern (the other site uses the same intentionalExits.delete + void
failTurnsForDeployment(...) call) and add identical .catch logging there.

In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 219-223: The current logic deletes the turn marker before
validating action_input via asTurnRouting, which causes malformed rows to be
permanently removed; change the flow so you parse/validate with
asTurnRouting(row.action_input) first and only delete the marker if parsing
succeeds, otherwise abort the transaction (or keep the marker pending and emit a
fallback terminal error) to avoid committing a delete on malformed rows. Apply
this same rollback-first pattern to the sweep and single-turn helper paths that
use asTurnRouting (the other occurrences you noted at the 282-287 and 393-395
sites) so all marker-deletes are conditional on successful asTurnRouting
parsing.

---

Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-352: The inner catch around the queue.send to
"thread_response" currently throws notifyError but that gets swallowed by the
outer catch (the guardrail evaluation block), turning enqueue failures into a
fail-open; fix by surfacing notification failures past the outer catch: declare
an outer-scope variable (e.g., let notifyErrorOutside: unknown = undefined)
before the guardrail try, assign notifyErrorOutside = notifyError in the inner
catch instead of directly throwing, and after the outer try/catch (or at the end
of the guardrail handling) if notifyErrorOutside is set rethrow it so the
messages handler can retry; reference the inner catch variable notifyError, the
queue.send to responseQueue (TERMINAL_DELIVERY_SEND_OPTS), and the outer catch
that currently logs and continues.

In `@packages/server/src/gateway/platform/unified-thread-consumer.ts`:
- Around line 189-205: The SSE-owner throw currently blocks terminal API rows
even when the Chat SDK fast path can deliver them; update the terminal-row check
around isTerminal/sseKey so it only throws when the Chat SDK cannot already
handle delivery (i.e., add a guard using chatResponseBridge.canHandle(data)).
Concretely, modify the if that checks isTerminal && sseKey &&
!this.sseManager.hasActiveConnection(sseKey) to also require
!chatResponseBridge.canHandle(data) (or equivalent), referencing the existing
isTerminal, sseKey, this.sseManager.hasActiveConnection, and
chatResponseBridge.canHandle symbols so SDK-deliverable terminal rows skip the
SSE owner gate.

---

Nitpick comments:
In `@packages/server/src/__tests__/guardrails-runtime.test.ts`:
- Around line 500-514: The test currently checks threadResponses for a single
rejection payload but doesn't assert the terminal delivery options; update the
assertions to also capture the third send() argument for the thread_response
enqueue and assert it equals TERMINAL_DELIVERY_SEND_OPTS. Locate the
threadResponses array derived from sentToQueue, extract the send-args (third
argument) from the matching sentToQueue entry for queue === 'thread_response'
(or from threadResponses[0] if it contains send args), and add an
expect(...).toEqual(TERMINAL_DELIVERY_SEND_OPTS) alongside the existing
error/content assertions to lock down the delivery contract.

In `@packages/server/src/gateway/__tests__/turn-liveness.test.ts`:
- Around line 1-33: The test file turn-liveness.test.ts currently living under
the gateway tests needs to be moved into the package's src/__tests__ tree to
match project conventions; relocate the file into the package-level
src/__tests__ directory, update any relative imports used in that file
(references such as getDb, RunsQueue, armTurnTimeout, commitTerminalReply,
dischargeTurn, extendTurnDeadlines, failTurnIfPending, failTurnsForDeployment,
sweepExpiredTurns and helpers like
ensurePgliteForGatewayTests/resetTestDatabase) so they resolve from the new
location, and run the test discovery to confirm the suite is picked up under the
correct package test tree.

In `@packages/server/src/gateway/infrastructure/queue/types.ts`:
- Around line 44-47: Rename the exported constant TERMINAL_DELIVERY_SEND_OPTS to
camelCase (terminalDeliverySendOpts) and update its declaration (export const
terminalDeliverySendOpts: QueueOptions = { retryLimit: 30, retryDelay: 1 }) and
all places that import or re-export TERMINAL_DELIVERY_SEND_OPTS to use
terminalDeliverySendOpts instead; ensure any re-export barrels or named imports
are updated so consumers use the new camelCase public API and run tests/TS build
to catch remaining references.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e1164f2-bf80-4b02-90bc-8a9d4d0d163c

📥 Commits

Reviewing files that changed from the base of the PR and between 9e30b65 and bbc0c13.

📒 Files selected for processing (10)

packages/server/src/__tests__/guardrails-runtime.test.ts
packages/server/src/gateway/__tests__/turn-liveness.test.ts
packages/server/src/gateway/gateway-main.ts
packages/server/src/gateway/gateway/index.ts
packages/server/src/gateway/infrastructure/queue/index.ts
packages/server/src/gateway/infrastructure/queue/types.ts
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
packages/server/src/gateway/orchestration/message-consumer.ts
packages/server/src/gateway/orchestration/turn-liveness.ts
packages/server/src/gateway/platform/unified-thread-consumer.ts

coderabbitai · 2026-05-20T17:03:22Z

+        // A worker ACK (delivery receipt or heartbeat) is a worker-driven
+        // liveness signal — push the turn-liveness deadline forward so a live
+        // but slow worker is never falsely failed by the sweep. Best-effort.
+        void extendTurnDeadlines(deploymentName);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't fire-and-forget extendTurnDeadlines() without handling rejection.

This is a DB write on the heartbeat path. If it rejects, the handler still returns 200 while the promise goes unobserved and can surface as an unhandled rejection. Add a .catch() here and log the failure.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/gateway/index.ts` around lines 413 - 416, The fire-and-forget call to extendTurnDeadlines(deploymentName) can reject and cause an unhandled promise rejection; change the call site to attach a .catch(...) handler that logs the error (including deploymentName and context) via the existing logger so failures are observed; ensure you do not block the handler (keep it best-effort) but log the rejection and any relevant metadata to aid debugging.

coderabbitai · 2026-05-20T17:03:22Z

+      // A spawn error is never a deliberate stop. Fail any in-flight turn(s)
+      // for this deployment so the client gets a terminal error instead of a
+      // hang. No-op if nothing is in flight (markers already discharged).
+      this.intentionalExits.delete(deploymentName);
+      void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle rejections from these crash-path failTurnsForDeployment() calls.

Both sites launch async DB work with void. If the liveness write rejects during a worker crash/exit, that rejection is unobserved and the client-facing terminal failure can be lost. Attach .catch() and log at both call sites.

Also applies to: 847-848

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around lines 803 - 807, The async calls to failTurnsForDeployment are currently fire-and-forget (using void) so any rejection is unobserved; update both crash/exit sites where intentionalExits.delete(deploymentName); void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); is used to attach a .catch(...) that logs the error (include deploymentName and WORKER_DIED_MESSAGE context) so rejections from the liveness/DB work are not swallowed; search for the same pattern (the other site uses the same intentionalExits.delete + void failTurnsForDeployment(...) call) and add identical .catch logging there.

coderabbitai · 2026-05-20T17:03:22Z

+        const routing = asTurnRouting(row.action_input);
+        if (!routing) {
+          logger.warn("Skipping malformed turn-timeout marker (fast path)");
+          continue;
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't commit marker deletes for malformed routing rows.

These paths delete the marker first and only then validate action_input. If parsing fails, the transaction still commits with no terminal row, permanently dropping the only durable obligation and reintroducing the hang this module is meant to prevent. Roll back on malformed rows, or keep the marker pending until you can emit a fallback terminal error.

Suggested pattern

const routing = asTurnRouting(row.action_input); if (!routing) { - logger.warn("Skipping malformed turn-timeout marker (fast path)"); - continue; + logger.error("Malformed turn-timeout marker; rolling back so another pass can recover it"); + throw new Error("Malformed turn-timeout marker"); }

Apply the same rollback-on-parse-failure behavior in the sweep and single-turn helper as well.

Also applies to: 282-287, 393-395

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 219 - 223, The current logic deletes the turn marker before validating action_input via asTurnRouting, which causes malformed rows to be permanently removed; change the flow so you parse/validate with asTurnRouting(row.action_input) first and only delete the marker if parsing succeeds, otherwise abort the transaction (or keep the marker pending and emit a fallback terminal error) to avoid committing a delete on malformed rows. Apply this same rollback-first pattern to the sweep and single-turn helper paths that use asTurnRouting (the other occurrences you noted at the 282-287 and 393-395 sites) so all marker-deletes are conditional on successful asTurnRouting parsing.

buremba · 2026-05-20T17:29:20Z

bug_free 72, simplicity 74, slop 25, bugs 1, 0 blockers

Read diff/logs; skipped extra boot probe. Typecheck passed and relevant guardrails-runtime/turn-liveness tests passed. [env] unit failed on connector-sdk dist export in CLI tests; integration had unrelated public-origin/MCP/sandbox/etc failures in untouched files.

Suggested fixes

File	Line	Change
`packages/server/src/gateway/orchestration/message-consumer.ts`	343	Do not let the guardrail fail-open catch swallow block-message enqueue failures after a trip; move the enqueue/return path outside the guardrail store/runner try-catch, or rethrow from this catch when the error came from notifying a tripped guardrail so the message job retries instead of dispatching the blocked input.

Full verdict JSON

{
  "bug_free_confidence": 72,
  "bugs": 1,
  "slop": 25,
  "simplicity": 74,
  "blockers": [],
  "change_type": "fix",
  "behavior_change_risk": "high",
  "tests_adequate": false,
  "suggested_fixes": [
    {
      "file": "packages/server/src/gateway/orchestration/message-consumer.ts",
      "line": 343,
      "change": "Do not let the guardrail fail-open catch swallow block-message enqueue failures after a trip; move the enqueue/return path outside the guardrail store/runner try-catch, or rethrow from this catch when the error came from notifying a tripped guardrail so the message job retries instead of dispatching the blocked input."
    }
  ],
  "notes": "Read diff/logs; skipped extra boot probe. Typecheck passed and relevant guardrails-runtime/turn-liveness tests passed. [env] unit failed on connector-sdk dist export in CLI tests; integration had unrelated public-origin/MCP/sandbox/etc failures in untouched files.",
  "categories": {
    "src": 724,
    "tests": 202,
    "docs": 0,
    "config": 0,
    "deps": 0,
    "migrations": 0,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

…ors (#946) A worker that fails to produce a reply (crash, hang, or pod death) left the client's SSE/CLI hanging forever or returning a silent `complete` with no content. Three defects, one theme: a failed turn must produce a terminal event the client renders, correct under N>1 k8s replicas. - Turn-liveness election (turn-liveness.ts): a passive `runs` marker on the consumer-less `internal:turn_timeout` queue records that a turn owes a terminal event. DELETE...RETURNING is a first-writer-wins election; the deleter emits the terminal error in the same transaction (atomic, crash-safe). Fast path: child exit/error fails the deployment's in-flight turns instantly. Backstop: a periodic SKIP-LOCKED sweep fails lapsed turns (hung worker / pod death). Deadline pushed forward by the 20s worker heartbeat. - Owner-gated terminal delivery (unified-thread-consumer.ts): terminal API rows (success completion AND error) re-queue unless this pod holds the client's SSE, so a row produced on one replica reaches the SSE-owning replica. Fixes the pre-existing cross-pod success-completion drop too. - Render via `error` (message-consumer.ts): trackFailedDeployment and the input-guardrail reject moved off the dropped `content` field onto `error`, which renders end-to-end (SSE error event + CLI exit 1; platforms post Error:). Validated: live PGlite gateway with a worker that exits 1 -> connected/error/ agent-error/complete (was: hang). Discharge/sweep/exactly-once verified against Postgres. All claim queries index-backed via runs_lobu_claim_idx (EXPLAIN).

buremba · 2026-05-20T17:35:39Z

Round 2 (95613d58) — triaged all five:

Fixed (real):

armTurnTimeout continuing after marker-arm failure — now fail-closed: it throws, so dispatch aborts and the messages run retries the whole (idempotent) turn rather than dispatching an unprotected one.
commitTerminalReply inserting on 0-marker delete — now uses DELETE … RETURNING and only inserts the reply if it won ≥1 pending marker. A late worker reply after the sweep deletes 0 markers → dropped, no double-signal. Added a test for exactly this.
Owner-gate vs Chat SDK rows — gate now excludes rows with platformMetadata.connectionId (Chat SDK rows already owner-route via canHandle), so a ready-to-deliver reply isn't re-queued.

Also fixed (your earlier comment + pi caught the same): the guardrail-trip rejection delivery is now outside the fail-open try-catch. Previously a re-throw was caught by the "proceeding without guardrails" handler and fell through to dispatch the blocked input. Now the trip is captured and delivered after the try; a delivery failure propagates to retry, never to dispatch.

Not changed (with reasoning):
4. void extendTurnDeadlines() / void failTurnsForDeployment() need .catch() — both functions are fully wrapped in try/catch internally and return (never reject), so void cannot produce an unhandled rejection. Adding .catch() would be dead code.
5. "Don't delete markers for malformed routing rows" — asTurnRouting only rejects rows missing messageId/deploymentName, which are unroutable (can't address a terminal event), so dropping them is correct; keeping them would re-select the same expired row every sweep forever. Kept the delete, upgraded the log to error so genuine schema drift is noticed.

Added 2 tests (now 9 in turn-liveness.test.ts); strict packages/server tsc clean.

…ABASE_URL guard; trim apple_photos comment - turn-liveness.test.ts (added by #971) imported ensurePgliteForGatewayTests, which this PR's PGlite removal renamed to ensureDbForGatewayTests. The merge produced no textual conflict but a broken import that failed the gateway suite. Point it at the renamed helper. Validated: 9/9 pass vs Postgres. - review.sh: `unset DATABASE_URL` (run tests on isolated embedded PG) left a later `DATABASE_URL="$DATABASE_URL"` reference that tripped `set -u` (unbound variable), so pi never produced a verdict. Default to empty. - apple_photos.ts: trim the 13-line geo-tier TODO to a 4-line note (pi slop).

…; earthdistance geo (#965) * feat(server)!: replace PGlite with embedded Postgres; bundle pgvector; earthdistance geo Local `lobu run` and the test backend now run a real embedded PostgreSQL 18 (embedded-postgres) instead of PGlite, with pgvector injected from the new @lobu/pgvector-embedded package. This removes the entire dual-backend: the LOBU_DISABLE_PREPARE single-connection pool pin, the conversation-lock no-op, the PGlite test harness, and the @electric-sql/pglite* deps are all gone — one engine everywhere. - DATABASE_URL is the sole backend selector: postgres:// = external; a path (or file:) = embedded PG with its cluster at <path>/.lobu/pgdata. - geo_lookup reimplemented on core-contrib cube+earthdistance (no PostGIS, works on every backend); restores the geo schema lost in the migration squash. - pgvector ships as prebuilt per-platform artifacts; a CI workflow rebuilds them, and only the control + pinned install SQL are vendored. - Test backend auto-spawns an ephemeral embedded PG when DATABASE_URL is unset, so `make test` needs no external database. - review.sh now unsets DATABASE_URL so the suite runs against an isolated embedded PG instead of whatever (often shared/prod) DB .env points at. BREAKING CHANGE: the local `lobu run` / test database engine is now a real embedded PostgreSQL instead of PGlite. Existing ~/.lobu PGlite data dirs are not migrated — a fresh embedded PG cluster is created. Production (external Postgres via DATABASE_URL) is unchanged. * feat(cli): lobu init asks embedded vs external Postgres; write DATABASE_URL `lobu init` now prompts for the database backend and writes an explicit DATABASE_URL into the scaffolded .env: - "Local embedded Postgres" (default) → DATABASE_URL=file://. — an isolated per-project cluster at ./.lobu/pgdata (gitignored). - "Connect to an existing Postgres" → prompts for the postgres:// URL. Both already work at runtime (the DATABASE_URL path/URL routing); init is just the front-door UX. Modernized .env.tmpl off the PGlite/LOBU_DATA_DIR model onto the file:// vs postgres:// scheme. * refactor(server): consolidate the two server entrypoints into one server.ts is now the single entry for both backends, branching on DATABASE_URL: postgres:// connects to an external Postgres; a path/file:// spawns a local embedded Postgres. The embedded boot lives in the new embedded-runtime.ts and is loaded ONLY via `await import(...)` in the embedded branch, so the external/prod path never resolves or loads the embedded-postgres binary even though it sits in node_modules. - Deleted start-local.ts; build one bundle (server.bundle.mjs) instead of two. - dev.ts drops bundle-switching (one bundle, self-selects on DATABASE_URL) and maps LOBU_DATA_DIR → an embedded DATABASE_URL, so the Mac app (which spawns `lobu run` with LOBU_DATA_DIR) keeps working with no Swift change. - dev:local + e2e-lobu-apply.sh point at server.ts. Validated both modes from the one bundle: embedded (file://) spawns PG, runs migrations, all 4 extensions live; external (postgres://) connects, passes the schema-version check, and never triggers the lazy embedded-postgres import (0 spawn lines). typecheck 0, gateway tests 24/24. * refactor: remove LOBU_DATA_DIR — DATABASE_URL is the single backend selector LOBU_DATA_DIR was a second, redundant way to point at the embedded DB's data location. It's now purely the embedded-PG root (nothing else reads it — the old bootstrap-pat/ephemeral-key persistence claims were stale), so it folds entirely into DATABASE_URL=file://<dir>. - embedded-runtime.ts requires DATABASE_URL (no LOBU_DATA_DIR fallback). - user-config `dataDir` and the CLI `--data-dir` now map to DATABASE_URL=file://. - dev.ts drops the LOBU_DATA_DIR fallback; dev-native.sh + e2e-lobu-apply.sh set DATABASE_URL=file://<dir>. Validated: embedded boots from DATABASE_URL=file:// alone (cluster at <dir>/.lobu/pgdata), typecheck 0. (Mac app LocalLobuRunner switch to DATABASE_URL=file:// lands via the owletto submodule bump.) * chore: bump owletto pointer — Mac runner uses DATABASE_URL=file:// (drops LOBU_DATA_DIR) * chore: bump packages/owletto pointer for Keychain ENCRYPTION_KEY persistence (0c54927) * fix(pgvector-embedded): stage artifacts from build dir + build portable binaries The build matrix never produced non-darwin artifacts: - build.sh copied vector.control / vector--<v>.sql from the OS Postgres SHAREDIR, but the workflow only runs `make` (not `make install`), so those files never land there — every fresh-PG cell died on `cp: cannot stat '.../extension/vector.control'`. Stage straight from the built pgvector tree instead (vector.control is in-repo, the SQL is make-generated). Validated locally: reproduces the committed darwin-arm64 artifact byte-for-bit. - Pass OPTFLAGS="" so the redistributed binary drops pgvector's default `-march=native`; otherwise it bakes in the CI runner's microarch and SIGILLs on older user CPUs. - Linux install used a hand-rolled curl|gpg --dearmor pipe that flaked with "gpg: cannot open '/dev/tty'"; switch to the official pgdg apt.postgresql.org.sh setup script. * refactor(server): drop simpleQuery identity wrapper simpleQuery(query) just returned query — a no-op left over from the removed PGlite `.simple()` prepared-statement path. Unwrapped all 30 call sites (resolve_path.ts, multi-tenant.ts) and removed the export. * chore(pgvector-embedded): add linux-x64 + linux-arm64 prebuilt artifacts Built by build-pgvector-embedded.yml (pgvector v0.8.1, PG18, OPTFLAGS="" for portability). darwin-x64 follows once the macos-13 cell clears the runner queue; darwin-arm64 already committed. * fix(docker): wire @lobu/pgvector-embedded into the build, prune embedded PG from runtime server now depends on @lobu/pgvector-embedded (workspace:*), but the image never copied or stubbed it — bun install + the builder tsc would have failed the first time this hit main (build-images runs only on main). COPY its manifest + source and build its dist before the server type-check. Then prune embedded-postgres + @embedded-postgres/* (~145MB platform binary) from node_modules after the bundle: prod always uses an external DATABASE_URL, so the embedded path (the only loader, via await import) never runs. Validated with a full --target builder build: install + dist + tsc + bundle succeed and the runtime image no longer carries the 145MB binary. * refactor(server): drop dead applyDevProjectPathDefault, fix start-local.ts comment drift applyDevProjectPathDefault had zero callers (the LOBU_DEV_PROJECT_PATH fallback is inlined in server.ts main()). Removed it, and corrected the server-lifecycle.ts / dev-vite.ts headers that still described a two-entry-point design — start-local.ts was deleted in this PR. * fix(ci): build @lobu/pgvector-embedded before the integration suite The integration job (vitest under Node) failed with ERR_RESOLVE_PACKAGE_ENTRY_FAIL: the test backend imports @lobu/pgvector-embedded, so vite resolves its `import` condition (./dist/index.js) at transform time — but CI never built that package's dist (it's a new workspace dep in this PR). Build it alongside the other workspace packages, same as the Docker builder fix. Validated: full integration suite vs pg16 with dist freshly built → 0 resolve errors, 287/294 pass (the 5 fails are isolated-vm prebuild gaps local to the dev Mac; they load under CI's pinned Node 22). Also fix a stale comment: the test backend mirrors src/embedded-runtime.ts, not the deleted src/start-local.ts. * fix(ci): build darwin-x64 pgvector on macos-15-intel, not retired macos-13 GitHub retired the macos-13 (Intel) hosted runner in Dec 2025, so the darwin-x64 matrix cell queued indefinitely (no runner ever assigned). macos-15-intel is GitHub's current x86_64 macOS runner (Intel on macOS 15), the supported replacement until x86_64 macOS is dropped in 2027. * chore(pgvector-embedded): add darwin-x64 prebuilt artifact Built on macos-15-intel (x86_64 Mach-O bundle), completing all four platforms: darwin-arm64, darwin-x64, linux-x64, linux-arm64. * fix: turn-liveness test uses renamed gateway DB helper; review.sh DATABASE_URL guard; trim apple_photos comment - turn-liveness.test.ts (added by #971) imported ensurePgliteForGatewayTests, which this PR's PGlite removal renamed to ensureDbForGatewayTests. The merge produced no textual conflict but a broken import that failed the gateway suite. Point it at the renamed helper. Validated: 9/9 pass vs Postgres. - review.sh: `unset DATABASE_URL` (run tests on isolated embedded PG) left a later `DATABASE_URL="$DATABASE_URL"` reference that tripped `set -u` (unbound variable), so pi never produced a verdict. Default to empty. - apple_photos.ts: trim the 13-line geo-tier TODO to a 4-line note (pi slop). * docs(server): correct entry-point comment — embedded-runtime is statically imported The header said embedded-runtime loads via await import; it's actually a static import (the heavy embedded-postgres + pgvector injector are what load lazily, inside that module). Matches the dynamic-import allow-list in AGENTS.md.

buremba marked this pull request as ready for review May 20, 2026 16:17

buremba force-pushed the feat/fix-worker-failure-multipod branch from 036b2ff to 78e0226 Compare May 20, 2026 16:20

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread packages/server/src/gateway/orchestration/message-consumer.ts Outdated

Comment thread packages/server/src/gateway/orchestration/message-consumer.ts Outdated

Comment thread packages/server/src/gateway/orchestration/turn-liveness.ts Outdated

buremba force-pushed the feat/fix-worker-failure-multipod branch from 78e0226 to f0842a9 Compare May 20, 2026 16:31

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread packages/server/src/gateway/gateway/index.ts Outdated

Comment thread packages/server/src/gateway/orchestration/turn-liveness.ts Outdated

buremba force-pushed the feat/fix-worker-failure-multipod branch from f0842a9 to bbc0c13 Compare May 20, 2026 16:50

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread packages/server/src/gateway/orchestration/turn-liveness.ts Outdated

Comment thread packages/server/src/gateway/platform/unified-thread-consumer.ts Outdated

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

buremba force-pushed the feat/fix-worker-failure-multipod branch from bbc0c13 to 95613d5 Compare May 20, 2026 17:35

buremba merged commit c8553c1 into main May 20, 2026
17 checks passed

buremba deleted the feat/fix-worker-failure-multipod branch May 20, 2026 17:40

buremba mentioned this pull request May 20, 2026

chore(main): release lobu 9.0.0 #939

Merged

buremba mentioned this pull request May 20, 2026

ci: build pgvector-embedded before vitest security tests #986

Merged

coderabbitai Bot mentioned this pull request May 26, 2026

fix(server): orchestration cleanup — dead dischargeTurn + cross-pod spawn gate #1068

Merged

Conversation

buremba commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanism — turn-liveness election (no new schema)

Reproducer (red → green)

Validation

Notes for the reviewer

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 20, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

buremba commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 20, 2026

Suggested fixes

Uh oh!

buremba commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading