fix(gateway): surface worker failures to chat clients as terminal errors (#946)#971
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughDurable turn-liveness markers are persisted and managed to ensure exactly-once terminal delivery: markers are armed on dispatch, extended by heartbeats, discharged by terminal replies or worker failure, backstopped by a periodic sweep, and integrated across queue options, gateway lifecycle, unified consumer owner-gating, and tests. ChangesTurn-Liveness Error Backstop Implementation
Sequence DiagramsequenceDiagram
participant Dispatcher
participant Worker
participant TurnLiveness
participant DB
participant ThreadResponseConsumer
Dispatcher->>TurnLiveness: armTurnTimeout(routing)
TurnLiveness->>DB: INSERT internal marker (delayed run_at)
Dispatcher->>Worker: enqueue work (worker queue)
Worker->>Dispatcher: enrichedResponse (including heartbeat/ACK)
alt Heartbeat/ACK
Dispatcher->>TurnLiveness: extendTurnDeadlines(deployment)
TurnLiveness->>DB: UPDATE run_at for deployment markers
end
alt Worker responds terminally (error/processedMessageIds)
Dispatcher->>TurnLiveness: commitTerminalReply(deployment, messageIds, reply)
TurnLiveness->>DB: DELETE markers RETURNING -> INSERT terminal thread_response (atomic)
TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
else Worker dies / crash detected
Dispatcher->>TurnLiveness: failTurnsForDeployment(deployment, reason)
TurnLiveness->>DB: DELETE markers -> INSERT error thread_response rows (atomic)
TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
else Periodic sweep fires
TurnLiveness->>DB: sweepExpiredTurns() (FOR UPDATE SKIP LOCKED + DELETE ... RETURNING)
TurnLiveness->>DB: INSERT terminal error thread_response rows
TurnLiveness->>ThreadResponseConsumer: pg_notify after commit
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
036b2ff to
78e0226
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-317: The catch block after attempting to create/send to
responseQueue ("thread_response") currently swallows enqueue failures
(notifyError) and returns before arming any turn liveness marker; update the
catch to either re-throw notifyError or invoke the same liveness-backed failure
flow used elsewhere (e.g., routeThroughLivenessFailure or the turn-marker arming
routine) before returning so blocked turns still emit a terminal event and are
swept; ensure you handle failures from this.queue.createQueue and
this.queue.send (with TERMINAL_DELIVERY_SEND_OPTS and blockMessage) by routing
them through that liveness-backed handler instead of simply logging and
returning.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 317-324: The payload currently may emit platform as undefined
(platform: routing.platform) which violates the explicit-platform requirement;
update the payload construction in turn-liveness.ts so the platform field always
defaults (e.g., platform: routing.platform ?? "api")—mirror the existing teamId
defaulting pattern—so InteractionService events always carry an explicit
platform value and platform-based routing/isolation works correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 960c3bb5-61a3-4a74-bfa3-5cbb2ac7f33c
📒 Files selected for processing (8)
packages/server/src/gateway/gateway-main.tspackages/server/src/gateway/gateway/index.tspackages/server/src/gateway/infrastructure/queue/index.tspackages/server/src/gateway/infrastructure/queue/types.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/orchestration/message-consumer.tspackages/server/src/gateway/orchestration/turn-liveness.tspackages/server/src/gateway/platform/unified-thread-consumer.ts
78e0226 to
f0842a9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
packages/server/src/gateway/orchestration/turn-liveness.ts (1)
317-324:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDefault
platformon the terminal payload.Line 323 can still enqueue
platform: undefinedfor API turns, which violates the explicit-platform contract for gateway events and makes renderer-side filtering brittle.Proposed fix
const payload = { messageId: routing.messageId, channelId: routing.channelId, conversationId: routing.conversationId, userId: routing.userId, teamId: routing.platform ?? "api", - platform: routing.platform, + platform: routing.platform ?? "api", platformMetadata: routing.platformMetadata, error: reason, processedMessageIds: [routing.messageId], timestamp: Date.now(), };As per coding guidelines,
packages/server/src/gateway/**/*.ts: Gateway orchestration and platform isolation: InteractionService events MUST carry an explicitplatformfield; each platform renderer MUST filter on its own identity (e.g.,platform === "telegram"), never reference another platform's identifier.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 317 - 324, The payload being enqueued may set platform to undefined; update the payload construction in turn-liveness.ts so the payload.platform uses the explicit default (use routing.platform ?? "api") just like teamId does, ensuring InteractionService events always carry an explicit platform value and renderers can reliably filter by platform.packages/server/src/gateway/orchestration/message-consumer.ts (2)
295-322:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't complete the guardrail short-circuit when terminal delivery fails.
If
thread_responseenqueue fails here, this branch logs and returns before any liveness marker exists. That permanently drops the blocked turn: no client error, no retry, and no sweep backstop. Re-thrownotifyErroror route this branch through the same liveness-backed failure helper before returning.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines 295 - 322, The current guardrail short-circuit swallows failures from creating/sending to responseQueue ("thread_response") by logging and returning; update the try/catch in the block that uses this.queue.createQueue(responseQueue) and this.queue.send(..., TERMINAL_DELIVERY_SEND_OPTS) so that on notifyError you either re-throw the notifyError or invoke the existing liveness-backed failure helper (the same helper used by other failure paths) before returning, rather than simply calling logger.error; ensure the catch includes any relevant context (messageId, userId, channelId, blockMessage) when delegating/throwing so the blocked turn is retried or swept as intended.
679-701:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftGate this startup-failure send through the same first-writer-wins path.
By this point the turn is already in
thread_message_${deploymentName}. A worker can still win the race and enqueue a terminal reply before this background failure path runs, but this code sends its own terminalerrorunconditionally and only discharges afterward. That can double-signal the client. Use the atomic liveness failure helper here instead of manualqueue.send(...)+dischargeTurn(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/gateway/index.ts`:
- Around line 442-468: The terminal-response enqueue followed by separate
dischargeTurn calls is not crash-atomic: if the process dies after
queue.send("thread_response", enrichedResponse, ...) but before
dischargeTurn(deploymentName, id) runs, a surviving marker can later cause a
duplicate terminal error; fix by making the persist-terminal-row + delete-marker
operation atomic (either implement a transactional helper that writes the
terminal delivery row and removes the marker in one DB transaction inside the
same code path that enqueues the terminal response, or modify the sweep to
detect and suppress markers when a terminal row already exists for the same
(deploymentName, messageId)); locate the enqueue call to queue.send and the
subsequent dischargeTurn loop (using enrichedResponse,
enrichedResponse.processedMessageIds, isTerminalResponse and
TERMINAL_DELIVERY_SEND_OPTS) and either (a) move/delete the marker removal into
the same transactional helper used to persist the thread_response or (b) add a
sweep-side existence check for a terminal row for (deploymentName, messageId)
before emitting a terminal error.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 118-129: The try/catch around queue.createQueue and queue.send
currently swallows errors and allows dispatch to continue without the durable
turn-timeout marker; change it so failures abort dispatch instead of logging and
continuing: in the block that calls queue.createQueue(TURN_TIMEOUT_QUEUE) and
queue.send(..., { singletonKey: turnMarkerKey(...) }), remove or alter the catch
to rethrow the error (or return a failure) so the caller will not proceed to
dispatch; this ensures dischargeTurn, failTurnsForDeployment and the periodic
sweep have the marker to elect against.
---
Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-322: The current guardrail short-circuit swallows failures
from creating/sending to responseQueue ("thread_response") by logging and
returning; update the try/catch in the block that uses
this.queue.createQueue(responseQueue) and this.queue.send(...,
TERMINAL_DELIVERY_SEND_OPTS) so that on notifyError you either re-throw the
notifyError or invoke the existing liveness-backed failure helper (the same
helper used by other failure paths) before returning, rather than simply calling
logger.error; ensure the catch includes any relevant context (messageId, userId,
channelId, blockMessage) when delegating/throwing so the blocked turn is retried
or swept as intended.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 317-324: The payload being enqueued may set platform to undefined;
update the payload construction in turn-liveness.ts so the payload.platform uses
the explicit default (use routing.platform ?? "api") just like teamId does,
ensuring InteractionService events always carry an explicit platform value and
renderers can reliably filter by platform.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 833e2580-c4db-4650-91f6-99f85cf8d7d1
📒 Files selected for processing (9)
packages/server/src/__tests__/guardrails-runtime.test.tspackages/server/src/gateway/gateway-main.tspackages/server/src/gateway/gateway/index.tspackages/server/src/gateway/infrastructure/queue/index.tspackages/server/src/gateway/infrastructure/queue/types.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/orchestration/message-consumer.tspackages/server/src/gateway/orchestration/turn-liveness.tspackages/server/src/gateway/platform/unified-thread-consumer.ts
f0842a9 to
bbc0c13
Compare
|
Thanks @coderabbitai — all four findings were real; addressed in
Added a committed integration test ( |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts (1)
797-807:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGate the
errorhandler to spawn failures only.Node.js emits the
errorevent not only for spawn failures, but also for failedkill()andsend()operations after the child has been spawned. The current handler will wrongly clearintentionalExitsand surface a bogus terminalWORKER_DIED_MESSAGEduring a deliberate teardown iferrorfires after a post-spawnkill()call. Track thespawnevent and gate the failure path to true pre-spawn failures; let post-spawn failures flow through theexitpath instead.Fix
+ let sawSpawn = false; + child.once("spawn", () => { + sawSpawn = true; + }); + child.once("error", (err) => { + if (sawSpawn) { + logger.error( + `Embedded worker ${deploymentName} child-process error after spawn: ${err.message}` + ); + return; + } logger.error( `Embedded worker ${deploymentName} spawn error: ${err.message}` ); this.workers.delete(deploymentName); releaseLockOnce(); // A spawn error is never a deliberate stop. Fail any in-flight turn(s) // for this deployment so the client gets a terminal error instead of a // hang. No-op if nothing is in flight (markers already discharged). this.intentionalExits.delete(deploymentName); void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around lines 797 - 807, The current child.once("error") handler treats all child errors as pre-spawn spawn failures; change it to only handle true spawn failures by tracking the child's "spawn" event: add a boolean flag (e.g., spawned = false), set spawned = true in child.once("spawn", ...) and in the child.once("error", ...) handler only perform the pre-spawn failure path (logger.error with err.message, this.workers.delete(deploymentName), releaseLockOnce(), this.intentionalExits.delete(deploymentName), and void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE)) when spawned is false; for errors after spawn, let the normal "exit" handling run and avoid clearing intentionalExits or surfacing WORKER_DIED_MESSAGE.
♻️ Duplicate comments (2)
packages/server/src/gateway/orchestration/turn-liveness.ts (1)
118-129:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't continue when timeout-marker arming fails.
armTurnTimeout()still logs and returns oncreateQueue()/send()failure, so dispatch can proceed without any durable marker. In that state there is nothing forfailTurnIfPending(),failTurnsForDeployment(), or the sweep to elect against, which brings back the hanging-turn behavior this module is meant to remove.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 118 - 129, The armTurnTimeout() implementation currently swallows errors from queue.createQueue()/queue.send(), allowing dispatch to continue without a durable TURN_TIMEOUT_QUEUE marker; change it so failures are not ignored: when createQueue() or send() throws, log the error as before but then rethrow (or otherwise propagate) the exception so the caller cannot proceed. Update armTurnTimeout() to let the error bubble up (instead of returning normally) so callers like failTurnIfPending() / failTurnsForDeployment() / the sweeper always have a marker to elect against; reference TURN_TIMEOUT_QUEUE, turnMarkerKey(), createQueue(), and send() to locate the code to modify.packages/server/src/gateway/orchestration/message-consumer.ts (1)
295-352:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winThe guardrail reject path still fails open on notify errors.
throw notifyErrorat Line 328 never escapeshandleMessage(): the outer catch at Lines 343-352 catches it, logs "proceeding without guardrails", and then continues into marker arming and worker dispatch. Ifthread_responseenqueue fails, a message that already tripped the input guardrail reaches the worker instead of being retried/blocked.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines 295 - 352, The guardrail notification failure is still being swallowed by the outer input-guardrail catch in handleMessage(), so a tripped block can continue into normal processing. Update the control flow around the notifyError path in message-consumer.ts so failures from the thread_response send are not caught by the broader “proceeding without guardrails” handler; either rethrow outside that catch scope or add a dedicated branch so the message is retried/aborted instead of falling through to marker arming and worker dispatch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 424-435: The current transaction always calls
insertThreadResponseRow() and notifyThreadResponse() even if none of the TURN
markers were deleted; change the logic in commitTerminalReply()/the tx block to
only emit the terminal row when the transaction actually deleted at least one
pending marker: replace the per-message DELETEs with a DELETE ... RETURNING (or
use tx`DELETE ... RETURNING idempotency_key`) inside the loop or a single
set-based DELETE that RETURNS rows, track whether any rows were returned (i.e.
any marker was removed), and only call insertThreadResponseRow(tx, replyPayload,
organizationId) and the subsequent notifyThreadResponse() if that delete
count/returned set is non-empty.
In `@packages/server/src/gateway/platform/unified-thread-consumer.ts`:
- Around line 189-205: The SSE-owner check in routeToRenderer is incorrectly
applied to Chat SDK rows; update the terminal API-row gate so it skips the SSE
ownership throw when the Chat SDK can handle the message. Concretely, in the
block that computes isApiRow / isTerminal and checks
!this.sseManager.hasActiveConnection(sseKey) before throwing, add a guard that
also allows delivery when chatResponseBridge.canHandle(data) returns true (i.e.,
add && !this.chatResponseBridge?.canHandle(data) to the condition or
equivalent), so Chat SDK-handled rows are not re-queued.
---
Outside diff comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 797-807: The current child.once("error") handler treats all child
errors as pre-spawn spawn failures; change it to only handle true spawn failures
by tracking the child's "spawn" event: add a boolean flag (e.g., spawned =
false), set spawned = true in child.once("spawn", ...) and in the
child.once("error", ...) handler only perform the pre-spawn failure path
(logger.error with err.message, this.workers.delete(deploymentName),
releaseLockOnce(), this.intentionalExits.delete(deploymentName), and void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE)) when spawned is
false; for errors after spawn, let the normal "exit" handling run and avoid
clearing intentionalExits or surfacing WORKER_DIED_MESSAGE.
---
Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-352: The guardrail notification failure is still being
swallowed by the outer input-guardrail catch in handleMessage(), so a tripped
block can continue into normal processing. Update the control flow around the
notifyError path in message-consumer.ts so failures from the thread_response
send are not caught by the broader “proceeding without guardrails” handler;
either rethrow outside that catch scope or add a dedicated branch so the message
is retried/aborted instead of falling through to marker arming and worker
dispatch.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 118-129: The armTurnTimeout() implementation currently swallows
errors from queue.createQueue()/queue.send(), allowing dispatch to continue
without a durable TURN_TIMEOUT_QUEUE marker; change it so failures are not
ignored: when createQueue() or send() throws, log the error as before but then
rethrow (or otherwise propagate) the exception so the caller cannot proceed.
Update armTurnTimeout() to let the error bubble up (instead of returning
normally) so callers like failTurnIfPending() / failTurnsForDeployment() / the
sweeper always have a marker to elect against; reference TURN_TIMEOUT_QUEUE,
turnMarkerKey(), createQueue(), and send() to locate the code to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 7893bef9-bd7b-4c91-a107-71f7ae246f70
📒 Files selected for processing (10)
packages/server/src/__tests__/guardrails-runtime.test.tspackages/server/src/gateway/__tests__/turn-liveness.test.tspackages/server/src/gateway/gateway-main.tspackages/server/src/gateway/gateway/index.tspackages/server/src/gateway/infrastructure/queue/index.tspackages/server/src/gateway/infrastructure/queue/types.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/orchestration/message-consumer.tspackages/server/src/gateway/orchestration/turn-liveness.tspackages/server/src/gateway/platform/unified-thread-consumer.ts
✅ Files skipped from review due to trivial changes (1)
- packages/server/src/gateway/tests/turn-liveness.test.ts
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
packages/server/src/gateway/orchestration/message-consumer.ts (1)
295-352:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuardrail notification failures still fail open here.
The
throw notifyErrorat Line 328 is immediately caught by the outercatchat Line 343, which logs and continues with normal dispatch. That turns athread_responseoutage into a guardrail bypass instead of a retry. Narrow the outer catch to just guardrail evaluation/metadata work, or explicitly rethrow notification failures past it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/orchestration/message-consumer.ts` around lines 295 - 352, The inner catch around the queue.send to "thread_response" currently throws notifyError but that gets swallowed by the outer catch (the guardrail evaluation block), turning enqueue failures into a fail-open; fix by surfacing notification failures past the outer catch: declare an outer-scope variable (e.g., let notifyErrorOutside: unknown = undefined) before the guardrail try, assign notifyErrorOutside = notifyError in the inner catch instead of directly throwing, and after the outer try/catch (or at the end of the guardrail handling) if notifyErrorOutside is set rethrow it so the messages handler can retry; reference the inner catch variable notifyError, the queue.send to responseQueue (TERMINAL_DELIVERY_SEND_OPTS), and the outer catch that currently logs and continues.packages/server/src/gateway/platform/unified-thread-consumer.ts (1)
189-205:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip the SSE owner gate for Chat SDK-deliverable terminal rows.
routeToRenderer()is also used by the Chat SDK fast path at Lines 97-100. For a terminal API row thatchatResponseBridge.canHandle(data)can already deliver, this block still throws when there is no matching SSE session on the pod, so the reply gets re-queued instead of sent.💡 Localized fix
if ( isTerminal && sseKey && - !this.sseManager.hasActiveConnection(sseKey) + !this.sseManager.hasActiveConnection(sseKey) && + !this.chatResponseBridge?.canHandle(data) ) { throw new Error( `API SSE session ${sseKey} not owned by this gateway instance; re-queueing for owner delivery` ); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/platform/unified-thread-consumer.ts` around lines 189 - 205, The SSE-owner throw currently blocks terminal API rows even when the Chat SDK fast path can deliver them; update the terminal-row check around isTerminal/sseKey so it only throws when the Chat SDK cannot already handle delivery (i.e., add a guard using chatResponseBridge.canHandle(data)). Concretely, modify the if that checks isTerminal && sseKey && !this.sseManager.hasActiveConnection(sseKey) to also require !chatResponseBridge.canHandle(data) (or equivalent), referencing the existing isTerminal, sseKey, this.sseManager.hasActiveConnection, and chatResponseBridge.canHandle symbols so SDK-deliverable terminal rows skip the SSE owner gate.
🧹 Nitpick comments (3)
packages/server/src/gateway/infrastructure/queue/types.ts (1)
44-47: ⚡ Quick winRename this exported constant to camelCase.
TERMINAL_DELIVERY_SEND_OPTSadds a non-standard public API name that every consumer now has to mirror. PreferterminalDeliverySendOptsand update the re-export/import sites with it. As per coding guidelines, Use camelCase for variable names in TypeScript/JavaScript.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/infrastructure/queue/types.ts` around lines 44 - 47, Rename the exported constant TERMINAL_DELIVERY_SEND_OPTS to camelCase (terminalDeliverySendOpts) and update its declaration (export const terminalDeliverySendOpts: QueueOptions = { retryLimit: 30, retryDelay: 1 }) and all places that import or re-export TERMINAL_DELIVERY_SEND_OPTS to use terminalDeliverySendOpts instead; ensure any re-export barrels or named imports are updated so consumers use the new camelCase public API and run tests/TS build to catch remaining references.packages/server/src/__tests__/guardrails-runtime.test.ts (1)
500-514: ⚡ Quick winAssert the terminal send options here too.
This now proves the payload uses
error, but it still won't catch someone droppingTERMINAL_DELIVERY_SEND_OPTSfrom thethread_responseenqueue. Capturing the thirdsend()arg and asserting it matches the terminal delivery opts would lock down the full contract this PR depends on.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/__tests__/guardrails-runtime.test.ts` around lines 500 - 514, The test currently checks threadResponses for a single rejection payload but doesn't assert the terminal delivery options; update the assertions to also capture the third send() argument for the thread_response enqueue and assert it equals TERMINAL_DELIVERY_SEND_OPTS. Locate the threadResponses array derived from sentToQueue, extract the send-args (third argument) from the matching sentToQueue entry for queue === 'thread_response' (or from threadResponses[0] if it contains send args), and add an expect(...).toEqual(TERMINAL_DELIVERY_SEND_OPTS) alongside the existing error/content assertions to lock down the delivery contract.packages/server/src/gateway/__tests__/turn-liveness.test.ts (1)
1-33: ⚡ Quick winMove this suite under
packages/server/src/__tests__/.This new file lives under
packages/server/src/gateway/__tests__/, which breaks the repo's test layout convention and makes package-level test discovery less consistent. Please move it into the packagesrc/__tests__/tree instead. As per coding guidelines, Test files follow package structure:packages/*/src/__tests__/.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/__tests__/turn-liveness.test.ts` around lines 1 - 33, The test file turn-liveness.test.ts currently living under the gateway tests needs to be moved into the package's src/__tests__ tree to match project conventions; relocate the file into the package-level src/__tests__ directory, update any relative imports used in that file (references such as getDb, RunsQueue, armTurnTimeout, commitTerminalReply, dischargeTurn, extendTurnDeadlines, failTurnIfPending, failTurnsForDeployment, sweepExpiredTurns and helpers like ensurePgliteForGatewayTests/resetTestDatabase) so they resolve from the new location, and run the test discovery to confirm the suite is picked up under the correct package test tree.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/gateway/index.ts`:
- Around line 413-416: The fire-and-forget call to
extendTurnDeadlines(deploymentName) can reject and cause an unhandled promise
rejection; change the call site to attach a .catch(...) handler that logs the
error (including deploymentName and context) via the existing logger so failures
are observed; ensure you do not block the handler (keep it best-effort) but log
the rejection and any relevant metadata to aid debugging.
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 803-807: The async calls to failTurnsForDeployment are currently
fire-and-forget (using void) so any rejection is unobserved; update both
crash/exit sites where intentionalExits.delete(deploymentName); void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); is used to attach a
.catch(...) that logs the error (include deploymentName and WORKER_DIED_MESSAGE
context) so rejections from the liveness/DB work are not swallowed; search for
the same pattern (the other site uses the same intentionalExits.delete + void
failTurnsForDeployment(...) call) and add identical .catch logging there.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts`:
- Around line 219-223: The current logic deletes the turn marker before
validating action_input via asTurnRouting, which causes malformed rows to be
permanently removed; change the flow so you parse/validate with
asTurnRouting(row.action_input) first and only delete the marker if parsing
succeeds, otherwise abort the transaction (or keep the marker pending and emit a
fallback terminal error) to avoid committing a delete on malformed rows. Apply
this same rollback-first pattern to the sweep and single-turn helper paths that
use asTurnRouting (the other occurrences you noted at the 282-287 and 393-395
sites) so all marker-deletes are conditional on successful asTurnRouting
parsing.
---
Duplicate comments:
In `@packages/server/src/gateway/orchestration/message-consumer.ts`:
- Around line 295-352: The inner catch around the queue.send to
"thread_response" currently throws notifyError but that gets swallowed by the
outer catch (the guardrail evaluation block), turning enqueue failures into a
fail-open; fix by surfacing notification failures past the outer catch: declare
an outer-scope variable (e.g., let notifyErrorOutside: unknown = undefined)
before the guardrail try, assign notifyErrorOutside = notifyError in the inner
catch instead of directly throwing, and after the outer try/catch (or at the end
of the guardrail handling) if notifyErrorOutside is set rethrow it so the
messages handler can retry; reference the inner catch variable notifyError, the
queue.send to responseQueue (TERMINAL_DELIVERY_SEND_OPTS), and the outer catch
that currently logs and continues.
In `@packages/server/src/gateway/platform/unified-thread-consumer.ts`:
- Around line 189-205: The SSE-owner throw currently blocks terminal API rows
even when the Chat SDK fast path can deliver them; update the terminal-row check
around isTerminal/sseKey so it only throws when the Chat SDK cannot already
handle delivery (i.e., add a guard using chatResponseBridge.canHandle(data)).
Concretely, modify the if that checks isTerminal && sseKey &&
!this.sseManager.hasActiveConnection(sseKey) to also require
!chatResponseBridge.canHandle(data) (or equivalent), referencing the existing
isTerminal, sseKey, this.sseManager.hasActiveConnection, and
chatResponseBridge.canHandle symbols so SDK-deliverable terminal rows skip the
SSE owner gate.
---
Nitpick comments:
In `@packages/server/src/__tests__/guardrails-runtime.test.ts`:
- Around line 500-514: The test currently checks threadResponses for a single
rejection payload but doesn't assert the terminal delivery options; update the
assertions to also capture the third send() argument for the thread_response
enqueue and assert it equals TERMINAL_DELIVERY_SEND_OPTS. Locate the
threadResponses array derived from sentToQueue, extract the send-args (third
argument) from the matching sentToQueue entry for queue === 'thread_response'
(or from threadResponses[0] if it contains send args), and add an
expect(...).toEqual(TERMINAL_DELIVERY_SEND_OPTS) alongside the existing
error/content assertions to lock down the delivery contract.
In `@packages/server/src/gateway/__tests__/turn-liveness.test.ts`:
- Around line 1-33: The test file turn-liveness.test.ts currently living under
the gateway tests needs to be moved into the package's src/__tests__ tree to
match project conventions; relocate the file into the package-level
src/__tests__ directory, update any relative imports used in that file
(references such as getDb, RunsQueue, armTurnTimeout, commitTerminalReply,
dischargeTurn, extendTurnDeadlines, failTurnIfPending, failTurnsForDeployment,
sweepExpiredTurns and helpers like
ensurePgliteForGatewayTests/resetTestDatabase) so they resolve from the new
location, and run the test discovery to confirm the suite is picked up under the
correct package test tree.
In `@packages/server/src/gateway/infrastructure/queue/types.ts`:
- Around line 44-47: Rename the exported constant TERMINAL_DELIVERY_SEND_OPTS to
camelCase (terminalDeliverySendOpts) and update its declaration (export const
terminalDeliverySendOpts: QueueOptions = { retryLimit: 30, retryDelay: 1 }) and
all places that import or re-export TERMINAL_DELIVERY_SEND_OPTS to use
terminalDeliverySendOpts instead; ensure any re-export barrels or named imports
are updated so consumers use the new camelCase public API and run tests/TS build
to catch remaining references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6e1164f2-bf80-4b02-90bc-8a9d4d0d163c
📒 Files selected for processing (10)
packages/server/src/__tests__/guardrails-runtime.test.tspackages/server/src/gateway/__tests__/turn-liveness.test.tspackages/server/src/gateway/gateway-main.tspackages/server/src/gateway/gateway/index.tspackages/server/src/gateway/infrastructure/queue/index.tspackages/server/src/gateway/infrastructure/queue/types.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/orchestration/message-consumer.tspackages/server/src/gateway/orchestration/turn-liveness.tspackages/server/src/gateway/platform/unified-thread-consumer.ts
| // A worker ACK (delivery receipt or heartbeat) is a worker-driven | ||
| // liveness signal — push the turn-liveness deadline forward so a live | ||
| // but slow worker is never falsely failed by the sweep. Best-effort. | ||
| void extendTurnDeadlines(deploymentName); |
There was a problem hiding this comment.
Don't fire-and-forget extendTurnDeadlines() without handling rejection.
This is a DB write on the heartbeat path. If it rejects, the handler still returns 200 while the promise goes unobserved and can surface as an unhandled rejection. Add a .catch() here and log the failure.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/gateway/index.ts` around lines 413 - 416, The
fire-and-forget call to extendTurnDeadlines(deploymentName) can reject and cause
an unhandled promise rejection; change the call site to attach a .catch(...)
handler that logs the error (including deploymentName and context) via the
existing logger so failures are observed; ensure you do not block the handler
(keep it best-effort) but log the rejection and any relevant metadata to aid
debugging.
| // A spawn error is never a deliberate stop. Fail any in-flight turn(s) | ||
| // for this deployment so the client gets a terminal error instead of a | ||
| // hang. No-op if nothing is in flight (markers already discharged). | ||
| this.intentionalExits.delete(deploymentName); | ||
| void failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); |
There was a problem hiding this comment.
Handle rejections from these crash-path failTurnsForDeployment() calls.
Both sites launch async DB work with void. If the liveness write rejects during a worker crash/exit, that rejection is unobserved and the client-facing terminal failure can be lost. Attach .catch() and log at both call sites.
Also applies to: 847-848
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts` around
lines 803 - 807, The async calls to failTurnsForDeployment are currently
fire-and-forget (using void) so any rejection is unobserved; update both
crash/exit sites where intentionalExits.delete(deploymentName); void
failTurnsForDeployment(deploymentName, WORKER_DIED_MESSAGE); is used to attach a
.catch(...) that logs the error (include deploymentName and WORKER_DIED_MESSAGE
context) so rejections from the liveness/DB work are not swallowed; search for
the same pattern (the other site uses the same intentionalExits.delete + void
failTurnsForDeployment(...) call) and add identical .catch logging there.
| const routing = asTurnRouting(row.action_input); | ||
| if (!routing) { | ||
| logger.warn("Skipping malformed turn-timeout marker (fast path)"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Don't commit marker deletes for malformed routing rows.
These paths delete the marker first and only then validate action_input. If parsing fails, the transaction still commits with no terminal row, permanently dropping the only durable obligation and reintroducing the hang this module is meant to prevent. Roll back on malformed rows, or keep the marker pending until you can emit a fallback terminal error.
Suggested pattern
const routing = asTurnRouting(row.action_input);
if (!routing) {
- logger.warn("Skipping malformed turn-timeout marker (fast path)");
- continue;
+ logger.error("Malformed turn-timeout marker; rolling back so another pass can recover it");
+ throw new Error("Malformed turn-timeout marker");
}Apply the same rollback-on-parse-failure behavior in the sweep and single-turn helper as well.
Also applies to: 282-287, 393-395
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/orchestration/turn-liveness.ts` around lines 219
- 223, The current logic deletes the turn marker before validating action_input
via asTurnRouting, which causes malformed rows to be permanently removed; change
the flow so you parse/validate with asTurnRouting(row.action_input) first and
only delete the marker if parsing succeeds, otherwise abort the transaction (or
keep the marker pending and emit a fallback terminal error) to avoid committing
a delete on malformed rows. Apply this same rollback-first pattern to the sweep
and single-turn helper paths that use asTurnRouting (the other occurrences you
noted at the 282-287 and 393-395 sites) so all marker-deletes are conditional on
successful asTurnRouting parsing.
|
bug_free 72, simplicity 74, slop 25, bugs 1, 0 blockers Read diff/logs; skipped extra boot probe. Typecheck passed and relevant guardrails-runtime/turn-liveness tests passed. [env] unit failed on connector-sdk dist export in CLI tests; integration had unrelated public-origin/MCP/sandbox/etc failures in untouched files. Suggested fixes
Full verdict JSON{
"bug_free_confidence": 72,
"bugs": 1,
"slop": 25,
"simplicity": 74,
"blockers": [],
"change_type": "fix",
"behavior_change_risk": "high",
"tests_adequate": false,
"suggested_fixes": [
{
"file": "packages/server/src/gateway/orchestration/message-consumer.ts",
"line": 343,
"change": "Do not let the guardrail fail-open catch swallow block-message enqueue failures after a trip; move the enqueue/return path outside the guardrail store/runner try-catch, or rethrow from this catch when the error came from notifying a tripped guardrail so the message job retries instead of dispatching the blocked input."
}
],
"notes": "Read diff/logs; skipped extra boot probe. Typecheck passed and relevant guardrails-runtime/turn-liveness tests passed. [env] unit failed on connector-sdk dist export in CLI tests; integration had unrelated public-origin/MCP/sandbox/etc failures in untouched files.",
"categories": {
"src": 724,
"tests": 202,
"docs": 0,
"config": 0,
"deps": 0,
"migrations": 0,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
…ors (#946) A worker that fails to produce a reply (crash, hang, or pod death) left the client's SSE/CLI hanging forever or returning a silent `complete` with no content. Three defects, one theme: a failed turn must produce a terminal event the client renders, correct under N>1 k8s replicas. - Turn-liveness election (turn-liveness.ts): a passive `runs` marker on the consumer-less `internal:turn_timeout` queue records that a turn owes a terminal event. DELETE...RETURNING is a first-writer-wins election; the deleter emits the terminal error in the same transaction (atomic, crash-safe). Fast path: child exit/error fails the deployment's in-flight turns instantly. Backstop: a periodic SKIP-LOCKED sweep fails lapsed turns (hung worker / pod death). Deadline pushed forward by the 20s worker heartbeat. - Owner-gated terminal delivery (unified-thread-consumer.ts): terminal API rows (success completion AND error) re-queue unless this pod holds the client's SSE, so a row produced on one replica reaches the SSE-owning replica. Fixes the pre-existing cross-pod success-completion drop too. - Render via `error` (message-consumer.ts): trackFailedDeployment and the input-guardrail reject moved off the dropped `content` field onto `error`, which renders end-to-end (SSE error event + CLI exit 1; platforms post Error:). Validated: live PGlite gateway with a worker that exits 1 -> connected/error/ agent-error/complete (was: hang). Discharge/sweep/exactly-once verified against Postgres. All claim queries index-backed via runs_lobu_claim_idx (EXPLAIN).
bbc0c13 to
95613d5
Compare
|
Round 2 ( Fixed (real):
Also fixed (your earlier comment + Not changed (with reasoning): Added 2 tests (now 9 in |
…ABASE_URL guard; trim apple_photos comment - turn-liveness.test.ts (added by #971) imported ensurePgliteForGatewayTests, which this PR's PGlite removal renamed to ensureDbForGatewayTests. The merge produced no textual conflict but a broken import that failed the gateway suite. Point it at the renamed helper. Validated: 9/9 pass vs Postgres. - review.sh: `unset DATABASE_URL` (run tests on isolated embedded PG) left a later `DATABASE_URL="$DATABASE_URL"` reference that tripped `set -u` (unbound variable), so pi never produced a verdict. Default to empty. - apple_photos.ts: trim the 13-line geo-tier TODO to a 4-line note (pi slop).
…; earthdistance geo (#965) * feat(server)!: replace PGlite with embedded Postgres; bundle pgvector; earthdistance geo Local `lobu run` and the test backend now run a real embedded PostgreSQL 18 (embedded-postgres) instead of PGlite, with pgvector injected from the new @lobu/pgvector-embedded package. This removes the entire dual-backend: the LOBU_DISABLE_PREPARE single-connection pool pin, the conversation-lock no-op, the PGlite test harness, and the @electric-sql/pglite* deps are all gone — one engine everywhere. - DATABASE_URL is the sole backend selector: postgres:// = external; a path (or file:) = embedded PG with its cluster at <path>/.lobu/pgdata. - geo_lookup reimplemented on core-contrib cube+earthdistance (no PostGIS, works on every backend); restores the geo schema lost in the migration squash. - pgvector ships as prebuilt per-platform artifacts; a CI workflow rebuilds them, and only the control + pinned install SQL are vendored. - Test backend auto-spawns an ephemeral embedded PG when DATABASE_URL is unset, so `make test` needs no external database. - review.sh now unsets DATABASE_URL so the suite runs against an isolated embedded PG instead of whatever (often shared/prod) DB .env points at. BREAKING CHANGE: the local `lobu run` / test database engine is now a real embedded PostgreSQL instead of PGlite. Existing ~/.lobu PGlite data dirs are not migrated — a fresh embedded PG cluster is created. Production (external Postgres via DATABASE_URL) is unchanged. * feat(cli): lobu init asks embedded vs external Postgres; write DATABASE_URL `lobu init` now prompts for the database backend and writes an explicit DATABASE_URL into the scaffolded .env: - "Local embedded Postgres" (default) → DATABASE_URL=file://. — an isolated per-project cluster at ./.lobu/pgdata (gitignored). - "Connect to an existing Postgres" → prompts for the postgres:// URL. Both already work at runtime (the DATABASE_URL path/URL routing); init is just the front-door UX. Modernized .env.tmpl off the PGlite/LOBU_DATA_DIR model onto the file:// vs postgres:// scheme. * refactor(server): consolidate the two server entrypoints into one server.ts is now the single entry for both backends, branching on DATABASE_URL: postgres:// connects to an external Postgres; a path/file:// spawns a local embedded Postgres. The embedded boot lives in the new embedded-runtime.ts and is loaded ONLY via `await import(...)` in the embedded branch, so the external/prod path never resolves or loads the embedded-postgres binary even though it sits in node_modules. - Deleted start-local.ts; build one bundle (server.bundle.mjs) instead of two. - dev.ts drops bundle-switching (one bundle, self-selects on DATABASE_URL) and maps LOBU_DATA_DIR → an embedded DATABASE_URL, so the Mac app (which spawns `lobu run` with LOBU_DATA_DIR) keeps working with no Swift change. - dev:local + e2e-lobu-apply.sh point at server.ts. Validated both modes from the one bundle: embedded (file://) spawns PG, runs migrations, all 4 extensions live; external (postgres://) connects, passes the schema-version check, and never triggers the lazy embedded-postgres import (0 spawn lines). typecheck 0, gateway tests 24/24. * refactor: remove LOBU_DATA_DIR — DATABASE_URL is the single backend selector LOBU_DATA_DIR was a second, redundant way to point at the embedded DB's data location. It's now purely the embedded-PG root (nothing else reads it — the old bootstrap-pat/ephemeral-key persistence claims were stale), so it folds entirely into DATABASE_URL=file://<dir>. - embedded-runtime.ts requires DATABASE_URL (no LOBU_DATA_DIR fallback). - user-config `dataDir` and the CLI `--data-dir` now map to DATABASE_URL=file://. - dev.ts drops the LOBU_DATA_DIR fallback; dev-native.sh + e2e-lobu-apply.sh set DATABASE_URL=file://<dir>. Validated: embedded boots from DATABASE_URL=file:// alone (cluster at <dir>/.lobu/pgdata), typecheck 0. (Mac app LocalLobuRunner switch to DATABASE_URL=file:// lands via the owletto submodule bump.) * chore: bump owletto pointer — Mac runner uses DATABASE_URL=file:// (drops LOBU_DATA_DIR) * chore: bump packages/owletto pointer for Keychain ENCRYPTION_KEY persistence (0c54927) * fix(pgvector-embedded): stage artifacts from build dir + build portable binaries The build matrix never produced non-darwin artifacts: - build.sh copied vector.control / vector--<v>.sql from the OS Postgres SHAREDIR, but the workflow only runs `make` (not `make install`), so those files never land there — every fresh-PG cell died on `cp: cannot stat '.../extension/vector.control'`. Stage straight from the built pgvector tree instead (vector.control is in-repo, the SQL is make-generated). Validated locally: reproduces the committed darwin-arm64 artifact byte-for-bit. - Pass OPTFLAGS="" so the redistributed binary drops pgvector's default `-march=native`; otherwise it bakes in the CI runner's microarch and SIGILLs on older user CPUs. - Linux install used a hand-rolled curl|gpg --dearmor pipe that flaked with "gpg: cannot open '/dev/tty'"; switch to the official pgdg apt.postgresql.org.sh setup script. * refactor(server): drop simpleQuery identity wrapper simpleQuery(query) just returned query — a no-op left over from the removed PGlite `.simple()` prepared-statement path. Unwrapped all 30 call sites (resolve_path.ts, multi-tenant.ts) and removed the export. * chore(pgvector-embedded): add linux-x64 + linux-arm64 prebuilt artifacts Built by build-pgvector-embedded.yml (pgvector v0.8.1, PG18, OPTFLAGS="" for portability). darwin-x64 follows once the macos-13 cell clears the runner queue; darwin-arm64 already committed. * fix(docker): wire @lobu/pgvector-embedded into the build, prune embedded PG from runtime server now depends on @lobu/pgvector-embedded (workspace:*), but the image never copied or stubbed it — bun install + the builder tsc would have failed the first time this hit main (build-images runs only on main). COPY its manifest + source and build its dist before the server type-check. Then prune embedded-postgres + @embedded-postgres/* (~145MB platform binary) from node_modules after the bundle: prod always uses an external DATABASE_URL, so the embedded path (the only loader, via await import) never runs. Validated with a full --target builder build: install + dist + tsc + bundle succeed and the runtime image no longer carries the 145MB binary. * refactor(server): drop dead applyDevProjectPathDefault, fix start-local.ts comment drift applyDevProjectPathDefault had zero callers (the LOBU_DEV_PROJECT_PATH fallback is inlined in server.ts main()). Removed it, and corrected the server-lifecycle.ts / dev-vite.ts headers that still described a two-entry-point design — start-local.ts was deleted in this PR. * fix(ci): build @lobu/pgvector-embedded before the integration suite The integration job (vitest under Node) failed with ERR_RESOLVE_PACKAGE_ENTRY_FAIL: the test backend imports @lobu/pgvector-embedded, so vite resolves its `import` condition (./dist/index.js) at transform time — but CI never built that package's dist (it's a new workspace dep in this PR). Build it alongside the other workspace packages, same as the Docker builder fix. Validated: full integration suite vs pg16 with dist freshly built → 0 resolve errors, 287/294 pass (the 5 fails are isolated-vm prebuild gaps local to the dev Mac; they load under CI's pinned Node 22). Also fix a stale comment: the test backend mirrors src/embedded-runtime.ts, not the deleted src/start-local.ts. * fix(ci): build darwin-x64 pgvector on macos-15-intel, not retired macos-13 GitHub retired the macos-13 (Intel) hosted runner in Dec 2025, so the darwin-x64 matrix cell queued indefinitely (no runner ever assigned). macos-15-intel is GitHub's current x86_64 macOS runner (Intel on macOS 15), the supported replacement until x86_64 macOS is dropped in 2027. * chore(pgvector-embedded): add darwin-x64 prebuilt artifact Built on macos-15-intel (x86_64 Mach-O bundle), completing all four platforms: darwin-arm64, darwin-x64, linux-x64, linux-arm64. * fix: turn-liveness test uses renamed gateway DB helper; review.sh DATABASE_URL guard; trim apple_photos comment - turn-liveness.test.ts (added by #971) imported ensurePgliteForGatewayTests, which this PR's PGlite removal renamed to ensureDbForGatewayTests. The merge produced no textual conflict but a broken import that failed the gateway suite. Point it at the renamed helper. Validated: 9/9 pass vs Postgres. - review.sh: `unset DATABASE_URL` (run tests on isolated embedded PG) left a later `DATABASE_URL="$DATABASE_URL"` reference that tripped `set -u` (unbound variable), so pi never produced a verdict. Default to empty. - apple_photos.ts: trim the 13-line geo-tier TODO to a 4-line note (pi slop). * docs(server): correct entry-point comment — embedded-runtime is statically imported The header said embedded-runtime loads via await import; it's actually a static import (the heavy embedded-postgres + pgvector injector are what load lazily, inside that module). Matches the dynamic-import allow-list in AGENTS.md.
Summary
Fixes #946. A worker that can't produce a reply (crash, hang, or pod death) left the client's SSE/CLI hanging forever, or returned a silent
completewith no content. This makes a failed turn always produce a terminal event the client renders — correct under N>1 k8s replicas (the constraint codified in #962).Supersedes the closed #958 (which was in-memory / single-process only).
Mechanism — turn-liveness election (no new schema)
Every dispatched turn arms a passive marker row in
public.runson the consumer-lessinternal:turn_timeoutqueue (turn-liveness.ts). It's never claimed as a job, so the queue's status machinery never touches it — its existence is the durable obligation that the turn owes the client a terminal event.DELETE … RETURNINGis first-writer-wins; the deleter emits a terminalthread_response{error}in the same transaction (atomic, crash-safe).EmbeddedDeploymentManagerchild.once("exit"/"error")fails the deployment's markers immediately.killWorkersets anintentionalExitflag so deliberate stops never notify.SKIP LOCKEDsweep fails lapsed markers — covers a hung worker and a worker-pod death (the marker outlives the pod; any replica sweeps it). Deadline pushed forward by the worker's 20s heartbeat.deploymentName:messageId(globally unique — platform message IDs are per-chat).error:trackFailedDeployment+ input-guardrail reject moved off the droppedcontentfield ontoerror(renders end-to-end: SSEerrorevent + CLI exit 1; platforms postError: …).Reproducer (red → green)
Red (#946 + handoff):
lobu chatagainst an agent whose worker can't reply → SSE sits onconnected+ ping forever (or silentcomplete).Green — live PGlite gateway,
LOBU_WORKER_ENTRYPOINT=process.exit(1), real Agent API:Validation
tsc --noEmit✅runs): all three marker queries index-backed viaruns_lobu_claim_idx/runs_idempotency_key_uniq— caught + fixed a discharge seq-scan (status='pending'). ✅RunsQueue+ twoSseManager+ realUnifiedThreadResponseConsumer, shared Postgres): error produced on the non-SSE pod → owner-gate re-queues → SSE-owning pod deliverserror,agent-error,complete; the other pod delivers nothing. This is the test fix(server): surface worker failures to chat clients instead of silent complete / hang (#946) #958 could never satisfy. ✅platform||teamId, composite marker key,trackFailedDeploymentdischarge); final re-review clean. ✅Accepted tradeoff (documented in code): a mid-window pod crash between enqueueing the real reply and discharging the marker can emit one bounded duplicate error — deliberately favored over re-introducing a hang.
Notes for the reviewer
make reviewwas not run here — its integration tests run DDL against the worktree.env's shared prod DB (per the repo handoff). Run it against a local throwaway DB before merge. The GPT-5.5 diff reviews above are the DB-safe substitute.completecross-pod; a full-stack happy-path E2E needs a real worker (provider key).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests