fix(server): surface worker failures to chat clients instead of silent complete / hang (#946)#958
fix(server): surface worker failures to chat clients instead of silent complete / hang (#946)#958buremba wants to merge 7 commits into
Conversation
When background worker creation fails, `trackFailedDeployment` notifies the user via a `thread_response`. It wrote the message into the payload's `content` field — but `content` is documented (and implemented) as ephemeral-only: the response router renders it solely in the `ephemeral`-gated branch. A non-ephemeral `content` notice is therefore silently dropped on the direct-API/CLI SSE path, so the gateway emits a bare `complete` with no preceding content and `lobu chat` exits 0 with no output — the silent-success footgun in #946. Route the notice through the `error` field instead, which the pipeline already renders end-to-end: an `error` SSE event for direct-API clients (`lobu chat` prints "Agent error: …" and exits 1) and an "Error: …" message on chat platforms. This also matches the issue's expected outcome #2 (a clear, actionable error rather than empty success). Worker provisioning itself is unchanged: workers auto-provision on first message regardless of platform connections, so the issue's no-worker hypothesis is moot — the only defect was the dropped failure notice.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds buffered ChangesWorker Startup Failure Notice & Buffered Content
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The response router (`routeToRenderer`) had branches for `delta`, `ephemeral`-gated `content`, `error`, and `processedMessageIds` (completion) — but none for a plain, non-ephemeral `content` message. Any payload that carried user-facing text in `content` therefore fell straight through to the completion branch and the user saw a bare `complete` with no content. This is the general defect behind #946: changing one producer only patches one call site, so make `content` a first-class render path instead. - Add optional `handleContent` to the `ResponseRenderer` interface. - `ApiResponseRenderer.handleContent` broadcasts the message as an `output` SSE event (direct-API/CLI clients render it like a streamed delta). - `ChatResponseBridge.handleContent` posts it in-thread; the ephemeral and content paths now share one `postBufferedContent` helper. - Router renders `content` (then falls through to completion so the turn still terminates); the `ephemeral` branch keeps its own handling. Invariant: a `thread_response` carrying `delta`, `content`, or `error` is always delivered — no field is silently dropped. Producers pick the field by intent (`error` = failure/exit 1, `content` = neutral notice/exit 0); the worker-startup-failure notice stays on `error`, and the guardrail input-trip notice (on `content`) is now surfaced too instead of vanishing.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/server/src/gateway/api/response-renderer.ts (1)
63-66: ⚡ Quick winRemove the unused
_sessionKeyparameter fromhandleContent.Please delete the unused parameter instead of prefixing with
_to match repo standards.Proposed fix
- async handleContent( - payload: ThreadResponsePayload, - _sessionKey: string - ): Promise<void> { + async handleContent( + payload: ThreadResponsePayload + ): Promise<void> {As per coding guidelines, "When fixing unused-parameter errors, delete the parameter rather than prefixing with
_".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/api/response-renderer.ts` around lines 63 - 66, The handleContent method currently declares an unused parameter `_sessionKey`; remove that parameter so the signature becomes handleContent(payload: ThreadResponsePayload): Promise<void>. Update the method declaration and any implementing/overriding declarations or call sites that pass a second argument (search for handleContent(...) usages) to stop supplying the unused session argument, and run type checks to ensure ThreadResponsePayload usage inside handleContent remains correct.packages/server/src/gateway/connections/chat-response-bridge.ts (1)
609-612: ⚡ Quick winDrop the unused
_sessionKeyparameter inhandleContent.This should follow the repo rule to remove unused params rather than underscore-prefixing.
Proposed fix
- async handleContent( - payload: ThreadResponsePayload, - _sessionKey: string - ): Promise<void> { + async handleContent( + payload: ThreadResponsePayload + ): Promise<void> {As per coding guidelines, "When fixing unused-parameter errors, delete the parameter rather than prefixing with
_".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/connections/chat-response-bridge.ts` around lines 609 - 612, The handleContent method declares an unused parameter _sessionKey; remove that parameter from the method signature (change async handleContent(payload: ThreadResponsePayload): Promise<void>) and update any callers or overrides to stop passing a sessionKey argument so signatures remain consistent (also update any implementing interfaces/types that reference handleContent to match the new two-argument signature). Ensure imports/types for ThreadResponsePayload remain unchanged and run the typechecker to catch remaining call sites that need adjustment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/server/src/gateway/api/response-renderer.ts`:
- Around line 63-66: The handleContent method currently declares an unused
parameter `_sessionKey`; remove that parameter so the signature becomes
handleContent(payload: ThreadResponsePayload): Promise<void>. Update the method
declaration and any implementing/overriding declarations or call sites that pass
a second argument (search for handleContent(...) usages) to stop supplying the
unused session argument, and run type checks to ensure ThreadResponsePayload
usage inside handleContent remains correct.
In `@packages/server/src/gateway/connections/chat-response-bridge.ts`:
- Around line 609-612: The handleContent method declares an unused parameter
_sessionKey; remove that parameter from the method signature (change async
handleContent(payload: ThreadResponsePayload): Promise<void>) and update any
callers or overrides to stop passing a sessionKey argument so signatures remain
consistent (also update any implementing interfaces/types that reference
handleContent to match the new two-argument signature). Ensure imports/types for
ThreadResponsePayload remain unchanged and run the typechecker to catch
remaining call sites that need adjustment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 08a68067-20f7-404a-bcb1-62cfa01dc1fa
📒 Files selected for processing (5)
packages/server/src/gateway/__tests__/worker-startup-failure-notice.test.tspackages/server/src/gateway/api/response-renderer.tspackages/server/src/gateway/connections/chat-response-bridge.tspackages/server/src/gateway/platform/response-renderer.tspackages/server/src/gateway/platform/unified-thread-consumer.ts
A worker that spawns successfully and then dies (crash on startup, OOM, external kill) never rejects createWorkerDeployment — the embedded manager declares the worker "ready" the moment spawn() returns, so trackFailedDeployment is never reached. The exit handler only logged the non-zero exit; the worker's queued message stranded with no terminal event and the run hung until the client's idle timeout. Live-reproduced against a PGlite gateway with a worker entrypoint that exits 1: the SSE stream sat on `connected` + pings forever. Add an unexpected-exit notifier: - `BaseDeploymentManager.setWorkerExitNotifier(...)` — orchestrator-wired hook. - Embedded manager flips an `intentionalExit` flag in `killWorker` (the sole kill chokepoint: scale-to-0, idle reap, delete), so only genuine crashes / external kills notify — never operator-driven stops or clean exits. - `MessageConsumer.notifyWorkerCrash` emits the notice via the `error` field (rendered end-to-end; `content` is ephemeral-only) for the deployment's conversation. Live result (same broken-worker setup): SSE now emits connected → error → complete within ~6s, and real `lobu chat` prints "Agent error: The worker handling your request stopped unexpectedly (exit code 1)…" and exits 1.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 942-947: The code sets entry.markIntentionalExit() before checking
the liveness guard (exitCode/signalCode), which can hide a real crash that
happens between those steps; move the call to entry.markIntentionalExit() so it
runs after the existing liveness guard that inspects exitCode and signalCode
(i.e., after the code path that determines whether the exit was a crash),
ensuring the exit handler can still detect and notify on genuine crashes; update
the block where entry.markIntentionalExit() is invoked (the closure tied to the
spawn's exit handler and the map deletion) so the liveness check runs first and
only then markIntentionalExit is set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f42870a9-e2a0-4360-a89a-0831185e7200
📒 Files selected for processing (4)
packages/server/src/gateway/__tests__/worker-startup-failure-notice.test.tspackages/server/src/gateway/orchestration/base-deployment-manager.tspackages/server/src/gateway/orchestration/impl/embedded-deployment.tspackages/server/src/gateway/orchestration/message-consumer.ts
Three reliability fixes from a codex pass on the worker-exit handling: 1. Spawn errors now notify. A missing/unexecutable worker binary fires the child's `error` event (not `exit`), and Node may emit no `exit` after it — so the queued turn stranded with no terminal event despite the new exit handler. Route `error` through the same notifier; a one-shot guard dedups if `exit` also fires. 2. Notify for the in-flight turn, not the deployment's first message. A long-lived worker serves many turns; the notice was keyed to the originating message. Track the latest dispatched message per worker (`recordInFlightMessage`, advanced from sendToWorkerQueue) so a crash on turn N reports turn N. 3. Stay silent when no turn is outstanding. Gate notifyWorkerCrash on the thread queue having waiting+active work — a worker that already replied and then dies (idle crash / reap) has no pending or claimed message, so it no longer emits a false "before it could reply". Fails open if the stats lookup throws. Live-reverified (PGlite + worker that exits 1): SSE still emits connected → error → complete, now carrying the correct in-flight messageId.
…round 2) The previous commit gated notifyWorkerCrash on the thread-message queue having waiting+active work, to avoid a false notice when an idle worker dies after already replying. But that gate is wrong: the thread-message row is marked completed the moment the worker acknowledges *receipt* (before it processes or replies), not after it replies — so during the entire agent-processing window the queue shows 0 outstanding. A crash there would be suppressed, re-stranding the turn with no terminal event — the exact silent-hang #946 fixes. Codex caught this. Remove the gate: notify on every unexpected, non-clean, non-intentional worker death. The accepted edge is benign — a long-lived worker that dies while idle after replying may emit a spurious notice; for the direct-API/CLI path that session already got `complete` and closed (no-op), and on chat platforms it's a rare non-destructive false alarm. Notice wording is now neutral on whether a reply happened. Surfacing a real mid-turn crash is worth that tradeoff; eliminating the edge entirely needs dispatch→terminal-response in-flight tracking shared across the response consumer (deferred).
|
bug_free 88, simplicity 72, slop 5, bugs 0, 0 blockers Large mixed branch (~17k lines changed) — bug fixes for worker-crash notification + content rendering, a server-lifecycle refactor, PAT auth bridge, lobu call CLI, agent_id memory scoping, and landing page rebuild. All script-run suites pass (typecheck=0, unit=0 fail across 201+49+52 tests, integration=0 fail across 894+30 tests). Explored the diff in depth: verified content-search.ts $11/$12 param slots align, ResponseRenderer.handleContent is optional so no breakage, workerExitNotifier one-shot guard dedups error+exit, and no secrets or illicit dynamic imports. Skipped booting the server for exploratory endpoint checks (diff touches 104 files across 8 packages — full boot validation would exceed time budget without a running dev env). The medium risk reflects the PAT auth bridge and organizationId threading touching auth-critical paths that the integration tests cover but only against a test DB, not the full embedded gateway stack. Full verdict JSON{
"bug_free_confidence": 88,
"bugs": 0,
"slop": 5,
"simplicity": 72,
"blockers": [],
"change_type": "fix",
"behavior_change_risk": "medium",
"tests_adequate": true,
"suggested_fixes": [],
"notes": "Large mixed branch (~17k lines changed) — bug fixes for worker-crash notification + content rendering, a server-lifecycle refactor, PAT auth bridge, lobu call CLI, agent_id memory scoping, and landing page rebuild. All script-run suites pass (typecheck=0, unit=0 fail across 201+49+52 tests, integration=0 fail across 894+30 tests). Explored the diff in depth: verified content-search.ts $11/$12 param slots align, ResponseRenderer.handleContent is optional so no breakage, workerExitNotifier one-shot guard dedups error+exit, and no secrets or illicit dynamic imports. Skipped booting the server for exploratory endpoint checks (diff touches 104 files across 8 packages — full boot validation would exceed time budget without a running dev env). The medium risk reflects the PAT auth bridge and organizationId threading touching auth-critical paths that the integration tests cover but only against a test DB, not the full embedded gateway stack.",
"categories": {
"src": 8500,
"tests": 850,
"docs": 400,
"config": 150,
"deps": 50,
"migrations": 0,
"ci": 15,
"generated": 50
}
}Shadow mode — verdict does not gate merges. See |
|
Closing — not merging this. It fixes the single-turn silent-hang/complete, but it is not correct for the multi-replica k8s deployment (ClientIP affinity, per-pod in-memory The proper fix is Postgres-mediated and reuses existing patterns:
Will be redone properly on a fresh branch with a mandatory 2-replica validation. The salvageable bits (the The AGENTS.md multi-pod-correctness rule added on this branch should be preserved/re-landed. |
Summary
Fixes #946 —
lobu chatagainst a freshly-applied agent returns a silent result: either a bareconnected→completewith no content, or (more often) the SSE stream hangs onconnected+ pings forever. Either way the user sees no answer and no error.Worker provisioning is not gated on platform connections (the issue's hypothesis) —
MessageConsumerenqueues +ensureWorkerExistsfor every message regardless of connections, so HTTP-only agents do get workers. The actual defects are two ways a worker failure goes unsurfaced, plus a render gap that swallowed the one notice the gateway did try to send.Three defects, one theme: a failed turn must produce a terminal event the client renders
1. The response router dropped non-ephemeral
content.routeToRendererhad branches fordelta,ephemeral-gatedcontent,error, and completion — but none for a plaincontentmessage. The type even documentscontent?: string; // Used only for ephemeral messages. So any gateway notice placed incontent(withoutephemeral) was silently dropped; onlycompletereached the user.contentis now a first-class render path —handleContentonResponseRenderer, implemented inApiResponseRenderer(→outputSSE event) andChatResponseBridge(→ in-thread post, sharing onepostBufferedContenthelper). Router renders it, then falls through to completion.2. Pre-spawn deployment failure surfaced via the wrong field.
trackFailedDeployment(fires whencreateWorkerDeploymentthrows — workspace/token/lock setup) wrote its notice tocontent→ dropped (defect 1).error→errorSSE event,lobu chatprints it and exits 1;Error: …on chat platforms.3. A worker that spawns then dies was never surfaced — the run hung. The embedded manager declares the worker "ready" the moment
spawn()returns, socreateWorkerDeploymentresolves andtrackFailedDeploymentis never reached. The exit handler only logged the non-zero exit; the queued message stranded with no terminal event. This is the most likely real-world worker-startup failure (bad provider, missing dep, OOM) and it hung indefinitely.setWorkerExitNotifierhook on the deployment manager; the embedded manager flips anintentionalExitflag inkillWorker(the sole kill chokepoint — scale-to-0, idle reap, delete) so only genuine crashes / external kills notify, never deliberate stops or clean exits.MessageConsumer.notifyWorkerCrashemits viaerror.Invariant: a
thread_responsecarryingdelta,content, orerroris always delivered; a worker that dies without completing always produces a terminalerror. No field, and no failure mode, silently vanishes.Live end-to-end verification (PGlite gateway, deliberately-broken worker)
Booted
start-local.bundle.mjs(PGlite, isolated data dir,DATABASE_URLscrubbed) withLOBU_WORKER_ENTRYPOINTpointing at a script that exits 1, then drove the real Agent API + reallobu chat.Before (defect 3): SSE sat on
connected+ping… forever. No terminal event.After:
Real
lobu chat(worktree CLI → fixed gateway):Unit tests (
worker-startup-failure-notice.test.ts)Real producer (
trackFailedDeployment,notifyWorkerCrash— injected fake queue) + real consumer routing + realApiResponseRenderer:error, not ephemeral-onlycontenterror-shaped notice for the conversationerrornotice →handleError+ completioncontent→handleContent+ completion (the router fix)ephemeralcontent →handleEphemeral, nothandleContentApiResponseRenderer.handleContentemits theoutputSSE event the CLI parsesRed→green verified both ways (producer field; router branch removed). 10/10 in this file + existing
unified-thread-consumer.test.ts.Validation
make build-packagesclean ·bun run typecheckexit 0guardrails-runtime.test.tsfailures are pre-existing infra (session_replication_roleDB-permission; needs a dedicated/PGlite DB), untouched here.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Tests