Skip to content

fix(server): surface worker failures to chat clients instead of silent complete / hang (#946)#958

Closed
buremba wants to merge 7 commits into
mainfrom
feat/fix-api-content-render
Closed

fix(server): surface worker failures to chat clients instead of silent complete / hang (#946)#958
buremba wants to merge 7 commits into
mainfrom
feat/fix-api-content-render

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 20, 2026

Summary

Fixes #946lobu chat against a freshly-applied agent returns a silent result: either a bare connectedcomplete with no content, or (more often) the SSE stream hangs on connected + pings forever. Either way the user sees no answer and no error.

Worker provisioning is not gated on platform connections (the issue's hypothesis) — MessageConsumer enqueues + ensureWorkerExists for every message regardless of connections, so HTTP-only agents do get workers. The actual defects are two ways a worker failure goes unsurfaced, plus a render gap that swallowed the one notice the gateway did try to send.

Three defects, one theme: a failed turn must produce a terminal event the client renders

1. The response router dropped non-ephemeral content. routeToRenderer had branches for delta, ephemeral-gated content, error, and completion — but none for a plain content message. The type even documents content?: string; // Used only for ephemeral messages. So any gateway notice placed in content (without ephemeral) was silently dropped; only complete reached the user.

  • Fix: content is now a first-class render path — handleContent on ResponseRenderer, implemented in ApiResponseRenderer (→ output SSE event) and ChatResponseBridge (→ in-thread post, sharing one postBufferedContent helper). Router renders it, then falls through to completion.

2. Pre-spawn deployment failure surfaced via the wrong field. trackFailedDeployment (fires when createWorkerDeployment throws — workspace/token/lock setup) wrote its notice to content → dropped (defect 1).

  • Fix: route it through errorerror SSE event, lobu chat prints it and exits 1; Error: … on chat platforms.

3. A worker that spawns then dies was never surfaced — the run hung. The embedded manager declares the worker "ready" the moment spawn() returns, so createWorkerDeployment resolves and trackFailedDeployment is never reached. The exit handler only logged the non-zero exit; the queued message stranded with no terminal event. This is the most likely real-world worker-startup failure (bad provider, missing dep, OOM) and it hung indefinitely.

  • Fix: setWorkerExitNotifier hook on the deployment manager; the embedded manager flips an intentionalExit flag in killWorker (the sole kill chokepoint — scale-to-0, idle reap, delete) so only genuine crashes / external kills notify, never deliberate stops or clean exits. MessageConsumer.notifyWorkerCrash emits via error.

Invariant: a thread_response carrying delta, content, or error is always delivered; a worker that dies without completing always produces a terminal error. No field, and no failure mode, silently vanishes.

Live end-to-end verification (PGlite gateway, deliberately-broken worker)

Booted start-local.bundle.mjs (PGlite, isolated data dir, DATABASE_URL scrubbed) with LOBU_WORKER_ENTRYPOINT pointing at a script that exits 1, then drove the real Agent API + real lobu chat.

Before (defect 3): SSE sat on connected + ping … forever. No terminal event.

After:

event: connected
event: error      → "The worker handling your request stopped unexpectedly (exit code 1)
event: agent-error   before it could reply. Please retry in a moment."
event: complete

Real lobu chat (worktree CLI → fixed gateway):

Agent error: The worker handling your request stopped unexpectedly (exit code 1) before it could reply.
$ echo $?  → 1

Unit tests (worker-startup-failure-notice.test.ts)

Real producer (trackFailedDeployment, notifyWorkerCrash — injected fake queue) + real consumer routing + real ApiResponseRenderer:

  • producer emits the pre-spawn-failure notice via error, not ephemeral-only content
  • unexpected worker exit → error-shaped notice for the conversation
  • error notice → handleError + completion
  • non-ephemeral contenthandleContent + completion (the router fix)
  • ephemeral content → handleEphemeral, not handleContent
  • real ApiResponseRenderer.handleContent emits the output SSE event the CLI parses

Red→green verified both ways (producer field; router branch removed). 10/10 in this file + existing unified-thread-consumer.test.ts.

Validation

  • make build-packages clean · bun run typecheck exit 0
  • Unit suites green · live E2E above
  • guardrails-runtime.test.ts failures are pre-existing infra (session_replication_role DB-permission; needs a dedicated/PGlite DB), untouched here.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Complete (non-streamed) responses now deliver full content to SSE, CLI, and chat channels; chat bridges render cards/actions or plain text fallbacks.
  • Bug Fixes

    • Worker startup failures and unexpected worker crashes now surface as user-facing error notifications (with correct conversation/session routing); intentional worker stops are suppressed.
  • Improvements

    • Routing now cleanly separates ephemeral vs non-ephemeral content and ensures content payloads are handled before completion; in-flight message tracking improves crash targeting.
  • Tests

    • Added tests for error, content, ephemeral, routing, and SSE output behaviors.

Review Change Stack

When background worker creation fails, `trackFailedDeployment` notifies the
user via a `thread_response`. It wrote the message into the payload's
`content` field — but `content` is documented (and implemented) as
ephemeral-only: the response router renders it solely in the
`ephemeral`-gated branch. A non-ephemeral `content` notice is therefore
silently dropped on the direct-API/CLI SSE path, so the gateway emits a bare
`complete` with no preceding content and `lobu chat` exits 0 with no output —
the silent-success footgun in #946.

Route the notice through the `error` field instead, which the pipeline
already renders end-to-end: an `error` SSE event for direct-API clients
(`lobu chat` prints "Agent error: …" and exits 1) and an "Error: …" message
on chat platforms. This also matches the issue's expected outcome #2 (a
clear, actionable error rather than empty success).

Worker provisioning itself is unchanged: workers auto-provision on first
message regardless of platform connections, so the issue's no-worker
hypothesis is moot — the only defect was the dropped failure notice.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds buffered handleContent paths and chat bridge consolidation so complete (non-streamed) content messages are routed to renderers and SSE/CLI. Worker-startup failure notices are enqueued using error so they surface on non-ephemeral paths. Tests validate error/content/ephemeral routing and queue payload shape.

Changes

Worker Startup Failure Notice & Buffered Content

Layer / File(s) Summary
ResponseRenderer contract extension
packages/server/src/gateway/platform/response-renderer.ts
Adds optional handleContent?(payload, sessionKey) to render complete, non-streamed content messages as the buffered counterpart to handleDelta.
Unified consumer content routing
packages/server/src/gateway/platform/unified-thread-consumer.ts
routeToRenderer now calls renderer.handleContent and broadcasts to CLI/SSE when data.content is present on non-ephemeral payloads before completion handling.
ApiResponseRenderer.handleContent
packages/server/src/gateway/api/response-renderer.ts
New handleContent resolves sessionId, logs when missing, and emits an SSE output event with { type: "delta", content, timestamp, messageId }.
ChatResponseBridge buffered posting
packages/server/src/gateway/connections/chat-response-bridge.ts
Introduce postBufferedContent(payload, label) used by handleEphemeral and new handleContent; centralizes card/actions rendering and plain-text fallback; logs reference label.
BaseDeploymentManager notifier
packages/server/src/gateway/orchestration/base-deployment-manager.ts
Add optional workerExitNotifier and setWorkerExitNotifier to register an unexpected worker-exit callback; add recordInFlightMessage hook for subclasses.
Embedded deployment lifecycle
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
Track intentionalExit via markIntentionalExit and currentMessage to suppress notifier on operator-driven or clean exits; notify on unexpected non-clean exits; update recordInFlightMessage and killWorker behavior.
trackFailedDeployment / notifyWorkerCrash
packages/server/src/gateway/orchestration/message-consumer.ts
Wire setWorkerExitNotifier to notifyWorkerCrash that enqueues a thread_response with error; call deploymentManager.recordInFlightMessage when sending to worker queue; update trackFailedDeployment to enqueue error: userMessage (not content) with clarifying comments.
Worker startup failure & routing tests
packages/server/src/gateway/__tests__/worker-startup-failure-notice.test.ts
New test suite with mocked queues/renderers/SSE verifying enqueue payload uses error, notifyWorkerCrash payload shape, and handleThreadResponse routing for error/content/ephemeral. Includes integration-style SSE broadcast check for ApiResponseRenderer.handleContent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble on the message queue,
Error fields now hop into view.
No silent completes, no empty night,
Content buffered, routed right.
A tiny thump — the logs alight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately summarizes the main change: surfacing worker failures to chat clients instead of silent completion or hangs, directly addressing issue #946.
Description check ✅ Passed Description is comprehensive and complete, covering summary, three specific defects, fixes, live E2E verification, unit tests, and validation steps matching the template structure.
Linked Issues check ✅ Passed All code changes directly address the core requirements from issue #946: surfacing worker failures via error events instead of silent completion, implementing handleContent for non-ephemeral messages, and notifying on worker crashes.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to addressing the three defects described in #946: response routing, pre-spawn failure surfacing, and worker crash notification. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-api-content-render

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The response router (`routeToRenderer`) had branches for `delta`,
`ephemeral`-gated `content`, `error`, and `processedMessageIds` (completion)
— but none for a plain, non-ephemeral `content` message. Any payload that
carried user-facing text in `content` therefore fell straight through to the
completion branch and the user saw a bare `complete` with no content. This is
the general defect behind #946: changing one producer only patches
one call site, so make `content` a first-class render path instead.

- Add optional `handleContent` to the `ResponseRenderer` interface.
- `ApiResponseRenderer.handleContent` broadcasts the message as an `output`
  SSE event (direct-API/CLI clients render it like a streamed delta).
- `ChatResponseBridge.handleContent` posts it in-thread; the ephemeral and
  content paths now share one `postBufferedContent` helper.
- Router renders `content` (then falls through to completion so the turn
  still terminates); the `ephemeral` branch keeps its own handling.

Invariant: a `thread_response` carrying `delta`, `content`, or `error` is
always delivered — no field is silently dropped. Producers pick the field by
intent (`error` = failure/exit 1, `content` = neutral notice/exit 0); the
worker-startup-failure notice stays on `error`, and the guardrail input-trip
notice (on `content`) is now surfaced too instead of vanishing.
@buremba buremba changed the title fix(server): surface worker-startup-failure notice to API/CLI clients (#946) fix(server): never silently drop user-facing thread_response content (#946) May 20, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/server/src/gateway/api/response-renderer.ts (1)

63-66: ⚡ Quick win

Remove the unused _sessionKey parameter from handleContent.

Please delete the unused parameter instead of prefixing with _ to match repo standards.

Proposed fix
-  async handleContent(
-    payload: ThreadResponsePayload,
-    _sessionKey: string
-  ): Promise<void> {
+  async handleContent(
+    payload: ThreadResponsePayload
+  ): Promise<void> {

As per coding guidelines, "When fixing unused-parameter errors, delete the parameter rather than prefixing with _".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/api/response-renderer.ts` around lines 63 - 66,
The handleContent method currently declares an unused parameter `_sessionKey`;
remove that parameter so the signature becomes handleContent(payload:
ThreadResponsePayload): Promise<void>. Update the method declaration and any
implementing/overriding declarations or call sites that pass a second argument
(search for handleContent(...) usages) to stop supplying the unused session
argument, and run type checks to ensure ThreadResponsePayload usage inside
handleContent remains correct.
packages/server/src/gateway/connections/chat-response-bridge.ts (1)

609-612: ⚡ Quick win

Drop the unused _sessionKey parameter in handleContent.

This should follow the repo rule to remove unused params rather than underscore-prefixing.

Proposed fix
-  async handleContent(
-    payload: ThreadResponsePayload,
-    _sessionKey: string
-  ): Promise<void> {
+  async handleContent(
+    payload: ThreadResponsePayload
+  ): Promise<void> {

As per coding guidelines, "When fixing unused-parameter errors, delete the parameter rather than prefixing with _".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/connections/chat-response-bridge.ts` around lines
609 - 612, The handleContent method declares an unused parameter _sessionKey;
remove that parameter from the method signature (change async
handleContent(payload: ThreadResponsePayload): Promise<void>) and update any
callers or overrides to stop passing a sessionKey argument so signatures remain
consistent (also update any implementing interfaces/types that reference
handleContent to match the new two-argument signature). Ensure imports/types for
ThreadResponsePayload remain unchanged and run the typechecker to catch
remaining call sites that need adjustment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/server/src/gateway/api/response-renderer.ts`:
- Around line 63-66: The handleContent method currently declares an unused
parameter `_sessionKey`; remove that parameter so the signature becomes
handleContent(payload: ThreadResponsePayload): Promise<void>. Update the method
declaration and any implementing/overriding declarations or call sites that pass
a second argument (search for handleContent(...) usages) to stop supplying the
unused session argument, and run type checks to ensure ThreadResponsePayload
usage inside handleContent remains correct.

In `@packages/server/src/gateway/connections/chat-response-bridge.ts`:
- Around line 609-612: The handleContent method declares an unused parameter
_sessionKey; remove that parameter from the method signature (change async
handleContent(payload: ThreadResponsePayload): Promise<void>) and update any
callers or overrides to stop passing a sessionKey argument so signatures remain
consistent (also update any implementing interfaces/types that reference
handleContent to match the new two-argument signature). Ensure imports/types for
ThreadResponsePayload remain unchanged and run the typechecker to catch
remaining call sites that need adjustment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 08a68067-20f7-404a-bcb1-62cfa01dc1fa

📥 Commits

Reviewing files that changed from the base of the PR and between 6bbd508 and 3d63931.

📒 Files selected for processing (5)
  • packages/server/src/gateway/__tests__/worker-startup-failure-notice.test.ts
  • packages/server/src/gateway/api/response-renderer.ts
  • packages/server/src/gateway/connections/chat-response-bridge.ts
  • packages/server/src/gateway/platform/response-renderer.ts
  • packages/server/src/gateway/platform/unified-thread-consumer.ts

buremba added 2 commits May 20, 2026 03:58
A worker that spawns successfully and then dies (crash on startup, OOM,
external kill) never rejects createWorkerDeployment — the embedded manager
declares the worker "ready" the moment spawn() returns, so trackFailedDeployment
is never reached. The exit handler only logged the non-zero exit; the worker's
queued message stranded with no terminal event and the run hung until the
client's idle timeout. Live-reproduced against a PGlite gateway with a worker
entrypoint that exits 1: the SSE stream sat on `connected` + pings forever.

Add an unexpected-exit notifier:
- `BaseDeploymentManager.setWorkerExitNotifier(...)` — orchestrator-wired hook.
- Embedded manager flips an `intentionalExit` flag in `killWorker` (the sole
  kill chokepoint: scale-to-0, idle reap, delete), so only genuine crashes /
  external kills notify — never operator-driven stops or clean exits.
- `MessageConsumer.notifyWorkerCrash` emits the notice via the `error` field
  (rendered end-to-end; `content` is ephemeral-only) for the deployment's
  conversation.

Live result (same broken-worker setup): SSE now emits connected → error →
complete within ~6s, and real `lobu chat` prints "Agent error: The worker
handling your request stopped unexpectedly (exit code 1)…" and exits 1.
@buremba buremba changed the title fix(server): never silently drop user-facing thread_response content (#946) fix(server): surface worker failures to chat clients instead of silent complete / hang (#946) May 20, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/orchestration/impl/embedded-deployment.ts`:
- Around line 942-947: The code sets entry.markIntentionalExit() before checking
the liveness guard (exitCode/signalCode), which can hide a real crash that
happens between those steps; move the call to entry.markIntentionalExit() so it
runs after the existing liveness guard that inspects exitCode and signalCode
(i.e., after the code path that determines whether the exit was a crash),
ensuring the exit handler can still detect and notify on genuine crashes; update
the block where entry.markIntentionalExit() is invoked (the closure tied to the
spawn's exit handler and the map deletion) so the liveness check runs first and
only then markIntentionalExit is set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f42870a9-e2a0-4360-a89a-0831185e7200

📥 Commits

Reviewing files that changed from the base of the PR and between 57a73ba and a5d9371.

📒 Files selected for processing (4)
  • packages/server/src/gateway/__tests__/worker-startup-failure-notice.test.ts
  • packages/server/src/gateway/orchestration/base-deployment-manager.ts
  • packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
  • packages/server/src/gateway/orchestration/message-consumer.ts

buremba added 2 commits May 20, 2026 14:10
Three reliability fixes from a codex pass on the worker-exit handling:

1. Spawn errors now notify. A missing/​unexecutable worker binary fires the
   child's `error` event (not `exit`), and Node may emit no `exit` after it —
   so the queued turn stranded with no terminal event despite the new exit
   handler. Route `error` through the same notifier; a one-shot guard dedups
   if `exit` also fires.

2. Notify for the in-flight turn, not the deployment's first message. A
   long-lived worker serves many turns; the notice was keyed to the
   originating message. Track the latest dispatched message per worker
   (`recordInFlightMessage`, advanced from sendToWorkerQueue) so a crash on
   turn N reports turn N.

3. Stay silent when no turn is outstanding. Gate notifyWorkerCrash on the
   thread queue having waiting+active work — a worker that already replied and
   then dies (idle crash / reap) has no pending or claimed message, so it no
   longer emits a false "before it could reply". Fails open if the stats
   lookup throws.

Live-reverified (PGlite + worker that exits 1): SSE still emits connected →
error → complete, now carrying the correct in-flight messageId.
…round 2)

The previous commit gated notifyWorkerCrash on the thread-message queue having
waiting+active work, to avoid a false notice when an idle worker dies after
already replying. But that gate is wrong: the thread-message row is marked
completed the moment the worker acknowledges *receipt* (before it processes or
replies), not after it replies — so during the entire agent-processing window
the queue shows 0 outstanding. A crash there would be suppressed, re-stranding
the turn with no terminal event — the exact silent-hang #946 fixes. Codex
caught this.

Remove the gate: notify on every unexpected, non-clean, non-intentional worker
death. The accepted edge is benign — a long-lived worker that dies while idle
after replying may emit a spurious notice; for the direct-API/CLI path that
session already got `complete` and closed (no-op), and on chat platforms it's a
rare non-destructive false alarm. Notice wording is now neutral on whether a
reply happened. Surfacing a real mid-turn crash is worth that tradeoff;
eliminating the edge entirely needs dispatch→terminal-response in-flight
tracking shared across the response consumer (deferred).
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 20, 2026

bug_free 88, simplicity 72, slop 5, bugs 0, 0 blockers

Large mixed branch (~17k lines changed) — bug fixes for worker-crash notification + content rendering, a server-lifecycle refactor, PAT auth bridge, lobu call CLI, agent_id memory scoping, and landing page rebuild. All script-run suites pass (typecheck=0, unit=0 fail across 201+49+52 tests, integration=0 fail across 894+30 tests). Explored the diff in depth: verified content-search.ts $11/$12 param slots align, ResponseRenderer.handleContent is optional so no breakage, workerExitNotifier one-shot guard dedups error+exit, and no secrets or illicit dynamic imports. Skipped booting the server for exploratory endpoint checks (diff touches 104 files across 8 packages — full boot validation would exceed time budget without a running dev env). The medium risk reflects the PAT auth bridge and organizationId threading touching auth-critical paths that the integration tests cover but only against a test DB, not the full embedded gateway stack.

Full verdict JSON
{
  "bug_free_confidence": 88,
  "bugs": 0,
  "slop": 5,
  "simplicity": 72,
  "blockers": [],
  "change_type": "fix",
  "behavior_change_risk": "medium",
  "tests_adequate": true,
  "suggested_fixes": [],
  "notes": "Large mixed branch (~17k lines changed) — bug fixes for worker-crash notification + content rendering, a server-lifecycle refactor, PAT auth bridge, lobu call CLI, agent_id memory scoping, and landing page rebuild. All script-run suites pass (typecheck=0, unit=0 fail across 201+49+52 tests, integration=0 fail across 894+30 tests). Explored the diff in depth: verified content-search.ts $11/$12 param slots align, ResponseRenderer.handleContent is optional so no breakage, workerExitNotifier one-shot guard dedups error+exit, and no secrets or illicit dynamic imports. Skipped booting the server for exploratory endpoint checks (diff touches 104 files across 8 packages — full boot validation would exceed time budget without a running dev env). The medium risk reflects the PAT auth bridge and organizationId threading touching auth-critical paths that the integration tests cover but only against a test DB, not the full embedded gateway stack.",
  "categories": {
    "src": 8500,
    "tests": 850,
    "docs": 400,
    "config": 150,
    "deps": 50,
    "migrations": 0,
    "ci": 15,
    "generated": 50
  }
}

Shadow mode — verdict does not gate merges. See docs/REVIEW_SCHEMA.md.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 20, 2026

Closing — not merging this. It fixes the single-turn silent-hang/complete, but it is not correct for the multi-replica k8s deployment (ClientIP affinity, per-pod in-memory SseManager, competitive thread_response consume). The approaches here — an in-memory in-flight registry, and a queue-depth gate — are fundamentally single-process and were shown wrong (the gate even re-introduces the silent hang because the thread-message run completes on receipt, not reply).

The proper fix is Postgres-mediated and reuses existing patterns:

  1. Run-as-turn-lease — keep the per-turn run running+heartbeated until the worker replies; a dead worker's stale claim is recovered by any pod's sweep → terminal failure → notify. (Kills the in-memory edge cases.)
  2. Owner-routed thread_response for API/SSE — mirror ChatResponseBridge.canHandle so the pod holding the client's SSE claims the row; others re-queue. (Also fixes the pre-existing "API replies drop cross-pod" gap.)

Will be redone properly on a fresh branch with a mandatory 2-replica validation. The salvageable bits (the error-field rendering, the one-shot death-detection guard + intentionalExit) are noted in the handoff. #946 stays open.

The AGENTS.md multi-pod-correctness rule added on this branch should be preserved/re-landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloud chat returns silent complete with no content when agent has no platform connections

2 participants