Skip to content

fix(server): orchestration cleanup — dead dischargeTurn + cross-pod spawn gate#1068

Merged
buremba merged 2 commits into
mainfrom
feat/fix-orchestration-cleanup
May 26, 2026
Merged

fix(server): orchestration cleanup — dead dischargeTurn + cross-pod spawn gate#1068
buremba merged 2 commits into
mainfrom
feat/fix-orchestration-cleanup

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 26, 2026

Fixes three confirmed orchestration findings under packages/server/src/gateway/orchestration/.

Finding #17 (LOW, cleanup) — dead non-atomic dischargeTurn — FIXED

dischargeTurn was a standalone marker DELETE with no reply insert. Production discharge goes through commitTerminalReply (atomic delete+insert, wired at gateway/index.ts). git grep confirmed the only references were the definition + three test cases — no production caller.

  • Deleted the function.
  • Rewrote the three tests to discharge via the production commitTerminalReply path, so they exercise real behaviour.

Evidence: bun test .../turn-liveness.test.ts → 12 pass / 0 fail; server tsc --noEmit clean.

Finding #6 (MED, multi-replica) — cross-pod duplicate-spawn when org/conv missing — FIXED

spawnDeployment only took the cross-pod conversation lock (keyed on (org, agent, conversationId)) when both organizationId AND conversationId were present. When either was missing it silently skipped the lock and spawned an unguarded worker — two pods could both spawn for the same conversation, hydrate the same completed snapshot, and write divergent next snapshots (one reply silently wins).

Fix is scoped to exactly the dangerous case. A turn writes a shared snapshot only when it carries a runId (the worker's writeSnapshot bails otherwise — see the runId comment in MessageConsumer.handleMessage). So:

  • Snapshot mode + runId present + (org OR conv missing) → throw a re-queueable OrchestratorError (mirrors the existing lock-busy throw) instead of spawning unguarded.
  • Legacy direct-enqueue (runId undefined) and file-mode (LOBU_SESSION_STORE=file) turns never write a shared snapshot → unaffected. This keeps the existing unit-test/legacy paths (which have no org) working.

Reproducer (red → green)

Unit tests in embedded-deployment.test.ts (child_process mocked, no DB needed):

  • two-pod test: two EmbeddedDeploymentManager instances (two replicas) both receive the same org-less snapshot turn for the same conversation → both refuse, zero child spawns.
  • per-field refusal (org missing / conv missing), and legacy/file-mode pass-through.

Proven RED on origin/main source: the org-missing snapshot turn resolved and spawned a worker (Expected promise that rejects / Received promise that resolved). GREEN with the fix: refuses, 0 spawns. Full file: 28 pass / 0 fail.

Note on the gold-standard two-DbClient lock test: a DB-backed acquireConversationLock two-pod test is not runnable in the gateway bun:test harness — postgres-js sql.reserve() hangs there (which is exactly why the existing reserve-cap.test.ts deliberately stages the counter to short-circuit before reserve()). The bug #6 fixes is short-circuited before the DB lock anyway, so the spawn-level two-pod unit reproducer is the faithful red→green. The advisory lock itself is unchanged and already covered by reserve-cap.test.ts.

Finding #15 (MED, narrow) — failTurnsForDeployment over-deletes — BAILED (acceptable as-is)

Verified LEAVE-AS-IS. When a worker subprocess exits non-deliberately, it is the single drainer of thread_message_<deploymentName>; every pending marker for that deployment is a turn the now-dead process can no longer answer. Failing them all is correct. Scoping to a tracked in-flight subset would mask real failures: a turn enqueued to the worker's thread queue but not yet reflected in per-entry in-flight tracking (the worker pulls from the queue itself; the gateway doesn't observe each pull) would be left with a live marker and no process to answer it — re-introducing the silent hang this module exists to prevent (until the slower deadline sweep eventually catches it). The "different restarted worker" race isn't real: spawnDeployment early-returns while the old entry is still in the workers map, and the exit handler deletes the entry + runs failTurnsForDeployment before any new spawn could arm new markers. Net: not a correctness improvement; risks masking failures. Left as-is per the finding's bail condition.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced validation for snapshot-mode deployment operations to ensure required identifiers are present before spawning workers, preventing invalid configurations from creating unnecessary processes.
  • Tests

    • Added test coverage for snapshot-mode worker spawning behavior across multiple scenarios.
    • Updated integration tests to use production deployment paths for improved test accuracy.

Review Change Stack

…ness

dischargeTurn was a standalone DELETE of the turn-liveness marker with no
reply insert. Production discharge has gone through commitTerminalReply
(atomic delete+insert) since it was wired in gateway/index.ts; the only
remaining references to dischargeTurn were its definition and three test
cases. Delete the dead function and rewrite those tests to discharge via
the production commitTerminalReply path instead, so they exercise real
behaviour rather than a non-atomic shortcut.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c30a2db1-14d5-40fb-86cd-4a56bba53a4b

📥 Commits

Reviewing files that changed from the base of the PR and between 56b6cff and 0d252bc.

📒 Files selected for processing (4)
  • packages/server/src/gateway/__tests__/embedded-deployment.test.ts
  • packages/server/src/gateway/__tests__/turn-liveness.test.ts
  • packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
  • packages/server/src/gateway/orchestration/turn-liveness.ts
💤 Files with no reviewable changes (1)
  • packages/server/src/gateway/orchestration/turn-liveness.ts

📝 Walkthrough

Walkthrough

This PR adds validation gates to prevent snapshot-writing workers from spawning without required organization and conversation context, and refactors turn-liveness tests to exercise the production commitTerminalReply path instead of a test-only helper function that is now removed.

Changes

Snapshot-mode spawn gating and turn-liveness marker refactoring

Layer / File(s) Summary
Snapshot-mode spawn validation
packages/server/src/gateway/orchestration/impl/embedded-deployment.ts
spawnDeployment now computes writesSharedSnapshot from LOBU_SESSION_STORE and messageData.runId, and throws OrchestratorError when snapshot-mode workers would write shared snapshots without organizationId or conversationId.
Snapshot-mode spawn gating tests
packages/server/src/gateway/__tests__/embedded-deployment.test.ts
New "snapshot-mode cross-pod gate" test suite covers refusal to spawn when organizationId or conversationId is missing, legacy non-runId turns still spawning, file-mode behavior, and multi-pod scenarios asserting no duplicate spawns across replicas.
Turn-liveness marker cleanup
packages/server/src/gateway/orchestration/turn-liveness.ts
Removed exported dischargeTurn(deploymentName, messageId) function that previously deleted turn-timeout markers, allowing tests to transition to the production commitTerminalReply path.
Turn-liveness test refactoring
packages/server/src/gateway/__tests__/turn-liveness.test.ts
Updated integration tests to use production commitTerminalReply instead of dischargeTurn helper. Added shared reply(deploymentName, messageId) builder and updated "arm then discharge", "failTurnIfPending race", and marker-isolation tests with assertions reflecting the real terminal-reply flow and cleanup behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lobu-ai/lobu#971: Directly related—this PR removes usage of dischargeTurn and updates tests to use commitTerminalReply for turn-liveness markers, which was the terminal signaling flow introduced in that PR's turn-liveness.ts reshaping.
  • lobu-ai/lobu#871: Related through snapshot-mode behavior adjustments in embedded-deployment.ts; this PR adds stricter spawn gating for shared-snapshot turns, whereas the retrieved PR adjusts how snapshot mode is propagated via LOBU_SESSION_STORE.

Poem

🐰 A gate guards the snapshot, a marker takes flight,
When runId is set, we must hold it just right—
No worker shall spawn if the context is spare,
The turn-liveness dances through commitTerminal air! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes both main changes: removing dead dischargeTurn and adding a cross-pod spawn gate for snapshot-mode turns.
Description check ✅ Passed The description is comprehensive, covering all three findings with detailed rationale, test evidence, and a reproducer. It exceeds template requirements with substantial technical depth.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-orchestration-cleanup

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The cross-pod conversation lock — keyed on (org, agent, conversationId) —
is the only thing stopping two replicas from both hydrating the same
`completed` snapshot and writing divergent next snapshots (one reply
silently wins). spawnDeployment only ran that lock when both
organizationId AND conversationId were present; when either was missing it
silently SKIPPED the lock and spawned an unguarded worker, so two pods
could both spawn for the same conversation.

A turn only writes a SHARED snapshot when it carries a runId (the worker's
writeSnapshot bails otherwise). Scope the guard to exactly that case:
under snapshot mode, a runId-bearing turn missing org or conversationId
now throws a re-queueable OrchestratorError instead of spawning unguarded.
Legacy direct-enqueue / file-mode turns (no runId, or snapshot opted out)
never write a shared snapshot and are unaffected.

Reproducer: two-pod unit test (two managers) shows both replicas refuse an
org-less snapshot turn for the same conversation — zero duplicate spawns;
plus per-field refusal tests and legacy/file-mode pass-through tests.
@buremba buremba force-pushed the feat/fix-orchestration-cleanup branch from ff2475e to 0d252bc Compare May 26, 2026 03:57
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 26, 2026

pi review verdict (gpt-5.5:high ×3)

metric value threshold
bug_free_confidence 84 ≥80 (target ≥90)
bugs 0 0
blockers 0 0
slop 12 ≤15
simplicity 82 ≥70
tests_adequate true true
behavior_change_risk medium
change_type fix

0 bugs, 0 blockers, tests adequate. The single suggested fix (cosmetic: drop a (finding #6) review-artifact label from a test header) has been applied. The 84 vs 90 gap is driven by the inherent medium behavior-change-risk of a multi-replica worker-spawn gate, not a defect.

Suites

  • server tsc --noEmit: clean (exit 0)
  • embedded-deployment.test.ts: 28 pass / 0 fail (incl. the two-pod cross-pod-gate reproducer)
  • turn-liveness.test.ts: 9 pass / 0 fail (dischargeTurn removal + commitTerminalReply rewrite)
  • reserve-cap.test.ts: 3 pass / 0 fail

Note: a second pi run to retry for >90 is blocked — the shared Codex gpt-5.5 backend hit usage_limit_reached (429, resets in ~3h). The score above is from the run that completed before the quota was exhausted.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 26, 2026

Final status — ready for review

Deterministic suites green; pi verdict bug_free 84, 0 bugs, 0 blockers.

84 reflects the inherent medium risk of a multi-replica spawn gate (0 bugs / 0 blockers), not a defect; fresh re-roll is quota-blocked (~4.2h). Diff: 4 files, +149/-47. Not merged.

@buremba buremba marked this pull request as ready for review May 26, 2026 04:46
@buremba buremba merged commit 7bf6d8a into main May 26, 2026
20 checks passed
@buremba buremba deleted the feat/fix-orchestration-cleanup branch May 26, 2026 13:20
buremba added a commit that referenced this pull request May 28, 2026
* fix(gateway): route lobu chat to org's default agent end-to-end

Closes #1129, #1133.

The 'isEphemeral' code path in createAgent — which generated a random UUID
and tried to auto-install system-key providers for any no-agentId chat —
was always broken: saveSettings was UPDATE-only and silently no-op'd
because the agents row had never been INSERTed. Even after that was
patched it would have polluted the DB with one phantom agent per chat
invocation, with no continuity across calls.

This PR removes the concept entirely and wires the existing per-org
default agent (`owletto-default`, provisioned at signup via
`ensureDefaultAgent`) into the chat path.

Changes:

1. `createAgent` (gateway/routes/public/agent.ts): when the caller
   omits agentId, look up `owletto-default` for the caller's org
   instead of minting a UUID. Two flows now: caller-pinned, or
   default-agent — no third.

2. `ensureDefaultAgent` (auth/default-provisioning.ts): also write
   `installed_providers` from the set of providers with system keys
   (env-var API keys, Claude OAuth-discovery) at boot, so the default
   agent is chat-ready out of the box. Resolve the personal-org owner
   from `organization.metadata.personal_org_for_user_id` and write the
   agent as user-owned + mirror into `agent_users` so the per-user
   ownership check in `verifyOwnedAgentAccess` recognizes the PAT
   session's user as the owner.

3. `createApiAuthMiddleware` (gateway/auth/api-auth-middleware.ts):
   plumb the caller's resolved org through (worker-token payload,
   /oauth/userinfo, settings session) by attaching `authContext` to
   the Hono context AND running the request inside
   `orgContext.run()`. This was the #1068 follow-up pi flagged: the
   PAT path's `createLobuAuthBridge` already wraps in orgContext, but
   the OAuth and worker-token paths didn't — so any handler that
   reached for AsyncLocalStorage-scoped org id would miss for those
   auth methods.

4. `ExternalAuthClient.fetchUserInfo` (gateway/auth/external/client.ts):
   surface the organization id resolved from /oauth/userinfo's
   organization_slug + organizations[] list (previously typed away).

5. `ChatResponseBridge` (gateway/connections/chat-response-bridge.ts):
   render the NO_MODEL_CONFIGURED error with a real link to the
   agent's settings page (`${publicWebUrl}/${orgSlug}/agents/${agentId}`)
   instead of vague 'ask an admin' text.

6. CLI dry-run help text (cli/src/index.ts): tighten from 'Process
   without persisting history' (misleading — history IS persisted) to
   'Skip side-effecting tool calls (sandbox writes, sdk_run mutations).'

7. `ThreadSession.isEphemeral` field deleted along with the
   wasEphemeral cleanup branch in DELETE /api/v1/agents — there are no
   ephemeral sessions anymore.

* fix(server): backfill default agent for legacy installs

Pi review caught this: orgs where ensureDefaultAgent ran before this PR
have the agents row but with owner_user_id = NULL and installed_providers
= []. The sentinel-fast-path would skip them, so legacy installs would
still hit 403 / 'No model configured' on lobu chat -c local.

Add an idempotent backfillDefaultAgent helper that runs unconditionally
on every ensureDefaultAgent call. It:

- Populates owner_platform = 'external' and owner_user_id from the
  personal-org marker when the row's owner is unset (legacy 'lobu', NULL).
- Adds the agent_users row for that owner so ownsAgent returns true.
- Appends missing system-key providers to installed_providers when the
  list is currently empty. Never removes existing entries; never
  overwrites a non-empty list (admins may have curated it).

No-op on freshly-inserted rows — those already have the correct shape.

* fix(gateway): close cross-tenant default-agent session collision (pi review)

Pi caught a real one: when the caller omits agentId, this PR resolves to
the global constant DEFAULT_AGENT_ID = 'owletto-default'. The userId
default downstream was `requestedUserId || agentId` — so for no-userId
requests userId also became 'owletto-default', and the session-store key
`${agentId}_${userId}` collapsed to the literal string
'owletto-default_owletto-default' identically across every tenant. Org B's
PAT session could then resume org A's session and inherit its worker
token.

Two-layer fix:

1. Derive userId from the authenticated caller via the new authContext
   (`requestedUserId || authContext.userId || agentId`). For the
   default-agent path, this makes conversationId per-user-per-org
   unique. For pinned agents the agentId fallback stays safe because
   pinned IDs are per-org scoped.

2. Defensive resume guard: even if a future caller bypasses the auth
   bridge or sends a colliding requestedUserId, refuse to resume
   when the existing session's organizationId doesn't match the
   current tokenOrganizationId — return 403 instead of leaking the
   worker token.

Also addressed pi's second finding (default-provisioning.ts:164,274):
replace the `metadata::jsonb` selects with `readOrgMetadata` so a row
with invalid legacy JSON doesn't abort provisioning/backfill.

* fix(gateway): tenant-scope conversationId for default + pinned agents (pi review)

Pi found another cross-tenant path: even with userId now derived from the
authenticated caller, two different orgs whose external OAuth tokens share
the same `sub` would still produce the same conversationId — and a
forceNew request from org B would overwrite org A's session at
setSession time (the resume guard only catches the resume path).

Append the tokenOrganizationId to the conversationId so the key is
tenant-unique end-to-end. Covers:
- default-agent path (agentId = DEFAULT_AGENT_ID is a global constant)
- pinned-agent path (per-org rows can share the same id string across
  tenants)

Watcher sessions retain their existing deterministic key (their userId
is already unique by virtue of `watcher_${watcherId}` and they bypass
the user-derived auth context).

* test(server): cover default-agent ownership + backfill paths

Pi flagged tests_adequate=false — no committed tests for the new
default-agent ownership stamping or the backfill path. Add two
integration tests in default-provisioning.test.ts:

1. `stamps owner_user_id + installed_providers + agent_users on insert`
   exercises the create path: marks the org as personal_org_for_user_id=<X>,
   then asserts the inserted agent row has owner_platform='external',
   owner_user_id=X, a populated installed_providers array, and a
   corresponding agent_users row.

2. `backfills owner + agent_users on a legacy row past the sentinel`
   simulates a pre-this-PR row (owner_platform='lobu', owner_user_id=NULL,
   installed_providers='[]', sentinel already set) and asserts that
   calling ensureDefaultAgent now heals it: owner fields updated to
   ('external', <ownerUserId>) and the agent_users row is added.

Also move the orphaned JSDoc back above ensureDefaultAgent (pi noted it
ended up over backfillDefaultAgent after the helper landed).

* fix(gateway): strip /lobu suffix from admin link in error message (pi review)

manager.publicGatewayUrl is the gateway base — in embedded mode the
gateway is mounted at /lobu under the web app (see server-lifecycle.ts:253:
`wrapper.route('/lobu', lobuApp)`), so it ends with /lobu. Admin UI
routes live at the web origin (`/<orgSlug>/agents/...`), not under
/lobu, so the link built by buildAgentSettingsUrl currently points to a
404 like http://localhost:8821/lobu/local-install/agents/owletto-default
instead of http://localhost:8821/local-install/agents/owletto-default.

Strip a trailing /lobu before composing the link.

* test(gateway): route-level tests for no-agentId default resolution

Pi flagged tests_adequate=false on the route — store-level coverage
landed but POST /api/v1/agents itself wasn't tested for the new
default-agent path. Add three tests to agent-session-create.test.ts:

1. empty body resolves to owletto-default, returns 201 with a session
   id of shape <agentId>_<userId>_<orgId> (the tenant suffix is what
   prevents cross-org session collisions when DEFAULT_AGENT_ID is a
   global constant).
2. empty body returns 404 when the caller's org has no default agent.
3. empty body returns 404 when the default agent exists but belongs
   to another org — must NOT leak.

All three use a real createApiAuthMiddleware path: the worker token
carries organizationId, the middleware stamps authContext, the
handler resolves through that. No DB needed; in-memory stubs match
the production interfaces.

* fix(gateway): fold tenant guard into requireAgentOwnership (pi review)

Pi caught a real cross-tenant exploit: GET/DELETE/SSE/messages routes
checked agent ownership (platform, userId, agentId) but NOT
session.organizationId. With agentIds repeating across tenants (the
global DEFAULT_AGENT_ID constant, or two orgs sharing an id string),
org B's caller could hit org A's sessionKey URL and read or delete
A's session because both pass ownership against their own agent-X.

I almost shipped four `denyOnCrossTenantSession` calls scattered next
to each ownership check. That's the workaround — the next route added
forgets the second call and the exploit is back. The structural fix is
to fold the tenant check INTO requireAgentOwnership: pass the session
(when available) and the helper denies on org mismatch before running
the ownership query. One function, one place, can't be missed.

createAgent has no pre-existing session and passes null; the other four
routes (GET, DELETE, SSE, /messages) pass the session they already
loaded for ownership resolution.

Added a route-level test (cross-tenant GET → 403) that asserts the
guard fires inside the auth helper itself, not at the call site, so a
future route handler that forgets to plumb the session still gets the
agent-level ownership check while route-level coverage flags the
plumbing gap.

* fix(gateway): keep watcher conversationId shape + close cookie-path tenant gap

Two follow-ups to the org-scoped conversationId / tenant-guard work in this PR:

1. Exempt watcher sessions from the orgScope suffix. It was spliced between
   userId and thread, turning ..._watcher_<id>_run_<id> into
   ..._watcher_<id>_<org>_run_<id> — which breaks watcher->worker dispatch: the
   worker session key AND the API/SSE owner-routing key (unified-thread-consumer)
   both derive from this conversationId and rely on the watcher_<id>_run_<id>
   shape. That reddened the sdk-e2e watcher gate (green on main). Watcher
   sessions are already globally unique via the DB-serial watcherId + runId, and
   tenant isolation rides session.organizationId (still set) + the route guard,
   not the string.

2. Enforce the tenant guard on the settings-session cookie path. The guard only
   fired when callerOrgId was set (PAT bridge / worker token / external OAuth),
   but the cookie path sets userId with no org, so the guard was a no-op there —
   and verifyOwnedAgentAccess authorizes on (platform, userId, agentId) and
   returns the caller's org, not the session's. A cookie session for org B could
   therefore read org A's session via the shared global DEFAULT_AGENT_ID. Now we
   compare the org ownership actually resolves to against the session's org and
   deny on a definite mismatch (undefined on either side falls through, so a
   legitimate same-org caller is never denied).

* test(gateway): cover cookie-path cross-tenant denial + watcher id shape

Two regression tests for the fixes in 6c040c3 (both verified red→green —
they fail on the pre-fix source and pass after):

- Cross-tenant GET via a settings-session COOKIE is denied with 403. The
  existing cross-tenant test only drives the worker-token path (which sets an
  org on authContext, so the up-front guard catches it). This drives the
  cookie path, where authContext has a userId but no org, so the up-front
  guard is a no-op — and exercises the resolved-org comparison that closes the
  leak. Pre-fix this GET returned 200 and served org A's session to org B.

- Watcher conversationId keeps the exact `<agentId>_watcher_<watcherId>_run_<runId>`
  shape and omits the org suffix, even though the agent's metadata carries an
  org (so a non-watcher path would add it). Pre-fix the id was
  `watcher-agent_watcher_5_org-watcher_run_27`, which breaks watcher->worker
  dispatch correlation (the sdk-e2e gate).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants