Skip to content

fix(gateway): create ephemeral agent row before saveSettings + loud-fail saveSettings on no-op#1133

Closed
buremba wants to merge 4 commits into
mainfrom
feat/fix-ephemeral-providers
Closed

fix(gateway): create ephemeral agent row before saveSettings + loud-fail saveSettings on no-op#1133
buremba wants to merge 4 commits into
mainfrom
feat/fix-ephemeral-providers

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 28, 2026

Summary

Two-part fix for the latent "ephemeral agent ends up with no provider" bug uncovered while validating the chat-org propagation fix (#1129).

Root defect: PostgresAgentConfigStore.saveSettings is UPDATE-only — no agents row, no update, no error. The "auto-provision system providers" call in createAgent for ephemeral agents hit exactly this: it logged "provisioned [claude, gemini, …]" but actually persisted nothing because the agents row didn't exist yet. Downstream, the worker's session-context resolved installedProviders = [] → no provider → "No model configured. Ask an admin to connect a provider for the base agent".

Fix 1 (the legitimate caller): createAgent now calls AgentMetadataStore.createAgent before saveSettings for ephemeral agents. saveMetadata is INSERT … ON CONFLICT DO UPDATE so the call is idempotent; re-creating the same id is benign.

Fix 2 (the footgun): saveSettings now throws when result.count === 0. Previous behavior was a silent success. Any future caller that saves settings before the agents row exists fails loud at the write site instead of as a confusing downstream "No model configured".

Test plan

Reproducer (origin/main + #1129 merged, pre-fix):

$ lobu run --port 8816
$ lobu chat -c local "ping"
Agent error: No model configured. Ask an admin to connect a provider for the base agent.

# gateway log:
[info] [agent-settings-store] Saved settings for agent da2299dc-…
[info] [agent-api] Ephemeral agent da2299dc-…: provisioned system providers [claude, gemini, z-ai, openrouter]
[info] [worker-gateway] Session context for da2299dc-…: provider: none, cliBackends: none

With this fix:

$ lobu chat -c local "ping"
❌ Session failed: 403 OAuth authentication is currently not allowed for this organization.

The "No model configured" is gone — the worker resolves the provider (claude is picked first), contacts Anthropic, and the request reaches the API. The remaining 403 is a separate, unrelated bug (claude.hasSystemKey() always returns true because it does OAuth, but no real OAuth/key is set in this install — different ticket).

  • bun run typecheck clean
  • cd packages/server && bunx tsc --noEmit clean
  • make build-packages green
  • E2E: lobu run boot → lobu chat -c local "ping" now reaches the provider call layer (previously stuck at "No model configured")

Stacks on top of #1129 — without that fix, the cross-pod lock blocks the chat before this path even runs.

Confidence

~85. Reproduced both bugs (Refusing to spawn worker then No model configured), applied both fixes, re-ran the same flow, confirmed both errors are gone. The behavior change of saveSettings throwing on no-op is small and targeted — full integration suite is being re-run as part of the review.

Follow-ups

  1. claude.hasSystemKey() heuristic — should return false when no ANTHROPIC_API_KEY and no OAuth state is present, so the provider-priority loop picks a usable provider (gemini / z-ai / openrouter) when only those have credentials. The current OAuth-assumed-true heuristic preempts that.
  2. Add an integration test for the full lobu chat -c local flow (install_operator → ephemeral agent → worker → provider call). The bug existed for weeks because no test exercised this path end-to-end.

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

buremba added 4 commits May 28, 2026 17:14
 follow-up)

The cross-pod conversation-lock guard introduced in #1068 refuses to
spawn a worker when an org-less turn is enqueued. Ephemeral agents
created by 'lobu chat -c local' (no project / no metadata) fell into
that path: createAgent stamped tokenOrganizationId from the agent's
own metadata only, leaving it undefined for ephemeral agents — so the
session shipped to enqueueMessage with no organizationId and the
deployment manager threw 'Cannot acquire per-conversation lock'.

Fall back to the caller's organizationId resolved by createLobuAuthBridge
(from the PAT or Better Auth session). Also surface the same info via a
new ApiAuthContext that createApiAuthMiddleware sets after worker-token
or /oauth/userinfo validation, so consumers that don't sit behind the
Lobu auth bridge can read it too. ExternalAuthClient.fetchUserInfo now
forwards the organization_id resolved from /oauth/userinfo's
organization_slug + organizations[] list.

E2E: lobu run + lobu chat -c local 'ping' now reaches the worker (no
more 'Worker startup failed: cross-pod conversation lock requires
organizationId'). The remaining 'No model configured' error is a
separate ephemeral-provider-binding gap.
saveSettings on the Postgres-backed AgentConfigStore is UPDATE-only —
no row, no save. Ephemeral agents created via 'lobu chat -c local'
never had a corresponding row in the agents table, so the
'auto-provision system providers' UPDATE matched 0 rows silently.
Downstream, the worker's session-context resolved installedProviders
to [] → defaultProvider = none → 'No model configured. Ask an admin
to connect a provider for the base agent'.

Call AgentMetadataStore.createAgent before saveSettings so the row
exists. Same call site that already provisioned the providers, just
runs the underlying INSERT ... ON CONFLICT DO UPDATE one step
earlier.
The Postgres-backed AgentConfigStore.saveSettings was UPDATE-only and
silently returned success when 0 rows matched. That made it a footgun:
any caller that saved settings before the agents row existed (the
ephemeral-chat path being the case we hit) would get no error and no
persisted data — surfacing downstream as 'No model configured' once
the worker read installedProviders = [].

Make saveSettings throw when the UPDATE matches 0 rows. The previous
commit fixes the legitimate caller (createAgent for ephemeral agents
now calls saveMetadata first). This commit ensures future regressions
of the same pattern fail loud at the write site instead of materializing
as a confusing downstream error.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@buremba, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 685370d5-6c50-4c57-a535-287064216ecd

📥 Commits

Reviewing files that changed from the base of the PR and between 2bc616f and 096dee1.

📒 Files selected for processing (4)
  • packages/server/src/gateway/auth/api-auth-middleware.ts
  • packages/server/src/gateway/auth/external/client.ts
  • packages/server/src/gateway/routes/public/agent.ts
  • packages/server/src/lobu/stores/postgres-stores.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix-ephemeral-providers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 28, 2026

bug_free 52, simplicity 62, slop 20, bugs 1, 1 blockers

Read diff and canonical logs: typecheck=0, unit=0, integration=0. Skipped extra server boot; reviewed the org-context path statically. External OAuth sets authContext org, but Postgres stores use orgContext, which is only established earlier from PAT/session user.

Blockers

  • external OAuth-authenticated ephemeral agent creation still lacks orgContext for Postgres-backed stores

Suggested fixes

File Line Change
packages/server/src/gateway/auth/api-auth-middleware.ts 83 When userInfo.organizationId is present, also set c.set("organizationId", userInfo.organizationId) and run next() inside orgContext.run({ organizationId: userInfo.organizationId }, ...), so createAgent/saveSettings see the org in AsyncLocalStorage.
packages/server/src/gateway/routes/public/agent.ts 675 Do not swallow createAgent errors; remove the catch or rethrow after logging. saveMetadata is already an upsert, and masking org/FK failures can let the route proceed without an agents row.
Full verdict JSON
{
  "bug_free_confidence": 52,
  "bugs": 1,
  "slop": 20,
  "simplicity": 62,
  "blockers": [
    "external OAuth-authenticated ephemeral agent creation still lacks orgContext for Postgres-backed stores"
  ],
  "change_type": "fix",
  "behavior_change_risk": "medium",
  "tests_adequate": false,
  "suggested_fixes": [
    {
      "file": "packages/server/src/gateway/auth/api-auth-middleware.ts",
      "line": 83,
      "change": "When userInfo.organizationId is present, also set c.set(\"organizationId\", userInfo.organizationId) and run next() inside orgContext.run({ organizationId: userInfo.organizationId }, ...), so createAgent/saveSettings see the org in AsyncLocalStorage."
    },
    {
      "file": "packages/server/src/gateway/routes/public/agent.ts",
      "line": 675,
      "change": "Do not swallow createAgent errors; remove the catch or rethrow after logging. saveMetadata is already an upsert, and masking org/FK failures can let the route proceed without an agents row."
    }
  ],
  "notes": "Read diff and canonical logs: typecheck=0, unit=0, integration=0. Skipped extra server boot; reviewed the org-context path statically. External OAuth sets authContext org, but Postgres stores use orgContext, which is only established earlier from PAT/session user.",
  "categories": {
    "src": 133,
    "tests": 0,
    "docs": 0,
    "config": 0,
    "deps": 0,
    "migrations": 0,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 28, 2026

Replaced by #PENDING_NUMBER (will update). The 'ephemeral' path was the wrong shape — routing no-agentId chats to the org's default agent instead of minting phantom UUIDs is the correct fix.

@buremba buremba closed this May 28, 2026
buremba added a commit that referenced this pull request May 28, 2026
* fix(gateway): route lobu chat to org's default agent end-to-end

Closes #1129, #1133.

The 'isEphemeral' code path in createAgent — which generated a random UUID
and tried to auto-install system-key providers for any no-agentId chat —
was always broken: saveSettings was UPDATE-only and silently no-op'd
because the agents row had never been INSERTed. Even after that was
patched it would have polluted the DB with one phantom agent per chat
invocation, with no continuity across calls.

This PR removes the concept entirely and wires the existing per-org
default agent (`owletto-default`, provisioned at signup via
`ensureDefaultAgent`) into the chat path.

Changes:

1. `createAgent` (gateway/routes/public/agent.ts): when the caller
   omits agentId, look up `owletto-default` for the caller's org
   instead of minting a UUID. Two flows now: caller-pinned, or
   default-agent — no third.

2. `ensureDefaultAgent` (auth/default-provisioning.ts): also write
   `installed_providers` from the set of providers with system keys
   (env-var API keys, Claude OAuth-discovery) at boot, so the default
   agent is chat-ready out of the box. Resolve the personal-org owner
   from `organization.metadata.personal_org_for_user_id` and write the
   agent as user-owned + mirror into `agent_users` so the per-user
   ownership check in `verifyOwnedAgentAccess` recognizes the PAT
   session's user as the owner.

3. `createApiAuthMiddleware` (gateway/auth/api-auth-middleware.ts):
   plumb the caller's resolved org through (worker-token payload,
   /oauth/userinfo, settings session) by attaching `authContext` to
   the Hono context AND running the request inside
   `orgContext.run()`. This was the #1068 follow-up pi flagged: the
   PAT path's `createLobuAuthBridge` already wraps in orgContext, but
   the OAuth and worker-token paths didn't — so any handler that
   reached for AsyncLocalStorage-scoped org id would miss for those
   auth methods.

4. `ExternalAuthClient.fetchUserInfo` (gateway/auth/external/client.ts):
   surface the organization id resolved from /oauth/userinfo's
   organization_slug + organizations[] list (previously typed away).

5. `ChatResponseBridge` (gateway/connections/chat-response-bridge.ts):
   render the NO_MODEL_CONFIGURED error with a real link to the
   agent's settings page (`${publicWebUrl}/${orgSlug}/agents/${agentId}`)
   instead of vague 'ask an admin' text.

6. CLI dry-run help text (cli/src/index.ts): tighten from 'Process
   without persisting history' (misleading — history IS persisted) to
   'Skip side-effecting tool calls (sandbox writes, sdk_run mutations).'

7. `ThreadSession.isEphemeral` field deleted along with the
   wasEphemeral cleanup branch in DELETE /api/v1/agents — there are no
   ephemeral sessions anymore.

* fix(server): backfill default agent for legacy installs

Pi review caught this: orgs where ensureDefaultAgent ran before this PR
have the agents row but with owner_user_id = NULL and installed_providers
= []. The sentinel-fast-path would skip them, so legacy installs would
still hit 403 / 'No model configured' on lobu chat -c local.

Add an idempotent backfillDefaultAgent helper that runs unconditionally
on every ensureDefaultAgent call. It:

- Populates owner_platform = 'external' and owner_user_id from the
  personal-org marker when the row's owner is unset (legacy 'lobu', NULL).
- Adds the agent_users row for that owner so ownsAgent returns true.
- Appends missing system-key providers to installed_providers when the
  list is currently empty. Never removes existing entries; never
  overwrites a non-empty list (admins may have curated it).

No-op on freshly-inserted rows — those already have the correct shape.

* fix(gateway): close cross-tenant default-agent session collision (pi review)

Pi caught a real one: when the caller omits agentId, this PR resolves to
the global constant DEFAULT_AGENT_ID = 'owletto-default'. The userId
default downstream was `requestedUserId || agentId` — so for no-userId
requests userId also became 'owletto-default', and the session-store key
`${agentId}_${userId}` collapsed to the literal string
'owletto-default_owletto-default' identically across every tenant. Org B's
PAT session could then resume org A's session and inherit its worker
token.

Two-layer fix:

1. Derive userId from the authenticated caller via the new authContext
   (`requestedUserId || authContext.userId || agentId`). For the
   default-agent path, this makes conversationId per-user-per-org
   unique. For pinned agents the agentId fallback stays safe because
   pinned IDs are per-org scoped.

2. Defensive resume guard: even if a future caller bypasses the auth
   bridge or sends a colliding requestedUserId, refuse to resume
   when the existing session's organizationId doesn't match the
   current tokenOrganizationId — return 403 instead of leaking the
   worker token.

Also addressed pi's second finding (default-provisioning.ts:164,274):
replace the `metadata::jsonb` selects with `readOrgMetadata` so a row
with invalid legacy JSON doesn't abort provisioning/backfill.

* fix(gateway): tenant-scope conversationId for default + pinned agents (pi review)

Pi found another cross-tenant path: even with userId now derived from the
authenticated caller, two different orgs whose external OAuth tokens share
the same `sub` would still produce the same conversationId — and a
forceNew request from org B would overwrite org A's session at
setSession time (the resume guard only catches the resume path).

Append the tokenOrganizationId to the conversationId so the key is
tenant-unique end-to-end. Covers:
- default-agent path (agentId = DEFAULT_AGENT_ID is a global constant)
- pinned-agent path (per-org rows can share the same id string across
  tenants)

Watcher sessions retain their existing deterministic key (their userId
is already unique by virtue of `watcher_${watcherId}` and they bypass
the user-derived auth context).

* test(server): cover default-agent ownership + backfill paths

Pi flagged tests_adequate=false — no committed tests for the new
default-agent ownership stamping or the backfill path. Add two
integration tests in default-provisioning.test.ts:

1. `stamps owner_user_id + installed_providers + agent_users on insert`
   exercises the create path: marks the org as personal_org_for_user_id=<X>,
   then asserts the inserted agent row has owner_platform='external',
   owner_user_id=X, a populated installed_providers array, and a
   corresponding agent_users row.

2. `backfills owner + agent_users on a legacy row past the sentinel`
   simulates a pre-this-PR row (owner_platform='lobu', owner_user_id=NULL,
   installed_providers='[]', sentinel already set) and asserts that
   calling ensureDefaultAgent now heals it: owner fields updated to
   ('external', <ownerUserId>) and the agent_users row is added.

Also move the orphaned JSDoc back above ensureDefaultAgent (pi noted it
ended up over backfillDefaultAgent after the helper landed).

* fix(gateway): strip /lobu suffix from admin link in error message (pi review)

manager.publicGatewayUrl is the gateway base — in embedded mode the
gateway is mounted at /lobu under the web app (see server-lifecycle.ts:253:
`wrapper.route('/lobu', lobuApp)`), so it ends with /lobu. Admin UI
routes live at the web origin (`/<orgSlug>/agents/...`), not under
/lobu, so the link built by buildAgentSettingsUrl currently points to a
404 like http://localhost:8821/lobu/local-install/agents/owletto-default
instead of http://localhost:8821/local-install/agents/owletto-default.

Strip a trailing /lobu before composing the link.

* test(gateway): route-level tests for no-agentId default resolution

Pi flagged tests_adequate=false on the route — store-level coverage
landed but POST /api/v1/agents itself wasn't tested for the new
default-agent path. Add three tests to agent-session-create.test.ts:

1. empty body resolves to owletto-default, returns 201 with a session
   id of shape <agentId>_<userId>_<orgId> (the tenant suffix is what
   prevents cross-org session collisions when DEFAULT_AGENT_ID is a
   global constant).
2. empty body returns 404 when the caller's org has no default agent.
3. empty body returns 404 when the default agent exists but belongs
   to another org — must NOT leak.

All three use a real createApiAuthMiddleware path: the worker token
carries organizationId, the middleware stamps authContext, the
handler resolves through that. No DB needed; in-memory stubs match
the production interfaces.

* fix(gateway): fold tenant guard into requireAgentOwnership (pi review)

Pi caught a real cross-tenant exploit: GET/DELETE/SSE/messages routes
checked agent ownership (platform, userId, agentId) but NOT
session.organizationId. With agentIds repeating across tenants (the
global DEFAULT_AGENT_ID constant, or two orgs sharing an id string),
org B's caller could hit org A's sessionKey URL and read or delete
A's session because both pass ownership against their own agent-X.

I almost shipped four `denyOnCrossTenantSession` calls scattered next
to each ownership check. That's the workaround — the next route added
forgets the second call and the exploit is back. The structural fix is
to fold the tenant check INTO requireAgentOwnership: pass the session
(when available) and the helper denies on org mismatch before running
the ownership query. One function, one place, can't be missed.

createAgent has no pre-existing session and passes null; the other four
routes (GET, DELETE, SSE, /messages) pass the session they already
loaded for ownership resolution.

Added a route-level test (cross-tenant GET → 403) that asserts the
guard fires inside the auth helper itself, not at the call site, so a
future route handler that forgets to plumb the session still gets the
agent-level ownership check while route-level coverage flags the
plumbing gap.

* fix(gateway): keep watcher conversationId shape + close cookie-path tenant gap

Two follow-ups to the org-scoped conversationId / tenant-guard work in this PR:

1. Exempt watcher sessions from the orgScope suffix. It was spliced between
   userId and thread, turning ..._watcher_<id>_run_<id> into
   ..._watcher_<id>_<org>_run_<id> — which breaks watcher->worker dispatch: the
   worker session key AND the API/SSE owner-routing key (unified-thread-consumer)
   both derive from this conversationId and rely on the watcher_<id>_run_<id>
   shape. That reddened the sdk-e2e watcher gate (green on main). Watcher
   sessions are already globally unique via the DB-serial watcherId + runId, and
   tenant isolation rides session.organizationId (still set) + the route guard,
   not the string.

2. Enforce the tenant guard on the settings-session cookie path. The guard only
   fired when callerOrgId was set (PAT bridge / worker token / external OAuth),
   but the cookie path sets userId with no org, so the guard was a no-op there —
   and verifyOwnedAgentAccess authorizes on (platform, userId, agentId) and
   returns the caller's org, not the session's. A cookie session for org B could
   therefore read org A's session via the shared global DEFAULT_AGENT_ID. Now we
   compare the org ownership actually resolves to against the session's org and
   deny on a definite mismatch (undefined on either side falls through, so a
   legitimate same-org caller is never denied).

* test(gateway): cover cookie-path cross-tenant denial + watcher id shape

Two regression tests for the fixes in 6c040c3 (both verified red→green —
they fail on the pre-fix source and pass after):

- Cross-tenant GET via a settings-session COOKIE is denied with 403. The
  existing cross-tenant test only drives the worker-token path (which sets an
  org on authContext, so the up-front guard catches it). This drives the
  cookie path, where authContext has a userId but no org, so the up-front
  guard is a no-op — and exercises the resolved-org comparison that closes the
  leak. Pre-fix this GET returned 200 and served org A's session to org B.

- Watcher conversationId keeps the exact `<agentId>_watcher_<watcherId>_run_<runId>`
  shape and omits the org suffix, even though the agent's metadata carries an
  org (so a non-watcher path would add it). Pre-fix the id was
  `watcher-agent_watcher_5_org-watcher_run_27`, which breaks watcher->worker
  dispatch correlation (the sdk-e2e gate).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants