Skip to content

fix(server): scope tenant boundaries across egress judge, secret proxy, and oauth state#836

Merged
buremba merged 3 commits into
mainfrom
fix/multi-tenant-isolation
May 18, 2026
Merged

fix(server): scope tenant boundaries across egress judge, secret proxy, and oauth state#836
buremba merged 3 commits into
mainfrom
fix/multi-tenant-isolation

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

Closes four cross-org leakage paths surfaced by a multi-tenant audit:

  • Egress judge cache — key now includes orgId so org A's verdict cannot satisfy org B's identical request. organizationId is plumbed onto WorkerTokenData (and through generateWorkerToken / base-deployment-manager.ts / agent-threads.ts), then read by the HTTP proxy and fed into VerdictCache.key.
  • Secret proxySecretMapping carries organizationId; the new module-level lookupPlaceholderMapping(placeholder, expectedOrganizationId?) rejects an org mismatch the same way as a missing mapping (log + return null, never throw). generatePlaceholder() accepts an organizationId option, set at the deployment-manager call site.
  • OAuth install stateSlackInstallStateData now carries organizationId. /slack/install reads the active org from the session/Hono context (rejects with 401 if absent) and stamps it into the state. /slack/oauth_callback rejects with 403 when the callback session's active org doesn't match the install-state's org.
  • Telegram polling in cloudChatInstanceManager.addConnection rejects mode: "polling" when LOBU_CLOUD_MODE=1. Self-hosters (LOBU_CLOUD_MODE unset/0) keep the polling option for tunnel-less dev. mode: "auto" is unaffected — it resolves to webhook whenever the gateway has a public URL, which cloud always has. AGENTS.md updated to document the cloud constraint.

Test plan

  • make typecheck passes (strict — matches the prod Dockerfile)
  • make build-packages passes (full bundle build clean)
  • bun test src/gateway/__tests__/ — 798 tests pass
  • Unit test: secret-proxy lookupPlaceholderMapping returns null on org mismatch, returns mapping when org matches or no expectation is supplied, falls through when the mapping has no org tag (legacy)
  • Unit test: VerdictCache key — different orgId produces a different key for identical (policyHash, hostname, method, path)
  • Unit test: Slack /slack/install returns 401 with no session org; /slack/oauth_callback returns 403 when callback session org differs from install-state org
  • Manual: trying to create a Telegram polling connection with LOBU_CLOUD_MODE=1 returns a clear error

Summary by CodeRabbit

  • Bug Fixes

    • Telegram polling mode is rejected in cloud deployments to prevent connection/startup errors.
  • Improvements

    • Multi-tenant isolation: verdict caching, policy storage, secret placeholder resolution, and proxy/domain checks are now scoped per organization.
    • OAuth flows and install/state handling enforce organization matching.
    • Worker tokens and agent creation include organization scoping.
  • Documentation

    • Clarified cloud polling behavior in agent docs.
  • Tests

    • Expanded multi-tenant isolation and security regression tests.

Review Change Stack

…y, and oauth state

Close four cross-org leakage paths surfaced by a multi-tenant audit:

- Egress judge cache key now includes the worker's org id so org A's
  verdict for `api.example.com` cannot satisfy org B's identical request.
  Plumbs `organizationId` onto WorkerTokenData and through the proxy.
- SecretMapping carries `organizationId`; new `lookupPlaceholderMapping`
  rejects mismatches the same way as a missing mapping (log + null).
- SlackInstallStateData now carries `organizationId`; `/slack/install`
  refuses anonymous sessions, `/slack/oauth_callback` rejects when the
  callback session's org doesn't match the install state's.
- ChatInstanceManager.addConnection rejects Telegram `mode: "polling"`
  when `LOBU_CLOUD_MODE=1`. Self-hosters (default) keep polling for
  tunnel-less dev. Documented in AGENTS.md.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

Threads organizationId through worker tokens, verdict cache keys, proxy grant checks, secret placeholder resolution, OAuth install state, policy storage/hashing, and deployment wiring; adds a cloud-mode guard disallowing Telegram polling and updates/extends tests and reproducers for tenant isolation.

Changes

Organization-Scoped Multi-Tenant Isolation

Layer / File(s) Summary
Worker token organization context
packages/core/src/worker/auth.ts
WorkerTokenData gains optional organizationId field; generateWorkerToken accepts and persists it in the encrypted token payload.
Service integration of organization tokens
packages/server/src/gateway/orchestration/base-deployment-manager.ts, packages/server/src/gateway/services/agent-threads.ts, packages/server/src/gateway/services/core-services.ts
Deployment manager and thread creation pass organizationId into token generation and secret persistence; CoreServices provides agent→org resolver for secret proxy.
Egress judge verdict cache org isolation
packages/server/src/gateway/proxy/egress-judge/types.ts, packages/server/src/gateway/proxy/egress-judge/cache.ts, packages/server/src/gateway/proxy/egress-judge/judge.ts
JudgeRequest adds organizationId; VerdictCache.key includes orgId; decide() scopes verdicts per organization.
HTTP proxy organization forwarding to judge & grants
packages/server/src/gateway/proxy/http-proxy.ts
Exports setProxyGrantStore; checkDomainAccess and grant checks accept organizationId; CONNECT and HTTP handlers pass tokenData.organizationId to deny/allow checks and judge evaluation.
Secret placeholder organization scoping
packages/server/src/gateway/proxy/secret-proxy.ts
Introduces AgentOrgResolver, exported lookupPlaceholderMapping, SecretMapping.organizationId, org-aware placeholder lookup/swap, and generatePlaceholder storing organizationId.
OAuth install org binding and Slack routes validation
packages/server/src/gateway/auth/oauth/state-store.ts, packages/server/src/gateway/routes/public/slack.ts
Adds OAuthStateStore.peek; SlackInstallStateData stores organizationId; /slack/install binds an org to the state and /slack/oauth_callback validates the callback session org matches the saved state.
PolicyStore org scoping & hashing
packages/server/src/gateway/permissions/policy-store.ts
PolicyStore keys bundles by (organizationId, agentId); prepareBundle and hashPolicy include organizationId to isolate policy hashes per org.
Cloud mode Telegram polling guard
packages/server/src/gateway/connections/chat-instance-manager.ts, AGENTS.md
Adds isCloudMode() and isPollingTelegramMode(); existing polling connections are errored in cloud-mode on initialize and new polling-mode Telegram connections are rejected; AGENTS.md documents polling restriction in cloud mode.
Comprehensive org-scoping test coverage & reproducers
packages/server/src/__tests__/*, packages/server/src/gateway/__tests__/*
Tests updated and added to assert org-scoped verdict keys, policy store isolation, secret placeholder enforcement, grant-store scoping, OAuth org checks, agent API token stamping, and a multi-tenant isolation reproducer suite.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lobu-ai/lobu#734: Schema/migration work adding organization_id and related indexes that underpin org-scoped storage changes.
  • lobu-ai/lobu#750: Related per-organization scoping and storage changes for grants and agent access.
  • lobu-ai/lobu#820: Overlap on worker token format and verification changes touching token payloads.

Suggested labels

skip-size-check

"🐇 I stitched each token with an org-bound thread,
caches whisper secrets kept inside their bed;
polling guarded where tenants share the sky,
installs and swaps now check who holds the tie. 🎉"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: scoping tenant boundaries across three major systems (egress judge, secret proxy, OAuth state).
Description check ✅ Passed The PR description is comprehensive and well-structured with clear summary bullets, detailed test plan (mostly checked), and helpful notes; it exceeds the template requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/multi-tenant-isolation

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/server/src/gateway/services/agent-threads.ts (1)

72-83: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate organizationId through the session/message path, not only the initial token.

organizationId is captured and embedded in the bootstrap token, but it is dropped before enqueue (ThreadSession object + enqueueMessage payload). That leaves downstream processing without tenant context for internally-created threads.

Suggested fix
@@
   const session: ThreadSession = {
@@
     agentId,
+    organizationId,
     dryRun: false,
     isEphemeral: false,
   };
@@
   const jobId = await queueProducer.enqueueMessage({
@@
     agentId: realAgentId,
+    organizationId: session.organizationId,
     botId: "lobu-api",

If ThreadSession does not already define this field, add organizationId?: string in packages/server/src/gateway/session.ts too.

Also applies to: 89-101, 152-173

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/services/agent-threads.ts` around lines 72 - 83,
The bootstrap code currently includes organizationId only in the token (via
generateWorkerToken) but never forwards it into the ThreadSession or the payload
passed to enqueueMessage; update the creation flow to attach organizationId to
the ThreadSession and to any message/enqueue payloads so downstream workers have
tenant context. Specifically, add organizationId?: string to the ThreadSession
type in session.ts (if missing), set ThreadSession.organizationId =
organizationId where the session is constructed (the block using threadId,
conversationId, channelId, deploymentName), and include organizationId in the
object passed into enqueueMessage (and similar payloads around lines 89-101 and
152-173) so the value flows with generateWorkerToken into downstream processing.
packages/server/src/gateway/proxy/secret-proxy.ts (1)

686-706: ⚠️ Potential issue | 🔴 Critical

Test files have not been fully updated to use the new options-object API.

The production code in base-deployment-manager.ts (line 760) correctly uses the new signature with the options object including organizationId. However, multiple test files still call generatePlaceholder with the old positional signature:

  • packages/server/src/gateway/__tests__/secret-proxy.test.ts: 6 calls at lines 27, 59, 69, 79, 85, 139
  • packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts: 1 call at line 630
  • packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts: 5 calls at lines 99, 127, 152, 180, 205

These 12 test calls pass only 4 arguments and must be updated to pass the options object as the 5th parameter: { ttlSeconds?, organizationId? }. Three test cases in secret-proxy.test.ts (lines 101, 114, 126) have already been correctly updated with { organizationId: "org-a" }.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/proxy/secret-proxy.ts` around lines 686 - 706,
Tests still call generatePlaceholder using the old positional 4-argument
signature; update each call to pass the options object as the 5th argument
(options: { ttlSeconds?: number; organizationId?: string }) instead of supplying
organizationId positionally. Specifically, update the 12 failing calls in the
test suites that reference generatePlaceholder in
packages/server/src/gateway/__tests__/secret-proxy.test.ts (6 calls),
packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts (1 call), and
packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts (5 calls) so
they pass an options object (e.g., { organizationId: "org-a", ttlSeconds: X }
when needed); locate usages of generatePlaceholder and adjust arguments
accordingly to match the new signature used by storeSecretMapping and
base-deployment-manager.
🧹 Nitpick comments (1)
packages/server/src/gateway/connections/chat-instance-manager.ts (1)

112-120: 💤 Low value

Consider using isTelegramConfig type guard for consistency.

The file already uses the isTelegramConfig type guard pattern at lines 660, 780, 816, and 1121. Using it here would eliminate the type cast and the separate isPollingTelegramMode helper while maintaining the same semantics:

♻️ Refactor for consistency
-/**
- * `mode: "polling"` is the only config that forces long-polling regardless
- * of whether the gateway has a public webhook URL. `mode: "auto"` resolves
- * to webhook on cloud (publicGatewayUrl is always set there), so it's fine
- * to allow. Only the explicit polling opt-in is rejected in cloud.
- */
-function isPollingTelegramMode(config: { mode?: string }): boolean {
-  return config.mode === "polling";
-}
-
 const ADAPTER_FACTORIES: Record<string, (config: any) => Promise<any>> = {
+    // `mode: "polling"` is the only config that forces long-polling regardless
+    // of whether the gateway has a public webhook URL. `mode: "auto"` resolves
+    // to webhook on cloud (publicGatewayUrl is always set there), so it's fine
+    // to allow. Only the explicit polling opt-in is rejected in cloud.
     if (
       platform === "telegram" &&
+      isTelegramConfig(config) &&
       isCloudMode() &&
-      isPollingTelegramMode(config as { mode?: string })
+      config.mode === "polling"
     ) {

Also applies to: 241-257

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/connections/chat-instance-manager.ts` around
lines 112 - 120, Replace the custom isPollingTelegramMode helper with the
existing isTelegramConfig type guard to keep type-safety and consistency: where
the code currently calls isPollingTelegramMode(config) (and in the other
occurrence at the 241-257 block), change the check to use
isTelegramConfig(config) && config.mode === "polling" (or inline the same
condition) so TypeScript recognizes config as TelegramConfig and you can remove
the isPollingTelegramMode function altogether; search for the symbol
isPollingTelegramMode and update those call sites to reference isTelegramConfig
and the mode comparison instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/proxy/http-proxy.ts`:
- Around line 133-134: The code currently uses organizationId ?? "" as a
fallback key which groups all pre-pivot tokens into a single legacy cache
bucket; instead, change the fallback to a per-token or per-deployment identifier
so verdicts are scoped to the token/deployment (not a constant empty string).
Locate usages of organizationId (the parameter named organizationId in this file
and the cache-key formation sites around the mentioned regions) and replace the
empty-string fallback with a stable per-token/deployment fallback (for example
use the token id, token fingerprint, deploymentId, or a composed fallback like
"deployment:<deploymentId>" when organizationId is missing) so each pre-pivot
token gets its own cache scope; apply this change to the occurrences referenced
(organizationId usage at the start and the other three locations).

---

Outside diff comments:
In `@packages/server/src/gateway/proxy/secret-proxy.ts`:
- Around line 686-706: Tests still call generatePlaceholder using the old
positional 4-argument signature; update each call to pass the options object as
the 5th argument (options: { ttlSeconds?: number; organizationId?: string })
instead of supplying organizationId positionally. Specifically, update the 12
failing calls in the test suites that reference generatePlaceholder in
packages/server/src/gateway/__tests__/secret-proxy.test.ts (6 calls),
packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts (1 call), and
packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts (5 calls) so
they pass an options object (e.g., { organizationId: "org-a", ttlSeconds: X }
when needed); locate usages of generatePlaceholder and adjust arguments
accordingly to match the new signature used by storeSecretMapping and
base-deployment-manager.

In `@packages/server/src/gateway/services/agent-threads.ts`:
- Around line 72-83: The bootstrap code currently includes organizationId only
in the token (via generateWorkerToken) but never forwards it into the
ThreadSession or the payload passed to enqueueMessage; update the creation flow
to attach organizationId to the ThreadSession and to any message/enqueue
payloads so downstream workers have tenant context. Specifically, add
organizationId?: string to the ThreadSession type in session.ts (if missing),
set ThreadSession.organizationId = organizationId where the session is
constructed (the block using threadId, conversationId, channelId,
deploymentName), and include organizationId in the object passed into
enqueueMessage (and similar payloads around lines 89-101 and 152-173) so the
value flows with generateWorkerToken into downstream processing.

---

Nitpick comments:
In `@packages/server/src/gateway/connections/chat-instance-manager.ts`:
- Around line 112-120: Replace the custom isPollingTelegramMode helper with the
existing isTelegramConfig type guard to keep type-safety and consistency: where
the code currently calls isPollingTelegramMode(config) (and in the other
occurrence at the 241-257 block), change the check to use
isTelegramConfig(config) && config.mode === "polling" (or inline the same
condition) so TypeScript recognizes config as TelegramConfig and you can remove
the isPollingTelegramMode function altogether; search for the symbol
isPollingTelegramMode and update those call sites to reference isTelegramConfig
and the mode comparison instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 89b0f268-015a-4399-b464-722e418be3e1

📥 Commits

Reviewing files that changed from the base of the PR and between be8166c and 07ef8c7.

📒 Files selected for processing (18)
  • AGENTS.md
  • packages/core/src/worker/auth.ts
  • packages/server/src/__tests__/unit/egress-judge-timeout.test.ts
  • packages/server/src/gateway/__tests__/egress-judge-cache.test.ts
  • packages/server/src/gateway/__tests__/proxy-hardening.test.ts
  • packages/server/src/gateway/__tests__/rest-api-hardening.test.ts
  • packages/server/src/gateway/__tests__/secret-proxy.test.ts
  • packages/server/src/gateway/__tests__/slack-routes.test.ts
  • packages/server/src/gateway/auth/oauth/state-store.ts
  • packages/server/src/gateway/connections/chat-instance-manager.ts
  • packages/server/src/gateway/orchestration/base-deployment-manager.ts
  • packages/server/src/gateway/proxy/egress-judge/cache.ts
  • packages/server/src/gateway/proxy/egress-judge/judge.ts
  • packages/server/src/gateway/proxy/egress-judge/types.ts
  • packages/server/src/gateway/proxy/http-proxy.ts
  • packages/server/src/gateway/proxy/secret-proxy.ts
  • packages/server/src/gateway/routes/public/slack.ts
  • packages/server/src/gateway/services/agent-threads.ts

Comment thread packages/server/src/gateway/proxy/http-proxy.ts
…eam callers

Addresses six findings on the prior commit. Three were critical because
the new org parameter sat unused at the lookup layer — the isolation
guarantee the commit advertised never reached the call sites:

1. **secret-proxy lookupPlaceholderMapping was dead code.** Wired an
   `agentOrgResolver` (DB lookup keyed by URL agentId) into SecretProxy
   and pass `expectedOrganizationId` through `forward()` →
   `lookupPlaceholderMapping`. Worker tokens carrying `organizationId`
   are also extracted (header/query/bearer) for a signed signal that
   beats the DB lookup.

2. **checkDomainAccess dropped org on grant-store calls.** Pass
   `tokenData.organizationId` through to
   `GrantStore.isDenied/hasGrant`. Added `setProxyGrantStore` so tests
   can install a real store and exercise the call site. PolicyStore is
   now gated on `organizationId` being present — falling through to an
   unkeyed lookup would let another tenant's policy decide our verdict.

3. **PolicyStore is now keyed by `(organizationId, agentId)`.** Last
   sync no longer wins across tenants; `policyHash` includes the org id
   so the verdict cache scoping in #836 stays meaningful.

4. **Telegram cloud-mode guard also runs in `initialize()`.** Persisted
   `mode: "polling"` rows are marked errored at boot (under their own
   org context) instead of silently starting. `addConnection()` still
   rejects the same config at create time.

5. **Slack `/slack/install` self-host fallback.** When neither
   `c.get('organizationId')` nor `session.activeOrganizationId` is set,
   look up the org table — if exactly one org row exists, use it.
   Otherwise reject. Standalone deployments without the lobuApp wrapper
   stay usable; multi-tenant deployments stay strict.

6. **OAuth callback peek-before-consume.** Added `OAuthStateStore.peek()`.
   The Slack callback now validates the session org against the state's
   org *before* burning the row; a legitimate caller can retry after a
   cross-org or unauthenticated hit instead of restarting OAuth.

Adds `multi-tenant-isolation-reproducers.test.ts` with 10 red→green
tests that pin findings 1–4 (finding 6 is covered by an updated
assertion in slack-routes.test.ts that the row is preserved).
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 18, 2026

pi review follow-up — fixup c9a989f0

Plumbs the org parameter through to every downstream caller pi flagged as dead, plus the four lower-severity items. All six findings are addressed in a single commit on the same branch.

Per-finding red→green

Ran the new multi-tenant-isolation-reproducers.test.ts against 07ef8c75 (pre-fix) and c9a989f0 (post-fix). Same test file, no skips.

# pre-fix (commit 07ef8c75)
Ran 10 tests across 1 file. [2.43s]
 4 pass
 5 fail

(fail) [finding 3] PolicyStore is keyed by (orgId, agentId) > org A's policy survives org B's set under the same agent id
(fail) [finding 3] PolicyStore is keyed by (orgId, agentId) > resolve refuses cross-org reads — no fall-through to a sibling tenant's bundle
(fail) [finding 3] PolicyStore is keyed by (orgId, agentId) > clear(orgA, agentId) does not affect orgB's bundle for the same agent id
  TypeError: Object.entries requires that input parameter not be null or undefined
  → PolicyStore.set(agentId, bundle) had no slot for orgId; calling set(orgId, agentId, bundle)
    passes a string where a bundle was expected.

(fail) [finding 1] lookupPlaceholderMapping enforces caller's expected org > SecretProxy.forward rejects an org-A placeholder used on an org-B agent's URL
  TypeError: proxy.setAgentOrgResolver is not a function
  → SecretProxy had no resolver hook; no production call site supplied
    expectedOrganizationId so the lookup's org guard was dead code.

(fail) [finding 2] GrantStore queries scope to caller's organization id > HTTP proxy's checkDomainAccess passes the token's orgId into GrantStore
  TypeError: setProxyGrantStore is not a function
  → No way to inject a grant store at the call site; on pre-fix code the proxy's
    grantStore reference was null and the cross-org leak was unreachable to test.
    Fix adds the setter and threads tokenData.organizationId through both
    hasGrant/isDenied calls.

# post-fix (commit c9a989f0)
Ran 10 tests across 1 file. [2.31s]
 10 pass
 0 fail

(Finding 6 — peek-before-consume — is covered by an updated assertion in slack-routes.test.ts: the install-state row is preserved across a cross-org callback so a legitimate retry doesn't restart OAuth. The pre-existing finding-3 reproducer in that file asserted the opposite.)

What changed per finding

  1. secret-proxy lookupPlaceholderMapping — Added AgentOrgResolver (agentId → orgId) plumbed at boot in CoreServices.resolveAgentOrgId. SecretProxy.forward() derives the caller's expected org from the verified worker token first (header x-lobu-worker-token, query, or bearer when it's not a placeholder) and falls back to the DB resolver keyed by URL agentId. Both URL-binding and legacy swap() paths now pass expectedOrganizationId to lookupPlaceholderMapping / resolveSecret.
  2. checkDomainAccess()GrantStore — Pass tokenData.organizationId into both proxyGrantStore.isDenied(agentId, hostname, orgId) and hasGrant(agentId, hostname, orgId). Added setProxyGrantStore so the call site is injectable (it had no setter before, so the test couldn't even reach the leak). The judge-fallthrough branch now refuses to consult PolicyStore without an org id — falling through to an unkeyed lookup would let another tenant's policy decide our verdict.
  3. PolicyStore re-keyedset(orgId, agentId, bundle), clear(orgId, agentId), resolve(orgId, agentId, hostname). policyHash includes the org id. Sync via BaseDeploymentManager.syncEgressPolicy reads messageData.organizationId (already on every payload); refuses to sync without it rather than collapsing into a shared bucket. All four existing test files updated to the new signatures.
  4. Telegram boot guardChatInstanceManager.initialize() now applies the same isCloudMode() && isPollingTelegramMode() check before startInstance() for persisted rows. Self-binds the connection's org via orgContext.run() before marking the row errored so the Postgres-backed update doesn't trip the "no org context" guard.
  5. Slack /slack/install self-host fallback — When c.get('organizationId') and session.activeOrganizationId are both empty (standalone gateway without the lobuApp wrapper), resolveSingleTenantOrgId() queries organization — if exactly one row exists, that's the install org. Two or zero rows → still 401. Self-host stays usable, multi-tenant stays strict.
  6. Peek-before-consume — Added OAuthStateStore.peek() (non-destructive). /slack/oauth_callback now: peek → validate org → consume only on match. Cross-org or unauthenticated hits no longer burn the install link, so a legitimate caller can retry without restarting OAuth.

Validation

  • make typecheck
  • make build-packages
  • bun test src/gateway/__tests__/ — 808 pass / 0 fail (10 new reproducers + 0 regressions)
  • bun test src/__tests__/unit/ — 169 pass / 0 fail / 16 pre-existing skips

Items I'd flag (not blockers, not changed here)

  • setProxyPolicyStore / new setProxyGrantStore exist only for the test injection path; production still doesn't wire them. That's a pre-existing condition — outside this PR's scope per the "no scope expansion" rule.
  • ProviderOAuthStateData (Claude/ChatGPT/etc.) carries userId + agentId but no organizationId. The agent id is per-org-unique today so it isn't an active leak; flagged for a follow-up review pass.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/connections/chat-instance-manager.ts`:
- Around line 281-297: Existing guard that rejects Telegram polling mode in the
create path must also be enforced in other start/update flows; call the same
check (use isPollingTelegramMode(config) together with platform === "telegram"
and isCloudMode()) at the start of startInstance() and inside updateConnection()
before persisting any config changes (and likewise before restartConnection()
invokes startInstance()), and throw the same Error message to ensure any attempt
to switch or boot an existing connection into mode: "polling" is rejected in
Lobu Cloud.

In `@packages/server/src/gateway/orchestration/base-deployment-manager.ts`:
- Around line 421-425: The grantSyncCache is currently keyed only by agentId
which allows cross-tenant cache hits; update all cache lookups, inserts and
invalidations in syncNetworkConfigGrants (and any helpers) to key by the pair
(organizationId, agentId) instead of agentId alone—e.g., build a composite key
using organizationId and agentId (or a tuple) wherever grantSyncCache is
referenced; ensure the fast-path check that returns an unchanged-set uses this
composite key and that cache population/invalidation also uses the same
composite key so each org+agent has its own bucket (refer to variables
organizationId, agentId, grantSyncCache and method syncNetworkConfigGrants).

In `@packages/server/src/gateway/proxy/secret-proxy.ts`:
- Around line 533-551: When extractWorkerTokenOrg(c) returns a non-empty org and
urlAgentId is present, also call agentOrgResolver(urlAgentId) to obtain the
URL's org and compare them; if they differ, reject the request with a 403 rather
than silently using the token org. Update the logic around
expectedOrganizationId and the agentOrgResolver call (related symbols:
extractWorkerTokenOrg, agentOrgResolver, urlAgentId, expectedOrganizationId, and
the later mapping.agentId check) so that: 1) you always resolve urlAgentId when
present, 2) if extractWorkerTokenOrg returned an org and agentOrgResolver
returns a conflicting org you short-circuit with a 403, and 3) only set
expectedOrganizationId when the resolved orgs agree (or when no token org exists
and resolver returns one).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b3da2398-db90-45d7-912f-1c34a441e672

📥 Commits

Reviewing files that changed from the base of the PR and between 07ef8c7 and c9a989f.

📒 Files selected for processing (15)
  • packages/server/src/gateway/__tests__/base-deployment-grants.test.ts
  • packages/server/src/gateway/__tests__/http-proxy-judge.test.ts
  • packages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.ts
  • packages/server/src/gateway/__tests__/policy-store.test.ts
  • packages/server/src/gateway/__tests__/proxy-hardening.test.ts
  • packages/server/src/gateway/__tests__/slack-routes.test.ts
  • packages/server/src/gateway/auth/oauth/state-store.ts
  • packages/server/src/gateway/connections/chat-instance-manager.ts
  • packages/server/src/gateway/orchestration/base-deployment-manager.ts
  • packages/server/src/gateway/permissions/__tests__/policy-store.test.ts
  • packages/server/src/gateway/permissions/policy-store.ts
  • packages/server/src/gateway/proxy/http-proxy.ts
  • packages/server/src/gateway/proxy/secret-proxy.ts
  • packages/server/src/gateway/routes/public/slack.ts
  • packages/server/src/gateway/services/core-services.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/server/src/gateway/tests/base-deployment-grants.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/server/src/gateway/routes/public/slack.ts
  • packages/server/src/gateway/tests/slack-routes.test.ts

Comment on lines +281 to +297
// `mode: "polling"` long-polls Telegram's edge from the gateway pod and
// bypasses the per-tenant webhook URL we issue. On Lobu Cloud — where
// the same gateway serves many tenants — that means one org's connection
// can starve every other tenant's webhook delivery (and produces no
// audit trail tied to the inbound HTTP request). Refuse the explicit
// polling opt-in up front; self-hosters (LOBU_CLOUD_MODE unset/0) still
// get polling for tunnel-less dev. `mode: "auto"` is fine — it resolves
// to webhook whenever `publicGatewayUrl` is set, which cloud always has.
if (
platform === "telegram" &&
isCloudMode() &&
isPollingTelegramMode(config as { mode?: string })
) {
throw new Error(
"Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host."
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce the polling ban from the shared start path.

Lines 289-297 only block new connections. An existing Telegram row can still be switched to mode: "polling" via updateConnection(), and restartConnection() / startInstance() will boot it in cloud mode. That leaves a straightforward bypass for the guard this PR is adding.

Suggested direction
+  private assertTelegramModeAllowed(
+    platform: string,
+    config: PlatformAdapterConfig
+  ): void {
+    if (
+      platform === "telegram" &&
+      isCloudMode() &&
+      isPollingTelegramMode(config as { mode?: string })
+    ) {
+      throw new Error(
+        "Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host."
+      );
+    }
+  }
+
   async addConnection(
     platform: string,
     agentId: string | undefined,
     config: PlatformAdapterConfig,
@@
-    if (
-      platform === "telegram" &&
-      isCloudMode() &&
-      isPollingTelegramMode(config as { mode?: string })
-    ) {
-      throw new Error(
-        "Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host."
-      );
-    }
+    this.assertTelegramModeAllowed(platform, config);

And then call the same helper before any start path, e.g. in startInstance() and before persisting config changes in updateConnection().

As per coding guidelines, mode: "polling" is rejected at connection-create time when LOBU_CLOUD_MODE=1.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/connections/chat-instance-manager.ts` around
lines 281 - 297, Existing guard that rejects Telegram polling mode in the create
path must also be enforced in other start/update flows; call the same check (use
isPollingTelegramMode(config) together with platform === "telegram" and
isCloudMode()) at the start of startInstance() and inside updateConnection()
before persisting any config changes (and likewise before restartConnection()
invokes startInstance()), and throw the same Error message to ensure any attempt
to switch or boot an existing connection into mode: "polling" is rejected in
Lobu Cloud.

Comment on lines +421 to +425
const organizationId = messageData.organizationId;
// PolicyStore is keyed by `(orgId, agentId)` to prevent cross-tenant
// policy clobbering — refuse to sync without an org id rather than
// collapsing into a shared bucket.
if (!this.policyStore || !agentId || !organizationId) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Scope grantSyncCache by (organizationId, agentId) too.

This method now treats agent-id-only buckets as unsafe, but syncNetworkConfigGrants() still caches by agentId alone. If two orgs reuse the same agentId and converge on the same grant set, the second org hits the unchanged-set fast path at Line 551 and never persists its own grants.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/orchestration/base-deployment-manager.ts` around
lines 421 - 425, The grantSyncCache is currently keyed only by agentId which
allows cross-tenant cache hits; update all cache lookups, inserts and
invalidations in syncNetworkConfigGrants (and any helpers) to key by the pair
(organizationId, agentId) instead of agentId alone—e.g., build a composite key
using organizationId and agentId (or a tuple) wherever grantSyncCache is
referenced; ensure the fast-path check that returns an unchanged-set uses this
composite key and that cache population/invalidation also uses the same
composite key so each org+agent has its own bucket (refer to variables
organizationId, agentId, grantSyncCache and method syncNetworkConfigGrants).

Comment on lines +533 to +551
// Derive the caller's expected org from the verified worker token
// (preferred — it's signed) and fall back to a DB lookup keyed by the
// URL agentId. Either source becomes the `expectedOrganizationId`
// we hand to placeholder + secret lookups so a worker bearing org A's
// placeholder cannot resolve it under org B's URL.
const callerToken = this.extractCallerToken(c);
let expectedOrganizationId: string | undefined =
this.extractWorkerTokenOrg(c);
if (!expectedOrganizationId && urlAgentId && this.agentOrgResolver) {
try {
const orgId = await this.agentOrgResolver(urlAgentId);
if (orgId) expectedOrganizationId = orgId;
} catch (err) {
logger.warn(
{ urlAgentId, err: String(err) },
"agentOrgResolver failed — falling through without org expectation"
);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reject worker-token org / URL-agent org mismatches.

When extractWorkerTokenOrg() succeeds, this path skips agentOrgResolver(urlAgentId) entirely. If two orgs can reuse the same raw agentId, a worker from org A can present its own placeholder, target /a/<same-agent-id> for org B, pass the mapping.agentId === urlAgentId check at Line 572, and get org B's upstream credentials. Always resolve urlAgentId when present and 403 if its org disagrees with the verified token org.

🔒 Suggested fix
-    let expectedOrganizationId: string | undefined =
-      this.extractWorkerTokenOrg(c);
-    if (!expectedOrganizationId && urlAgentId && this.agentOrgResolver) {
+    const tokenOrganizationId = this.extractWorkerTokenOrg(c);
+    let expectedOrganizationId: string | undefined = tokenOrganizationId;
+    if (urlAgentId && this.agentOrgResolver) {
       try {
-        const orgId = await this.agentOrgResolver(urlAgentId);
-        if (orgId) expectedOrganizationId = orgId;
+        const urlOrganizationId = await this.agentOrgResolver(urlAgentId);
+        if (
+          tokenOrganizationId &&
+          urlOrganizationId &&
+          tokenOrganizationId !== urlOrganizationId
+        ) {
+          logger.warn(
+            { urlAgentId, tokenOrganizationId, urlOrganizationId },
+            "Rejecting proxy request: worker token org does not match URL agent org"
+          );
+          return c.json({ error: "Forbidden" }, 403);
+        }
+        expectedOrganizationId ??= urlOrganizationId ?? undefined;
       } catch (err) {
         logger.warn(
           { urlAgentId, err: String(err) },
           "agentOrgResolver failed — falling through without org expectation"
         );
       }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/proxy/secret-proxy.ts` around lines 533 - 551,
When extractWorkerTokenOrg(c) returns a non-empty org and urlAgentId is present,
also call agentOrgResolver(urlAgentId) to obtain the URL's org and compare them;
if they differ, reject the request with a 403 rather than silently using the
token org. Update the logic around expectedOrganizationId and the
agentOrgResolver call (related symbols: extractWorkerTokenOrg, agentOrgResolver,
urlAgentId, expectedOrganizationId, and the later mapping.agentId check) so
that: 1) you always resolve urlAgentId when present, 2) if extractWorkerTokenOrg
returned an org and agentOrgResolver returns a conflicting org you short-circuit
with a 403, and 3) only set expectedOrganizationId when the resolved orgs agree
(or when no token org exists and resolver returns one).

…ic Agent API

The chat-platform spawn path (base-deployment-manager,
agent-threads.createThreadForAgent) already passes organizationId into
generateWorkerToken. The public Agent API entry point (POST /api/v1/agents)
did NOT — every worker spawned via 'lobu chat', 'lobu eval', or the JS SDK
landed with tokenData.organizationId === undefined and the egress proxy's
new per-tenant gates short-circuited to unscoped checks for that worker.

The route now looks the agent's owning org up via the ownership metadata
store and stamps the token. To make this work, AgentMetadata gains an
optional organizationId field populated by the postgres-backed store
(in-memory test stores leave it undefined — back-compat by design).

Ephemeral agents (no preexisting metadata) still mint without orgId;
that narrower case is tracked as a follow-up — needs an auth-session-
driven derivation path.

Reproducers in multi-tenant-isolation-reproducers.test.ts pin the
contract for both the metadata-driven org-stamped path and the
ephemeral undefined path.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 18, 2026

Follow-up fix (ada4b16) — public Agent API mint sites

E2E code-read verification (driven by the main agent, not pi) surfaced one
unclosed gap on top of the round-2 fixes: the public Agent API
(POST /api/v1/agents) was minting worker tokens without
organizationId
.

The chat-platform spawn path (base-deployment-manager,
agent-threads.createThreadForAgent) already passed organizationId
into generateWorkerToken. But the API entry point — the main path for
lobu chat, lobu eval, and the JS SDK — did not. Result: every API-
spawned worker landed with tokenData.organizationId === undefined, and
the new per-tenant gates in http-proxy.checkDomainAccess() short-
circuited via if (agentId && organizationId) — falling through to the
legacy unscoped checks for that worker.

Change

  • AgentMetadata (in @lobu/core) gains optional organizationId?: string.
  • Postgres-backed getMetadata / listAgents SELECT organization_id; rowToMetadata maps it. In-memory stores leave it undefined (back-compat by design).
  • POST /api/v1/agents looks up the agent's owning org via the ownership metadata store and stamps the token. Two generateWorkerToken call sites updated (resume-session + create-session).
  • Ephemeral agents (requestedAgentId empty → freshly minted UUID, no preexisting metadata) still mint without orgId. Tracked as a follow-up — needs an auth-session-driven derivation.

Red→green reproducers (appended to multi-tenant-isolation-reproducers.test.ts)

Test Pre-fix Post-fix
metadata-driven lookup propagates org into worker token decoded.organizationId === undefined (no organizationId: in route's options bag) decoded.organizationId === "org-a"
ephemeral agent mints without organizationId (same, contract) decoded.organizationId === undefined

Verification

  • make typecheck — clean
  • make build-packages — clean
  • bun test src/gateway/__tests__/810 pass / 0 fail (was 808; +2 new reproducers)

Known follow-ups (not in this PR)

  1. watchers/automation.ts:678 (preflightWatcherMemoryTools) still mints without orgId. The token only authenticates a localhost call to /lobu/mcp/lobu-memory/tools (tool-listing); no egress proxy involved. Cosmetic consistency only — not blocking.
  2. Ephemeral-agent API mint still has organizationId === undefined. Needs a path from the auth session to the worker token.
  3. PolicyStore + GrantStore wiring (setProxyPolicyStore / setProxyGrantStore) — preventative as-is; wire when ready.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/server/src/lobu/stores/postgres-stores.ts (1)

265-284: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fail closed on org-less metadata lookups.

agents is keyed by (organization_id, id) now, but the fallback branch still does WHERE id = ${agentId} and returns rows[0]. Any caller outside orgContext can now pick an arbitrary tenant's metadata for a reused agentId, which is enough to mis-stamp organizationId on worker tokens and route requests under the wrong tenant. Require an explicit org id here, or refuse id-only lookups unless they resolve to exactly one row.

Suggested safety guard
     async getMetadata(agentId) {
       const sql = getDb();
       const orgId = tryGetOrgId();
       const rows = orgId
         ? await sql`
             SELECT id, organization_id, name, description, owner_platform, owner_user_id,
                    is_workspace_agent, workspace_id,
                    created_at, last_used_at
             FROM agents
             WHERE id = ${agentId} AND organization_id = ${orgId}
           `
         : await sql`
             SELECT id, organization_id, name, description, owner_platform, owner_user_id,
                    is_workspace_agent, workspace_id,
                    created_at, last_used_at
             FROM agents
             WHERE id = ${agentId}
+            LIMIT 2
           `;
       if (rows.length === 0) return null;
+      if (!orgId && rows.length !== 1) return null;
       return rowToMetadata(rows[0]);
     },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/lobu/stores/postgres-stores.ts` around lines 265 - 284,
getMetadata currently falls back to an id-only query when tryGetOrgId() is
falsy, which can return an arbitrary tenant's agent; change getMetadata to
refuse ambiguous org-less lookups: when orgId is absent, run the id-only query
but only accept the result if exactly one row is returned (rows.length === 1);
otherwise return null or throw an error to prevent cross-tenant leakage. Update
the logic around getMetadata (and still use rowToMetadata(rows[0]) only after
the unique-check) so callers no longer receive metadata for ambiguous agentId
lookups.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/routes/public/agent.ts`:
- Around line 636-645: tokenOrganizationId is left undefined for public POST
/api/v1/agents when requestedAgentId is absent, allowing org-less worker tokens
to bypass org-scoped checks; change the logic in the agent creation flow (where
tokenOrganizationId, isEphemeral, ownershipMetadataStore, requestedAgentId and
agentId are used) to resolve the caller's active organization from the
authenticated session (e.g., extract the session/org id from the request auth
context used by the route) and set tokenOrganizationId to that value when no
ownership metadata exists, and if no org can be resolved from the session then
fail closed (return an error) instead of minting an org-less token. Ensure you
update the branching that currently only consults ownershipMetadataStore so
session-derived organizationId is authoritative for new agent minting.

---

Outside diff comments:
In `@packages/server/src/lobu/stores/postgres-stores.ts`:
- Around line 265-284: getMetadata currently falls back to an id-only query when
tryGetOrgId() is falsy, which can return an arbitrary tenant's agent; change
getMetadata to refuse ambiguous org-less lookups: when orgId is absent, run the
id-only query but only accept the result if exactly one row is returned
(rows.length === 1); otherwise return null or throw an error to prevent
cross-tenant leakage. Update the logic around getMetadata (and still use
rowToMetadata(rows[0]) only after the unique-check) so callers no longer receive
metadata for ambiguous agentId lookups.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b172434f-469f-47b2-8d18-2ccfa2507553

📥 Commits

Reviewing files that changed from the base of the PR and between c9a989f and ada4b16.

📒 Files selected for processing (4)
  • packages/core/src/agent-store.ts
  • packages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.ts
  • packages/server/src/gateway/routes/public/agent.ts
  • packages/server/src/lobu/stores/postgres-stores.ts

Comment on lines +636 to +645
// Stamp the worker token with the agent's owning org so the egress
// proxy's per-tenant gates (grant/deny, judge cache, judge policy)
// can scope decisions by org. Ephemeral agents have no preexisting
// metadata; their token mints without orgId and the proxy falls
// through to unscoped checks for that worker — flagged for a
// future fix that derives org from the auth session.
const tokenOrganizationId =
!isEphemeral && ownershipMetadataStore
? (await ownershipMetadataStore.getMetadata(agentId))?.organizationId
: undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't mint org-less worker tokens on the public API path.

When requestedAgentId is absent, tokenOrganizationId is guaranteed to stay undefined, so POST /api/v1/agents still emits tokens that bypass the new org-scoped grant/judge/secret checks. That leaves the default ephemeral API flow outside the isolation guarantees this PR is adding. Resolve the caller's active org from the authenticated session before minting, or fail closed when the request has no org context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/routes/public/agent.ts` around lines 636 - 645,
tokenOrganizationId is left undefined for public POST /api/v1/agents when
requestedAgentId is absent, allowing org-less worker tokens to bypass org-scoped
checks; change the logic in the agent creation flow (where tokenOrganizationId,
isEphemeral, ownershipMetadataStore, requestedAgentId and agentId are used) to
resolve the caller's active organization from the authenticated session (e.g.,
extract the session/org id from the request auth context used by the route) and
set tokenOrganizationId to that value when no ownership metadata exists, and if
no org can be resolved from the session then fail closed (return an error)
instead of minting an org-less token. Ensure you update the branching that
currently only consults ownershipMetadataStore so session-derived organizationId
is authoritative for new agent minting.

@buremba buremba merged commit de4c238 into main May 18, 2026
23 of 26 checks passed
@buremba buremba deleted the fix/multi-tenant-isolation branch May 18, 2026 01:53
buremba added a commit that referenced this pull request May 18, 2026
…interactions (#867)

* fix(server): post-review cleanup of multi-tenant isolation + pending interactions

Closes 6 issues surfaced by a 6-agent review of #834, #836, #848, and the
#836 followup:

1. secret-proxy: fail-closed on agentOrgResolver DB error (was warn +
   fall-through with undefined expectedOrganizationId — a DB hiccup
   window let downstream org checks silently downgrade).
2. secret-proxy: legacy-mapping bypass closed. Pre-fix `lookupPlaceholder
   Mapping` skipped the org check when `mapping.organizationId` was unset
   (legacy rows pre-org-pivot); a worker from org B could resolve a
   legacy unscoped mapping owned by org A. Now: if the caller supplies
   `expectedOrganizationId`, the mapping must match it. Emit a WARN on
   every legacy unscoped access to plan the deprecation.
3. pending-interaction-store: drop `created_at = now()` from ON CONFLICT.
   Webhook retries no longer reset the 24h TTL clock, so a misbehaving
   retry loop cannot keep a row alive indefinitely. `claimed_at = NULL`
   reset is preserved so legitimate retries are still claimable.
4. egress-judge VerdictCache key already includes orgId
   (`cache.ts:30-43`) — no code change needed, documented for the
   reviewers.
5. interaction-bridge: drop the per-bridge `sweepStalePendingInteractions`
   setInterval. The global `coreServices.sweepEphemeralTables`
   (scheduled in `src/scheduled/jobs.ts`) already covers it; the
   per-bridge call was N-times-per-pod wasted DB work. The local sweep
   timer is retained but now only evicts the in-memory `pendingSent
   Messages` cache by TTL.
6. pending-interaction-store: cap `sweepStalePendingInteractions` at 1000
   rows/call (configurable). An unbounded DELETE under a stale-row
   backlog could lock the table; remaining rows drain across subsequent
   5-minute cycles.
7. pending-interaction-store: add `deletePendingQuestion(id, org, conn,
   user)` and use it in the post-failure drop path. Pre-fix the bridge
   called `claimPendingQuestion` (UPDATE setting claimed_at) to "drop"
   the row, leaving a phantom row sitting around until the 24h sweep.
   `deletePendingQuestion` carries the same four-field scoping as the
   claim path, so safety invariants are identical.

Tests:

- `multi-tenant-isolation-reproducers.test.ts` gets 3 new cases under
  `[finding 1]`: legacy-mapping bypass rejected, legacy-mapping warn
  path with no expected org, and SecretProxy.forward returns 503 when
  the resolver throws.
- New `pending-interaction-cleanup.test.ts` covers retry-preserves-
  `created_at`, sweep LIMIT honoured + remainder drains, post-failure
  drop is a DELETE (not just claim), and `deletePendingQuestion`
  scoping invariant.
- `secret-proxy.test.ts` updated to reflect the closed legacy bypass —
  the old "falls through (legacy)" case now expects a null return when
  the caller supplies an expected org; a new "no expected org" case
  documents the WARN path for un-org-scoped callers.

Red→green proof captured per-fix in the PR body.

* fix(server): address codex review on PR #867

Two issues flagged by codex review:

1. Provider credential lookup had an unscoped fail-open DB-error path.
   The PR's new 503 in `secret-proxy.ts` only covered the resolver-throws
   branch — when the worker token already carried `organizationId`, the
   resolver was skipped and downstream `authProfilesManager.getBestProfile`
   ran without org context. Inside `AuthProfilesManager.resolveAgentOrgId`,
   a DB error logged a warning and returned undefined, falling through to
   unscoped credential reads (`auth-profiles-manager.ts:251-275`). Now we
   wrap the credential lookup in `orgContext.run({organizationId:
   expectedOrganizationId}, ...)` so `AuthProfilesManager.listProfiles`
   short-circuits via `tryGetOrgId()` and never invokes its own resolver
   when we already know the org. A DB hiccup in the upstream resolver
   cannot downgrade scoping for these requests.

2. The post-failure cleanup test exercised `deletePendingQuestion()`
   directly but didn't drive `registerInteractionBridge`. A regression
   that swapped `deletePendingQuestion` back to `claimPendingQuestion`
   in the bridge would not have failed the test. Added an integration
   test that emits a `question:created` event, stubs the thread post to
   throw, and asserts the row is GONE from `pending_interactions` (count
   0). Verified red→green by reverting the bridge to claim — the test
   times out at 5s because the row is never deleted.

* chore(submodule): bump owletto to dcb2172 to clear drift check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants