fix(server): scope tenant boundaries across egress judge, secret proxy, and oauth state#836
Conversation
…y, and oauth state Close four cross-org leakage paths surfaced by a multi-tenant audit: - Egress judge cache key now includes the worker's org id so org A's verdict for `api.example.com` cannot satisfy org B's identical request. Plumbs `organizationId` onto WorkerTokenData and through the proxy. - SecretMapping carries `organizationId`; new `lookupPlaceholderMapping` rejects mismatches the same way as a missing mapping (log + null). - SlackInstallStateData now carries `organizationId`; `/slack/install` refuses anonymous sessions, `/slack/oauth_callback` rejects when the callback session's org doesn't match the install state's. - ChatInstanceManager.addConnection rejects Telegram `mode: "polling"` when `LOBU_CLOUD_MODE=1`. Self-hosters (default) keep polling for tunnel-less dev. Documented in AGENTS.md.
📝 WalkthroughWalkthroughThreads organizationId through worker tokens, verdict cache keys, proxy grant checks, secret placeholder resolution, OAuth install state, policy storage/hashing, and deployment wiring; adds a cloud-mode guard disallowing Telegram polling and updates/extends tests and reproducers for tenant isolation. ChangesOrganization-Scoped Multi-Tenant Isolation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/server/src/gateway/services/agent-threads.ts (1)
72-83:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPropagate
organizationIdthrough the session/message path, not only the initial token.
organizationIdis captured and embedded in the bootstrap token, but it is dropped before enqueue (ThreadSessionobject +enqueueMessagepayload). That leaves downstream processing without tenant context for internally-created threads.Suggested fix
@@ const session: ThreadSession = { @@ agentId, + organizationId, dryRun: false, isEphemeral: false, }; @@ const jobId = await queueProducer.enqueueMessage({ @@ agentId: realAgentId, + organizationId: session.organizationId, botId: "lobu-api",If
ThreadSessiondoes not already define this field, addorganizationId?: stringinpackages/server/src/gateway/session.tstoo.Also applies to: 89-101, 152-173
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/services/agent-threads.ts` around lines 72 - 83, The bootstrap code currently includes organizationId only in the token (via generateWorkerToken) but never forwards it into the ThreadSession or the payload passed to enqueueMessage; update the creation flow to attach organizationId to the ThreadSession and to any message/enqueue payloads so downstream workers have tenant context. Specifically, add organizationId?: string to the ThreadSession type in session.ts (if missing), set ThreadSession.organizationId = organizationId where the session is constructed (the block using threadId, conversationId, channelId, deploymentName), and include organizationId in the object passed into enqueueMessage (and similar payloads around lines 89-101 and 152-173) so the value flows with generateWorkerToken into downstream processing.packages/server/src/gateway/proxy/secret-proxy.ts (1)
686-706:⚠️ Potential issue | 🔴 CriticalTest files have not been fully updated to use the new options-object API.
The production code in
base-deployment-manager.ts(line 760) correctly uses the new signature with the options object includingorganizationId. However, multiple test files still callgeneratePlaceholderwith the old positional signature:
packages/server/src/gateway/__tests__/secret-proxy.test.ts: 6 calls at lines 27, 59, 69, 79, 85, 139packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts: 1 call at line 630packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts: 5 calls at lines 99, 127, 152, 180, 205These 12 test calls pass only 4 arguments and must be updated to pass the options object as the 5th parameter:
{ ttlSeconds?, organizationId? }. Three test cases insecret-proxy.test.ts(lines 101, 114, 126) have already been correctly updated with{ organizationId: "org-a" }.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/proxy/secret-proxy.ts` around lines 686 - 706, Tests still call generatePlaceholder using the old positional 4-argument signature; update each call to pass the options object as the 5th argument (options: { ttlSeconds?: number; organizationId?: string }) instead of supplying organizationId positionally. Specifically, update the 12 failing calls in the test suites that reference generatePlaceholder in packages/server/src/gateway/__tests__/secret-proxy.test.ts (6 calls), packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts (1 call), and packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts (5 calls) so they pass an options object (e.g., { organizationId: "org-a", ttlSeconds: X } when needed); locate usages of generatePlaceholder and adjust arguments accordingly to match the new signature used by storeSecretMapping and base-deployment-manager.
🧹 Nitpick comments (1)
packages/server/src/gateway/connections/chat-instance-manager.ts (1)
112-120: 💤 Low valueConsider using
isTelegramConfigtype guard for consistency.The file already uses the
isTelegramConfigtype guard pattern at lines 660, 780, 816, and 1121. Using it here would eliminate the type cast and the separateisPollingTelegramModehelper while maintaining the same semantics:♻️ Refactor for consistency
-/** - * `mode: "polling"` is the only config that forces long-polling regardless - * of whether the gateway has a public webhook URL. `mode: "auto"` resolves - * to webhook on cloud (publicGatewayUrl is always set there), so it's fine - * to allow. Only the explicit polling opt-in is rejected in cloud. - */ -function isPollingTelegramMode(config: { mode?: string }): boolean { - return config.mode === "polling"; -} - const ADAPTER_FACTORIES: Record<string, (config: any) => Promise<any>> = {+ // `mode: "polling"` is the only config that forces long-polling regardless + // of whether the gateway has a public webhook URL. `mode: "auto"` resolves + // to webhook on cloud (publicGatewayUrl is always set there), so it's fine + // to allow. Only the explicit polling opt-in is rejected in cloud. if ( platform === "telegram" && + isTelegramConfig(config) && isCloudMode() && - isPollingTelegramMode(config as { mode?: string }) + config.mode === "polling" ) {Also applies to: 241-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/connections/chat-instance-manager.ts` around lines 112 - 120, Replace the custom isPollingTelegramMode helper with the existing isTelegramConfig type guard to keep type-safety and consistency: where the code currently calls isPollingTelegramMode(config) (and in the other occurrence at the 241-257 block), change the check to use isTelegramConfig(config) && config.mode === "polling" (or inline the same condition) so TypeScript recognizes config as TelegramConfig and you can remove the isPollingTelegramMode function altogether; search for the symbol isPollingTelegramMode and update those call sites to reference isTelegramConfig and the mode comparison instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/proxy/http-proxy.ts`:
- Around line 133-134: The code currently uses organizationId ?? "" as a
fallback key which groups all pre-pivot tokens into a single legacy cache
bucket; instead, change the fallback to a per-token or per-deployment identifier
so verdicts are scoped to the token/deployment (not a constant empty string).
Locate usages of organizationId (the parameter named organizationId in this file
and the cache-key formation sites around the mentioned regions) and replace the
empty-string fallback with a stable per-token/deployment fallback (for example
use the token id, token fingerprint, deploymentId, or a composed fallback like
"deployment:<deploymentId>" when organizationId is missing) so each pre-pivot
token gets its own cache scope; apply this change to the occurrences referenced
(organizationId usage at the start and the other three locations).
---
Outside diff comments:
In `@packages/server/src/gateway/proxy/secret-proxy.ts`:
- Around line 686-706: Tests still call generatePlaceholder using the old
positional 4-argument signature; update each call to pass the options object as
the 5th argument (options: { ttlSeconds?: number; organizationId?: string })
instead of supplying organizationId positionally. Specifically, update the 12
failing calls in the test suites that reference generatePlaceholder in
packages/server/src/gateway/__tests__/secret-proxy.test.ts (6 calls),
packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts (1 call), and
packages/server/src/__tests__/unit/secret-proxy-lifecycle.test.ts (5 calls) so
they pass an options object (e.g., { organizationId: "org-a", ttlSeconds: X }
when needed); locate usages of generatePlaceholder and adjust arguments
accordingly to match the new signature used by storeSecretMapping and
base-deployment-manager.
In `@packages/server/src/gateway/services/agent-threads.ts`:
- Around line 72-83: The bootstrap code currently includes organizationId only
in the token (via generateWorkerToken) but never forwards it into the
ThreadSession or the payload passed to enqueueMessage; update the creation flow
to attach organizationId to the ThreadSession and to any message/enqueue
payloads so downstream workers have tenant context. Specifically, add
organizationId?: string to the ThreadSession type in session.ts (if missing),
set ThreadSession.organizationId = organizationId where the session is
constructed (the block using threadId, conversationId, channelId,
deploymentName), and include organizationId in the object passed into
enqueueMessage (and similar payloads around lines 89-101 and 152-173) so the
value flows with generateWorkerToken into downstream processing.
---
Nitpick comments:
In `@packages/server/src/gateway/connections/chat-instance-manager.ts`:
- Around line 112-120: Replace the custom isPollingTelegramMode helper with the
existing isTelegramConfig type guard to keep type-safety and consistency: where
the code currently calls isPollingTelegramMode(config) (and in the other
occurrence at the 241-257 block), change the check to use
isTelegramConfig(config) && config.mode === "polling" (or inline the same
condition) so TypeScript recognizes config as TelegramConfig and you can remove
the isPollingTelegramMode function altogether; search for the symbol
isPollingTelegramMode and update those call sites to reference isTelegramConfig
and the mode comparison instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 89b0f268-015a-4399-b464-722e418be3e1
📒 Files selected for processing (18)
AGENTS.mdpackages/core/src/worker/auth.tspackages/server/src/__tests__/unit/egress-judge-timeout.test.tspackages/server/src/gateway/__tests__/egress-judge-cache.test.tspackages/server/src/gateway/__tests__/proxy-hardening.test.tspackages/server/src/gateway/__tests__/rest-api-hardening.test.tspackages/server/src/gateway/__tests__/secret-proxy.test.tspackages/server/src/gateway/__tests__/slack-routes.test.tspackages/server/src/gateway/auth/oauth/state-store.tspackages/server/src/gateway/connections/chat-instance-manager.tspackages/server/src/gateway/orchestration/base-deployment-manager.tspackages/server/src/gateway/proxy/egress-judge/cache.tspackages/server/src/gateway/proxy/egress-judge/judge.tspackages/server/src/gateway/proxy/egress-judge/types.tspackages/server/src/gateway/proxy/http-proxy.tspackages/server/src/gateway/proxy/secret-proxy.tspackages/server/src/gateway/routes/public/slack.tspackages/server/src/gateway/services/agent-threads.ts
…eam callers Addresses six findings on the prior commit. Three were critical because the new org parameter sat unused at the lookup layer — the isolation guarantee the commit advertised never reached the call sites: 1. **secret-proxy lookupPlaceholderMapping was dead code.** Wired an `agentOrgResolver` (DB lookup keyed by URL agentId) into SecretProxy and pass `expectedOrganizationId` through `forward()` → `lookupPlaceholderMapping`. Worker tokens carrying `organizationId` are also extracted (header/query/bearer) for a signed signal that beats the DB lookup. 2. **checkDomainAccess dropped org on grant-store calls.** Pass `tokenData.organizationId` through to `GrantStore.isDenied/hasGrant`. Added `setProxyGrantStore` so tests can install a real store and exercise the call site. PolicyStore is now gated on `organizationId` being present — falling through to an unkeyed lookup would let another tenant's policy decide our verdict. 3. **PolicyStore is now keyed by `(organizationId, agentId)`.** Last sync no longer wins across tenants; `policyHash` includes the org id so the verdict cache scoping in #836 stays meaningful. 4. **Telegram cloud-mode guard also runs in `initialize()`.** Persisted `mode: "polling"` rows are marked errored at boot (under their own org context) instead of silently starting. `addConnection()` still rejects the same config at create time. 5. **Slack `/slack/install` self-host fallback.** When neither `c.get('organizationId')` nor `session.activeOrganizationId` is set, look up the org table — if exactly one org row exists, use it. Otherwise reject. Standalone deployments without the lobuApp wrapper stay usable; multi-tenant deployments stay strict. 6. **OAuth callback peek-before-consume.** Added `OAuthStateStore.peek()`. The Slack callback now validates the session org against the state's org *before* burning the row; a legitimate caller can retry after a cross-org or unauthenticated hit instead of restarting OAuth. Adds `multi-tenant-isolation-reproducers.test.ts` with 10 red→green tests that pin findings 1–4 (finding 6 is covered by an updated assertion in slack-routes.test.ts that the row is preserved).
pi review follow-up — fixup
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/connections/chat-instance-manager.ts`:
- Around line 281-297: Existing guard that rejects Telegram polling mode in the
create path must also be enforced in other start/update flows; call the same
check (use isPollingTelegramMode(config) together with platform === "telegram"
and isCloudMode()) at the start of startInstance() and inside updateConnection()
before persisting any config changes (and likewise before restartConnection()
invokes startInstance()), and throw the same Error message to ensure any attempt
to switch or boot an existing connection into mode: "polling" is rejected in
Lobu Cloud.
In `@packages/server/src/gateway/orchestration/base-deployment-manager.ts`:
- Around line 421-425: The grantSyncCache is currently keyed only by agentId
which allows cross-tenant cache hits; update all cache lookups, inserts and
invalidations in syncNetworkConfigGrants (and any helpers) to key by the pair
(organizationId, agentId) instead of agentId alone—e.g., build a composite key
using organizationId and agentId (or a tuple) wherever grantSyncCache is
referenced; ensure the fast-path check that returns an unchanged-set uses this
composite key and that cache population/invalidation also uses the same
composite key so each org+agent has its own bucket (refer to variables
organizationId, agentId, grantSyncCache and method syncNetworkConfigGrants).
In `@packages/server/src/gateway/proxy/secret-proxy.ts`:
- Around line 533-551: When extractWorkerTokenOrg(c) returns a non-empty org and
urlAgentId is present, also call agentOrgResolver(urlAgentId) to obtain the
URL's org and compare them; if they differ, reject the request with a 403 rather
than silently using the token org. Update the logic around
expectedOrganizationId and the agentOrgResolver call (related symbols:
extractWorkerTokenOrg, agentOrgResolver, urlAgentId, expectedOrganizationId, and
the later mapping.agentId check) so that: 1) you always resolve urlAgentId when
present, 2) if extractWorkerTokenOrg returned an org and agentOrgResolver
returns a conflicting org you short-circuit with a 403, and 3) only set
expectedOrganizationId when the resolved orgs agree (or when no token org exists
and resolver returns one).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b3da2398-db90-45d7-912f-1c34a441e672
📒 Files selected for processing (15)
packages/server/src/gateway/__tests__/base-deployment-grants.test.tspackages/server/src/gateway/__tests__/http-proxy-judge.test.tspackages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.tspackages/server/src/gateway/__tests__/policy-store.test.tspackages/server/src/gateway/__tests__/proxy-hardening.test.tspackages/server/src/gateway/__tests__/slack-routes.test.tspackages/server/src/gateway/auth/oauth/state-store.tspackages/server/src/gateway/connections/chat-instance-manager.tspackages/server/src/gateway/orchestration/base-deployment-manager.tspackages/server/src/gateway/permissions/__tests__/policy-store.test.tspackages/server/src/gateway/permissions/policy-store.tspackages/server/src/gateway/proxy/http-proxy.tspackages/server/src/gateway/proxy/secret-proxy.tspackages/server/src/gateway/routes/public/slack.tspackages/server/src/gateway/services/core-services.ts
✅ Files skipped from review due to trivial changes (1)
- packages/server/src/gateway/tests/base-deployment-grants.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/server/src/gateway/routes/public/slack.ts
- packages/server/src/gateway/tests/slack-routes.test.ts
| // `mode: "polling"` long-polls Telegram's edge from the gateway pod and | ||
| // bypasses the per-tenant webhook URL we issue. On Lobu Cloud — where | ||
| // the same gateway serves many tenants — that means one org's connection | ||
| // can starve every other tenant's webhook delivery (and produces no | ||
| // audit trail tied to the inbound HTTP request). Refuse the explicit | ||
| // polling opt-in up front; self-hosters (LOBU_CLOUD_MODE unset/0) still | ||
| // get polling for tunnel-less dev. `mode: "auto"` is fine — it resolves | ||
| // to webhook whenever `publicGatewayUrl` is set, which cloud always has. | ||
| if ( | ||
| platform === "telegram" && | ||
| isCloudMode() && | ||
| isPollingTelegramMode(config as { mode?: string }) | ||
| ) { | ||
| throw new Error( | ||
| "Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host." | ||
| ); | ||
| } |
There was a problem hiding this comment.
Enforce the polling ban from the shared start path.
Lines 289-297 only block new connections. An existing Telegram row can still be switched to mode: "polling" via updateConnection(), and restartConnection() / startInstance() will boot it in cloud mode. That leaves a straightforward bypass for the guard this PR is adding.
Suggested direction
+ private assertTelegramModeAllowed(
+ platform: string,
+ config: PlatformAdapterConfig
+ ): void {
+ if (
+ platform === "telegram" &&
+ isCloudMode() &&
+ isPollingTelegramMode(config as { mode?: string })
+ ) {
+ throw new Error(
+ "Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host."
+ );
+ }
+ }
+
async addConnection(
platform: string,
agentId: string | undefined,
config: PlatformAdapterConfig,
@@
- if (
- platform === "telegram" &&
- isCloudMode() &&
- isPollingTelegramMode(config as { mode?: string })
- ) {
- throw new Error(
- "Polling mode is not supported in Lobu Cloud — use webhook mode, or self-host."
- );
- }
+ this.assertTelegramModeAllowed(platform, config);And then call the same helper before any start path, e.g. in startInstance() and before persisting config changes in updateConnection().
As per coding guidelines, mode: "polling" is rejected at connection-create time when LOBU_CLOUD_MODE=1.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/connections/chat-instance-manager.ts` around
lines 281 - 297, Existing guard that rejects Telegram polling mode in the create
path must also be enforced in other start/update flows; call the same check (use
isPollingTelegramMode(config) together with platform === "telegram" and
isCloudMode()) at the start of startInstance() and inside updateConnection()
before persisting any config changes (and likewise before restartConnection()
invokes startInstance()), and throw the same Error message to ensure any attempt
to switch or boot an existing connection into mode: "polling" is rejected in
Lobu Cloud.
| const organizationId = messageData.organizationId; | ||
| // PolicyStore is keyed by `(orgId, agentId)` to prevent cross-tenant | ||
| // policy clobbering — refuse to sync without an org id rather than | ||
| // collapsing into a shared bucket. | ||
| if (!this.policyStore || !agentId || !organizationId) { |
There was a problem hiding this comment.
Scope grantSyncCache by (organizationId, agentId) too.
This method now treats agent-id-only buckets as unsafe, but syncNetworkConfigGrants() still caches by agentId alone. If two orgs reuse the same agentId and converge on the same grant set, the second org hits the unchanged-set fast path at Line 551 and never persists its own grants.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/orchestration/base-deployment-manager.ts` around
lines 421 - 425, The grantSyncCache is currently keyed only by agentId which
allows cross-tenant cache hits; update all cache lookups, inserts and
invalidations in syncNetworkConfigGrants (and any helpers) to key by the pair
(organizationId, agentId) instead of agentId alone—e.g., build a composite key
using organizationId and agentId (or a tuple) wherever grantSyncCache is
referenced; ensure the fast-path check that returns an unchanged-set uses this
composite key and that cache population/invalidation also uses the same
composite key so each org+agent has its own bucket (refer to variables
organizationId, agentId, grantSyncCache and method syncNetworkConfigGrants).
| // Derive the caller's expected org from the verified worker token | ||
| // (preferred — it's signed) and fall back to a DB lookup keyed by the | ||
| // URL agentId. Either source becomes the `expectedOrganizationId` | ||
| // we hand to placeholder + secret lookups so a worker bearing org A's | ||
| // placeholder cannot resolve it under org B's URL. | ||
| const callerToken = this.extractCallerToken(c); | ||
| let expectedOrganizationId: string | undefined = | ||
| this.extractWorkerTokenOrg(c); | ||
| if (!expectedOrganizationId && urlAgentId && this.agentOrgResolver) { | ||
| try { | ||
| const orgId = await this.agentOrgResolver(urlAgentId); | ||
| if (orgId) expectedOrganizationId = orgId; | ||
| } catch (err) { | ||
| logger.warn( | ||
| { urlAgentId, err: String(err) }, | ||
| "agentOrgResolver failed — falling through without org expectation" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Reject worker-token org / URL-agent org mismatches.
When extractWorkerTokenOrg() succeeds, this path skips agentOrgResolver(urlAgentId) entirely. If two orgs can reuse the same raw agentId, a worker from org A can present its own placeholder, target /a/<same-agent-id> for org B, pass the mapping.agentId === urlAgentId check at Line 572, and get org B's upstream credentials. Always resolve urlAgentId when present and 403 if its org disagrees with the verified token org.
🔒 Suggested fix
- let expectedOrganizationId: string | undefined =
- this.extractWorkerTokenOrg(c);
- if (!expectedOrganizationId && urlAgentId && this.agentOrgResolver) {
+ const tokenOrganizationId = this.extractWorkerTokenOrg(c);
+ let expectedOrganizationId: string | undefined = tokenOrganizationId;
+ if (urlAgentId && this.agentOrgResolver) {
try {
- const orgId = await this.agentOrgResolver(urlAgentId);
- if (orgId) expectedOrganizationId = orgId;
+ const urlOrganizationId = await this.agentOrgResolver(urlAgentId);
+ if (
+ tokenOrganizationId &&
+ urlOrganizationId &&
+ tokenOrganizationId !== urlOrganizationId
+ ) {
+ logger.warn(
+ { urlAgentId, tokenOrganizationId, urlOrganizationId },
+ "Rejecting proxy request: worker token org does not match URL agent org"
+ );
+ return c.json({ error: "Forbidden" }, 403);
+ }
+ expectedOrganizationId ??= urlOrganizationId ?? undefined;
} catch (err) {
logger.warn(
{ urlAgentId, err: String(err) },
"agentOrgResolver failed — falling through without org expectation"
);
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/proxy/secret-proxy.ts` around lines 533 - 551,
When extractWorkerTokenOrg(c) returns a non-empty org and urlAgentId is present,
also call agentOrgResolver(urlAgentId) to obtain the URL's org and compare them;
if they differ, reject the request with a 403 rather than silently using the
token org. Update the logic around expectedOrganizationId and the
agentOrgResolver call (related symbols: extractWorkerTokenOrg, agentOrgResolver,
urlAgentId, expectedOrganizationId, and the later mapping.agentId check) so
that: 1) you always resolve urlAgentId when present, 2) if extractWorkerTokenOrg
returned an org and agentOrgResolver returns a conflicting org you short-circuit
with a 403, and 3) only set expectedOrganizationId when the resolved orgs agree
(or when no token org exists and resolver returns one).
…ic Agent API The chat-platform spawn path (base-deployment-manager, agent-threads.createThreadForAgent) already passes organizationId into generateWorkerToken. The public Agent API entry point (POST /api/v1/agents) did NOT — every worker spawned via 'lobu chat', 'lobu eval', or the JS SDK landed with tokenData.organizationId === undefined and the egress proxy's new per-tenant gates short-circuited to unscoped checks for that worker. The route now looks the agent's owning org up via the ownership metadata store and stamps the token. To make this work, AgentMetadata gains an optional organizationId field populated by the postgres-backed store (in-memory test stores leave it undefined — back-compat by design). Ephemeral agents (no preexisting metadata) still mint without orgId; that narrower case is tracked as a follow-up — needs an auth-session- driven derivation path. Reproducers in multi-tenant-isolation-reproducers.test.ts pin the contract for both the metadata-driven org-stamped path and the ephemeral undefined path.
Follow-up fix (ada4b16) — public Agent API mint sitesE2E code-read verification (driven by the main agent, not pi) surfaced one The chat-platform spawn path ( Change
Red→green reproducers (appended to
|
| Test | Pre-fix | Post-fix |
|---|---|---|
| metadata-driven lookup propagates org into worker token | decoded.organizationId === undefined (no organizationId: in route's options bag) |
decoded.organizationId === "org-a" |
| ephemeral agent mints without organizationId | (same, contract) | decoded.organizationId === undefined |
Verification
make typecheck— cleanmake build-packages— cleanbun test src/gateway/__tests__/— 810 pass / 0 fail (was 808; +2 new reproducers)
Known follow-ups (not in this PR)
watchers/automation.ts:678(preflightWatcherMemoryTools) still mints without orgId. The token only authenticates a localhost call to/lobu/mcp/lobu-memory/tools(tool-listing); no egress proxy involved. Cosmetic consistency only — not blocking.- Ephemeral-agent API mint still has
organizationId === undefined. Needs a path from the auth session to the worker token. - PolicyStore + GrantStore wiring (
setProxyPolicyStore/setProxyGrantStore) — preventative as-is; wire when ready.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/server/src/lobu/stores/postgres-stores.ts (1)
265-284:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFail closed on org-less metadata lookups.
agentsis keyed by(organization_id, id)now, but the fallback branch still doesWHERE id = ${agentId}and returnsrows[0]. Any caller outsideorgContextcan now pick an arbitrary tenant's metadata for a reusedagentId, which is enough to mis-stamporganizationIdon worker tokens and route requests under the wrong tenant. Require an explicit org id here, or refuse id-only lookups unless they resolve to exactly one row.Suggested safety guard
async getMetadata(agentId) { const sql = getDb(); const orgId = tryGetOrgId(); const rows = orgId ? await sql` SELECT id, organization_id, name, description, owner_platform, owner_user_id, is_workspace_agent, workspace_id, created_at, last_used_at FROM agents WHERE id = ${agentId} AND organization_id = ${orgId} ` : await sql` SELECT id, organization_id, name, description, owner_platform, owner_user_id, is_workspace_agent, workspace_id, created_at, last_used_at FROM agents WHERE id = ${agentId} + LIMIT 2 `; if (rows.length === 0) return null; + if (!orgId && rows.length !== 1) return null; return rowToMetadata(rows[0]); },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/lobu/stores/postgres-stores.ts` around lines 265 - 284, getMetadata currently falls back to an id-only query when tryGetOrgId() is falsy, which can return an arbitrary tenant's agent; change getMetadata to refuse ambiguous org-less lookups: when orgId is absent, run the id-only query but only accept the result if exactly one row is returned (rows.length === 1); otherwise return null or throw an error to prevent cross-tenant leakage. Update the logic around getMetadata (and still use rowToMetadata(rows[0]) only after the unique-check) so callers no longer receive metadata for ambiguous agentId lookups.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/routes/public/agent.ts`:
- Around line 636-645: tokenOrganizationId is left undefined for public POST
/api/v1/agents when requestedAgentId is absent, allowing org-less worker tokens
to bypass org-scoped checks; change the logic in the agent creation flow (where
tokenOrganizationId, isEphemeral, ownershipMetadataStore, requestedAgentId and
agentId are used) to resolve the caller's active organization from the
authenticated session (e.g., extract the session/org id from the request auth
context used by the route) and set tokenOrganizationId to that value when no
ownership metadata exists, and if no org can be resolved from the session then
fail closed (return an error) instead of minting an org-less token. Ensure you
update the branching that currently only consults ownershipMetadataStore so
session-derived organizationId is authoritative for new agent minting.
---
Outside diff comments:
In `@packages/server/src/lobu/stores/postgres-stores.ts`:
- Around line 265-284: getMetadata currently falls back to an id-only query when
tryGetOrgId() is falsy, which can return an arbitrary tenant's agent; change
getMetadata to refuse ambiguous org-less lookups: when orgId is absent, run the
id-only query but only accept the result if exactly one row is returned
(rows.length === 1); otherwise return null or throw an error to prevent
cross-tenant leakage. Update the logic around getMetadata (and still use
rowToMetadata(rows[0]) only after the unique-check) so callers no longer receive
metadata for ambiguous agentId lookups.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b172434f-469f-47b2-8d18-2ccfa2507553
📒 Files selected for processing (4)
packages/core/src/agent-store.tspackages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.tspackages/server/src/gateway/routes/public/agent.tspackages/server/src/lobu/stores/postgres-stores.ts
| // Stamp the worker token with the agent's owning org so the egress | ||
| // proxy's per-tenant gates (grant/deny, judge cache, judge policy) | ||
| // can scope decisions by org. Ephemeral agents have no preexisting | ||
| // metadata; their token mints without orgId and the proxy falls | ||
| // through to unscoped checks for that worker — flagged for a | ||
| // future fix that derives org from the auth session. | ||
| const tokenOrganizationId = | ||
| !isEphemeral && ownershipMetadataStore | ||
| ? (await ownershipMetadataStore.getMetadata(agentId))?.organizationId | ||
| : undefined; |
There was a problem hiding this comment.
Don't mint org-less worker tokens on the public API path.
When requestedAgentId is absent, tokenOrganizationId is guaranteed to stay undefined, so POST /api/v1/agents still emits tokens that bypass the new org-scoped grant/judge/secret checks. That leaves the default ephemeral API flow outside the isolation guarantees this PR is adding. Resolve the caller's active org from the authenticated session before minting, or fail closed when the request has no org context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/routes/public/agent.ts` around lines 636 - 645,
tokenOrganizationId is left undefined for public POST /api/v1/agents when
requestedAgentId is absent, allowing org-less worker tokens to bypass org-scoped
checks; change the logic in the agent creation flow (where tokenOrganizationId,
isEphemeral, ownershipMetadataStore, requestedAgentId and agentId are used) to
resolve the caller's active organization from the authenticated session (e.g.,
extract the session/org id from the request auth context used by the route) and
set tokenOrganizationId to that value when no ownership metadata exists, and if
no org can be resolved from the session then fail closed (return an error)
instead of minting an org-less token. Ensure you update the branching that
currently only consults ownershipMetadataStore so session-derived organizationId
is authoritative for new agent minting.
…interactions (#867) * fix(server): post-review cleanup of multi-tenant isolation + pending interactions Closes 6 issues surfaced by a 6-agent review of #834, #836, #848, and the #836 followup: 1. secret-proxy: fail-closed on agentOrgResolver DB error (was warn + fall-through with undefined expectedOrganizationId — a DB hiccup window let downstream org checks silently downgrade). 2. secret-proxy: legacy-mapping bypass closed. Pre-fix `lookupPlaceholder Mapping` skipped the org check when `mapping.organizationId` was unset (legacy rows pre-org-pivot); a worker from org B could resolve a legacy unscoped mapping owned by org A. Now: if the caller supplies `expectedOrganizationId`, the mapping must match it. Emit a WARN on every legacy unscoped access to plan the deprecation. 3. pending-interaction-store: drop `created_at = now()` from ON CONFLICT. Webhook retries no longer reset the 24h TTL clock, so a misbehaving retry loop cannot keep a row alive indefinitely. `claimed_at = NULL` reset is preserved so legitimate retries are still claimable. 4. egress-judge VerdictCache key already includes orgId (`cache.ts:30-43`) — no code change needed, documented for the reviewers. 5. interaction-bridge: drop the per-bridge `sweepStalePendingInteractions` setInterval. The global `coreServices.sweepEphemeralTables` (scheduled in `src/scheduled/jobs.ts`) already covers it; the per-bridge call was N-times-per-pod wasted DB work. The local sweep timer is retained but now only evicts the in-memory `pendingSent Messages` cache by TTL. 6. pending-interaction-store: cap `sweepStalePendingInteractions` at 1000 rows/call (configurable). An unbounded DELETE under a stale-row backlog could lock the table; remaining rows drain across subsequent 5-minute cycles. 7. pending-interaction-store: add `deletePendingQuestion(id, org, conn, user)` and use it in the post-failure drop path. Pre-fix the bridge called `claimPendingQuestion` (UPDATE setting claimed_at) to "drop" the row, leaving a phantom row sitting around until the 24h sweep. `deletePendingQuestion` carries the same four-field scoping as the claim path, so safety invariants are identical. Tests: - `multi-tenant-isolation-reproducers.test.ts` gets 3 new cases under `[finding 1]`: legacy-mapping bypass rejected, legacy-mapping warn path with no expected org, and SecretProxy.forward returns 503 when the resolver throws. - New `pending-interaction-cleanup.test.ts` covers retry-preserves- `created_at`, sweep LIMIT honoured + remainder drains, post-failure drop is a DELETE (not just claim), and `deletePendingQuestion` scoping invariant. - `secret-proxy.test.ts` updated to reflect the closed legacy bypass — the old "falls through (legacy)" case now expects a null return when the caller supplies an expected org; a new "no expected org" case documents the WARN path for un-org-scoped callers. Red→green proof captured per-fix in the PR body. * fix(server): address codex review on PR #867 Two issues flagged by codex review: 1. Provider credential lookup had an unscoped fail-open DB-error path. The PR's new 503 in `secret-proxy.ts` only covered the resolver-throws branch — when the worker token already carried `organizationId`, the resolver was skipped and downstream `authProfilesManager.getBestProfile` ran without org context. Inside `AuthProfilesManager.resolveAgentOrgId`, a DB error logged a warning and returned undefined, falling through to unscoped credential reads (`auth-profiles-manager.ts:251-275`). Now we wrap the credential lookup in `orgContext.run({organizationId: expectedOrganizationId}, ...)` so `AuthProfilesManager.listProfiles` short-circuits via `tryGetOrgId()` and never invokes its own resolver when we already know the org. A DB hiccup in the upstream resolver cannot downgrade scoping for these requests. 2. The post-failure cleanup test exercised `deletePendingQuestion()` directly but didn't drive `registerInteractionBridge`. A regression that swapped `deletePendingQuestion` back to `claimPendingQuestion` in the bridge would not have failed the test. Added an integration test that emits a `question:created` event, stubs the thread post to throw, and asserts the row is GONE from `pending_interactions` (count 0). Verified red→green by reverting the bridge to claim — the test times out at 5s because the row is never deleted. * chore(submodule): bump owletto to dcb2172 to clear drift check
Summary
Closes four cross-org leakage paths surfaced by a multi-tenant audit:
orgIdso org A's verdict cannot satisfy org B's identical request.organizationIdis plumbed ontoWorkerTokenData(and throughgenerateWorkerToken/base-deployment-manager.ts/agent-threads.ts), then read by the HTTP proxy and fed intoVerdictCache.key.SecretMappingcarriesorganizationId; the new module-levellookupPlaceholderMapping(placeholder, expectedOrganizationId?)rejects an org mismatch the same way as a missing mapping (log + return null, never throw).generatePlaceholder()accepts anorganizationIdoption, set at the deployment-manager call site.SlackInstallStateDatanow carriesorganizationId./slack/installreads the active org from the session/Hono context (rejects with 401 if absent) and stamps it into the state./slack/oauth_callbackrejects with 403 when the callback session's active org doesn't match the install-state's org.ChatInstanceManager.addConnectionrejectsmode: "polling"whenLOBU_CLOUD_MODE=1. Self-hosters (LOBU_CLOUD_MODEunset/0) keep the polling option for tunnel-less dev.mode: "auto"is unaffected — it resolves to webhook whenever the gateway has a public URL, which cloud always has. AGENTS.md updated to document the cloud constraint.Test plan
make typecheckpasses (strict — matches the prod Dockerfile)make build-packagespasses (full bundle build clean)bun test src/gateway/__tests__/— 798 tests passlookupPlaceholderMappingreturns null on org mismatch, returns mapping when org matches or no expectation is supplied, falls through when the mapping has no org tag (legacy)orgIdproduces a different key for identical(policyHash, hostname, method, path)/slack/installreturns 401 with no session org;/slack/oauth_callbackreturns 403 when callback session org differs from install-state orgLOBU_CLOUD_MODE=1returns a clear errorSummary by CodeRabbit
Bug Fixes
Improvements
Documentation
Tests