Skip to content

fix(server): post-review cleanup of multi-tenant isolation + pending interactions#867

Merged
buremba merged 3 commits into
mainfrom
fix/multi-tenant-cleanup
May 18, 2026
Merged

fix(server): post-review cleanup of multi-tenant isolation + pending interactions#867
buremba merged 3 commits into
mainfrom
fix/multi-tenant-cleanup

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

Single PR fixing 7 issues surfaced by a 6-agent review of recently-merged commits de81ebe5 (#834), de4c238b (#836), 7fc36dca (#848), and ada4b169 (#836 followup).

Fixes

  1. secret-proxy: fail-closed on agentOrgResolver DB errorpackages/server/src/gateway/proxy/secret-proxy.ts:541-563. Was warn + fall-through with expectedOrganizationId = undefined; now a 503 response. A DB hiccup window no longer downgrades downstream org checks.
  2. secret-proxy: legacy-mapping bypass closedsecret-proxy.ts:195-227. Pre-fix the org check gated on mapping.organizationId being set, so a legacy mapping (minted before the org-id pivot) sailed through under any expected org. Now: if caller supplies expectedOrganizationId, the mapping must match — including refusing to match undefined. WARN log emitted on every legacy unscoped access to schedule the deprecation.
  3. pending-interaction-store: ON CONFLICT preserves created_atpending-interaction-store.ts:54-65. Drop created_at = now() from the UPDATE clause so webhook retries cannot keep a row alive past the 24h TTL by re-stashing it. claimed_at = NULL reset retained.
  4. policyHash org-scoping verified, no change neededegress-judge/cache.ts:29-43 already keys by (orgId, policyHash, …). judge.ts:105-111 passes request.organizationId as the first key part. Two orgs with identical policy text get distinct cache entries.
  5. interaction-bridge: per-bridge sweep removedinteraction-bridge.ts:198-227. Global coreServices.sweepEphemeralTables (scheduled in src/scheduled/jobs.ts:110-114, every 5 min) already calls sweepStalePendingInteractions. Per-bridge setInterval was N-times-per-pod redundant work. Local timer retained for in-memory pendingSentMessages TTL eviction only.
  6. sweepStalePendingInteractions LIMITpending-interaction-store.ts:101-135. Default 1000 rows/call (configurable). DELETE … WHERE id IN (SELECT … LIMIT N) shape; remaining rows drain across subsequent cycles.
  7. Post-failure drop uses DELETEinteraction-bridge.ts:332-355 + new deletePendingQuestion(id, org, conn, user) in pending-interaction-store.ts:137-159. Pre-fix the bridge called claimPendingQuestion (UPDATE setting claimed_at) to "drop" the row, leaving a phantom row until the 24h sweep. deletePendingQuestion carries the same four-field scoping as claimPendingQuestion, so safety is identical.

Red → green per fix

All red runs captured with the fix branch checked out and the production source files temporarily reverted.

Fix #1 (503 on resolver error) — pre-fix:

expect(res.status).toBe(503);
Expected: 503
Received: 200

Post-fix: passes.

Fix #2 (legacy bypass) — pre-fix:

expect(lookupPlaceholderMapping(placeholder, "org-b")).toBeNull();
Expected: null
Received: { agentId: "agent-legacy", organizationId: undefined, … }

Post-fix: passes.

Fix #3 (created_at preserved) — pre-fix:

expect(afterTs).toBe(backdatedTs);
Expected: 1779076816752  // backdated
Received: 1779080416753  // now()

Post-fix: passes (created_at unchanged across retry).

Fixes #6 + #7 (LIMIT + deletePendingQuestion) — pre-fix the cleanup test file fails to even load:

SyntaxError: Export named 'deletePendingQuestion' not found

Post-fix: all 4 cleanup tests pass (sweep LIMIT honoured + remainder drains, deletePendingQuestion hard-deletes, scoping invariant).

Green (final)

bun test packages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.ts \
         packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts \
         packages/server/src/gateway/__tests__/pending-interaction-store.test.ts \
         packages/server/src/gateway/__tests__/secret-proxy.test.ts \
         packages/server/src/gateway/__tests__/secret-proxy-harden.test.ts

 50 pass
 0 fail
 95 expect() calls

Full gateway test suite: 831 pass / 0 fail across 68 files.

make build-packages clean. make typecheck shows 12 errors — all pre-existing on origin/main (WorkerTokenData.organizationId / AgentMetadata.organizationId type definitions lag behind their callers); zero new errors introduced.

Non-goals (explicitly skipped)

Cosmetic items (composite-key PolicyStore refactor, unref?.() operator, parameter inversion, comment cleanups), PR #865's branch, helm/kustomize changes, and coverage gaps that don't map to a fix above (cascade-delete tests, PolicyStore.clear() cross-org, etc.).

Test plan

  • bun test for the 5 affected gateway test files (50/50 pass)
  • Full packages/server/src/gateway/__tests__/ run (831/831 pass)
  • make build-packages
  • make typecheck — 12 pre-existing errors, 0 new
  • Red→green captured per fix

Summary by CodeRabbit

  • Tests

    • Added regression and end-to-end tests covering tenant isolation, secret-proxy behavior, and pending-interaction cleanup.
  • Bug Fixes

    • Enforced tenant scoping for placeholder/secret resolution to prevent cross-organization access.
    • Fail-closed: requests now return 503 when agent-organization resolution fails.
    • Pending-interaction fixes: preserve original creation timestamps, bounded sweep deletions, and explicit hard-delete to avoid orphaned rows.
  • Refactor

    • In-memory pending-message cache now relies on TTL-only eviction; DB sweeping handled centrally.

Review Change Stack

…interactions

Closes 6 issues surfaced by a 6-agent review of #834, #836, #848, and the
#836 followup:

1. secret-proxy: fail-closed on agentOrgResolver DB error (was warn +
   fall-through with undefined expectedOrganizationId — a DB hiccup
   window let downstream org checks silently downgrade).
2. secret-proxy: legacy-mapping bypass closed. Pre-fix `lookupPlaceholder
   Mapping` skipped the org check when `mapping.organizationId` was unset
   (legacy rows pre-org-pivot); a worker from org B could resolve a
   legacy unscoped mapping owned by org A. Now: if the caller supplies
   `expectedOrganizationId`, the mapping must match it. Emit a WARN on
   every legacy unscoped access to plan the deprecation.
3. pending-interaction-store: drop `created_at = now()` from ON CONFLICT.
   Webhook retries no longer reset the 24h TTL clock, so a misbehaving
   retry loop cannot keep a row alive indefinitely. `claimed_at = NULL`
   reset is preserved so legitimate retries are still claimable.
4. egress-judge VerdictCache key already includes orgId
   (`cache.ts:30-43`) — no code change needed, documented for the
   reviewers.
5. interaction-bridge: drop the per-bridge `sweepStalePendingInteractions`
   setInterval. The global `coreServices.sweepEphemeralTables`
   (scheduled in `src/scheduled/jobs.ts`) already covers it; the
   per-bridge call was N-times-per-pod wasted DB work. The local sweep
   timer is retained but now only evicts the in-memory `pendingSent
   Messages` cache by TTL.
6. pending-interaction-store: cap `sweepStalePendingInteractions` at 1000
   rows/call (configurable). An unbounded DELETE under a stale-row
   backlog could lock the table; remaining rows drain across subsequent
   5-minute cycles.
7. pending-interaction-store: add `deletePendingQuestion(id, org, conn,
   user)` and use it in the post-failure drop path. Pre-fix the bridge
   called `claimPendingQuestion` (UPDATE setting claimed_at) to "drop"
   the row, leaving a phantom row sitting around until the 24h sweep.
   `deletePendingQuestion` carries the same four-field scoping as the
   claim path, so safety invariants are identical.

Tests:

- `multi-tenant-isolation-reproducers.test.ts` gets 3 new cases under
  `[finding 1]`: legacy-mapping bypass rejected, legacy-mapping warn
  path with no expected org, and SecretProxy.forward returns 503 when
  the resolver throws.
- New `pending-interaction-cleanup.test.ts` covers retry-preserves-
  `created_at`, sweep LIMIT honoured + remainder drains, post-failure
  drop is a DELETE (not just claim), and `deletePendingQuestion`
  scoping invariant.
- `secret-proxy.test.ts` updated to reflect the closed legacy bypass —
  the old "falls through (legacy)" case now expects a null return when
  the caller supplies an expected org; a new "no expected org" case
  documents the WARN path for un-org-scoped callers.

Red→green proof captured per-fix in the PR body.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6a27ddfe-169c-461c-afc7-691e45dbe073

📥 Commits

Reviewing files that changed from the base of the PR and between effc163 and 9b09853.

📒 Files selected for processing (1)
  • packages/owletto

📝 Walkthrough

Walkthrough

This PR enforces org scoping in secret-proxy, makes agent org resolution failures return HTTP 503, preserves pending-interaction created_at on retry, adds bounded sweep/delete APIs, updates interaction-bridge cleanup to hard-delete, and adds tests covering these behaviors.

Changes

Multi-tenant isolation & pending interaction cleanup

Layer / File(s) Summary
Organization scoping and resolver error handling
packages/server/src/gateway/proxy/secret-proxy.ts, packages/server/src/gateway/__tests__/secret-proxy.test.ts, packages/server/src/gateway/__tests__/multi-tenant-isolation-reproducers.test.ts
lookupPlaceholderMapping rejects mappings missing organizationId when caller provides expectedOrganizationId; logs legacy/unscoped accesses; SecretProxy.forward fails closed with HTTP 503 on agentOrgResolver errors; auth/profile resolution runs in orgContext when expectedOrganizationId is present. Tests added/updated to assert fail-closed and legacy-no-org behaviors.
Pending interaction persistence layer
packages/server/src/gateway/connections/pending-interaction-store.ts
storePendingQuestion upsert preserves original created_at on conflict while resetting only claimed_at; sweepStalePendingInteractions gains a limit parameter and DEFAULT_SWEEP_LIMIT to bound deletions; new deletePendingQuestion hard-deletes a pending row scoped by (id, organization_id, connection_id, expected_user_id).
Interaction-bridge cleanup integration
packages/server/src/gateway/connections/interaction-bridge.ts
Imports and uses deletePendingQuestion for post-failure cleanup instead of claim-style updates; simplifies pending-sent cache sweep to in-memory TTL eviction only; updates comments to reflect global DB-row sweeping ownership.
Comprehensive pending interaction cleanup test coverage
packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts
Adds tests verifying retry-on-conflict preserves created_at while resetting claimed_at, bounded sweep removes up to LIMIT items and drains remainder on subsequent runs, deletePendingQuestion hard-deletes and is scoped to the correct tuple, and an end-to-end bridge test confirms hard-delete on post failure.
Submodule pointer
packages/owletto
Bump recorded subproject commit SHA.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant SecretProxy
  participant AgentOrgResolver
  participant Upstream
  Client->>SecretProxy: request requiring agent org resolution
  SecretProxy->>AgentOrgResolver: resolve agent organization
  AgentOrgResolver-->>SecretProxy: throws error
  SecretProxy->>Client: HTTP 503 "failed to resolve agent organization"
  note over Upstream: no call made to upstream when resolver fails
Loading
sequenceDiagram
  participant InteractionBridge
  participant ThreadService
  participant PendingStore
  InteractionBridge->>PendingStore: store pending interaction
  InteractionBridge->>ThreadService: thread.post (on question:created)
  ThreadService-->>InteractionBridge: post fails
  InteractionBridge->>PendingStore: deletePendingQuestion(id, org, conn, user)
  PendingStore-->>InteractionBridge: deleted (row removed)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lobu-ai/lobu#836: Overlaps with tenant-scoped secret-proxy changes (strict lookupPlaceholderMapping behavior and fail-closed SecretProxy.forward).
  • lobu-ai/lobu#834: Related Postgres pending-interaction persistence refactor that this PR builds on and modifies (store/sweep/delete behaviors).

Suggested labels

skip-size-check

Poem

🐰 I hopped through tenants, kept legacy in sight,

Timestamps held steady, retries kept light,
Sweeps trimmed in batches, deletes clear and neat,
Secrets bound to orgs, resolver won't cheat,
A rabbit cheers clean code — tiny, steady, bright!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The linked issues (#1, #2, #3, #6) concern GitHub Actions, Kubernetes integration, Claude Code workflow, and local Docker deployment—none relate to multi-tenant isolation or pending interactions cleanup that this PR addresses. Verify the correct linked issues are referenced. This PR appears to address issues from a different feature area than those currently linked.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the primary change: post-review cleanup addressing multi-tenant isolation and pending interactions issues.
Description check ✅ Passed The PR description is comprehensive and well-structured. It includes a detailed summary of all seven fixes with technical explanations, red→green test results for each fix, final test results, non-goals, and a complete test plan with verification steps.
Out of Scope Changes check ✅ Passed All changes (secret-proxy org-scoping, agentOrgResolver error handling, pending-interaction-store TTL/cleanup, sweepStalePendingInteractions LIMIT, interaction-bridge cleanup) are focused on multi-tenant isolation and pending interactions—directly aligned with PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/multi-tenant-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Two issues flagged by codex review:

1. Provider credential lookup had an unscoped fail-open DB-error path.
   The PR's new 503 in `secret-proxy.ts` only covered the resolver-throws
   branch — when the worker token already carried `organizationId`, the
   resolver was skipped and downstream `authProfilesManager.getBestProfile`
   ran without org context. Inside `AuthProfilesManager.resolveAgentOrgId`,
   a DB error logged a warning and returned undefined, falling through to
   unscoped credential reads (`auth-profiles-manager.ts:251-275`). Now we
   wrap the credential lookup in `orgContext.run({organizationId:
   expectedOrganizationId}, ...)` so `AuthProfilesManager.listProfiles`
   short-circuits via `tryGetOrgId()` and never invokes its own resolver
   when we already know the org. A DB hiccup in the upstream resolver
   cannot downgrade scoping for these requests.

2. The post-failure cleanup test exercised `deletePendingQuestion()`
   directly but didn't drive `registerInteractionBridge`. A regression
   that swapped `deletePendingQuestion` back to `claimPendingQuestion`
   in the bridge would not have failed the test. Added an integration
   test that emits a `question:created` event, stubs the thread post to
   throw, and asserts the row is GONE from `pending_interactions` (count
   0). Verified red→green by reverting the bridge to claim — the test
   times out at 5s because the row is never deleted.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 18, 2026

Codex review iteration

First codex pass flagged two issues; both addressed in commit effc1636:

  1. Provider credential lookup had an unscoped fail-open DB-error path. The 503 only covered the resolver-throws branch in secret-proxy.ts. When the worker token already carried organizationId, downstream authProfilesManager.getBestProfile ran without org context, and AuthProfilesManager.resolveAgentOrgId swallowed DB errors and fell through to unscoped reads (auth-profiles-manager.ts:251-275). Fix: wrap the credential lookup in orgContext.run({organizationId: expectedOrganizationId}, ...) so the resolver short-circuits via tryGetOrgId() and a transient DB error cannot downgrade scoping.

  2. Post-failure cleanup test was unit-level, not bridge-level. Added an integration test that emits question:created, stubs the thread post to throw, drives registerInteractionBridge end-to-end, and asserts the row is deleted from pending_interactions. Red→green verified: with deletePendingQuestion reverted to claimPendingQuestion in the bridge, the test times out at 5s because the row is never deleted.

Second codex pass:

No blocking findings. […] The PR appears to address the two prior Codex concerns. […] The interaction-bridge test drives registerInteractionBridge() via question:created, forces thread.post failure through postWithFallback, and verifies the persisted row is deleted; it is a stubbed chat boundary, but not a mock-on-mock of deletePendingQuestion(). […] Targeted tests reported 31 pass, 0 fail.

Verdict: SHIP IT — confidence 92%

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts`:
- Around line 93-111: The test currently never exercises the ON CONFLICT path
that resets claimed_at because claimed_at is NULL from the initial insert;
modify the test around storePendingQuestion/retry so that after the initial
storePendingQuestion(q.id, ...) you mark the row as claimed (e.g. set claimed_at
to a non-null value via an update against the pending_interactions row for q.id
or by invoking the claim codepath), then call storePendingQuestion(retry.id,
ORG_A, CONN_A, USER_A, { question: retry }) and assert that the subsequent
SELECT on pending_interactions for q.id returns claimed_at === null while still
asserting created_at remained unchanged; reference storePendingQuestion and the
pending_interactions.claimed_at field in your changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f621861-b657-432d-a3c6-10abdf118720

📥 Commits

Reviewing files that changed from the base of the PR and between 3626a0b and effc163.

📒 Files selected for processing (2)
  • packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts
  • packages/server/src/gateway/proxy/secret-proxy.ts

Comment on lines +93 to +111
// Webhook retry — same id, same scope, slightly different payload.
const retry = { ...q, question: "go? (retry)" };
await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, {
question: retry,
});

const after = await sql<{ created_at: Date; claimed_at: Date | null }>`
SELECT created_at, claimed_at
FROM pending_interactions WHERE id = ${q.id}
`;
const afterTs = new Date(after[0]!.created_at).getTime();

// Pre-fix: this assertion fails — the ON CONFLICT clause moved
// created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`.
// Post-fix: created_at is unchanged across retries.
expect(afterTs).toBe(backdatedTs);
// claimed_at is still reset on conflict so a legitimate retry can
// be claimed.
expect(after[0]!.claimed_at).toBeNull();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This test doesn't actually exercise the claimed_at reset.

claimed_at is still NULL from the initial insert, so the final assertion passes even if the conflict path stops clearing it. Make the row claimed first, then retry the store and assert it goes back to NULL.

Suggested test fix
     const backdatedTs = new Date(backdated[0]!.created_at).getTime();
     expect(backdatedTs).toBeLessThan(originalCreatedAt);
 
+    expect(
+      await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A)
+    ).not.toBeNull();
+
     // Webhook retry — same id, same scope, slightly different payload.
     const retry = { ...q, question: "go? (retry)" };
     await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, {
       question: retry,
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Webhook retry — same id, same scope, slightly different payload.
const retry = { ...q, question: "go? (retry)" };
await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, {
question: retry,
});
const after = await sql<{ created_at: Date; claimed_at: Date | null }>`
SELECT created_at, claimed_at
FROM pending_interactions WHERE id = ${q.id}
`;
const afterTs = new Date(after[0]!.created_at).getTime();
// Pre-fix: this assertion fails — the ON CONFLICT clause moved
// created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`.
// Post-fix: created_at is unchanged across retries.
expect(afterTs).toBe(backdatedTs);
// claimed_at is still reset on conflict so a legitimate retry can
// be claimed.
expect(after[0]!.claimed_at).toBeNull();
const backdatedTs = new Date(backdated[0]!.created_at).getTime();
expect(backdatedTs).toBeLessThan(originalCreatedAt);
expect(
await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A)
).not.toBeNull();
// Webhook retry — same id, same scope, slightly different payload.
const retry = { ...q, question: "go? (retry)" };
await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, {
question: retry,
});
const after = await sql<{ created_at: Date; claimed_at: Date | null }>`
SELECT created_at, claimed_at
FROM pending_interactions WHERE id = ${q.id}
`;
const afterTs = new Date(after[0]!.created_at).getTime();
// Pre-fix: this assertion fails — the ON CONFLICT clause moved
// created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`.
// Post-fix: created_at is unchanged across retries.
expect(afterTs).toBe(backdatedTs);
// claimed_at is still reset on conflict so a legitimate retry can
// be claimed.
expect(after[0]!.claimed_at).toBeNull();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts`
around lines 93 - 111, The test currently never exercises the ON CONFLICT path
that resets claimed_at because claimed_at is NULL from the initial insert;
modify the test around storePendingQuestion/retry so that after the initial
storePendingQuestion(q.id, ...) you mark the row as claimed (e.g. set claimed_at
to a non-null value via an update against the pending_interactions row for q.id
or by invoking the claim codepath), then call storePendingQuestion(retry.id,
ORG_A, CONN_A, USER_A, { question: retry }) and assert that the subsequent
SELECT on pending_interactions for q.id returns claimed_at === null while still
asserting created_at remained unchanged; reference storePendingQuestion and the
pending_interactions.claimed_at field in your changes.

@buremba buremba merged commit 907bdd8 into main May 18, 2026
17 of 18 checks passed
@buremba buremba deleted the fix/multi-tenant-cleanup branch May 18, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants