fix(server): post-review cleanup of multi-tenant isolation + pending interactions#867
Conversation
…interactions Closes 6 issues surfaced by a 6-agent review of #834, #836, #848, and the #836 followup: 1. secret-proxy: fail-closed on agentOrgResolver DB error (was warn + fall-through with undefined expectedOrganizationId — a DB hiccup window let downstream org checks silently downgrade). 2. secret-proxy: legacy-mapping bypass closed. Pre-fix `lookupPlaceholder Mapping` skipped the org check when `mapping.organizationId` was unset (legacy rows pre-org-pivot); a worker from org B could resolve a legacy unscoped mapping owned by org A. Now: if the caller supplies `expectedOrganizationId`, the mapping must match it. Emit a WARN on every legacy unscoped access to plan the deprecation. 3. pending-interaction-store: drop `created_at = now()` from ON CONFLICT. Webhook retries no longer reset the 24h TTL clock, so a misbehaving retry loop cannot keep a row alive indefinitely. `claimed_at = NULL` reset is preserved so legitimate retries are still claimable. 4. egress-judge VerdictCache key already includes orgId (`cache.ts:30-43`) — no code change needed, documented for the reviewers. 5. interaction-bridge: drop the per-bridge `sweepStalePendingInteractions` setInterval. The global `coreServices.sweepEphemeralTables` (scheduled in `src/scheduled/jobs.ts`) already covers it; the per-bridge call was N-times-per-pod wasted DB work. The local sweep timer is retained but now only evicts the in-memory `pendingSent Messages` cache by TTL. 6. pending-interaction-store: cap `sweepStalePendingInteractions` at 1000 rows/call (configurable). An unbounded DELETE under a stale-row backlog could lock the table; remaining rows drain across subsequent 5-minute cycles. 7. pending-interaction-store: add `deletePendingQuestion(id, org, conn, user)` and use it in the post-failure drop path. Pre-fix the bridge called `claimPendingQuestion` (UPDATE setting claimed_at) to "drop" the row, leaving a phantom row sitting around until the 24h sweep. `deletePendingQuestion` carries the same four-field scoping as the claim path, so safety invariants are identical. Tests: - `multi-tenant-isolation-reproducers.test.ts` gets 3 new cases under `[finding 1]`: legacy-mapping bypass rejected, legacy-mapping warn path with no expected org, and SecretProxy.forward returns 503 when the resolver throws. - New `pending-interaction-cleanup.test.ts` covers retry-preserves- `created_at`, sweep LIMIT honoured + remainder drains, post-failure drop is a DELETE (not just claim), and `deletePendingQuestion` scoping invariant. - `secret-proxy.test.ts` updated to reflect the closed legacy bypass — the old "falls through (legacy)" case now expects a null return when the caller supplies an expected org; a new "no expected org" case documents the WARN path for un-org-scoped callers. Red→green proof captured per-fix in the PR body.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR enforces org scoping in secret-proxy, makes agent org resolution failures return HTTP 503, preserves pending-interaction created_at on retry, adds bounded sweep/delete APIs, updates interaction-bridge cleanup to hard-delete, and adds tests covering these behaviors. ChangesMulti-tenant isolation & pending interaction cleanup
Sequence Diagram(s)sequenceDiagram
participant Client
participant SecretProxy
participant AgentOrgResolver
participant Upstream
Client->>SecretProxy: request requiring agent org resolution
SecretProxy->>AgentOrgResolver: resolve agent organization
AgentOrgResolver-->>SecretProxy: throws error
SecretProxy->>Client: HTTP 503 "failed to resolve agent organization"
note over Upstream: no call made to upstream when resolver fails
sequenceDiagram
participant InteractionBridge
participant ThreadService
participant PendingStore
InteractionBridge->>PendingStore: store pending interaction
InteractionBridge->>ThreadService: thread.post (on question:created)
ThreadService-->>InteractionBridge: post fails
InteractionBridge->>PendingStore: deletePendingQuestion(id, org, conn, user)
PendingStore-->>InteractionBridge: deleted (row removed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Two issues flagged by codex review:
1. Provider credential lookup had an unscoped fail-open DB-error path.
The PR's new 503 in `secret-proxy.ts` only covered the resolver-throws
branch — when the worker token already carried `organizationId`, the
resolver was skipped and downstream `authProfilesManager.getBestProfile`
ran without org context. Inside `AuthProfilesManager.resolveAgentOrgId`,
a DB error logged a warning and returned undefined, falling through to
unscoped credential reads (`auth-profiles-manager.ts:251-275`). Now we
wrap the credential lookup in `orgContext.run({organizationId:
expectedOrganizationId}, ...)` so `AuthProfilesManager.listProfiles`
short-circuits via `tryGetOrgId()` and never invokes its own resolver
when we already know the org. A DB hiccup in the upstream resolver
cannot downgrade scoping for these requests.
2. The post-failure cleanup test exercised `deletePendingQuestion()`
directly but didn't drive `registerInteractionBridge`. A regression
that swapped `deletePendingQuestion` back to `claimPendingQuestion`
in the bridge would not have failed the test. Added an integration
test that emits a `question:created` event, stubs the thread post to
throw, and asserts the row is GONE from `pending_interactions` (count
0). Verified red→green by reverting the bridge to claim — the test
times out at 5s because the row is never deleted.
Codex review iterationFirst codex pass flagged two issues; both addressed in commit
Second codex pass:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts`:
- Around line 93-111: The test currently never exercises the ON CONFLICT path
that resets claimed_at because claimed_at is NULL from the initial insert;
modify the test around storePendingQuestion/retry so that after the initial
storePendingQuestion(q.id, ...) you mark the row as claimed (e.g. set claimed_at
to a non-null value via an update against the pending_interactions row for q.id
or by invoking the claim codepath), then call storePendingQuestion(retry.id,
ORG_A, CONN_A, USER_A, { question: retry }) and assert that the subsequent
SELECT on pending_interactions for q.id returns claimed_at === null while still
asserting created_at remained unchanged; reference storePendingQuestion and the
pending_interactions.claimed_at field in your changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 9f621861-b657-432d-a3c6-10abdf118720
📒 Files selected for processing (2)
packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.tspackages/server/src/gateway/proxy/secret-proxy.ts
| // Webhook retry — same id, same scope, slightly different payload. | ||
| const retry = { ...q, question: "go? (retry)" }; | ||
| await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, { | ||
| question: retry, | ||
| }); | ||
|
|
||
| const after = await sql<{ created_at: Date; claimed_at: Date | null }>` | ||
| SELECT created_at, claimed_at | ||
| FROM pending_interactions WHERE id = ${q.id} | ||
| `; | ||
| const afterTs = new Date(after[0]!.created_at).getTime(); | ||
|
|
||
| // Pre-fix: this assertion fails — the ON CONFLICT clause moved | ||
| // created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`. | ||
| // Post-fix: created_at is unchanged across retries. | ||
| expect(afterTs).toBe(backdatedTs); | ||
| // claimed_at is still reset on conflict so a legitimate retry can | ||
| // be claimed. | ||
| expect(after[0]!.claimed_at).toBeNull(); |
There was a problem hiding this comment.
This test doesn't actually exercise the claimed_at reset.
claimed_at is still NULL from the initial insert, so the final assertion passes even if the conflict path stops clearing it. Make the row claimed first, then retry the store and assert it goes back to NULL.
Suggested test fix
const backdatedTs = new Date(backdated[0]!.created_at).getTime();
expect(backdatedTs).toBeLessThan(originalCreatedAt);
+ expect(
+ await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A)
+ ).not.toBeNull();
+
// Webhook retry — same id, same scope, slightly different payload.
const retry = { ...q, question: "go? (retry)" };
await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, {
question: retry,
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Webhook retry — same id, same scope, slightly different payload. | |
| const retry = { ...q, question: "go? (retry)" }; | |
| await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, { | |
| question: retry, | |
| }); | |
| const after = await sql<{ created_at: Date; claimed_at: Date | null }>` | |
| SELECT created_at, claimed_at | |
| FROM pending_interactions WHERE id = ${q.id} | |
| `; | |
| const afterTs = new Date(after[0]!.created_at).getTime(); | |
| // Pre-fix: this assertion fails — the ON CONFLICT clause moved | |
| // created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`. | |
| // Post-fix: created_at is unchanged across retries. | |
| expect(afterTs).toBe(backdatedTs); | |
| // claimed_at is still reset on conflict so a legitimate retry can | |
| // be claimed. | |
| expect(after[0]!.claimed_at).toBeNull(); | |
| const backdatedTs = new Date(backdated[0]!.created_at).getTime(); | |
| expect(backdatedTs).toBeLessThan(originalCreatedAt); | |
| expect( | |
| await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A) | |
| ).not.toBeNull(); | |
| // Webhook retry — same id, same scope, slightly different payload. | |
| const retry = { ...q, question: "go? (retry)" }; | |
| await storePendingQuestion(retry.id, ORG_A, CONN_A, USER_A, { | |
| question: retry, | |
| }); | |
| const after = await sql<{ created_at: Date; claimed_at: Date | null }>` | |
| SELECT created_at, claimed_at | |
| FROM pending_interactions WHERE id = ${q.id} | |
| `; | |
| const afterTs = new Date(after[0]!.created_at).getTime(); | |
| // Pre-fix: this assertion fails — the ON CONFLICT clause moved | |
| // created_at to now() and `afterTs` ≈ now() ≫ `backdatedTs`. | |
| // Post-fix: created_at is unchanged across retries. | |
| expect(afterTs).toBe(backdatedTs); | |
| // claimed_at is still reset on conflict so a legitimate retry can | |
| // be claimed. | |
| expect(after[0]!.claimed_at).toBeNull(); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/__tests__/pending-interaction-cleanup.test.ts`
around lines 93 - 111, The test currently never exercises the ON CONFLICT path
that resets claimed_at because claimed_at is NULL from the initial insert;
modify the test around storePendingQuestion/retry so that after the initial
storePendingQuestion(q.id, ...) you mark the row as claimed (e.g. set claimed_at
to a non-null value via an update against the pending_interactions row for q.id
or by invoking the claim codepath), then call storePendingQuestion(retry.id,
ORG_A, CONN_A, USER_A, { question: retry }) and assert that the subsequent
SELECT on pending_interactions for q.id returns claimed_at === null while still
asserting created_at remained unchanged; reference storePendingQuestion and the
pending_interactions.claimed_at field in your changes.
Summary
Single PR fixing 7 issues surfaced by a 6-agent review of recently-merged commits
de81ebe5(#834),de4c238b(#836),7fc36dca(#848), andada4b169(#836 followup).Fixes
agentOrgResolverDB error —packages/server/src/gateway/proxy/secret-proxy.ts:541-563. Was warn + fall-through withexpectedOrganizationId = undefined; now a 503 response. A DB hiccup window no longer downgrades downstream org checks.secret-proxy.ts:195-227. Pre-fix the org check gated onmapping.organizationIdbeing set, so a legacy mapping (minted before the org-id pivot) sailed through under any expected org. Now: if caller suppliesexpectedOrganizationId, the mapping must match — including refusing to matchundefined. WARN log emitted on every legacy unscoped access to schedule the deprecation.created_at—pending-interaction-store.ts:54-65. Dropcreated_at = now()from the UPDATE clause so webhook retries cannot keep a row alive past the 24h TTL by re-stashing it.claimed_at = NULLreset retained.egress-judge/cache.ts:29-43already keys by(orgId, policyHash, …).judge.ts:105-111passesrequest.organizationIdas the first key part. Two orgs with identical policy text get distinct cache entries.interaction-bridge.ts:198-227. GlobalcoreServices.sweepEphemeralTables(scheduled insrc/scheduled/jobs.ts:110-114, every 5 min) already callssweepStalePendingInteractions. Per-bridgesetIntervalwas N-times-per-pod redundant work. Local timer retained for in-memorypendingSentMessagesTTL eviction only.pending-interaction-store.ts:101-135. Default 1000 rows/call (configurable).DELETE … WHERE id IN (SELECT … LIMIT N)shape; remaining rows drain across subsequent cycles.interaction-bridge.ts:332-355+ newdeletePendingQuestion(id, org, conn, user)inpending-interaction-store.ts:137-159. Pre-fix the bridge calledclaimPendingQuestion(UPDATE settingclaimed_at) to "drop" the row, leaving a phantom row until the 24h sweep.deletePendingQuestioncarries the same four-field scoping asclaimPendingQuestion, so safety is identical.Red → green per fix
All red runs captured with the fix branch checked out and the production source files temporarily reverted.
Fix #1 (503 on resolver error) — pre-fix:
Post-fix: passes.
Fix #2 (legacy bypass) — pre-fix:
Post-fix: passes.
Fix #3 (created_at preserved) — pre-fix:
Post-fix: passes (created_at unchanged across retry).
Fixes #6 + #7 (LIMIT + deletePendingQuestion) — pre-fix the cleanup test file fails to even load:
Post-fix: all 4 cleanup tests pass (sweep LIMIT honoured + remainder drains, deletePendingQuestion hard-deletes, scoping invariant).
Green (final)
Full gateway test suite:
831 pass / 0 failacross 68 files.make build-packagesclean.make typecheckshows 12 errors — all pre-existing onorigin/main(WorkerTokenData.organizationId/AgentMetadata.organizationIdtype definitions lag behind their callers); zero new errors introduced.Non-goals (explicitly skipped)
Cosmetic items (composite-key PolicyStore refactor,
unref?.()operator, parameter inversion, comment cleanups), PR #865's branch, helm/kustomize changes, and coverage gaps that don't map to a fix above (cascade-delete tests, PolicyStore.clear() cross-org, etc.).Test plan
bun testfor the 5 affected gateway test files (50/50 pass)packages/server/src/gateway/__tests__/run (831/831 pass)make build-packagesmake typecheck— 12 pre-existing errors, 0 newSummary by CodeRabbit
Tests
Bug Fixes
Refactor