refactor(server): move pending interactions and SSE fan-out to Postgres#834
Conversation
Replaces the in-process `Map<questionId, PendingQuestionEntry>` in `interaction-bridge.ts` with a `public.pending_interactions` table so a button click landing on pod B can claim a question registered on pod A. `claimPendingQuestion` is an atomic `UPDATE … SET claimed_at = now() WHERE id = $1 AND claimed_at IS NULL RETURNING entry_payload` — Slack webhook retries and racing clicks serialize on the row, the winner gets the payload, losers see null and no-op. The non-serializable platform `SentMessage` handle (used to edit the original card after a click) stays in a small per-pod cache. Losing it cross-pod only degrades card-edit UX; the answer still routes correctly. A 24h stale-row sweep piggybacks on the existing scheduled `sweep-ephemeral-tables` task via `sweepEphemeralTables` in `core-services.ts`. Migration is additive (one new table, two indexes). Single-pod behavior is unchanged. SSE fan-out (sse-manager.ts) is still per-pod and left for a follow-up PR.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds Postgres persistence for pending chat interactions: a new ChangesPersistent Pending Interactions
Test formatting-only changes
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/server/src/gateway/connections/interaction-bridge.ts (1)
291-297:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPersist the pending row before making the card clickable.
thread.post()exposes the buttons beforestorePendingQuestion()completes. If that insert is slow or fails, the first click sees “no pending entry” and gets dropped even though the card was visible. Store first, then post, or clean up the row if posting fails.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/connections/interaction-bridge.ts` around lines 291 - 297, The pending-question row must be persisted before making the card clickable: call the function that inserts the pending entry (storePendingQuestion or the code path that creates the pending row) before invoking postWithFallback(thread, { card, fallbackText }, connectionId, "question interaction"), and pass the pending ID/metadata through to trackQuestion so the saved row is associated with the post; if postWithFallback throws or returns a failure, delete/rollback the pending row to avoid orphaned rows visible to users. Ensure thread.post()/postWithFallback is only called after the insert completes successfully and add error handling to clean up the pending entry on posting failure.
🧹 Nitpick comments (1)
packages/server/src/gateway/__tests__/pending-interaction-store.test.ts (1)
48-57: ⚡ Quick winAdd a truly concurrent claim test for race coverage.
This case is sequential and would still pass with a non-atomic select-then-update implementation. Running two claims concurrently would better lock in the intended single-winner guarantee.
🧪 Suggested test shape
test("claim returns the stored payload exactly once", async () => { const q = buildQuestion("q-1"); await storePendingQuestion(q.id, undefined, { question: q }); - const first = await claimPendingQuestion(q.id); - expect(first?.question.id).toBe("q-1"); - - const second = await claimPendingQuestion(q.id); - expect(second).toBeNull(); + const [a, b] = await Promise.all([ + claimPendingQuestion(q.id), + claimPendingQuestion(q.id), + ]); + const winners = [a, b].filter((x) => x !== null); + expect(winners).toHaveLength(1); + expect(winners[0]?.question.id).toBe("q-1"); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/__tests__/pending-interaction-store.test.ts` around lines 48 - 57, Add a concurrent test that verifies only one caller wins when two claims run at the same time: use buildQuestion to create a question, storePendingQuestion to insert it, then start two simultaneous calls to claimPendingQuestion (e.g., via Promise.all or Promise.allSettled) and assert that exactly one result is non-null and has question.id "q-1" while the other is null; reference the existing test name and functions claimPendingQuestion, storePendingQuestion, and buildQuestion to locate where to add this new test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/connections/interaction-bridge.ts`:
- Around line 203-206: The pendingSentMessages map currently never evicts
entries; modify trackQuestion and restashQuestion to call a shared helper
trackPendingSentMessage(questionId, SentMessage) that adds the entry to
pendingSentMessages and starts a TTL timer (match DB sweep horizon, e.g. 24
hours) to remove it when expired; ensure trackPendingSentMessage replaces any
existing timer for the same questionId. Also update the unregister/teardown path
to clear any outstanding timers and remove entries from pendingSentMessages so
timers don’t leak on connection teardown.
- Around line 463-468: The code currently calls claimQuestion(questionId) before
checking manager.getInstance(connectionId), which can leave a question
permanently claimed if this pod can't service it; move the live-instance check
(manager.getInstance(connectionId)) before calling claimQuestion(questionId), or
if you prefer to keep the current order, ensure you call the unclaim/restash
path (the function that releases the DB claim) when manager.getInstance(...)
returns a miss so the row is released; update the handler in the async
(questionId, value, thread, author) => block (and the similar block at lines
~477-484) to either perform manager.getInstance first or call the
release/unclaim helper on the missed path before returning.
In `@packages/server/src/gateway/connections/pending-interaction-store.ts`:
- Around line 82-85: The restashPendingQuestion SQL clears claimed_at but
doesn't update created_at, so a row restashed near the 24h cutoff can be
immediately swept; modify the UPSERT in restashPendingQuestion to also refresh
created_at (e.g., set created_at = NOW() or created_at = EXCLUDED.created_at) in
the ON CONFLICT ... DO UPDATE clause so restashed rows get a fresh timestamp and
won't be immediately deleted by the sweeper.
---
Outside diff comments:
In `@packages/server/src/gateway/connections/interaction-bridge.ts`:
- Around line 291-297: The pending-question row must be persisted before making
the card clickable: call the function that inserts the pending entry
(storePendingQuestion or the code path that creates the pending row) before
invoking postWithFallback(thread, { card, fallbackText }, connectionId,
"question interaction"), and pass the pending ID/metadata through to
trackQuestion so the saved row is associated with the post; if postWithFallback
throws or returns a failure, delete/rollback the pending row to avoid orphaned
rows visible to users. Ensure thread.post()/postWithFallback is only called
after the insert completes successfully and add error handling to clean up the
pending entry on posting failure.
---
Nitpick comments:
In `@packages/server/src/gateway/__tests__/pending-interaction-store.test.ts`:
- Around line 48-57: Add a concurrent test that verifies only one caller wins
when two claims run at the same time: use buildQuestion to create a question,
storePendingQuestion to insert it, then start two simultaneous calls to
claimPendingQuestion (e.g., via Promise.all or Promise.allSettled) and assert
that exactly one result is non-null and has question.id "q-1" while the other is
null; reference the existing test name and functions claimPendingQuestion,
storePendingQuestion, and buildQuestion to locate where to add this new test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: cb9e7dea-504f-44e0-a486-ab641681094b
📒 Files selected for processing (6)
db/migrations/20260518000000_pending_interactions.sqldb/schema.sqlpackages/server/src/gateway/__tests__/pending-interaction-store.test.tspackages/server/src/gateway/connections/interaction-bridge.tspackages/server/src/gateway/connections/pending-interaction-store.tspackages/server/src/gateway/services/core-services.ts
…entMessage TTL, persist-then-post Addresses the four pi-review findings on PR #834. [HIGH #1] Cross-tenant claim hole — keying by `id` alone let a leaked or forged question id be consumed by any connection in any org. The claim SQL now scopes by `(id, organization_id, connection_id, expected_user_id)` and the migration adds those columns as NOT NULL. Cross-tenant and cross-connection clicks return null without consuming the row. [MED #2] Unbounded pendingSentMessages — the old 5-minute timer was removed by mistake. A per-bridge `setInterval` now sweeps SentMessage handles older than 24h (matching the DB row TTL) and also drops handles for any row the scheduled DB sweep deleted. The sweep is unref'd and cleared on bridge unregister. [MED #3] Claim-then-auth race — folded into the SQL claim. The `expected_user_id = $4` predicate means wrong-user clicks never set `claimed_at`, so process death after the claim can no longer leave the row in a half-consumed state. The old "claim → check author → async restash" three-step (and `restashPendingQuestion` itself) is gone. [LOW #4] Post-then-persist ordering — `onQuestionCreated` now persists the row first, then posts the card. If the persist fails the card is never shown; if the post fails after a successful persist the row is dropped immediately rather than waiting for the 24h sweep. Red→green coverage: the new test file asserts the scoped claim rejects cross-tenant, cross-connection, and wrong-user clicks AND that the row remains claimable by the rightful owner. These same assertions fail on the pre-fix commit (verified locally against `claimPendingQuestion(id)`).
|
Pushed fix-up commit Per finding[HIGH #1] Cross-tenant claim hole — fixed. UPDATE pending_interactions
SET claimed_at = now()
WHERE id = $1
AND organization_id = $2
AND connection_id = $3
AND expected_user_id = $4
AND claimed_at IS NULL
RETURNING entry_payloadThe partial unclaimed-predicate index was widened to [MED #2] Unbounded [MED #3] Claim-then-auth race — fixed by going with option (a). [LOW #4] Post-then-persist — fixed. Red → greenWrote a temporary Then deleted the red-proof file (it only made sense against the old API), restored the fix, and ran the post-fix tests with the new scoped signatures: The fix-branch Validation
Pushed back onNothing. The author-userId check in the click handler is gone (it was redundant once the SQL claim does the work), but I did NOT touch the existing |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/server/src/gateway/connections/interaction-bridge.ts (1)
351-377:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't keep a pending row when the card fell back to plain text.
postWithFallback()returns aSentMessagefor both the interactive card and the plain-text fallback, so this branch only cleans up on total post failure. If card rendering fails but the fallback text posts, the DB row stays claimable even though there are no action buttons to produce aquestion:*click, leaving an unanswerable interaction behind.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/connections/interaction-bridge.ts` around lines 351 - 377, postWithFallback can return a SentMessage even when it only posted the plain-text fallback, leaving a pending DB row that can never receive a question:* click; update the logic after the postWithFallback call to detect when the returned SentMessage represents a fallback (e.g. check a property like sent.isFallback / sent.type / absence of renderedCard or actions) and, in that case, call claimPendingQuestion(...) to drop the pending row (same cleanup as the total-post-failure path) instead of calling rememberSentMessage(event.id, sent); keep the existing try/catch pattern and logging used in the failure branch when performing this cleanup.
🧹 Nitpick comments (1)
packages/server/src/gateway/__tests__/pending-interaction-store.test.ts (1)
70-79: ⚡ Quick winExercise the actual race here, not just sequential re-reads.
This test proves "claim once, then null", but it doesn't cover the concurrent single-winner guarantee the SQL is supposed to provide. Racing two
claimPendingQuestion(...)calls withPromise.allwould catch regressions in the atomic-claim path.Example update
- const first = await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A); - expect(first?.question.id).toBe("q-1"); - - const second = await claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A); - expect(second).toBeNull(); + const [first, second] = await Promise.all([ + claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A), + claimPendingQuestion(q.id, ORG_A, CONN_A, USER_A), + ]); + + const winners = [first, second].filter( + (result): result is NonNullable<typeof result> => result !== null + ); + + expect(winners).toHaveLength(1); + expect(winners[0].question.id).toBe("q-1");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/__tests__/pending-interaction-store.test.ts` around lines 70 - 79, The test currently calls claimPendingQuestion sequentially; change it to exercise a real race by invoking two concurrent claimPendingQuestion(...) calls (e.g., via Promise.all) after storing the pending question from storePendingQuestion(...) for the built question (buildQuestion("q-1")), then assert that exactly one of the results is non-null with question.id "q-1" and the other is null to validate the SQL atomic single-winner behavior of claimPendingQuestion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/connections/interaction-bridge.ts`:
- Around line 216-246: sweepPendingSent() running per-connection is calling the
global sweepStalePendingInteractions() which issues DELETEs across the whole
pending_interactions table; remove that DB-wide cleanup from the per-connection
path and only perform the local cache eviction there. Specifically, in the
function sweepPendingSent (invoked by pendingSentSweepTimer, interval
PENDING_SENT_SWEEP_INTERVAL_MS) remove the try/catch and await
sweepStalePendingInteractions() call and the deletedIds handling so the function
only prunes pendingSentMessages by registeredAt/ttlCutoff; if you need DB->cache
synchronization instead, have the single scheduled global sweeper publish
deleted IDs and wire that into each bridge to delete those ids from
pendingSentMessages rather than calling sweepStalePendingInteractions() from
sweepPendingSent().
---
Outside diff comments:
In `@packages/server/src/gateway/connections/interaction-bridge.ts`:
- Around line 351-377: postWithFallback can return a SentMessage even when it
only posted the plain-text fallback, leaving a pending DB row that can never
receive a question:* click; update the logic after the postWithFallback call to
detect when the returned SentMessage represents a fallback (e.g. check a
property like sent.isFallback / sent.type / absence of renderedCard or actions)
and, in that case, call claimPendingQuestion(...) to drop the pending row (same
cleanup as the total-post-failure path) instead of calling
rememberSentMessage(event.id, sent); keep the existing try/catch pattern and
logging used in the failure branch when performing this cleanup.
---
Nitpick comments:
In `@packages/server/src/gateway/__tests__/pending-interaction-store.test.ts`:
- Around line 70-79: The test currently calls claimPendingQuestion sequentially;
change it to exercise a real race by invoking two concurrent
claimPendingQuestion(...) calls (e.g., via Promise.all) after storing the
pending question from storePendingQuestion(...) for the built question
(buildQuestion("q-1")), then assert that exactly one of the results is non-null
with question.id "q-1" and the other is null to validate the SQL atomic
single-winner behavior of claimPendingQuestion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 2319e1c1-368e-4ca6-9507-9b1e577ada37
📒 Files selected for processing (6)
db/migrations/20260518000000_pending_interactions.sqldb/schema.sqlpackages/server/src/gateway/__tests__/pending-interaction-store.test.tspackages/server/src/gateway/connections/interaction-bridge.tspackages/server/src/gateway/connections/pending-interaction-store.tspackages/server/src/gateway/services/core-services.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- db/migrations/20260518000000_pending_interactions.sql
- packages/server/src/gateway/services/core-services.ts
| const pendingSentSweepTimer = setInterval(() => { | ||
| sweepPendingSent().catch((error) => { | ||
| logger.warn( | ||
| { connectionId, error: String(error) }, | ||
| "pendingSentMessages sweep failed" | ||
| ); | ||
| }); | ||
| }, PENDING_SENT_SWEEP_INTERVAL_MS); | ||
| pendingSentSweepTimer.unref?.(); | ||
| async function sweepPendingSent(): Promise<void> { | ||
| const ttlCutoff = Date.now() - PENDING_SENT_TTL_MS; | ||
| for (const [id, entry] of pendingSentMessages) { | ||
| if (entry.registeredAt <= ttlCutoff) { | ||
| pendingSentMessages.delete(id); | ||
| } | ||
| } | ||
| // Also drop local handles for any DB rows the scheduled sweeper just | ||
| // deleted — keeps the local cache from outliving its DB row. | ||
| let deletedIds: string[] = []; | ||
| try { | ||
| deletedIds = await sweepStalePendingInteractions(); | ||
| } catch (error) { | ||
| // The store logs its own DB errors; treat as best-effort here. | ||
| logger.debug( | ||
| { connectionId, error: String(error) }, | ||
| "sweepStalePendingInteractions failed during local sweep" | ||
| ); | ||
| } | ||
| for (const id of deletedIds) { | ||
| pendingSentMessages.delete(id); | ||
| } |
There was a problem hiding this comment.
Keep DB stale-row cleanup out of the per-connection cache sweep.
This interval is created once per registered bridge, but sweepPendingSent() calls sweepStalePendingInteractions(), which deletes across the whole pending_interactions table. With many live connections, that turns into many identical global DELETE sweeps every hour on top of the scheduled ephemeral-table sweep from this PR.
Suggested direction
async function sweepPendingSent(): Promise<void> {
const ttlCutoff = Date.now() - PENDING_SENT_TTL_MS;
for (const [id, entry] of pendingSentMessages) {
if (entry.registeredAt <= ttlCutoff) {
pendingSentMessages.delete(id);
}
}
- // Also drop local handles for any DB rows the scheduled sweeper just
- // deleted — keeps the local cache from outliving its DB row.
- let deletedIds: string[] = [];
- try {
- deletedIds = await sweepStalePendingInteractions();
- } catch (error) {
- // The store logs its own DB errors; treat as best-effort here.
- logger.debug(
- { connectionId, error: String(error) },
- "sweepStalePendingInteractions failed during local sweep"
- );
- }
- for (const id of deletedIds) {
- pendingSentMessages.delete(id);
- }
}If you still want exact cache/DB synchronization, wire deleted ids in from the single scheduled sweep instead of having every bridge issue its own global delete.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/connections/interaction-bridge.ts` around lines
216 - 246, sweepPendingSent() running per-connection is calling the global
sweepStalePendingInteractions() which issues DELETEs across the whole
pending_interactions table; remove that DB-wide cleanup from the per-connection
path and only perform the local cache eviction there. Specifically, in the
function sweepPendingSent (invoked by pendingSentSweepTimer, interval
PENDING_SENT_SWEEP_INTERVAL_MS) remove the try/catch and await
sweepStalePendingInteractions() call and the deletedIds handling so the function
only prunes pendingSentMessages by registeredAt/ttlCutoff; if you need DB->cache
synchronization instead, have the single scheduled global sweeper publish
deleted IDs and wire that into each bridge to delete those ids from
pendingSentMessages rather than calling sweepStalePendingInteractions() from
sweepPendingSent().
…s to match dbmate dump
…orm-isolation The bridge's new per-tenant guard returns before resolveThread when connection.organizationId is missing — three tests asserting instanceChat.channel() got called fell to 0 calls. Fixture now sets organizationId: "test-org" so the path reaches resolveThread; the subsequent persist still fails on FK in CI (no org row), but is caught by the bridge's persist-then-post catch block — which is what the bridge does in production when persistence drops.
…ckiness (#848) * feat(chart): expose service.sessionAffinity so multi-replica deploys can pin SSE clients When app.replicaCount > 1, kube-proxy round-robins traffic across pods. The SSE manager keeps connections + a 100-event/2-min backlog in memory per pod (packages/server/src/gateway/services/sse-manager.ts), so a client whose stream lands on pod A while the worker run is claimed by pod B silently misses events. PR #834 fixed the AskUser click path (now DB-backed) but SSE chunks still need to flow through the pod that holds the connection. ClientIP affinity at the Service level pins each client IP to a single pod for the affinity window. No ingress-controller config required — the chart stays controller-agnostic. Disabled by default; helmrelease opts in alongside replicaCount: 2. Verified with helm template: - default (None): no sessionAffinity stanza emitted - --set service.sessionAffinity=ClientIP: full stanza with 10800s timeout * feat(chart): NOTES warning when replicaCount > 1 but sticky is off The chart change only exposes the knob; an operator can still scale replicaCount past 1 without enabling sessionAffinity and silently degrade SSE delivery. Make the foot-gun loud at install/upgrade time so it's caught before the symptom shows up in production. Verified with helm install --dry-run: - default (replicaCount=1, sessionAffinity unset): no warning - replicaCount=2 + sessionAffinity=None: WARNING emitted - replicaCount=2 + sessionAffinity=ClientIP: no warning
#855) Both caches were "lazy refresh on read" — they update an entry's expiresAt when the same agentId is looked up again, but never delete entries for agentIds that are never re-queried. Net: cache size grows monotonically with distinct agentIds the gateway has ever seen, bounded only by the pod's lifetime. In practice the growth rate is small (~200 bytes per distinct agentId, hundreds-to-thousands of agents per day) and almost certainly NOT the cause of the 1 Gi OOM that prompted #782 — but it's still a genuine bound-less Map that ideally cleans up after itself. Adds a tiny cacheSet helper with a 1024-entry cap that evicts the oldest insertion (Maps iterate in insertion order, so size-1 peek-and-delete is O(1)). Test exercises 2048 distinct lookups and asserts both caches stay <= 1024. Refs #782 — hardening, not root-cause fix. SSE keepalive teardown (#833) and in-memory pending-interactions removal (#834) remain the most likely actual OOM fixes.
* fix(runs): add heartbeat-lost reaper for connector lanes
When a connector worker crashes, OOM-kills, or scales down mid-run the
`runs` row sits in `claimed`/`running` forever and the feed never gets a
retry. The legacy `checkStalledExecutions` cron caught some of this on a
5-minute cadence but only covered `sync` + `embed_backfill`, did a full
table scan, and had no cross-pod coordination.
`reapStaleRuns()` (packages/server/src/scheduled/check-stalled-executions.ts)
covers all four connector lanes (`sync`, `action`, `embed_backfill`,
`auth`), wraps the sweep in `pg_try_advisory_lock` for multi-pod safety,
and reads the threshold from `RUNS_REAPER_STALE_AFTER_SECONDS` (default
120s). It writes the failure as `error_message='worker_heartbeat_lost'`
and re-queues stalled `sync` runs so the feed self-heals.
Wired on a 30s `setInterval` from the gateway boot path (server.ts and
start-local.ts) with explicit teardown on SIGTERM. The legacy 5-minute
TaskScheduler cron stays as a backstop and now delegates to the same
function; the advisory lock keeps the two cadences from double-failing
rows. Watcher lane stays out of scope — it already has its own 2h sweep
in watchers/automation.ts.
Adds the missing partial index `idx_runs_heartbeat_inflight` on
`runs(last_heartbeat_at) WHERE status IN ('claimed','running') AND
run_type IN ('sync','action','embed_backfill','auth')` so the sweeper
query is index-only. Also sets `last_heartbeat_at = current_timestamp`
on the worker poll claim path so a freshly-claimed row has a sane
initial timestamp instead of relying on the first 30s worker heartbeat
to land.
Slice extracted from #615 — placement-tagging, organization-default-device,
and scale-from-zero remain out of scope and the bigger design is
unchanged by this PR.
* fix(schema): regenerate db/schema.sql after heartbeat index migration
* fix(migration): rename to 20260518010000 to avoid collision with #834 pending_interactions
* fix(schema): add 20260518010000 to schema_migrations after rebase
…interactions (#867) * fix(server): post-review cleanup of multi-tenant isolation + pending interactions Closes 6 issues surfaced by a 6-agent review of #834, #836, #848, and the #836 followup: 1. secret-proxy: fail-closed on agentOrgResolver DB error (was warn + fall-through with undefined expectedOrganizationId — a DB hiccup window let downstream org checks silently downgrade). 2. secret-proxy: legacy-mapping bypass closed. Pre-fix `lookupPlaceholder Mapping` skipped the org check when `mapping.organizationId` was unset (legacy rows pre-org-pivot); a worker from org B could resolve a legacy unscoped mapping owned by org A. Now: if the caller supplies `expectedOrganizationId`, the mapping must match it. Emit a WARN on every legacy unscoped access to plan the deprecation. 3. pending-interaction-store: drop `created_at = now()` from ON CONFLICT. Webhook retries no longer reset the 24h TTL clock, so a misbehaving retry loop cannot keep a row alive indefinitely. `claimed_at = NULL` reset is preserved so legitimate retries are still claimable. 4. egress-judge VerdictCache key already includes orgId (`cache.ts:30-43`) — no code change needed, documented for the reviewers. 5. interaction-bridge: drop the per-bridge `sweepStalePendingInteractions` setInterval. The global `coreServices.sweepEphemeralTables` (scheduled in `src/scheduled/jobs.ts`) already covers it; the per-bridge call was N-times-per-pod wasted DB work. The local sweep timer is retained but now only evicts the in-memory `pendingSent Messages` cache by TTL. 6. pending-interaction-store: cap `sweepStalePendingInteractions` at 1000 rows/call (configurable). An unbounded DELETE under a stale-row backlog could lock the table; remaining rows drain across subsequent 5-minute cycles. 7. pending-interaction-store: add `deletePendingQuestion(id, org, conn, user)` and use it in the post-failure drop path. Pre-fix the bridge called `claimPendingQuestion` (UPDATE setting claimed_at) to "drop" the row, leaving a phantom row sitting around until the 24h sweep. `deletePendingQuestion` carries the same four-field scoping as the claim path, so safety invariants are identical. Tests: - `multi-tenant-isolation-reproducers.test.ts` gets 3 new cases under `[finding 1]`: legacy-mapping bypass rejected, legacy-mapping warn path with no expected org, and SecretProxy.forward returns 503 when the resolver throws. - New `pending-interaction-cleanup.test.ts` covers retry-preserves- `created_at`, sweep LIMIT honoured + remainder drains, post-failure drop is a DELETE (not just claim), and `deletePendingQuestion` scoping invariant. - `secret-proxy.test.ts` updated to reflect the closed legacy bypass — the old "falls through (legacy)" case now expects a null return when the caller supplies an expected org; a new "no expected org" case documents the WARN path for un-org-scoped callers. Red→green proof captured per-fix in the PR body. * fix(server): address codex review on PR #867 Two issues flagged by codex review: 1. Provider credential lookup had an unscoped fail-open DB-error path. The PR's new 503 in `secret-proxy.ts` only covered the resolver-throws branch — when the worker token already carried `organizationId`, the resolver was skipped and downstream `authProfilesManager.getBestProfile` ran without org context. Inside `AuthProfilesManager.resolveAgentOrgId`, a DB error logged a warning and returned undefined, falling through to unscoped credential reads (`auth-profiles-manager.ts:251-275`). Now we wrap the credential lookup in `orgContext.run({organizationId: expectedOrganizationId}, ...)` so `AuthProfilesManager.listProfiles` short-circuits via `tryGetOrgId()` and never invokes its own resolver when we already know the org. A DB hiccup in the upstream resolver cannot downgrade scoping for these requests. 2. The post-failure cleanup test exercised `deletePendingQuestion()` directly but didn't drive `registerInteractionBridge`. A regression that swapped `deletePendingQuestion` back to `claimPendingQuestion` in the bridge would not have failed the test. Added an integration test that emits a `question:created` event, stubs the thread post to throw, and asserts the row is GONE from `pending_interactions` (count 0). Verified red→green by reverting the bridge to claim — the test times out at 5s because the row is never deleted. * chore(submodule): bump owletto to dcb2172 to clear drift check
Summary
Removes the in-process
Map<questionId, PendingQuestionEntry>ininteraction-bridge.tsthat pinned the chat interaction bridge to a single replica. Pending questions now live inpublic.pending_interactions;claimPendingQuestionis an atomicUPDATE … SET claimed_at = now() WHERE id = $1 AND claimed_at IS NULL RETURNING entry_payloadso a button click landing on pod B can claim a question registered on pod A. Slack webhook retries and racing clicks serialize on the row — the winner gets the payload, losers see null and no-op.The non-serializable platform
SentMessagehandle (used to edit the original card after a click) stays in a small per-pod cache. Losing it cross-pod only degrades card-edit UX; the answer still routes correctly through the worker.24h stale-row sweep piggybacks on the existing scheduled
sweep-ephemeral-tablestask (registered incore-services.ts).Scoped down from the original ask
LISTEN/NOTIFY(REQUIRED if time): deferred to a follow-up PR. A correct implementation needs payload-size handling (Postgres NOTIFY caps at 8 KB, agentoutputchunks can exceed that), which adds either chunking logic or a spill table — neither was a one-line change, so per the "we'd rather have two small PRs than one big one" rule I'm shipping P1 alone. The single-pod SSE behavior is unchanged and the bridge fix is the bigger HA wedge.Migration
Additive: one new table
pending_interactionswith(id text PK, organization_id text FK → organization(id) ON DELETE CASCADE, entry_payload jsonb, created_at, claimed_at), plus a partial index on the unclaimed predicate and one oncreated_atfor the sweep.db/schema.sqlupdated by hand to match.Test plan
make typecheckpasses (strict — matches the Dockerfile)make build-packagespassespending-interaction-store.test.tscovers atomic claim (second claim returns null), restash, and stale-row sweep against PGliteinteraction-bridge-action-handlers.test.tsandinteraction-bridge-slack-webhook.test.tsstill passquestion:createdposted on pod A, clicked on pod B, routes back through the workerSummary by CodeRabbit
New Features
Bug Fixes
Tests
Chores