fix(server): wrap connection-boot secret resolution in orgContext to fix Slack#881
Conversation
…fix Slack Two Slack connections in prod (`slack-lobu-preview`, `1b91933131464c95`) were stuck in `status=error` since 2026-05-13 14:22 UTC with `Failed to resolve secret ref for connection X field "botToken"`. The underlying secret rows were intact in the right org; the failure was a transient boot-time issue caused by the encryption-key parser regression in #692 (a 43-char unpadded URL-safe base64 key was rejected for ~2 days until #735 added the urlsafe branch). But once #735 deployed, the connections did not self-heal — the boot loop only retried `status=active` rows and there was no other path that would flip them back. Three changes to the boot path: 1. `initialize()` now retries `status=error` rows alongside `active` ones. Transient deploy-time failures self-heal on the next boot; on success the error marker is cleared (under the connection's own org context). `stopped` stays operator-driven. 2. The catch block wraps `connectionStore.updateConnection` in `orgContext.run(connection.organizationId, ...)`. Boot has no ALS org id and the postgres store's `saveConnection()` calls `getOrgId()` strict — without the wrap the error-marker write itself throws and the row stays `active`, hiding the failure. 3. `startInstance()` always rebinds to the connection's `organizationId` instead of only when the caller has no org bound. A caller's org that happens to differ from the connection's used to silently win and the secret lookup would miss the right tenant bucket. Webhooks resolved by team_id reach this path with whatever ALS context the public route leaves in place; rebinding unconditionally is the safe default. Tests: `chat-instance-manager-boot.test.ts` drives three red→green cases against PGlite — recovery from a previous boot's `error` state, the error-marking branch persisting the status under the right org, and the cross-tenant rebind in `startInstance()`. All three failed before the fix and pass after. Followup manual SQL flips the existing errored rows back to `active` on prod once the new image rolls (no migration — this is a data fix tied to an already-fixed transient regression, not a schema change).
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR modifies ChangesBoot Recovery and Org-Scoped Rebinding
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Symptom
Two Slack connections in prod were stuck in
status=errorsince 2026-05-13 14:22 UTC:Root cause
Three layers:
Transient regression (already fixed): PR sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 tightened
ENCRYPTION_KEYparsing to require canonical base64 with a clean round-trip. The prod env's 43-char unpadded URL-safe base64 key was rejected andgetEncryptionKey()threw. Every secret resolution at boot failed. The encryption parser was patched in fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 two days later — decryption works again.No self-healing.
ChatInstanceManager.initialize()only retriedstatus=activerows. The rows the sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 regression markederrorwere never retried after fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 deployed, so they stayed broken forever even though the underlying bug was gone.Latent org-context gap.
startInstance()only self-bound the connection's owning org when the caller's ALS had no org id set. A caller (e.g. an admin in org B triggering a webhook flow that hits an org A connection) would silently miss the per-tenant secret-store predicate.Fix
packages/server/src/gateway/connections/chat-instance-manager.ts:initialize()retriesstatus=errorrows. On success the error marker is cleared under the connection's own org context.stoppedstays operator-driven.updateConnectionerror-marker write inorgContext.run(connection.organizationId, ...)— the postgres store'ssaveConnection()requiresgetOrgId()strict and would otherwise throw and leave the rowactive, hiding the failure.startInstance()always rebinds toconnection.organizationId(when present), not just when the caller has no org bound. The caller's org no longer silently wins for a connection that lives in a different tenant.Tests
packages/server/src/gateway/__tests__/chat-instance-manager-boot.test.ts(PGlite, no network):errorwhen secret resolution fails (verifies the org-context wrap by driving a real per-tenant write).startInstance()resolves the connection's org's secret even when the ALS caller is bound to a different org.Red→green verified:
Existing 97 tests in
chat-instance-manager-slack.test.ts+connections-platform-isolation.test.ts+secret-store.test.tsstill pass.make typecheck && make build-packagesboth clean.Manual step after deploy
This won't auto-fix the existing errored rows because the new boot path only takes effect after the new image rolls. Two options:
A. Wait for the next pod restart after rollout — the new boot loop will retry both rows automatically (now that #735 fixed the encryption parser, secret resolution will succeed). Pod logs should show
"Recovered previously-errored connection"for each id.B. Force it immediately:
No migration — this is a data fix tied to an already-fixed transient regression, not a schema change.
Reproducer
The third test in the new file (
startInstance rebinds to the connection's org even when caller's org differs) is the closest direct repro of the original symptom: with the previousstartInstancelogic that test fails with the exact prod error string"Failed to resolve secret ref for connection X field \"botToken\"". The first two tests pin the boot-loop guarantees (retry + error-mark-under-org-context).Summary by CodeRabbit