Skip to content

fix(server): wrap connection-boot secret resolution in orgContext to fix Slack#881

Merged
buremba merged 1 commit into
mainfrom
fix/slack-secret-ref-resolution
May 18, 2026
Merged

fix(server): wrap connection-boot secret resolution in orgContext to fix Slack#881
buremba merged 1 commit into
mainfrom
fix/slack-secret-ref-resolution

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Symptom

Two Slack connections in prod were stuck in status=error since 2026-05-13 14:22 UTC:

id=slack-lobu-preview status=error error_message="Startup failed: Failed to resolve secret ref for connection slack-lobu-preview field \"botToken\""
id=1b91933131464c95   status=error error_message="Startup failed: Failed to resolve secret ref for connection 1b91933131464c95 field \"botToken\""

Root cause

Three layers:

  1. Transient regression (already fixed): PR sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 tightened ENCRYPTION_KEY parsing to require canonical base64 with a clean round-trip. The prod env's 43-char unpadded URL-safe base64 key was rejected and getEncryptionKey() threw. Every secret resolution at boot failed. The encryption parser was patched in fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 two days later — decryption works again.

  2. No self-healing. ChatInstanceManager.initialize() only retried status=active rows. The rows the sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 regression marked error were never retried after fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 deployed, so they stayed broken forever even though the underlying bug was gone.

  3. Latent org-context gap. startInstance() only self-bound the connection's owning org when the caller's ALS had no org id set. A caller (e.g. an admin in org B triggering a webhook flow that hits an org A connection) would silently miss the per-tenant secret-store predicate.

Fix

packages/server/src/gateway/connections/chat-instance-manager.ts:

  • initialize() retries status=error rows. On success the error marker is cleared under the connection's own org context. stopped stays operator-driven.
  • The boot catch block wraps the updateConnection error-marker write in orgContext.run(connection.organizationId, ...) — the postgres store's saveConnection() requires getOrgId() strict and would otherwise throw and leave the row active, hiding the failure.
  • startInstance() always rebinds to connection.organizationId (when present), not just when the caller has no org bound. The caller's org no longer silently wins for a connection that lives in a different tenant.

Tests

packages/server/src/gateway/__tests__/chat-instance-manager-boot.test.ts (PGlite, no network):

  • Boot retries a previously-errored connection and clears the error marker on success.
  • The catch path correctly marks a row error when secret resolution fails (verifies the org-context wrap by driving a real per-tenant write).
  • startInstance() resolves the connection's org's secret even when the ALS caller is bound to a different org.

Red→green verified:

without fix: 1 pass, 2 fail
with fix:    3 pass, 0 fail

Existing 97 tests in chat-instance-manager-slack.test.ts + connections-platform-isolation.test.ts + secret-store.test.ts still pass.

make typecheck && make build-packages both clean.

Manual step after deploy

This won't auto-fix the existing errored rows because the new boot path only takes effect after the new image rolls. Two options:

A. Wait for the next pod restart after rollout — the new boot loop will retry both rows automatically (now that #735 fixed the encryption parser, secret resolution will succeed). Pod logs should show "Recovered previously-errored connection" for each id.

B. Force it immediately:

UPDATE agent_connections
SET status = 'active', error_message = NULL
WHERE platform = 'slack' AND status = 'error';

No migration — this is a data fix tied to an already-fixed transient regression, not a schema change.

Reproducer

The third test in the new file (startInstance rebinds to the connection's org even when caller's org differs) is the closest direct repro of the original symptom: with the previous startInstance logic that test fails with the exact prod error string "Failed to resolve secret ref for connection X field \"botToken\"". The first two tests pin the boot-loop guarantees (retry + error-mark-under-org-context).

Summary by CodeRabbit

  • Bug Fixes
    • Failed connections now retry during initialization and recover when the underlying issue is resolved.
    • Improved organization context handling during connection recovery for correct multi-tenant operations.

Review Change Stack

…fix Slack

Two Slack connections in prod (`slack-lobu-preview`, `1b91933131464c95`) were
stuck in `status=error` since 2026-05-13 14:22 UTC with `Failed to resolve
secret ref for connection X field "botToken"`. The underlying secret rows
were intact in the right org; the failure was a transient boot-time issue
caused by the encryption-key parser regression in #692 (a 43-char unpadded
URL-safe base64 key was rejected for ~2 days until #735 added the urlsafe
branch). But once #735 deployed, the connections did not self-heal — the
boot loop only retried `status=active` rows and there was no other path
that would flip them back.

Three changes to the boot path:

1. `initialize()` now retries `status=error` rows alongside `active` ones.
   Transient deploy-time failures self-heal on the next boot; on success
   the error marker is cleared (under the connection's own org context).
   `stopped` stays operator-driven.

2. The catch block wraps `connectionStore.updateConnection` in
   `orgContext.run(connection.organizationId, ...)`. Boot has no ALS org
   id and the postgres store's `saveConnection()` calls `getOrgId()`
   strict — without the wrap the error-marker write itself throws and the
   row stays `active`, hiding the failure.

3. `startInstance()` always rebinds to the connection's `organizationId`
   instead of only when the caller has no org bound. A caller's org that
   happens to differ from the connection's used to silently win and the
   secret lookup would miss the right tenant bucket. Webhooks resolved
   by team_id reach this path with whatever ALS context the public route
   leaves in place; rebinding unconditionally is the safe default.

Tests: `chat-instance-manager-boot.test.ts` drives three red→green cases
against PGlite — recovery from a previous boot's `error` state, the
error-marking branch persisting the status under the right org, and the
cross-tenant rebind in `startInstance()`. All three failed before the fix
and pass after.

Followup manual SQL flips the existing errored rows back to `active` on
prod once the new image rolls (no migration — this is a data fix tied
to an already-fixed transient regression, not a schema change).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR modifies ChatInstanceManager to retry previously errored connections during boot initialization, clear error state after successful recovery, and unconditionally rebind startInstance() to the connection's own organization context. It includes comprehensive regression test coverage validating boot recovery, error persistence under correct org scope, and cross-tenant org rebinding.

Changes

Boot Recovery and Org-Scoped Rebinding

Layer / File(s) Summary
Boot recovery and org-scoping implementation
packages/server/src/gateway/connections/chat-instance-manager.ts
Boot initialization now treats error rows as candidates alongside active, retries them via startInstance(), and clears error state back to active when recovery succeeds, using the connection's organizationId for org-scoped updates. startInstance() unconditionally rebinds to connection.organizationId when present, ensuring secret resolution and other operations execute in the correct tenant scope.
Test coverage for boot recovery and cross-tenant org rebinding
packages/server/src/gateway/__tests__/chat-instance-manager-boot.test.ts
Test suite with PGlite database setup and helpers validates three boot/startInstance guarantees: error-row retry and error-state clearing on success, error-state persistence under the connection's org context with the "Failed to resolve secret ref" sentinel, and cross-tenant org rebinding in startInstance() to prevent secret-resolution failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Boot once threw errors and let them sleep,
But now retries and secrets to keep,
Each org's context binds just right,
Cross-tenant calls find the light,
Recovery clears the error night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely describes the main fix: wrapping connection-boot secret resolution in orgContext to resolve Slack connection issues. It is specific, clear, and accurately reflects the primary objective of the changeset.
Description check ✅ Passed The description is comprehensive, covering symptom, root cause, fix details, tests, and post-deploy steps. It addresses the required template sections (Summary, Test plan, Notes) and provides substantial context, though test plan checkboxes are not explicitly marked.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/slack-secret-ref-resolution

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba buremba merged commit ad75eb9 into main May 18, 2026
21 of 23 checks passed
@buremba buremba deleted the fix/slack-secret-ref-resolution branch May 18, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants