fix(server): wrap connection-boot secret resolution in orgContext to fix Slack by buremba · Pull Request #881 · lobu-ai/lobu

buremba · 2026-05-18T16:02:20Z

Symptom

Two Slack connections in prod were stuck in status=error since 2026-05-13 14:22 UTC:

id=slack-lobu-preview status=error error_message="Startup failed: Failed to resolve secret ref for connection slack-lobu-preview field \"botToken\""
id=1b91933131464c95   status=error error_message="Startup failed: Failed to resolve secret ref for connection 1b91933131464c95 field \"botToken\""

Root cause

Three layers:

Transient regression (already fixed): PR sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 tightened ENCRYPTION_KEY parsing to require canonical base64 with a clean round-trip. The prod env's 43-char unpadded URL-safe base64 key was rejected and getEncryptionKey() threw. Every secret resolution at boot failed. The encryption parser was patched in fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 two days later — decryption works again.
No self-healing. ChatInstanceManager.initialize() only retried status=active rows. The rows the sec: hardening sweep — webhook sigs, SSRF, nix injection, secret-proxy, token revocation, egress timeout, sandbox, encryption-key #692 regression marked error were never retried after fix(core): accept URL-safe base64 in ENCRYPTION_KEY validator (prod hotfix) #735 deployed, so they stayed broken forever even though the underlying bug was gone.
Latent org-context gap. startInstance() only self-bound the connection's owning org when the caller's ALS had no org id set. A caller (e.g. an admin in org B triggering a webhook flow that hits an org A connection) would silently miss the per-tenant secret-store predicate.

Fix

packages/server/src/gateway/connections/chat-instance-manager.ts:

initialize() retries status=error rows. On success the error marker is cleared under the connection's own org context. stopped stays operator-driven.
The boot catch block wraps the updateConnection error-marker write in orgContext.run(connection.organizationId, ...) — the postgres store's saveConnection() requires getOrgId() strict and would otherwise throw and leave the row active, hiding the failure.
startInstance() always rebinds to connection.organizationId (when present), not just when the caller has no org bound. The caller's org no longer silently wins for a connection that lives in a different tenant.

Tests

packages/server/src/gateway/__tests__/chat-instance-manager-boot.test.ts (PGlite, no network):

Boot retries a previously-errored connection and clears the error marker on success.
The catch path correctly marks a row error when secret resolution fails (verifies the org-context wrap by driving a real per-tenant write).
startInstance() resolves the connection's org's secret even when the ALS caller is bound to a different org.

Red→green verified:

without fix: 1 pass, 2 fail
with fix:    3 pass, 0 fail

Existing 97 tests in chat-instance-manager-slack.test.ts + connections-platform-isolation.test.ts + secret-store.test.ts still pass.

make typecheck && make build-packages both clean.

Manual step after deploy

This won't auto-fix the existing errored rows because the new boot path only takes effect after the new image rolls. Two options:

A. Wait for the next pod restart after rollout — the new boot loop will retry both rows automatically (now that #735 fixed the encryption parser, secret resolution will succeed). Pod logs should show "Recovered previously-errored connection" for each id.

B. Force it immediately:

UPDATE agent_connections
SET status = 'active', error_message = NULL
WHERE platform = 'slack' AND status = 'error';

No migration — this is a data fix tied to an already-fixed transient regression, not a schema change.

Reproducer

The third test in the new file (startInstance rebinds to the connection's org even when caller's org differs) is the closest direct repro of the original symptom: with the previous startInstance logic that test fails with the exact prod error string "Failed to resolve secret ref for connection X field \"botToken\"". The first two tests pin the boot-loop guarantees (retry + error-mark-under-org-context).

Summary by CodeRabbit

Bug Fixes
- Failed connections now retry during initialization and recover when the underlying issue is resolved.
- Improved organization context handling during connection recovery for correct multi-tenant operations.

…fix Slack Two Slack connections in prod (`slack-lobu-preview`, `1b91933131464c95`) were stuck in `status=error` since 2026-05-13 14:22 UTC with `Failed to resolve secret ref for connection X field "botToken"`. The underlying secret rows were intact in the right org; the failure was a transient boot-time issue caused by the encryption-key parser regression in #692 (a 43-char unpadded URL-safe base64 key was rejected for ~2 days until #735 added the urlsafe branch). But once #735 deployed, the connections did not self-heal — the boot loop only retried `status=active` rows and there was no other path that would flip them back. Three changes to the boot path: 1. `initialize()` now retries `status=error` rows alongside `active` ones. Transient deploy-time failures self-heal on the next boot; on success the error marker is cleared (under the connection's own org context). `stopped` stays operator-driven. 2. The catch block wraps `connectionStore.updateConnection` in `orgContext.run(connection.organizationId, ...)`. Boot has no ALS org id and the postgres store's `saveConnection()` calls `getOrgId()` strict — without the wrap the error-marker write itself throws and the row stays `active`, hiding the failure. 3. `startInstance()` always rebinds to the connection's `organizationId` instead of only when the caller has no org bound. A caller's org that happens to differ from the connection's used to silently win and the secret lookup would miss the right tenant bucket. Webhooks resolved by team_id reach this path with whatever ALS context the public route leaves in place; rebinding unconditionally is the safe default. Tests: `chat-instance-manager-boot.test.ts` drives three red→green cases against PGlite — recovery from a previous boot's `error` state, the error-marking branch persisting the status under the right org, and the cross-tenant rebind in `startInstance()`. All three failed before the fix and pass after. Followup manual SQL flips the existing errored rows back to `active` on prod once the new image rolls (no migration — this is a data fix tied to an already-fixed transient regression, not a schema change).

coderabbitai · 2026-05-18T16:02:35Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR modifies ChatInstanceManager to retry previously errored connections during boot initialization, clear error state after successful recovery, and unconditionally rebind startInstance() to the connection's own organization context. It includes comprehensive regression test coverage validating boot recovery, error persistence under correct org scope, and cross-tenant org rebinding.

Changes

Boot Recovery and Org-Scoped Rebinding

Layer / File(s)	Summary
Boot recovery and org-scoping implementation `packages/server/src/gateway/connections/chat-instance-manager.ts`	Boot initialization now treats `error` rows as candidates alongside `active`, retries them via `startInstance()`, and clears error state back to `active` when recovery succeeds, using the connection's `organizationId` for org-scoped updates. `startInstance()` unconditionally rebinds to `connection.organizationId` when present, ensuring secret resolution and other operations execute in the correct tenant scope.
Test coverage for boot recovery and cross-tenant org rebinding `packages/server/src/gateway/__tests__/chat-instance-manager-boot.test.ts`	Test suite with PGlite database setup and helpers validates three boot/startInstance guarantees: error-row retry and error-state clearing on success, error-state persistence under the connection's org context with the `"Failed to resolve secret ref"` sentinel, and cross-tenant org rebinding in `startInstance()` to prevent secret-resolution failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Boot once threw errors and let them sleep,
But now retries and secrets to keep,
Each org's context binds just right,
Cross-tenant calls find the light,
Recovery clears the error night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely describes the main fix: wrapping connection-boot secret resolution in orgContext to resolve Slack connection issues. It is specific, clear, and accurately reflects the primary objective of the changeset.
Description check	✅ Passed	The description is comprehensive, covering symptom, root cause, fix details, tests, and post-deploy steps. It addresses the required template sections (Summary, Test plan, Notes) and provides substantial context, though test plan checkboxes are not explicitly marked.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/slack-secret-ref-resolution

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-18T16:04:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

buremba merged commit ad75eb9 into main May 18, 2026
21 of 23 checks passed

buremba deleted the fix/slack-secret-ref-resolution branch May 18, 2026 16:06

buremba mentioned this pull request May 18, 2026

chore(main): release lobu 7.2.0 #863

Merged

coderabbitai Bot mentioned this pull request May 27, 2026

fix(gateway): hydrate connection on per-connection webhook under multi-replica #1098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): wrap connection-boot secret resolution in orgContext to fix Slack#881

fix(server): wrap connection-boot secret resolution in orgContext to fix Slack#881
buremba merged 1 commit into
mainfrom
fix/slack-secret-ref-resolution

buremba commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom

Root cause

Fix

Tests

Manual step after deploy

Reproducer

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 18, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading