Skip to content

refactor: single source of truth for agents, scope secrets per-org, webhook-only platforms#505

Closed
buremba wants to merge 9 commits into
mainfrom
claude/review-multitenant-setup-09Ow2
Closed

refactor: single source of truth for agents, scope secrets per-org, webhook-only platforms#505
buremba wants to merge 9 commits into
mainfrom
claude/review-multitenant-setup-09Ow2

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 3, 2026

Summary

Six commits cleaning up the multi-tenant setup so agents and secrets are properly scoped per organization, with a single delivery story for chat platforms.

Commits

60a7b21 — Auto-bootstrap admin PAT, retire LOBU_LOCAL_BOOTSTRAP

ensureBootstrapPat self-skips when the user table is non-empty. The env-flag gate is gone; production deployments are protected by the user-count guard. scripts/e2e-lobu-apply.sh updated.

7a751bd + 8646802 — Slack Socket Mode iterations (superseded by ab742bc)

Two intermediate commits that first documented and then made the Socket Mode bridge multi-tenant. Both are neutralized by the final commit on the branch — included in history because they're already pushed; net effect is removal.

26a9d04 — Remove gateway boot-time lobu.toml file-loader

The big one. Gateway no longer reads lobu.toml at boot — file-declared agents enter Postgres exclusively via lobu apply. One source of truth.

Deleted: gateway/config/file-loader.ts, 5 file-loader tests, entryFromFileLoadedAgent, the system/manifest owner sentinel, loadAgentConfigFromFiles, populateStoreFromFiles, reloadFromFiles, onReloadFromFiles, getFileLoadedAgents, isFileFirstMode, getProjectPath, /api/v1/reload dev route, file-loader-driven platform connection seeding.

Kept (still used by SDK-embedded mode via GatewayConfig.agents): DeclaredAgentRegistry, InMemoryAgentStore, entryFromAgentConfig, buildRegistryMap (signature simplified to take only configAgents).

Behavior change: gateway boot throws a clear error if neither host stores nor SDK config.agents are provided.

c75f748 — Scope agent_secrets by organization_id

The agent_secrets table had a global (name) PK, so two orgs storing the same logical secret (e.g. ANTHROPIC_API_KEY) silently overwrote each other.

  • Add organization_id column with PK (organization_id, name). Migration + embedded-schema patch for existing PGlite installs.
  • Empty-string organization_id is the legacy/global bucket — used by system env-store and any pre-existing rows.
  • PostgresSecretStore reads tryGetOrgId() from AsyncLocalStorage: per-org rows take precedence over global on read; writes go into the active org if context exists; deletes and lists scope to the active org.
  • The secret://<name> URI scheme is unchanged — same URI, different value per org context.

ab742bc — Drop Slack Socket Mode entirely, stick to Chat SDK webhooks

WebSocket transports are less reliable than webhooks (reconnect drops, no retry semantics) and Slack's own docs steer production toward Events API. Socket Mode existed only to save local dev users from running a tunnel.

  • Delete startSlackSocketMode, socketModeClients map, @slack/socket-mode from both owletto-backend and cli package.json.
  • Remove the disconnect-all loop from stopLobuGateway.
  • AGENTS.md: codify "one transport per platform — webhooks via the Chat SDK adapter; local dev for webhook-only platforms uses a tunnel."

Reverts the per-connection Socket Mode work in 7a751bd and 8646802.

Net effect

Read end-to-end, this PR:

  • Removes the LOBU_LOCAL_BOOTSTRAP env-flag layer.
  • Removes the gateway's boot-time lobu.toml reader and the in-memory parallel agent store wiring.
  • Scopes agent_secrets per organization.
  • Removes the entire Slack Socket Mode runtime dependency.
  • Codifies "one transport per platform" as project policy in AGENTS.md.

Lines net: ~750 removed, ~130 added.

Test plan

  • Repo-wide bun run typecheck passes.
  • declared-agent-registry tests pass (5/5).
  • lobu run on a fresh OWLETTO_DATA_DIR mints a bootstrap PAT without LOBU_LOCAL_BOOTSTRAP=true.
  • lobu run on a data dir with existing users no longer mints a "Local Developer" user.
  • scripts/e2e-lobu-apply.sh still passes end-to-end.
  • agent_secrets migration applies cleanly to an existing PGlite install (embedded patch path).
  • Setting ANTHROPIC_API_KEY in org A and a different value in org B no longer overwrites each other.
  • Slack Events API webhook delivery still works (regression check post Socket Mode removal).

DB-backed integration tests in this branch fail in the sandbox due to no reachable Postgres — they need a real DB to verify.

https://claude.ai/code/session_01GTYRB28aN7dKNiM5vcUaBp

claude added 2 commits May 3, 2026 21:16
…LOCAL_BOOTSTRAP

`ensureBootstrapPat` now self-skips when the user table is non-empty,
making the previous LOBU_LOCAL_BOOTSTRAP=true env-flag gate redundant.
Operators no longer need to opt in for first-run local dev: an empty
PGlite data dir mints a single admin PAT (printed once, scope unchanged),
while production deployments with provisioned users are protected by the
new user-count guard.

scripts/e2e-lobu-apply.sh stops setting the flag — the bootstrap PAT is
now produced automatically.
…R BY

The Socket Mode loop has always returned after the first connection with an
appToken (`break` at line 130), making the previous `LIMIT 10` redundant
without explaining why. Replace with `ORDER BY id` so which connection wins
is deterministic, and add a doc comment naming the single-client limitation
so multi-org operators reach for the Events API webhook transport.

Behavior unchanged.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot added the triage:needs-human Triage agent escalated for human review label May 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Triage decision: needs-human

Reasons:

  • PR is blocked (mergeStateStatus: BLOCKED) due to failing status checks
  • check-drift workflow failed with FAILURE conclusion
  • integration workflow still in progress

Next: Human review required to resolve the blocking status checks. Check the failing check-drift workflow and wait for integration tests to complete.

claude added 3 commits May 3, 2026 21:35
The gateway no longer reads lobu.toml at boot. File-declared agents now
enter Postgres exclusively via \`lobu apply\` (CLI) — there is one source
of truth for agent state, addressed by \`organization_id\`.

Deleted:
- gateway/config/file-loader.ts
- 5 file-loader test files
- entryFromFileLoadedAgent helper
- CoreServices: loadAgentConfigFromFiles boot path,
  populateStoreFromFiles, reloadFromFiles, onReloadFromFiles,
  getFileLoadedAgents, isFileFirstMode, getProjectPath
- /api/v1/reload dev route
- gateway/cli/gateway.ts file-loader-driven platform connection seeding

Kept (still used by SDK-embedded mode via GatewayConfig.agents):
- DeclaredAgentRegistry
- InMemoryAgentStore (now only the SDK-embedded backing store)
- entryFromAgentConfig + buildRegistryMap (signature simplified to take
  only configAgents)

The system/manifest owner sentinel is gone — every agent in the gateway
now has a real owner and a real organization_id.

Bun typechecks clean repo-wide. Behavior change: gateway boot will now
throw a clear error if neither host stores nor SDK config.agents are
provided (previously it would have silently picked up lobu.toml).
Previously the agent_secrets table had a global \`(name)\` primary key, so
two organizations storing the same logical secret (e.g. ANTHROPIC_API_KEY)
would silently overwrite each other on conflict.

- Add \`organization_id\` column with PK changed to \`(organization_id, name)\`.
- New ondisk migration plus an embedded-schema patch so existing PGlite
  installs pick it up on next boot.
- Empty-string organization_id is the legacy/global bucket — used by
  system env-store and any pre-existing rows.
- PostgresSecretStore reads \`tryGetOrgId()\` from AsyncLocalStorage:
  per-org rows take precedence over the global bucket on read; writes go
  into the active org if context exists, else the global bucket; deletes
  and lists scope to the active org.
- The \`secret://<name>\` URI scheme is unchanged — same URI, different
  value per org context. Stored refs across the codebase don't need
  migration.
…irst

Previously a module-level \`socketModeClient\` singleton meant only the
first agent_connections row with an appToken got a WebSocket — every
other Slack workspace (especially across orgs) was silently dropped.

Replace with a Map<connectionId, SocketModeClient> populated at gateway
boot, and disconnect each client on graceful shutdown. Errors starting an
individual client no longer take down the rest.

Lifecycle hooks for connections added/removed after gateway boot are not
wired here — restart the gateway to pick up newly added Slack rows.
That hook is a follow-up; this PR closes the cross-org silent-drop bug.
@buremba buremba changed the title refactor: retire LOBU_LOCAL_BOOTSTRAP, document Slack singleton refactor: single source of truth for agents, scope secrets/Slack per-org May 3, 2026
…at SDK

The Socket Mode bridge in lobu/gateway.ts ran a WebSocket per Slack
connection, translating events into a synthetic POST against
ChatInstanceManager.handleSlackAppWebhook. It existed only to support
local dev without a public URL.

WebSocket-based delivery is meaningfully less reliable than webhooks
(reconnect drops, no retry semantics) and Slack's own docs steer
production toward Events API. Keeping Socket Mode meant a per-platform
runtime dependency (@slack/socket-mode) and a parallel transport story
that diverged from every other platform.

- Delete startSlackSocketMode + the socketModeClients map
- Drop @slack/socket-mode from packages/owletto-backend and packages/cli
- Remove the disconnect-all loop from stopLobuGateway
- AGENTS.md: codify "one transport per platform — webhooks via the Chat
  SDK adapter; local dev for webhook-only platforms uses a tunnel"

Reverts the work in 7a751bd (ORDER BY) and 8646802 (per-connection map)
that only existed to make Socket Mode multi-tenant.
@buremba buremba changed the title refactor: single source of truth for agents, scope secrets/Slack per-org refactor: single source of truth for agents, scope secrets per-org, webhook-only platforms May 3, 2026
buremba added 3 commits May 4, 2026 00:26
PostgresSecretStore now resolves the same secret:// ref to different
values per org context (via tryGetOrgId AsyncLocalStorage), but
SecretStoreRegistry was caching by ref alone. Org A populating the
cache would serve its plaintext to Org B for the 60s TTL window.

Cache key is now `${orgId}|${ref}`. put() invalidates the ref across
every org scope; delete() walks the cache and clears any org-prefixed
entry that matches the canonical name.

Adds a regression test using an OrgAwareStore fake that mirrors
PostgresSecretStore's per-org resolution.
TokenRefreshJob ran maybeRefresh() with no AsyncLocalStorage org scope,
so PostgresSecretStore.put() (called via AuthProfilesManager.upsertProfile)
silently routed rotated refresh tokens into the empty-string GLOBAL
bucket. Reads went to GLOBAL too — when a request later tried to use the
profile, it got the wrong (or no) credential.

Both call paths are affected:
- runOnce (scheduled every 30min via TaskScheduler)
- refreshForUserAgent (lazy at-use, spawned via the scheduler queue —
  the original request's org scope is lost by the time the task runs)

Fix:
- scanAllOAuth() INNER JOINs agents to yield (userId, agentId, orgId).
- runOnce wraps each maybeRefresh in orgContext.run({organizationId}).
- refreshForUserAgent looks up the agent's org and wraps before
  delegating; logs and skips when the agent row is gone (deleted).

Tests updated: scanAllOAuth fixture seeds backing agents rows, and a
new test covers the orphaned-profile skip path.
…igration

- PostgresSecretStore: warn loudly when put()/delete() runs without an
  org context. System env-store reaches here intentionally; tenant code
  paths reaching here is a bug — the rotated secret would land in the
  cross-tenant GLOBAL bucket and shadow every org's read of the same
  name. Loud logs make the regression visible in observability.
- start-local.ts: refuse to mint a bootstrap PAT unless dbUrl resolves
  to a loopback host (127.0.0.1 / localhost / ::1). The user-count
  guard is the second layer; this catches the case where someone
  reuses ensureBootstrapPat against a fresh prod DB before the first
  signup lands.
- agent_secrets_org_scope migration: add a ROLLOUT NOTE explaining
  the non-atomic primary-key swap is safe for Lobu's single-process
  atomic-swap deployment but unsafe under rolling-deploy supervisors.
- lobu/gateway.ts: drop unused getDb import left over from Socket Mode
  removal.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 3, 2026

Pushed three fixup commits addressing the pi-cli review (#505 — earlier session findings):

  • 3e8feb68 fix(secrets): SecretStoreRegistry cache org-aware — cache key is now ${orgId}|${ref}. Two orgs resolving the same secret:// ref no longer share a cached value. New regression test (secret-store.test.ts) uses an OrgAwareStore fake that mirrors PostgresSecretStore's per-org resolution.
  • c7225230 fix(token-refresh): establish org context before refreshing OAuth tokensscanAllOAuth() now INNER JOINs agents to yield (userId, agentId, orgId); runOnce and refreshForUserAgent wrap each refresh in orgContext.run({organizationId}, ...) so PostgresSecretStore.put() lands in the correct tenant bucket. refreshForUserAgent looks up the agent's org and skips when the row is gone (deleted agent). Test fixture seeds backing agents rows; new test for the orphan-skip path.
  • 93ef406b chore: harden boundariesPostgresSecretStore.put/delete now warn loudly when no org context is present (system env-store reaches here intentionally; tenant code paths are bugs). start-local.ts refuses to mint a bootstrap PAT unless dbUrl is loopback (defense against future ensureBootstrapPat reuse against a real DB before first signup). Migration 20260503000000_agent_secrets_org_scope.sql carries a rollout note explaining the non-atomic PK swap is safe for Lobu's single-process atomic-swap deployment but unsafe under rolling-deploy supervisors. Also drops the unused getDb import in lobu/gateway.ts left over from the Socket Mode removal.

Verification:

  • bun run typecheck clean against the per-package packages/owletto-backend/tsconfig.json (which has noUnusedLocals: true — the root tsconfig excludes this package, so the prior PR check missed the unused import).
  • bun test src/gateway/__tests__/ — 489 pass, 0 fail (489 / 1140 expects across 62 files).

Reviewers: the architectural direction of #505 is right (PG single source of truth + org-scoped secrets + webhook-only platforms); the issues fixed here were correctness gaps that would have shown up as cross-tenant data leaks once OAuth refresh started rotating tokens at scale.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 4, 2026

Superseded by #516 — original branch was based pre-rename (#499), so a clean rebase wasn't viable. Cherry-picked the 7 substantive commits onto current main with paths rewritten to the renamed dirs. Skipped the two superseded Slack Socket Mode iterations (7a751bd and 8646802).

@buremba buremba closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triage:needs-human Triage agent escalated for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants