Skip to content

fix(gateway): Owletto MCP sync, Slack markdown/threading, tool-approval lifecycle, deployment coalescing#210

Merged
buremba merged 2 commits into
mainfrom
refactor/auth-split-continuation
Apr 18, 2026
Merged

fix(gateway): Owletto MCP sync, Slack markdown/threading, tool-approval lifecycle, deployment coalescing#210
buremba merged 2 commits into
mainfrom
refactor/auth-split-continuation

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented Apr 18, 2026

Bundle of follow-up hardening on top of #207 (declared vs user auth split). Each theme is independent and fixes a concrete user-facing bug; kept as one PR because they share the same test surface and landed together during hardening.


Theme A — Owletto MCP global server sync

Files: services/core-services.ts, auth/mcp/config-service.ts

  • Mirror resolved MEMORY_URL env var into McpConfigService as a global owletto streamable-http server at startup AND on lobu.toml reload. Without this, /mcp/owletto proxy lookups via getHttpServer(\"owletto\") return undefined and the worker-side plugin fails with "MCP server 'owletto' not found".
  • New upsertGlobalServer() — replace-on-write (vs existing registerGlobalServers() which is register-once).
  • Per-agent MCP lookup: getSettingsgetEffectiveSettings so MCP servers declared in lobu.toml (via the declared-agent registry from refactor(gateway): split declared agent config from user-scoped auth state #207) actually resolve.
  • Intentionally omit oauth: {} on the owletto entry: worker-side owletto_login owns the device-code flow; adding oauth would trigger a parallel gateway OAuth/PKCE discovery and surface two competing login links.

Theme B — Tool-approval lifecycle fixes

Files: auth/mcp/proxy.ts, cli/gateway.ts, connections/interaction-bridge.ts

  • Bump PENDING_TOOL_TTL 5 min → 24 h (proxy + in-memory card timer in interaction-bridge). Tool-approval cards sit in Slack for hours; 5 min silently expired real clicks.
  • cli/gateway.ts approveToolCall: switch GET-then-DEL to atomic GETDEL so a double-POST to /api/v1/agents/approve can't double-execute the tool. Matches the interaction-bridge path.
  • Capture channelId / conversationId / teamId / connectionId on the pending-tool Redis record so approval results route to the conversation the tool call originated from, not the thread of the clicked card (which can differ on Slack).
  • Expired-card UX: when GETDEL returns null but the in-memory approval card is still tracked, edit the card and post a "this approval expired" notice instead of silently no-op'ing. Real retries still no-op silently (no tracked card).

Theme C — Slack markdown rendering

Files: connections/chat-response-bridge.ts (+229 LOC)

  • Skip the SDK streaming path for Slack; buffer deltas, then at completion post a single chunked chat.postMessage with markdown_text. Fixes (a) mid-line splits from the Slack streaming API's fixed-size chunking and (b) < / > / & HTML-escaping that the text field inflicts on code snippets.
  • Split on paragraph boundaries (\n\n), then line boundaries, then hard-slice, with an 11,000-char per-post margin under Slack's 12k limit.
  • Pre-decode HTML entities (worker may have produced them upstream from MCP tool results) and strip empty markdown links [text]() (some tools emit these as unlinked citation footnotes).
  • Silent-drop fix: resolve the SDK fallback target during delta setup so a missing Slack adapter at completion still delivers via target.post() instead of silently dropping.

Theme D — Slack session behavior

Files: connections/{slack-instruction-provider.ts (new), chat-instance-manager.ts, message-handler-bridge.ts, conversation-state-store.ts}, gateway-main.ts

  • New SlackInstructionProvider contributes Slack bot identity (@username / U0123ABC) through the InstructionProvider interface so the model knows its own Slack handle and doesn't treat its own ID as a stranger. Auto-registered for Slack platforms; gateway-main.ts now eagerly registers providers for platforms added after start().
  • chat-instance-manager: mirror adapter.botUserId onto connection.metadata alongside botUsername so mention-strip and the instruction provider can find it.
  • message-handler-bridge: strip <@Uxxx> and bare @Uxxx bot-mention tokens Slack delivers raw. Collapse to canonical thread.id for DM thread replies (previously fell back to channelId, so replies to bot messages inside a DM thread landed in the main DM pane).
  • Thread-history backfill: on first mention in a group thread, fetch up to 50 prior messages via adapter.fetchMessages (Slack → conversations.replies) and append to conversation history, so mid-thread mentions have context. Atomic one-shot-per-thread guard via new ConversationStateStore.claimThreadBackfill / releaseThreadBackfill (Redis SETNX + 24h TTL, release on failure so transient errors can retry).

Theme E — Deployment ensure coalescing

Files: orchestration/base-deployment-manager.ts, orchestration/impl/{docker,embedded,k8s}-deployment.ts

  • BaseDeploymentManager: rename abstract createDeploymentspawnDeployment (protected) and add public ensureDeployment wrapper with an in-flight Promise map. Concurrent calls for the same deploymentName within one gateway process share a single spawn; cross-process concurrency is handled by the orchestrator's atomic name uniqueness (Docker 409, K8s 409, embedded workers.has check).
  • docker-deployment, k8s/deployment: treat 409 AlreadyExists as benign success — inspect/start existing container for Docker, no-op for K8s.
  • embedded-deployment: short-circuit on workers.has(deploymentName).

Smaller items (not a theme, bundled for proximity)

  • config/index.ts loadEnvFile: pass override: true to dotenv so .env wins over stale shell exports. Matches the single-source-of-truth contract introduced by chore(dev): simplify compose DEPLOYMENT_MODE and opt-in OpenAI for careops #209 (compose no longer re-exports DEPLOYMENT_MODE). Production (NODE_ENV=production) skips loadEnvFile entirely, so prod deploys are unaffected.
  • config/index.ts isPluginInstalled: walk up parent directories for the node_modules/<pkg> folder, falling back from require.resolve(). Fixes detection of ESM-only packages whose exports map omits a require condition (e.g. @lobu/owletto-openclaw).
  • routes/public/connections.ts: bump webhook-processing log debuginfo so Slack/Telegram inbound traffic is visible in prod without flipping LOG_LEVEL.
  • worker/src/openclaw/worker.ts (-62 LOC): drop the proactive owletto_login call at worker startup. Now that the gateway registers the owletto MCP server correctly (Theme A), the model can call owletto_login lazily when a memory tool first returns an auth error — more natural than pushing an unsolicited login link at every session start.
  • worker/src/gateway/sse-client.ts: signature updates matching the gateway-side changes above.

Tests (+~1,000 LOC)

Expanded coverage:

  • base-deployment-grants: ensureDeployment coalescing, spawnDeployment rename.
  • chat-response-bridge: Slack markdown buffer-and-post path.
  • docker-deployment: 409 handling.
  • embedded-deployment: workers.has short-circuit.
  • interaction-bridge-action-handlers: expired-card UX.
  • mcp-config-service: upsertGlobalServer semantics.
  • message-handler-bridge: backfill, mention strip, DM-thread conversationId.

Test plan

  • make build-packages passes (typecheck clean locally).
  • bun test packages/gateway/src green.
  • make dev with a careops agent configured via lobu.toml — agent loads, Owletto MCP calls reach memory (no "MCP server 'owletto' not found").
  • Slack end-to-end: <@Uxxx> mention → bot responds in-thread with markdown rendering (no HTML entities visible for code blocks), replies to bot DM messages stay in the DM thread, tool-approval card expired after 24h+ shows the expiry notice.
  • Confirm single Owletto login link surfaces on first memory call (not two competing device-code + PKCE flows).
  • Concurrent-request stress: fire N parallel messages for the same thread — only one worker deployment is spawned.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0c72efe80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +503 to +511
// No per-message user equality check: deployment names are derived from
// the canonical conversation key (`platform:channelId:conversationId`)
// *without* userId, so a thread is intentionally shared by every user
// who posts in it. The SSE queue (`thread_message_<deployment>`) is the
// routing boundary — only messages for this deployment reach this
// worker. The per-message `data.userId` is the actual sender of *this*
// turn and is propagated downstream via `payloadToWorkerConfig`
// (response addressing, MCP namespacing, audit). `this.userId` is the
// user that originally spawned the deployment, kept for audit only.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reinstate user isolation before queuing thread messages

Dropping the per-message userId check here allows any participant in a shared thread to execute turns inside a worker that is still authenticated with the deployment-scoped WORKER_TOKEN created for the original user. In that scenario, tool approvals/execution and MCP auth context are attributed to the wrong user, which can expose one user’s grants or integrations to another user in the same thread. The new comment says data.userId is enough for namespacing, but gateway auth for MCP routes is still derived from the worker token, not this payload field.

Useful? React with 👍 / 👎.

Comment on lines +345 to +352
stream = {
iterator: new AsyncPushIterator<string>(),
streamPromise: Promise.resolve(),
buffer: payload.delta,
streamFailed: true, // Force completion to use the post-buffer path
wasFullyReplaced: !!payload.isFullReplacement,
target: null,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve a Slack post fallback target for completion errors

In the Slack-specific delta path, stream.target is always initialized to null, but completion tries to fall back to stream.target.post(...) when postSlackMarkdown is unavailable or throws. That makes the fallback path unreachable and can silently drop Slack responses whenever the adapter client is missing/uninitialized or markdown posting fails. Populate target (or resolve one in completion) so the documented fallback actually works.

Useful? React with 👍 / 👎.

@buremba buremba force-pushed the refactor/auth-split-continuation branch from b0c72ef to 83a780a Compare April 18, 2026 03:10
@buremba buremba changed the title refactor(gateway): follow-ups to declared vs user auth split fix(gateway): Owletto MCP sync, Slack markdown/threading, tool-approval lifecycle, deployment coalescing Apr 18, 2026
…al lifecycle, deployment coalescing

Bundled follow-up work on top of 283c66a (declared vs user auth split).
Each theme is independent and fixes a concrete user-facing bug; kept as
one PR because they share the same test surface and landed together
during hardening.

## Owletto MCP global server sync
- core-services: mirror the resolved `MEMORY_URL` env var into
  `McpConfigService` as a global `owletto` streamable-http server at
  startup AND on lobu.toml reload. Without this, `/mcp/owletto` proxy
  lookups via `getHttpServer("owletto")` return undefined and the
  worker-side plugin fails with "MCP server 'owletto' not found".
- config-service: new `upsertGlobalServer()` (replace-on-write, vs the
  existing `registerGlobalServers()` which is register-once). Also
  switch the per-agent MCP lookup from `getSettings` → `getEffectiveSettings`
  so MCP servers declared in lobu.toml (via the declared-agent registry
  introduced in 283c66a) actually resolve.
- Intentionally omit `oauth: {}` on the owletto entry: worker-side
  `owletto_login` owns the device-code flow, adding it would trigger a
  parallel gateway OAuth/PKCE discovery and surface two competing login
  links.

## Tool-approval lifecycle fixes
- Bump `PENDING_TOOL_TTL` 5 min → 24 h (proxy.ts + in-memory card timer
  in interaction-bridge.ts). Tool-approval cards sit in Slack for hours;
  5 min silently expired real clicks.
- cli/gateway.ts `approveToolCall`: switch GET-then-DEL to atomic GETDEL
  so a double-POST to /api/v1/agents/approve can't double-execute the
  tool. Matches the interaction-bridge path.
- Capture `channelId / conversationId / teamId / connectionId` on the
  pending-tool Redis record so approval results route to the
  conversation the tool call originated from, not the thread of the
  clicked card (which can differ on Slack).
- Expired-card UX: when GETDEL returns null but the in-memory approval
  card is still tracked, edit the card and post a "this approval
  expired" notice instead of silently no-op'ing. Real retries still
  no-op silently (no tracked card).

## Slack markdown rendering (chat-response-bridge.ts)
- Skip the SDK streaming path for Slack; buffer deltas, then at
  completion post a single chunked `chat.postMessage` with
  `markdown_text`. Fixes (a) mid-line splits from the Slack streaming
  API's fixed-size chunking and (b) `<`/`>`/`&` HTML-escaping that
  `text` field inflicts on code snippets.
- Split on paragraph boundaries (`\n\n`), then line boundaries, then
  hard-slice, with an 11,000-char per-post margin under Slack's 12k
  limit.
- Pre-decode HTML entities (worker may have produced them upstream from
  MCP tool results) and strip empty markdown links `[text]()` (some
  tools emit these as unlinked citation footnotes).
- Resolve the SDK fallback target during delta setup so a missing Slack
  adapter at completion still delivers via `target.post()` instead of
  silently dropping.

## Slack session behavior
- New `SlackInstructionProvider` contributes Slack bot identity
  (`@username` / `U0123ABC`) through the `InstructionProvider` interface
  so the model knows its own Slack handle and doesn't treat its own ID
  as a stranger. Auto-registered for Slack platforms; `gateway-main.ts`
  now eagerly registers providers for platforms added after `start()`.
- chat-instance-manager: mirror `adapter.botUserId` onto
  `connection.metadata` alongside `botUsername` so the mention strip
  and the instruction provider can find it.
- message-handler-bridge: strip `<@uxxx>` and bare `@Uxxx` bot-mention
  tokens Slack delivers raw. Collapse to canonical `thread.id` for DM
  thread replies (previously fell back to channelId, so replies to bot
  messages inside a DM thread landed in the main DM pane).
- Thread-history backfill: on first mention in a group thread, fetch up
  to 50 prior messages via `adapter.fetchMessages` (Slack →
  conversations.replies) and append to conversation history, so
  mid-thread mentions have context. Atomic one-shot-per-thread guard
  via new `ConversationStateStore.claimThreadBackfill` /
  `releaseThreadBackfill` (Redis SETNX + 24h TTL, release on failure
  so transient errors can retry).

## Deployment ensure coalescing
- `BaseDeploymentManager`: rename abstract `createDeployment` →
  `spawnDeployment` (protected) and add a public `ensureDeployment`
  wrapper with an in-flight Promise map. Concurrent calls for the same
  `deploymentName` within one gateway process share a single spawn;
  cross-process concurrency is handled by the orchestrator's atomic
  name uniqueness (Docker 409, K8s 409, embedded workers.has check).
- docker-deployment, k8s/deployment: treat 409 AlreadyExists as benign
  success — inspect/start an existing container, no-op for k8s.
- embedded-deployment: short-circuit on `workers.has(deploymentName)`.

## Smaller items
- config/index.ts `loadEnvFile`: pass `override: true` to dotenv so
  `.env` wins over stale shell exports. Matches the single-source-of-
  truth contract introduced by PR #209 (compose no longer re-exports
  `DEPLOYMENT_MODE`). Production skips loadEnvFile entirely, so
  production deploys are unaffected.
- config/index.ts `isPluginInstalled`: walk up parent directories for
  the `node_modules/<pkg>` folder, falling back from
  `require.resolve()`. Fixes detection of ESM-only packages whose
  `exports` map omits a `require` condition (e.g. @lobu/owletto-openclaw).
- routes/public/connections.ts: bump webhook-processing log from debug
  → info so Slack/Telegram inbound traffic is visible in prod without
  flipping LOG_LEVEL.
- worker/src/openclaw/worker.ts: drop the proactive `owletto_login`
  call at worker startup. Now that the gateway registers the owletto
  MCP server correctly (theme 1), the model can call `owletto_login`
  lazily when a memory tool first returns an auth error — more natural
  than pushing an unsolicited login link at every session start.
- worker/src/gateway/sse-client.ts: signature updates matching the
  gateway-side changes above.

## Tests (+~1,000 LOC)
Expanded coverage: base-deployment-grants (ensureDeployment coalescing,
spawnDeployment rename), chat-response-bridge (Slack markdown
buffer-and-post path), docker-deployment (409 handling), embedded-
deployment (workers.has short-circuit), interaction-bridge action
handlers (expired-card UX), mcp-config-service (upsertGlobalServer),
message-handler-bridge (backfill, mention strip, DM-thread conversationId).
@buremba buremba force-pushed the refactor/auth-split-continuation branch from 83a780a to 75cb4c8 Compare April 18, 2026 03:13
@buremba
Copy link
Copy Markdown
Member Author

buremba commented Apr 18, 2026

Addressed both Codex comments in the latest push (75cb4c8):

P1 (sse-client.ts:511 — per-user isolation): Fixed. You're right — generateDeploymentName in the identity overload hashes only platform:channelId:conversationId (see base-deployment-manager.ts:94-101), so every user in a shared thread is routed to the same SSE queue, and the worker's baked-in WORKER_TOKEN encodes only the original spawning user's id. Without the equality guard, a second user's message would execute with the first user's token and their OAuth-scoped integrations.

Restored the data.userId !== this.userId drop with an updated comment that explains the isolation contract (WORKER_TOKEN is the enforcement point, deployment-name hashing is not). Long-term fix is per-message JWT minting so shared-thread workers can act on behalf of the actual sender — leaving that for a follow-up.

P2 (chat-response-bridge.ts:363 — Slack fallback target): Fixed. Same fix I flagged during my self-review in the previous round: now resolving the SDK target via this.resolveTarget(...) during delta setup on the Slack path, so when postSlackMarkdown returns false (missing adapter client) or throws, the else if (stream.target) fallback has a real non-null target and delivers via target.post(cleaned) instead of silently dropping.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented Apr 18, 2026

Correction on the P1 (sse-client.ts): I've reverted the userId equality check in 834d665.

Channels/threads are intentionally one-worker-per-platform:channelId:conversationId (see generateDeploymentName), shared across all posting users — that's the group-bot design. The check I added would have silently dropped every non-spawning user's message in a channel, breaking mention/subscribed flows. DMs are single-participant so the check would have been dead there anyway.

The underlying concern — WORKER_TOKEN is scoped to the original spawning user's id in a worker shared across many users — is real but a design tradeoff, not something to fix with a filter at the worker boundary. Proper fix is per-message JWT minting; that's out of scope for this PR and is now called out in the comment above the (removed) check.

@buremba buremba merged commit 92ce0eb into main Apr 18, 2026
13 of 14 checks passed
@buremba buremba deleted the refactor/auth-split-continuation branch April 18, 2026 03:30
buremba added a commit that referenced this pull request May 21, 2026
Picks up: deploy: add Postgres PITR recovery runbook + seal summaries-db-app (#210)

Before: 83c9a09
After:  bf904a7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant