fix(gateway): hydrate connection on per-connection webhook under multi-replica#1098
Conversation
…i-replica The per-connection webhook route (/api/v1/webhooks/:id) calls ChatInstanceManager.handleWebhook directly and 404s when the connection's Chat instance isn't warm on the receiving pod. Under app.replicaCount>1 a Slack webhook (a platform event OR a slash command) landing on a pod that hasn't warmed the connection is dropped. Events mostly survive via Slack's retries; a one-shot `/lobu link <code>` slash command does not — Slack reports "app did not respond" and the preview claim is never consumed. handleWebhook now calls ensureConnectionRunning() before giving up — the same store-backed hydration the /slack/events coordinator and postMessageToChannel already perform. It is a no-op when the instance is already running, so the coordinator's existing pre-call stays harmless. Reproducer (red->green): a cold-pod webhook now hydrates from the store and is handled (200) instead of 404; a stopped connection still 404s.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR enables multi-replica webhook delivery by making ChangesMulti-replica webhook delivery
Sequence DiagramsequenceDiagram
participant Client
participant handleWebhook
participant ensureConnectionRunning
participant ConnectionStore
participant WebhookHandler
Client->>handleWebhook: POST webhook (connectionId)
handleWebhook->>handleWebhook: lookup instance in memory
alt instance not found
handleWebhook->>ensureConnectionRunning: start connection from store
ensureConnectionRunning->>ConnectionStore: fetch and boot connection
ConnectionStore-->>ensureConnectionRunning: connection started
handleWebhook->>handleWebhook: re-fetch instance from registry
end
handleWebhook->>WebhookHandler: invoke handler with instance
WebhookHandler-->>handleWebhook: response
handleWebhook-->>Client: HTTP response
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
pi review — SHIP-WITH-NITSThe fix correctly adds the missing lazy hydration in Nits — all pre-existing, none blocking, tracked as follow-ups:
Live |
|
bug_free 84, simplicity 93, slop 0, bugs 0, 0 blockers Typecheck and unit passed. Canonical integration ran the affected chat-instance-manager Slack tests before later failing in untouched agent-routes-apply. [env] embedded Postgres init hit shared-memory exhaustion; exploratory narrow rerun of the affected test hit the same initdb shm failure before tests. Full verdict JSON{
"bug_free_confidence": 84,
"bugs": 0,
"slop": 0,
"simplicity": 93,
"blockers": [],
"change_type": "fix",
"behavior_change_risk": "medium",
"tests_adequate": true,
"suggested_fixes": [],
"notes": "Typecheck and unit passed. Canonical integration ran the affected chat-instance-manager Slack tests before later failing in untouched agent-routes-apply. [env] embedded Postgres init hit shared-memory exhaustion; exploratory narrow rerun of the affected test hit the same initdb shm failure before tests.",
"categories": {
"src": 17,
"tests": 55,
"docs": 0,
"config": 0,
"deps": 0,
"migrations": 0,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
|
Correcting the record: this PR was opened on the belief it fixed the Slack This change is still valid, as multi-replica hardening: the per-connection webhook ( |
What was broken
The Slack Preview flow (
/lobu link <code>to bind a chat to an agent) dead-ends in prod. Tested live against the working lobu-crm@Lobsterbot:oauth_states(no binding created).Root cause
The slash command (and the per-connection Event URL) hits
/api/v1/webhooks/:connectionId→ChatInstanceManager.handleWebhook(), which looks up the Chat instance in the pod-localthis.instancesmap and 404s if it isn't warm on the receiving pod. It's the only inbound path that doesn't hydrate the connection from the store first.Prod runs
app.replicaCount: 2. With ClientIP affinity, a webhook can land on the replica that never warmed this connection → instant 404. Slack events mostly survive because Slack retries 3x and eventually hits the warm pod, but a one-shot slash command doesn't tolerate that, so it surfaced as "app did not respond" + an unconsumed claim. It worked end-to-end when the app ran a single replica.The
/slack/eventscoordinator (handleAppWebhook) andpostMessageToChannelalready do the right thing —ensureConnectionRunning()before forwarding.handleWebhookjust never adopted it.Fix
handleWebhooknow callsensureConnectionRunning(connectionId)before giving up — the same store-backed lazy-start the coordinator andpostMessageToChanneluse. It's a no-op when the instance is already running, so the coordinator's existing pre-call stays harmless. Any replica can now serve any connection's inbound webhook.Reproducer (red → green)
packages/server/src/gateway/__tests__/chat-instance-manager-slack.test.ts— "ChatInstanceManager.handleWebhook (multi-replica)":restartConnectionnever called.Remaining (separate follow-ups, not in this PR)
allowedSurfacesbefore deleting the code.previewModeprovisioning seam (declare vialobu applyinstead of a manual DB edit).Live verification still owed
Prod runs the old code; the live
/lobu linkretest happens after this merges and Flux rolls out the new image.Summary by CodeRabbit
Bug Fixes
Tests