Skip to content

fix(host-service): unstrand tunnel reconnect; wire relay Sentry#4537

Merged
saddlepaddle merged 3 commits into
mainfrom
debug-relay-outage
May 14, 2026
Merged

fix(host-service): unstrand tunnel reconnect; wire relay Sentry#4537
saddlepaddle merged 3 commits into
mainfrom
debug-relay-outage

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 14, 2026

Summary

Investigating why my hosts stopped reconnecting after a relay deploy, found two issues — one a real bug, one missing observability.

Bug fix: host tunnel never reconnects after relay drops it (when active local channels exist).
TunnelClient.cleanupChannels calls channel.ws.close(1001, ...). 1001 is server-reserved per the WHATWG WebSocket spec; Undici (Node's built-in WebSocket) throws InvalidAccessError when clients send it. The throw bubbled out of socket.onclose before scheduleReconnect() ran, leaving the TunnelClient in a permanent dead state — socket=null, connecting=false, reconnectTimer=null, closed=false. Only relaunching the desktop app (= fresh connectRelay()) recovered it. Only hits hosts with active local channels at disconnect time, which is why idle hosts in the relay logs looked fine.

Caught via a real [host-service] uncaughtException — staying up { DOMException [InvalidAccessError]: invalid code } in ~/.superset/host/<org>/host-service.log after the deploy.

Observability: wire relay Sentry.
Relay had @sentry/node initialized but RELAY_SENTRY_DSN was never set in Fly secrets, so two recent crashes (exit_code=1, not OOM) left no stack trace. Adds:

  • consoleLoggingIntegration so every console.* (including Hono request log lines, with tokens already redacted by redactingLogger) ships to Sentry Logs.
  • onUncaughtExceptionIntegration({ exitEvenIfOtherHandlersAreRegistered: false }) + top-level process.on('uncaughtException'|'unhandledRejection') handlers — log + suppress instead of exiting, so a single uncaught throw doesn't drop every active tunnel.
  • app.onError() Hono hook to capture errors thrown out of route handlers (Hono swallows these into a default 500 otherwise).

RELAY_SENTRY_DSN has been staged + deployed to superset-relay already.

Test plan

  • bun run --filter=@superset/relay typecheck passes
  • bun run --filter=@superset/host-service typecheck passes
  • bun run lint clean
  • Relay v24 deployed; /health returns 200 in ~70ms; no Sentry init errors in fresh logs
  • Cut a desktop canary with this fix and verify: trigger a relay restart, confirm host with active terminal sessions reconnects automatically (host-service.log should show [host-service:tunnel] reconnecting in Xms followed by connected to relay for host ...)

Adjacent finds (not in this PR)

  • apps/relay/src/tunnel.ts:171 has the same clientWs.close(1001, reason) pattern. Relay uses the ws npm package via @hono/node-ws, which is more permissive than Undici, so it likely doesn't throw — but worth verifying.
  • Single-machine relay topology (sjc only). fly.toml comments reference a 6-region scale-out but actual state is one machine, so every relay deploy is a 100% outage window.
  • Defense-in-depth alternative: wrap the entire socket.onclose body in try/catch so any future throw inside it can't strand the client.

Summary by cubic

Fixes a host reconnect bug and wires relay Sentry to keep tunnels stable and errors observable. Hosts now reconnect after relay restarts, and relay errors are captured without taking the process down.

  • Bug Fixes

    • Host tunnel cleanup now uses WebSocket close code 1000 with a try/catch, preventing a throw that blocked scheduleReconnect().
    • Logs unexpected channel close errors so they surface in Sentry Logs.
  • New Features

    • Wired @sentry/node with console logging and uncaught-exception integrations; added Hono app.onError to capture route errors.
    • Process uncaughtException/unhandledRejection handlers now log only (avoids double-capture); relay stays up. Enabled RELAY_SENTRY_DSN on superset-relay.

Written for commit 9827893. Summary will update on new commits.

Summary by CodeRabbit

  • Bug Fixes

    • Centralized handling of unexpected process and request errors with standardized JSON 500 responses.
    • More robust WebSocket cleanup with graceful close and suppressed errors on failure.
  • Improvements

    • Improved error reporting and monitoring configuration.
    • Added console logging integration for clearer runtime visibility.

Review Change Stack

Two recent crashes left no trace because RELAY_SENTRY_DSN was never
deployed. Adds Sentry Logs via consoleLoggingIntegration, plus
process- and Hono-level error handlers that keep the relay alive
instead of exiting, so a single bug doesn't drop every host tunnel.
WebSocket.close(1001) is server-reserved per the WHATWG spec; Undici
throws InvalidAccessError when clients use it. The throw bubbled out
of socket.onclose before scheduleReconnect ran, stranding any host
with active local channels after a relay disconnect — users had to
relaunch the desktop app to recover.
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 14, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 70f9c193-c94b-4889-8a81-6bd5638d94b9

📥 Commits

Reviewing files that changed from the base of the PR and between 6bdd3ce and 9827893.

📒 Files selected for processing (2)
  • apps/relay/src/index.ts
  • packages/host-service/src/tunnel/tunnel-client.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/host-service/src/tunnel/tunnel-client.ts

📝 Walkthrough

Walkthrough

Enables Sentry console logs and uncaught-exception integration; adds process-level suppressed logging for uncaught exceptions/rejections (no Sentry capture there); centralizes Hono request error reporting to Sentry and returns JSON 500; makes tunnel WebSocket cleanup close with code 1000 and ignore close-time errors.

Changes

Error Handling and Resilience

Layer / File(s) Summary
Sentry Configuration Enhancement
apps/relay/src/sentry.ts
Sentry.init adds _experiments.enableLogs plus console and uncaught-exception integrations (uncaught integration configured to avoid forcing process exit).
Process-Level Suppressed Handlers
apps/relay/src/index.ts
Registers process.on handlers for uncaughtException and unhandledRejection that write suppressed error/reason details to stderr and do not call Sentry from those handlers.
Hono HTTP Error Handler
apps/relay/src/index.ts
Adds app.onError that captures the exception to Sentry with op: "hono.onError" and the request pathname, then responds with { error: "Internal server error" } and HTTP 500.
WebSocket Graceful Cleanup
packages/host-service/src/tunnel/tunnel-client.ts
cleanupChannels() closes local channel WebSockets with close code 1000 and reason "Tunnel disconnected", catching and warning on any close errors before clearing localChannels.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 I twitch my nose at noisy crash,
I log what's lost, suppress the splash,
Hono tells Sentry what went wrong,
WebSockets close soft, tidy, and strong,
A quiet burrow, errors logged in hash.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: fixing a host tunnel reconnect bug and adding Sentry observability to relay.
Description check ✅ Passed The description is comprehensive and detailed, covering all required template sections including Summary, Related Issues (implicit), Type of Change (bug fix + observability), Testing, and Additional Notes with adjacent findings.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch debug-relay-outage

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR fixes a real reconnect-stranding bug in the host-side tunnel client (1001 \u2192 1000 close code, wrapped in try/catch) and adds Sentry observability to the relay. The tunnel fix is correct; the Sentry wiring has an integration-registration issue that can prevent the no-exit behaviour from taking effect.

  • Tunnel bug fix (tunnel-client.ts): changing the WebSocket close code from 1001 to 1000 and wrapping in try/catch correctly unstrands cleanupChannels so scheduleReconnect() always runs after a relay drop.
  • Relay Sentry setup (sentry.ts / index.ts): because onUncaughtExceptionIntegration and onUnhandledRejectionIntegration are auto-enabled by default in @sentry/node, passing them in the integrations: [...] array installs a second copy on top of the defaults; the default copy retains exitEvenIfOtherHandlersAreRegistered: true and will call process.exit() unconditionally, defeating the intent of keeping the relay alive.

Confidence Score: 3/5

The tunnel-client fix is safe to merge on its own, but the relay Sentry changes may not achieve their stated goal of keeping the process alive on uncaught exceptions.

The tunnel fix resolves the reconnect bug cleanly. The relay Sentry wiring has a real defect: the array form of integrations adds on top of defaults, so the default onUncaughtExceptionIntegration fires alongside the new no-exit one and may still call process.exit() unconditionally.

apps/relay/src/sentry.ts needs the function form of integrations to remove the default OnUncaughtException and OnUnhandledRejection copies before adding the customised replacements.

Important Files Changed

Filename Overview
packages/host-service/src/tunnel/tunnel-client.ts Fixes reconnect-stranding bug by changing ws.close() code from 1001 to 1000 and wrapping in try/catch; the root-cause fix is sound
apps/relay/src/sentry.ts Adds consoleLoggingIntegration and onUncaughtExceptionIntegration, but both are auto-enabled by default in @sentry/node so the array form installs duplicates rather than replacing the defaults
apps/relay/src/index.ts Adds process-level uncaughtException/unhandledRejection handlers and app.onError hook; may not prevent exit if default Sentry integrations still call process.exit() unconditionally

Sequence Diagram

sequenceDiagram
    participant R as Relay
    participant TC as TunnelClient (host)
    participant LC as LocalChannel WS

    R->>TC: drops WebSocket (relay deploy / crash)
    TC->>TC: socket.onclose fires
    TC->>TC: stopWatchdog()
    TC->>TC: cleanupChannels()
    loop each active local channel
        TC->>LC: ws.close(1000, Tunnel disconnected) [was 1001 threw InvalidAccessError]
        LC-->>TC: ok or caught exception
    end
    TC->>TC: scheduleReconnect() now always reached
    TC->>R: reconnect after backoff delay
    R-->>TC: new WebSocket accepted
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/relay/src/sentry.ts:15-22
Both `onUncaughtExceptionIntegration` and `onUnhandledRejectionIntegration` are **auto-enabled by default** in `@sentry/node`. Passing them in the `integrations: [...]` array form adds a second copy on top of the defaults rather than replacing them. The default copy has `exitEvenIfOtherHandlersAreRegistered: true`, so it will still call `process.exit()` unconditionally — directly defeating the intent of this PR. Every uncaught exception will also be captured twice by Sentry (once by each instance). Use the function form to filter out the defaults first, then supply the customised replacements.

```suggestion
		integrations: (defaults) => [
			...defaults.filter(
				(i) =>
					i.name !== "OnUncaughtException" &&
					i.name !== "OnUnhandledRejection",
			),
			Sentry.consoleLoggingIntegration({
				levels: ["log", "info", "warn", "error"],
			}),
			Sentry.onUncaughtExceptionIntegration({
				exitEvenIfOtherHandlersAreRegistered: false,
			}),
			Sentry.onUnhandledRejectionIntegration({
				mode: "none",
			}),
		],
```

### Issue 2 of 2
packages/host-service/src/tunnel/tunnel-client.ts:303-310
The empty `catch {}` silently discards any error that isn't the now-fixed 1001 code issue. Since the primary fix is the `1000` code change, the try/catch is purely defensive, but a silent swallow means future `ws.close()` problems on cleanup will be invisible in logs.

```suggestion
	private cleanupChannels(): void {
		for (const channel of this.localChannels.values()) {
			try {
				channel.ws.close(1000, "Tunnel disconnected");
			} catch (err) {
				console.warn("[host-service:tunnel] error closing local channel ws", err);
			}
		}
		this.localChannels.clear();
	}
```

Reviews (1): Last reviewed commit: "fix(host-service): use valid close code ..." | Re-trigger Greptile

Comment thread apps/relay/src/sentry.ts
Comment on lines +15 to +22
integrations: [
Sentry.consoleLoggingIntegration({
levels: ["log", "info", "warn", "error"],
}),
Sentry.onUncaughtExceptionIntegration({
exitEvenIfOtherHandlersAreRegistered: false,
}),
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Both onUncaughtExceptionIntegration and onUnhandledRejectionIntegration are auto-enabled by default in @sentry/node. Passing them in the integrations: [...] array form adds a second copy on top of the defaults rather than replacing them. The default copy has exitEvenIfOtherHandlersAreRegistered: true, so it will still call process.exit() unconditionally — directly defeating the intent of this PR. Every uncaught exception will also be captured twice by Sentry (once by each instance). Use the function form to filter out the defaults first, then supply the customised replacements.

Suggested change
integrations: [
Sentry.consoleLoggingIntegration({
levels: ["log", "info", "warn", "error"],
}),
Sentry.onUncaughtExceptionIntegration({
exitEvenIfOtherHandlersAreRegistered: false,
}),
],
integrations: (defaults) => [
...defaults.filter(
(i) =>
i.name !== "OnUncaughtException" &&
i.name !== "OnUnhandledRejection",
),
Sentry.consoleLoggingIntegration({
levels: ["log", "info", "warn", "error"],
}),
Sentry.onUncaughtExceptionIntegration({
exitEvenIfOtherHandlersAreRegistered: false,
}),
Sentry.onUnhandledRejectionIntegration({
mode: "none",
}),
],
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/src/sentry.ts
Line: 15-22

Comment:
Both `onUncaughtExceptionIntegration` and `onUnhandledRejectionIntegration` are **auto-enabled by default** in `@sentry/node`. Passing them in the `integrations: [...]` array form adds a second copy on top of the defaults rather than replacing them. The default copy has `exitEvenIfOtherHandlersAreRegistered: true`, so it will still call `process.exit()` unconditionally — directly defeating the intent of this PR. Every uncaught exception will also be captured twice by Sentry (once by each instance). Use the function form to filter out the defaults first, then supply the customised replacements.

```suggestion
		integrations: (defaults) => [
			...defaults.filter(
				(i) =>
					i.name !== "OnUncaughtException" &&
					i.name !== "OnUnhandledRejection",
			),
			Sentry.consoleLoggingIntegration({
				levels: ["log", "info", "warn", "error"],
			}),
			Sentry.onUncaughtExceptionIntegration({
				exitEvenIfOtherHandlersAreRegistered: false,
			}),
			Sentry.onUnhandledRejectionIntegration({
				mode: "none",
			}),
		],
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against @sentry/core@10.46.0 source (integration.js: filterDuplicates + getIntegrationsToSetup): the array form does dedupe by name, and a user instance replaces a default of the same name (defaults are tagged isDefaultInstance = true; the merge rule "never want a default instance to overwrite an existing user instance" runs in our favor since user integrations come after defaults in the merged array).

Also verified @sentry/node-core@9.47.1 onUncaughtExceptionIntegration source: its built-in default for exitEvenIfOtherHandlersAreRegistered is false, not true. So even the default wouldn't unconditionally exit.

So the array form is fine here — single instance with our config, no exit unless no other handlers are attached.

You did sniff out a real adjacent issue though: Sentry's integration unconditionally calls captureException itself, and our process.on('uncaughtException'|'unhandledRejection') handlers were also calling captureSentryException — double captures. Fixed in 9827893 by slimming the handlers to just log (they still need to exist so Sentry's "other handlers attached" check stays true, keeping the process alive).

Comment on lines 303 to 310
private cleanupChannels(): void {
for (const channel of this.localChannels.values()) {
channel.ws.close(1001, "Tunnel disconnected");
try {
channel.ws.close(1000, "Tunnel disconnected");
} catch {}
}
this.localChannels.clear();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The empty catch {} silently discards any error that isn't the now-fixed 1001 code issue. Since the primary fix is the 1000 code change, the try/catch is purely defensive, but a silent swallow means future ws.close() problems on cleanup will be invisible in logs.

Suggested change
private cleanupChannels(): void {
for (const channel of this.localChannels.values()) {
channel.ws.close(1001, "Tunnel disconnected");
try {
channel.ws.close(1000, "Tunnel disconnected");
} catch {}
}
this.localChannels.clear();
}
private cleanupChannels(): void {
for (const channel of this.localChannels.values()) {
try {
channel.ws.close(1000, "Tunnel disconnected");
} catch (err) {
console.warn("[host-service:tunnel] error closing local channel ws", err);
}
}
this.localChannels.clear();
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 303-310

Comment:
The empty `catch {}` silently discards any error that isn't the now-fixed 1001 code issue. Since the primary fix is the `1000` code change, the try/catch is purely defensive, but a silent swallow means future `ws.close()` problems on cleanup will be invisible in logs.

```suggestion
	private cleanupChannels(): void {
		for (const channel of this.localChannels.values()) {
			try {
				channel.ws.close(1000, "Tunnel disconnected");
			} catch (err) {
				console.warn("[host-service:tunnel] error closing local channel ws", err);
			}
		}
		this.localChannels.clear();
	}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied in 9827893. With consoleLoggingIntegration wired, that console.warn flows to Sentry Logs automatically.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

Sentry's onUncaughtException/onUnhandledRejection integrations already
call captureException, so the process.on handlers were double-capturing.
Slim them to just log — they still exist so Sentry's "other handlers
exist" check keeps the process alive. Also surface unexpected channel
close errors via console.warn (which flows to Sentry Logs now).
@saddlepaddle saddlepaddle merged commit 0b6048b into main May 14, 2026
17 checks passed
saddlepaddle added a commit that referenced this pull request May 14, 2026
Changes since v0.2.16:

- host-service: connect-phase deadline + reconnect-guaranteed onclose so
  the tunnel can't get stranded mid-handshake; failed connects always
  schedule a retry. (#4539)
- host-service: unstrand tunnel reconnect path and wire relay Sentry so
  reconnect failures surface instead of silently looping. (#4537)
- build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives
  across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so
  duckdb-backed code paths work in the standalone CLI bundle.

Push cli-v0.2.17 after this lands to fire the release pipeline.
@saddlepaddle saddlepaddle mentioned this pull request May 14, 2026
3 tasks
saddlepaddle added a commit that referenced this pull request May 14, 2026
Changes since v0.2.16:

- host-service: connect-phase deadline + reconnect-guaranteed onclose so
  the tunnel can't get stranded mid-handshake; failed connects always
  schedule a retry. (#4539)
- host-service: unstrand tunnel reconnect path and wire relay Sentry so
  reconnect failures surface instead of silently looping. (#4537)
- build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives
  across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so
  duckdb-backed code paths work in the standalone CLI bundle.

Push cli-v0.2.17 after this lands to fire the release pipeline.
MocA-Love pushed a commit to MocA-Love/superset that referenced this pull request May 25, 2026
…rset-sh#4537)

* chore(relay): wire Sentry logs and uncaught-throw safety net

Two recent crashes left no trace because RELAY_SENTRY_DSN was never
deployed. Adds Sentry Logs via consoleLoggingIntegration, plus
process- and Hono-level error handlers that keep the relay alive
instead of exiting, so a single bug doesn't drop every host tunnel.

* fix(host-service): use valid close code in tunnel cleanupChannels

WebSocket.close(1001) is server-reserved per the WHATWG spec; Undici
throws InvalidAccessError when clients use it. The throw bubbled out
of socket.onclose before scheduleReconnect ran, stranding any host
with active local channels after a relay disconnect — users had to
relaunch the desktop app to recover.

* review: drop duplicate Sentry capture; log cleanupChannels failures

Sentry's onUncaughtException/onUnhandledRejection integrations already
call captureException, so the process.on handlers were double-capturing.
Slim them to just log — they still exist so Sentry's "other handlers
exist" check keeps the process alive. Also surface unexpected channel
close errors via console.warn (which flows to Sentry Logs now).
MocA-Love added a commit to MocA-Love/superset that referenced this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant