Skip to content

fix(host-service): catch tunnel connect errors to prevent crash on DNS failure#3861

Merged
Kitenite merged 1 commit into
mainfrom
fix-terminal-websocket-di
Apr 29, 2026
Merged

fix(host-service): catch tunnel connect errors to prevent crash on DNS failure#3861
Kitenite merged 1 commit into
mainfrom
fix-terminal-websocket-di

Conversation

@Kitenite
Copy link
Copy Markdown
Collaborator

@Kitenite Kitenite commented Apr 29, 2026

Summary

  • When the laptop wakes from sleep and DNS hasn't recovered, the tunnel's auto-reconnect path calls JwtApiAuthProvider.getJwt() which fetches api.superset.sh. The fetch throws ENOTFOUND, the rejection escapes via void this.connect() (scheduleReconnect's setTimeout), and the host-service process crashes — orphaning every PTY.
  • The coordinator respawns the host-service on a new port, but the renderer's terminal WebSockets are still pointed at the dead port → "WebSocket closed (1006). Reconnecting…" loop until max attempts → user reports "all terminals died overnight."
  • Fix: wrap TunnelClient.connect()'s body in try/catch. Any throw (DNS failure, WebSocket constructor error, etc.) now routes through the same scheduleReconnect() path that handles normal socket-level errors.

Evidence

Reproduced in ~/.superset/host/<org>/host-service.log:

[host-service:tunnel] socket error: ErrorEvent { ... }
[host-service:tunnel] reconnecting in 1000ms (attempt 1)
node:internal/process/promises:394
    triggerUncaughtException(err, true /* fromPromise */);
[TypeError: fetch failed] {
  [cause]: Error: getaddrinfo ENOTFOUND api.superset.sh
}
Node.js v24.14.0

Followed immediately by a fresh [host-service:db] Initialized line — the coordinator-driven respawn that breaks the renderer's terminal sockets. No [host-service] unhandledRejection — staying up log was emitted, so the safety net in safety.ts didn't catch it (likely Node 24 default --unhandled-rejections=throw behavior on this path).

Test plan

  • Build host-service, kill network (sudo ifconfig en0 down or unplug Ethernet) while a terminal is open
  • Confirm host-service stays up: pgrep -fl host-service.js shows the same pid before and after
  • Confirm host-service.log shows [host-service:tunnel] connect failed: ... followed by reconnecting in Nms (attempt N) — no Node crash footer
  • Restore network, confirm tunnel reconnects (connected to relay for host ...) and existing terminal WebSockets keep working
  • Sanity: typecheck passes for @superset/host-service (verified locally)

Summary by cubic

Prevent @superset/host-service from crashing on DNS/network failures during tunnel reconnect by catching connect errors and routing them through the backoff, so terminals survive sleep.

  • Bug Fixes
    • Wrapped TunnelClient.connect() in try/catch; on any error (auth token fetch, WebSocket init), log [host-service:tunnel] connect failed: ..., clear the socket, and call scheduleReconnect().
    • Avoids host-service process crashes that orphan PTYs and cause dead-port terminal reconnect loops.

Written for commit 4b16dab. Summary will update on new commits. Review in cubic

Summary by CodeRabbit

  • Bug Fixes
    • Improved tunnel connection stability: all connection failures are now properly caught, logged, and trigger automatic reconnection attempts.
    • Ensures transient errors (including token or network issues) no longer leave the tunnel in a broken state, reducing downtime and improving reliability.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

Wraps the body of TunnelClient.connect() in a try/catch so that any synchronous or async error (most importantly the ENOTFOUND thrown by getAuthToken()'s fetch after a sleep/wake DNS hiccup) is routed through scheduleReconnect() instead of escaping as an unhandled rejection that crashes the host-service process. The fix is minimal, well-commented, and correctly leverages the existing guard in scheduleReconnect() to prevent double-scheduling.

Confidence Score: 5/5

Safe to merge — the change is a targeted, low-risk crash fix with no logic regressions.

Single-file change with a clear, well-bounded fix. All error paths now funnel through the existing scheduleReconnect() which has a guard against double-scheduling. No new state is introduced. The only finding is a P2 logging suggestion.

No files require special attention.

Important Files Changed

Filename Overview
packages/host-service/src/tunnel/tunnel-client.ts Wraps connect() body in try/catch to route DNS/fetch errors through scheduleReconnect() instead of crashing the process; logic is sound and the guard in scheduleReconnect() prevents double-scheduling.

Sequence Diagram

sequenceDiagram
    participant SR as scheduleReconnect()
    participant C as connect()
    participant GA as getAuthToken()
    participant WS as WebSocket

    SR->>C: void this.connect() [after delay]
    Note over C: try {
    C->>GA: await getAuthToken()
    alt DNS failure / network error
        GA-->>C: throws ENOTFOUND
        Note over C: } catch (error) {
        C->>C: console.error("connect failed: ...")
        C->>C: this.socket = null
        C->>SR: scheduleReconnect()
    else token is null
        GA-->>C: returns null
        C->>C: console.warn("no auth token")
        C->>SR: scheduleReconnect()
    else success
        GA-->>C: returns token
        C->>WS: new WebSocket(url)
        WS-->>C: socket connected (onopen)
        C->>C: reconnectAttempts = 0
        alt socket error / close
            WS-->>C: onclose fired
            C->>SR: scheduleReconnect()
        end
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 88-92

Comment:
**Log the full error for better diagnostics**

Only `error.message` is logged here, which strips the `[cause]` chain that carries the root cause (e.g. `getaddrinfo ENOTFOUND api.superset.sh`). Passing the original error object as a second argument to `console.error` preserves the full stack and cause chain in the log file.

```suggestion
} catch (error) {
			const message = error instanceof Error ? error.message : String(error);
			console.error(`[host-service:tunnel] connect failed: ${message}`, error);
			this.socket = null;
			this.scheduleReconnect();
		}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(host-service): catch tunnel connect ..." | Re-trigger Greptile

Comment on lines +88 to +92
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
console.error(`[host-service:tunnel] connect failed: ${message}`);
this.socket = null;
this.scheduleReconnect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Log the full error for better diagnostics

Only error.message is logged here, which strips the [cause] chain that carries the root cause (e.g. getaddrinfo ENOTFOUND api.superset.sh). Passing the original error object as a second argument to console.error preserves the full stack and cause chain in the log file.

Suggested change
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
console.error(`[host-service:tunnel] connect failed: ${message}`);
this.socket = null;
this.scheduleReconnect();
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
console.error(`[host-service:tunnel] connect failed: ${message}`, error);
this.socket = null;
this.scheduleReconnect();
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 88-92

Comment:
**Log the full error for better diagnostics**

Only `error.message` is logged here, which strips the `[cause]` chain that carries the root cause (e.g. `getaddrinfo ENOTFOUND api.superset.sh`). Passing the original error object as a second argument to `console.error` preserves the full stack and cause chain in the log file.

```suggestion
} catch (error) {
			const message = error instanceof Error ? error.message : String(error);
			console.error(`[host-service:tunnel] connect failed: ${message}`, error);
			this.socket = null;
			this.scheduleReconnect();
		}
```

How can I resolve this? If you propose a fix, please make it concise.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 885642de-fd22-4129-bb9a-d5902dcd3b73

📥 Commits

Reviewing files that changed from the base of the PR and between 9443b87 and 4b16dab.

📒 Files selected for processing (1)
  • packages/host-service/src/tunnel/tunnel-client.ts

📝 Walkthrough

Walkthrough

The connect() method in the tunnel client now wraps asynchronous auth-token retrieval, relay URL construction, WebSocket creation, and event handler setup inside a try/catch. Any exception logs a failure, clears the socket reference, and schedules a reconnect; the explicit "no auth token" retry path is preserved.

Changes

Cohort / File(s) Summary
Tunnel Client Error Handling
packages/host-service/src/tunnel/tunnel-client.ts
Wrapped the entire connect() async flow (token fetch, URL build, WebSocket creation, handler assignment) in a try/catch. On error: log "connect failed", set socket to null, and call scheduleReconnect(). Maintains existing behavior for null auth token.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A wobble in wires, a missing reply,
I wrapped the connect flow so errors won't fly.
Now failures get logged and the socket stays neat,
Reconnects hop back on reliable feet. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding error handling to prevent crashes when DNS or network failures occur during tunnel reconnection.
Description check ✅ Passed The description covers all required template sections with substantial detail: a clear summary of the problem, evidence with reproduction logs, a concrete fix explanation, and a comprehensive test plan.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-terminal-websocket-di

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/host-service/src/tunnel/tunnel-client.ts`:
- Around line 50-64: After awaiting getAuthToken(), re-check the instance closed
flag so you don't open a socket after close() runs: inside the same method (the
block that calls getAuthToken()), add an immediate guard like "if (this.closed)
return;" after the await (before constructing the URL / new WebSocket) and
ensure you do not call scheduleReconnect or create this.socket when closed; this
prevents creating a live WebSocket on a client that was shut down (references:
getAuthToken, close, scheduleReconnect, this.closed, this.socket).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da4a21aa-f6b3-4bec-82dc-612c0d2945fe

📥 Commits

Reviewing files that changed from the base of the PR and between 382ac7f and 9443b87.

📒 Files selected for processing (1)
  • packages/host-service/src/tunnel/tunnel-client.ts

Comment on lines +50 to +64
try {
const token = await this.getAuthToken();
if (!token) {
console.warn("[host-service:tunnel] no auth token available, retrying");
this.scheduleReconnect();
return;
}
};

socket.onerror = (event) => {
console.error("[host-service:tunnel] socket error:", event);
};
const url = new URL("/tunnel", this.relayUrl);
url.protocol = url.protocol === "https:" ? "wss:" : "ws:";
url.searchParams.set("hostId", this.hostId);
url.searchParams.set("token", token);

const socket = new WebSocket(url.toString());
this.socket = socket;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-check this.closed after the awaited token fetch.

close() can run while getAuthToken() is in flight. In that case this method still creates a new relay socket after shutdown, because the only closed guard is before Line 51. That leaves a live connection behind a client that was already closed.

Suggested fix
 		try {
 			const token = await this.getAuthToken();
+			if (this.closed) {
+				return;
+			}
+
 			if (!token) {
 				console.warn("[host-service:tunnel] no auth token available, retrying");
 				this.scheduleReconnect();
 				return;
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/host-service/src/tunnel/tunnel-client.ts` around lines 50 - 64,
After awaiting getAuthToken(), re-check the instance closed flag so you don't
open a socket after close() runs: inside the same method (the block that calls
getAuthToken()), add an immediate guard like "if (this.closed) return;" after
the await (before constructing the URL / new WebSocket) and ensure you do not
call scheduleReconnect or create this.socket when closed; this prevents creating
a live WebSocket on a client that was shut down (references: getAuthToken,
close, scheduleReconnect, this.closed, this.socket).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

🚀 Preview Deployment

🔗 Preview Links

Service Status Link
Neon Database (Neon) View Branch
Vercel API (Vercel) Open Preview
Vercel Web (Vercel) Open Preview
Vercel Marketing (Vercel) Open Preview
Vercel Admin (Vercel) Open Preview
Vercel Docs (Vercel) Open Preview

Preview updates automatically with new commits

…S failure

When the laptop wakes from sleep and DNS hasn't recovered, the tunnel's
auto-reconnect path calls getAuthToken() which fetches api.superset.sh.
The fetch throws ENOTFOUND, the rejection escapes via `void this.connect()`,
and the host-service process crashes — orphaning every PTY. The coordinator
respawns on a new port, but the renderer keeps reconnecting to the dead
port until it gives up.

Wrap connect()'s body in try/catch and route any throw back through
scheduleReconnect, so transient network failures behave the same as
WebSocket socket-level errors.
@Kitenite Kitenite force-pushed the fix-terminal-websocket-di branch from 9443b87 to 4b16dab Compare April 29, 2026 17:58
@Kitenite Kitenite merged commit 4de08b5 into main Apr 29, 2026
5 of 7 checks passed
@Kitenite Kitenite deleted the fix-terminal-websocket-di branch April 29, 2026 17:59
MocA-Love pushed a commit to MocA-Love/superset that referenced this pull request May 21, 2026
…S failure (superset-sh#3861)

When the laptop wakes from sleep and DNS hasn't recovered, the tunnel's
auto-reconnect path calls getAuthToken() which fetches api.superset.sh.
The fetch throws ENOTFOUND, the rejection escapes via `void this.connect()`,
and the host-service process crashes — orphaning every PTY. The coordinator
respawns on a new port, but the renderer keeps reconnecting to the dead
port until it gives up.

Wrap connect()'s body in try/catch and route any throw back through
scheduleReconnect, so transient network failures behave the same as
WebSocket socket-level errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant