Skip to content

harden(host-service): connect-phase deadline + reconnect-guaranteed onclose#4539

Merged
saddlepaddle merged 1 commit into
mainfrom
harden-tunnel-connect-deadline
May 14, 2026
Merged

harden(host-service): connect-phase deadline + reconnect-guaranteed onclose#4539
saddlepaddle merged 1 commit into
mainfrom
harden-tunnel-connect-deadline

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 14, 2026

Summary

Follow-up to #4537. That PR fixed the one known stuck-state path (1001 close code throwing inside cleanupChannels). This closes two more paths to the same socket=null, connecting=true, reconnectTimer=null wedge:

  • getAuthToken() hangs without an upstream timeout. The await sits forever, connecting stays true, every subsequent connect() call early-returns on the if (this.connecting) return; guard. No retry ever scheduled. Permanent stuck.
  • WebSocket stalls in CONNECTING. Captive portal, NAT rebind mid-handshake, dead server — onopen/onclose/onerror never fire. Watchdog can't help because it's only started inside onopen. Same wedge.

Both collapse into one mechanism: a 20s connect-phase deadline. On timeout we force-close any in-flight socket (code 4001, application-reserved), reset state, and schedule a reconnect. The stale onclose for the abandoned socket no-ops via the existing this.socket !== socket identity check, so no double-scheduling.

Also wraps the onclose body in try/finally so any future throw inside it still routes through scheduleReconnect() — defense-in-depth against the class of bug we just fixed (synchronous throw inside the close handler that bypasses the retry call).

What's left unaddressed

The multi-boolean state (closed / connecting / socket / reconnectTimer / watchdogTimer) is the canonical setup for these bugs. A single state enum ('idle' | 'connecting' | 'open' | 'reconnecting' | 'closed') with explicit transitions would prevent whole classes of wedge — that's the "robust WebSocket client" refactor pattern. Out of scope for this PR; worth a follow-up if anyone feels like a from-scratch rewrite.

Test plan

  • bun run --filter=@superset/host-service typecheck passes
  • bun run lint clean
  • In a desktop canary: run host-service offline, confirm [host-service:tunnel] connect did not complete within 20000ms, forcing retry appears every ~20s in ~/.superset/host/<org>/host-service.log and [host-service:tunnel] reconnecting in Xms (attempt N) follows each one
  • In a desktop canary: connect normally, confirm connected to relay for host ... appears within ~3s (deadline cleared in onopen)

Summary by cubic

Prevents the tunnel client from getting stuck in “connecting” by adding a 20s connect-phase deadline and guaranteeing a reconnect even if onclose cleanup throws. On timeout, we force-close the socket (4001), reset state, and schedule a retry.

  • Bug Fixes
    • Added a 20s connect deadline to cover hanging getAuthToken() and stalled WebSocket handshakes; on timeout, close with 4001, reset, and reconnect.
    • Clear the deadline in all paths (onopen, auth failure, onclose, and catch) and ignore stale onclose via socket identity check.
    • Wrapped onclose cleanup in try/finally so scheduleReconnect() always runs; also stops the watchdog and cleans up channels safely.

Written for commit 7d9b7ec. Summary will update on new commits.

Summary by CodeRabbit

  • Bug Fixes
    • Added connection timeout enforcement for tunnel establishment to prevent hanging connections.
    • Improved socket cleanup and reconnection handling during connection failures.
    • Enhanced logging for better visibility into connection timeout events.

Review Change Stack

…nclose

Two stuck-state paths besides the one we already fixed:
- getAuthToken() can hang indefinitely (no upstream timeout) → connecting
  stays true → all future connect() calls early-return forever
- WebSocket can stall in CONNECTING (captive portal, NAT rebind mid-
  handshake) → onopen/onclose/onerror never fire → same wedge

A 20s connect-phase deadline collapses both into one mechanism. On
timeout we force-close any in-flight socket, reset state, and schedule
reconnect. Stale onclose for the abandoned socket no-ops via the
existing this.socket-identity check.

Also wraps onclose body in try/finally so any future throw inside it
still routes through scheduleReconnect — defense-in-depth against the
class of bug we just fixed (1001-close throwing inside cleanupChannels).
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 14, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 40d775f7-b83b-4479-807d-181617dc5ca3

📥 Commits

Reviewing files that changed from the base of the PR and between 0b6048b and 7d9b7ec.

📒 Files selected for processing (1)
  • packages/host-service/src/tunnel/tunnel-client.ts

📝 Walkthrough

Walkthrough

The PR adds connection timeout enforcement to TunnelClient.connect(). A timeout constant is introduced and a deadline mechanism is established during connection attempts. The deadline is properly cleaned up across success, close, and error paths with conditional reconnect suppression.

Changes

Connection Timeout

Layer / File(s) Summary
Timeout constant and connection deadline
packages/host-service/src/tunnel/tunnel-client.ts
CONNECT_TIMEOUT_MS constant is added. The connect() method now creates a deadline timer that enforces a timeout on token acquisition and connection; on timeout, the socket is closed with a specific code, timedOut flag is set, reconnect is scheduled, and token acquisition early-returns if the deadline has already fired or client was closed.
Deadline cleanup on success and failure paths
packages/host-service/src/tunnel/tunnel-client.ts
The deadline is cleared when the WebSocket successfully opens. The onclose handler is reorganized to clear the deadline, verify socket identity, run cleanup in a try/catch/finally block, and only schedule reconnect when the client is not fully closed. The outer catch handler clears the deadline and suppresses reconnect when the failure was triggered by timeout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • superset-sh/superset#4537: Modifies tunnel lifecycle and WebSocket close handling in the same file, with direct relationship around reconnect state management on disconnect.

Poem

🐰 A timeout is born, so precise and so keen,
With deadlines and cleanups, a safety routine,
When sockets grow restless and won't play along,
The tunnel reconnects and keeps moving along! 🌀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'harden(host-service): connect-phase deadline + reconnect-guaranteed onclose' directly and clearly summarizes the main changes: adding a connect deadline and ensuring reconnects even when onclose cleanup fails.
Description check ✅ Passed The PR description covers the motivation, implementation details, and testing strategy; it matches the template structure with clear sections on summary, related context, and test plan. All critical information is present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harden-tunnel-connect-deadline

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds a 20 s connect-phase deadline to TunnelClient.connect() that closes two previously unguarded wedge paths into the socket=null, connecting=true, reconnectTimer=null stuck state: an unbounded getAuthToken() await and a WebSocket that stalls in CONNECTING. It also wraps the onclose cleanup in try/finally so scheduleReconnect() fires even if cleanupChannels() throws.

  • Connect-phase deadline: a setTimeout(20 s) forces socket.close(4001), resets state, and calls scheduleReconnect(); existing socket identity checks in onclose prevent double-scheduling when the stale close event fires afterwards.
  • onclose try/finally: moves scheduleReconnect() into a finally block so any synchronous throw inside cleanup cannot bypass the retry path.
  • timedOut flag: gates all post-getAuthToken continuation paths so that a late-resolving/rejecting token future becomes a safe no-op after the deadline has already taken over.

Confidence Score: 4/5

Safe to merge; the deadline mechanism is logically sound and the socket identity guard correctly prevents double-scheduling across all traced paths.

The deadline mechanism is logically sound and the socket identity guard correctly prevents double-scheduling across all traced paths. The two findings are minor: the deadline timer is not tracked on the instance so close() cannot cancel it, and the timedOut early-return branch omits an explicit connecting = false that is implicitly guaranteed by the deadline side-effect. Neither causes misbehaviour today, but both create subtle implicit contracts that could bite a future refactor.

packages/host-service/src/tunnel/tunnel-client.ts — the new deadline timer lifetime and the asymmetric connecting reset in the timedOut early-return branch deserve a second look.

Important Files Changed

Filename Overview
packages/host-service/src/tunnel/tunnel-client.ts Adds a 20 s connect-phase deadline to prevent stuck connecting=true states caused by a hanging getAuthToken() or a stalled WebSocket handshake; wraps onclose cleanup in try/finally so scheduleReconnect() fires even if cleanup throws. Two minor concerns: the deadline timer is not stored on this (preventing cancellation from close()), and the timedOut early-return path skips an explicit connecting reset that relies on the deadline side-effect.

Sequence Diagram

sequenceDiagram
    participant App
    participant TunnelClient
    participant Deadline as Deadline Timer (20 s)
    participant Auth as getAuthToken()
    participant WS as WebSocket

    App->>TunnelClient: connect()
    TunnelClient->>Deadline: setTimeout(20 s)
    TunnelClient->>Auth: await getAuthToken()

    alt "Auth hangs > 20 s"
        Deadline-->>TunnelClient: "fires: timedOut=true, socket.close(4001), socket=null, connecting=false"
        TunnelClient->>TunnelClient: scheduleReconnect()
        Auth-->>TunnelClient: eventually resolves/rejects
        TunnelClient->>TunnelClient: if (timedOut) early return (no-op)
    else Auth resolves in time
        Auth-->>TunnelClient: token
        TunnelClient->>WS: new WebSocket(url)
        alt "WS stalls in CONNECTING > 20 s"
            Deadline-->>TunnelClient: "fires: socket.close(4001), socket=null, connecting=false"
            TunnelClient->>TunnelClient: scheduleReconnect()
            WS-->>TunnelClient: onclose fires (stale identity check, no-op)
        else WS opens normally
            WS-->>TunnelClient: onopen, clearTimeout(deadline), startWatchdog()
        end
        alt WS closes for any reason
            WS-->>TunnelClient: onclose
            TunnelClient->>Deadline: clearTimeout(deadline)
            TunnelClient->>TunnelClient: try cleanup, finally scheduleReconnect()
        end
    end
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/host-service/src/tunnel/tunnel-client.ts:64-76
**Deadline timer not tracked on `this`, so `close()` cannot cancel it**

`deadline` is a local variable, unlike `reconnectTimer` and `watchdogTimer` which are stored on the instance and cleared in `close()`. When `close()` is called while a `connect()` is mid-flight, `onclose` fires with the identity check returning early (because `close()` already nulled `this.socket`), so `clearTimeout(deadline)` is never reached. The timer then lives for up to 20 s before firing and short-circuiting on `if (this.closed) return`. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after `close()`.

### Issue 2 of 2
packages/host-service/src/tunnel/tunnel-client.ts:82-86
**Asymmetric `connecting` reset creates an implicit dependency on the deadline side-effect**

When `timedOut` is true but `this.closed` is false, `this.connecting` is not explicitly reset here — the assumption is that the deadline callback already set it to `false`. This differs from the `this.closed` branch directly below, which resets it explicitly. While correct today (the deadline atomically sets `timedOut = true` and then `this.connecting = false` with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit `this.connecting = false` for the `timedOut` branch would make the invariant self-contained and consistent.

Reviews (1): Last reviewed commit: "harden(host-service): connect-phase dead..." | Re-trigger Greptile

Comment on lines +64 to +76
const deadline = setTimeout(() => {
if (this.closed) return;
timedOut = true;
console.warn(
`[host-service:tunnel] connect did not complete within ${CONNECT_TIMEOUT_MS}ms, forcing retry`,
);
try {
this.socket?.close(4001, "Connect timeout");
} catch {}
this.socket = null;
this.connecting = false;
this.scheduleReconnect();
}, CONNECT_TIMEOUT_MS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Deadline timer not tracked on this, so close() cannot cancel it

deadline is a local variable, unlike reconnectTimer and watchdogTimer which are stored on the instance and cleared in close(). When close() is called while a connect() is mid-flight, onclose fires with the identity check returning early (because close() already nulled this.socket), so clearTimeout(deadline) is never reached. The timer then lives for up to 20 s before firing and short-circuiting on if (this.closed) return. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after close().

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 64-76

Comment:
**Deadline timer not tracked on `this`, so `close()` cannot cancel it**

`deadline` is a local variable, unlike `reconnectTimer` and `watchdogTimer` which are stored on the instance and cleared in `close()`. When `close()` is called while a `connect()` is mid-flight, `onclose` fires with the identity check returning early (because `close()` already nulled `this.socket`), so `clearTimeout(deadline)` is never reached. The timer then lives for up to 20 s before firing and short-circuiting on `if (this.closed) return`. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after `close()`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +82 to 86
if (timedOut || this.closed) {
clearTimeout(deadline);
if (this.closed) this.connecting = false;
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Asymmetric connecting reset creates an implicit dependency on the deadline side-effect

When timedOut is true but this.closed is false, this.connecting is not explicitly reset here — the assumption is that the deadline callback already set it to false. This differs from the this.closed branch directly below, which resets it explicitly. While correct today (the deadline atomically sets timedOut = true and then this.connecting = false with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit this.connecting = false for the timedOut branch would make the invariant self-contained and consistent.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 82-86

Comment:
**Asymmetric `connecting` reset creates an implicit dependency on the deadline side-effect**

When `timedOut` is true but `this.closed` is false, `this.connecting` is not explicitly reset here — the assumption is that the deadline callback already set it to `false`. This differs from the `this.closed` branch directly below, which resets it explicitly. While correct today (the deadline atomically sets `timedOut = true` and then `this.connecting = false` with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit `this.connecting = false` for the `timedOut` branch would make the invariant self-contained and consistent.

How can I resolve this? If you propose a fix, please make it concise.

@saddlepaddle saddlepaddle merged commit d249399 into main May 14, 2026
17 checks passed
saddlepaddle added a commit that referenced this pull request May 14, 2026
Changes since v0.2.16:

- host-service: connect-phase deadline + reconnect-guaranteed onclose so
  the tunnel can't get stranded mid-handshake; failed connects always
  schedule a retry. (#4539)
- host-service: unstrand tunnel reconnect path and wire relay Sentry so
  reconnect failures surface instead of silently looping. (#4537)
- build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives
  across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so
  duckdb-backed code paths work in the standalone CLI bundle.

Push cli-v0.2.17 after this lands to fire the release pipeline.
@saddlepaddle saddlepaddle mentioned this pull request May 14, 2026
3 tasks
saddlepaddle added a commit that referenced this pull request May 14, 2026
Changes since v0.2.16:

- host-service: connect-phase deadline + reconnect-guaranteed onclose so
  the tunnel can't get stranded mid-handshake; failed connects always
  schedule a retry. (#4539)
- host-service: unstrand tunnel reconnect path and wire relay Sentry so
  reconnect failures surface instead of silently looping. (#4537)
- build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives
  across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so
  duckdb-backed code paths work in the standalone CLI bundle.

Push cli-v0.2.17 after this lands to fire the release pipeline.
MocA-Love pushed a commit to MocA-Love/superset that referenced this pull request May 25, 2026
…nclose (superset-sh#4539)

Two stuck-state paths besides the one we already fixed:
- getAuthToken() can hang indefinitely (no upstream timeout) → connecting
  stays true → all future connect() calls early-return forever
- WebSocket can stall in CONNECTING (captive portal, NAT rebind mid-
  handshake) → onopen/onclose/onerror never fire → same wedge

A 20s connect-phase deadline collapses both into one mechanism. On
timeout we force-close any in-flight socket, reset state, and schedule
reconnect. Stale onclose for the abandoned socket no-ops via the
existing this.socket-identity check.

Also wraps onclose body in try/finally so any future throw inside it
still routes through scheduleReconnect — defense-in-depth against the
class of bug we just fixed (1001-close throwing inside cleanupChannels).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant