harden(host-service): connect-phase deadline + reconnect-guaranteed onclose#4539
Conversation
…nclose Two stuck-state paths besides the one we already fixed: - getAuthToken() can hang indefinitely (no upstream timeout) → connecting stays true → all future connect() calls early-return forever - WebSocket can stall in CONNECTING (captive portal, NAT rebind mid- handshake) → onopen/onclose/onerror never fire → same wedge A 20s connect-phase deadline collapses both into one mechanism. On timeout we force-close any in-flight socket, reset state, and schedule reconnect. Stale onclose for the abandoned socket no-ops via the existing this.socket-identity check. Also wraps onclose body in try/finally so any future throw inside it still routes through scheduleReconnect — defense-in-depth against the class of bug we just fixed (1001-close throwing inside cleanupChannels).
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR adds connection timeout enforcement to ChangesConnection Timeout
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
Greptile SummaryThis PR adds a 20 s connect-phase deadline to
Confidence Score: 4/5Safe to merge; the deadline mechanism is logically sound and the socket identity guard correctly prevents double-scheduling across all traced paths. The deadline mechanism is logically sound and the socket identity guard correctly prevents double-scheduling across all traced paths. The two findings are minor: the deadline timer is not tracked on the instance so close() cannot cancel it, and the timedOut early-return branch omits an explicit connecting = false that is implicitly guaranteed by the deadline side-effect. Neither causes misbehaviour today, but both create subtle implicit contracts that could bite a future refactor. packages/host-service/src/tunnel/tunnel-client.ts — the new deadline timer lifetime and the asymmetric connecting reset in the timedOut early-return branch deserve a second look.
|
| Filename | Overview |
|---|---|
| packages/host-service/src/tunnel/tunnel-client.ts | Adds a 20 s connect-phase deadline to prevent stuck connecting=true states caused by a hanging getAuthToken() or a stalled WebSocket handshake; wraps onclose cleanup in try/finally so scheduleReconnect() fires even if cleanup throws. Two minor concerns: the deadline timer is not stored on this (preventing cancellation from close()), and the timedOut early-return path skips an explicit connecting reset that relies on the deadline side-effect. |
Sequence Diagram
sequenceDiagram
participant App
participant TunnelClient
participant Deadline as Deadline Timer (20 s)
participant Auth as getAuthToken()
participant WS as WebSocket
App->>TunnelClient: connect()
TunnelClient->>Deadline: setTimeout(20 s)
TunnelClient->>Auth: await getAuthToken()
alt "Auth hangs > 20 s"
Deadline-->>TunnelClient: "fires: timedOut=true, socket.close(4001), socket=null, connecting=false"
TunnelClient->>TunnelClient: scheduleReconnect()
Auth-->>TunnelClient: eventually resolves/rejects
TunnelClient->>TunnelClient: if (timedOut) early return (no-op)
else Auth resolves in time
Auth-->>TunnelClient: token
TunnelClient->>WS: new WebSocket(url)
alt "WS stalls in CONNECTING > 20 s"
Deadline-->>TunnelClient: "fires: socket.close(4001), socket=null, connecting=false"
TunnelClient->>TunnelClient: scheduleReconnect()
WS-->>TunnelClient: onclose fires (stale identity check, no-op)
else WS opens normally
WS-->>TunnelClient: onopen, clearTimeout(deadline), startWatchdog()
end
alt WS closes for any reason
WS-->>TunnelClient: onclose
TunnelClient->>Deadline: clearTimeout(deadline)
TunnelClient->>TunnelClient: try cleanup, finally scheduleReconnect()
end
end
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/host-service/src/tunnel/tunnel-client.ts:64-76
**Deadline timer not tracked on `this`, so `close()` cannot cancel it**
`deadline` is a local variable, unlike `reconnectTimer` and `watchdogTimer` which are stored on the instance and cleared in `close()`. When `close()` is called while a `connect()` is mid-flight, `onclose` fires with the identity check returning early (because `close()` already nulled `this.socket`), so `clearTimeout(deadline)` is never reached. The timer then lives for up to 20 s before firing and short-circuiting on `if (this.closed) return`. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after `close()`.
### Issue 2 of 2
packages/host-service/src/tunnel/tunnel-client.ts:82-86
**Asymmetric `connecting` reset creates an implicit dependency on the deadline side-effect**
When `timedOut` is true but `this.closed` is false, `this.connecting` is not explicitly reset here — the assumption is that the deadline callback already set it to `false`. This differs from the `this.closed` branch directly below, which resets it explicitly. While correct today (the deadline atomically sets `timedOut = true` and then `this.connecting = false` with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit `this.connecting = false` for the `timedOut` branch would make the invariant self-contained and consistent.
Reviews (1): Last reviewed commit: "harden(host-service): connect-phase dead..." | Re-trigger Greptile
| const deadline = setTimeout(() => { | ||
| if (this.closed) return; | ||
| timedOut = true; | ||
| console.warn( | ||
| `[host-service:tunnel] connect did not complete within ${CONNECT_TIMEOUT_MS}ms, forcing retry`, | ||
| ); | ||
| try { | ||
| this.socket?.close(4001, "Connect timeout"); | ||
| } catch {} | ||
| this.socket = null; | ||
| this.connecting = false; | ||
| this.scheduleReconnect(); | ||
| }, CONNECT_TIMEOUT_MS); |
There was a problem hiding this comment.
Deadline timer not tracked on
this, so close() cannot cancel it
deadline is a local variable, unlike reconnectTimer and watchdogTimer which are stored on the instance and cleared in close(). When close() is called while a connect() is mid-flight, onclose fires with the identity check returning early (because close() already nulled this.socket), so clearTimeout(deadline) is never reached. The timer then lives for up to 20 s before firing and short-circuiting on if (this.closed) return. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after close().
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 64-76
Comment:
**Deadline timer not tracked on `this`, so `close()` cannot cancel it**
`deadline` is a local variable, unlike `reconnectTimer` and `watchdogTimer` which are stored on the instance and cleared in `close()`. When `close()` is called while a `connect()` is mid-flight, `onclose` fires with the identity check returning early (because `close()` already nulled `this.socket`), so `clearTimeout(deadline)` is never reached. The timer then lives for up to 20 s before firing and short-circuiting on `if (this.closed) return`. No incorrect behaviour results, but it is an orphaned async resource that is inconsistent with the rest of the timer-management pattern and could confuse tests that assert no pending timers remain after `close()`.
How can I resolve this? If you propose a fix, please make it concise.| if (timedOut || this.closed) { | ||
| clearTimeout(deadline); | ||
| if (this.closed) this.connecting = false; | ||
| return; | ||
| } |
There was a problem hiding this comment.
Asymmetric
connecting reset creates an implicit dependency on the deadline side-effect
When timedOut is true but this.closed is false, this.connecting is not explicitly reset here — the assumption is that the deadline callback already set it to false. This differs from the this.closed branch directly below, which resets it explicitly. While correct today (the deadline atomically sets timedOut = true and then this.connecting = false with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit this.connecting = false for the timedOut branch would make the invariant self-contained and consistent.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 82-86
Comment:
**Asymmetric `connecting` reset creates an implicit dependency on the deadline side-effect**
When `timedOut` is true but `this.closed` is false, `this.connecting` is not explicitly reset here — the assumption is that the deadline callback already set it to `false`. This differs from the `this.closed` branch directly below, which resets it explicitly. While correct today (the deadline atomically sets `timedOut = true` and then `this.connecting = false` with no yield between them), a future refactor that separates those two operations — or moves toward a state-machine model — could silently re-introduce the wedge. An explicit `this.connecting = false` for the `timedOut` branch would make the invariant self-contained and consistent.
How can I resolve this? If you propose a fix, please make it concise.Changes since v0.2.16: - host-service: connect-phase deadline + reconnect-guaranteed onclose so the tunnel can't get stranded mid-handshake; failed connects always schedule a retry. (#4539) - host-service: unstrand tunnel reconnect path and wire relay Sentry so reconnect failures surface instead of silently looping. (#4537) - build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so duckdb-backed code paths work in the standalone CLI bundle. Push cli-v0.2.17 after this lands to fire the release pipeline.
Changes since v0.2.16: - host-service: connect-phase deadline + reconnect-guaranteed onclose so the tunnel can't get stranded mid-handshake; failed connects always schedule a retry. (#4539) - host-service: unstrand tunnel reconnect path and wire relay Sentry so reconnect failures surface instead of silently looping. (#4537) - build: bundle @mastra/duckdb and the @duckdb/node-bindings-* natives across darwin-arm64 / darwin-x64 / linux-x64 / linux-arm64 targets so duckdb-backed code paths work in the standalone CLI bundle. Push cli-v0.2.17 after this lands to fire the release pipeline.
…nclose (superset-sh#4539) Two stuck-state paths besides the one we already fixed: - getAuthToken() can hang indefinitely (no upstream timeout) → connecting stays true → all future connect() calls early-return forever - WebSocket can stall in CONNECTING (captive portal, NAT rebind mid- handshake) → onopen/onclose/onerror never fire → same wedge A 20s connect-phase deadline collapses both into one mechanism. On timeout we force-close any in-flight socket, reset state, and schedule reconnect. Stale onclose for the abandoned socket no-ops via the existing this.socket-identity check. Also wraps onclose body in try/finally so any future throw inside it still routes through scheduleReconnect — defense-in-depth against the class of bug we just fixed (1001-close throwing inside cleanupChannels).
Summary
Follow-up to #4537. That PR fixed the one known stuck-state path (1001 close code throwing inside
cleanupChannels). This closes two more paths to the samesocket=null, connecting=true, reconnectTimer=nullwedge:getAuthToken()hangs without an upstream timeout. Theawaitsits forever,connectingstaystrue, every subsequentconnect()call early-returns on theif (this.connecting) return;guard. No retry ever scheduled. Permanent stuck.CONNECTING. Captive portal, NAT rebind mid-handshake, dead server —onopen/onclose/onerrornever fire. Watchdog can't help because it's only started insideonopen. Same wedge.Both collapse into one mechanism: a 20s connect-phase deadline. On timeout we force-close any in-flight socket (code 4001, application-reserved), reset state, and schedule a reconnect. The stale
onclosefor the abandoned socket no-ops via the existingthis.socket !== socketidentity check, so no double-scheduling.Also wraps the
onclosebody intry/finallyso any future throw inside it still routes throughscheduleReconnect()— defense-in-depth against the class of bug we just fixed (synchronous throw inside the close handler that bypasses the retry call).What's left unaddressed
The multi-boolean state (
closed/connecting/socket/reconnectTimer/watchdogTimer) is the canonical setup for these bugs. A singlestateenum ('idle' | 'connecting' | 'open' | 'reconnecting' | 'closed') with explicit transitions would prevent whole classes of wedge — that's the "robust WebSocket client" refactor pattern. Out of scope for this PR; worth a follow-up if anyone feels like a from-scratch rewrite.Test plan
bun run --filter=@superset/host-service typecheckpassesbun run lintclean[host-service:tunnel] connect did not complete within 20000ms, forcing retryappears every ~20s in~/.superset/host/<org>/host-service.logand[host-service:tunnel] reconnecting in Xms (attempt N)follows each oneconnected to relay for host ...appears within ~3s (deadline cleared inonopen)Summary by cubic
Prevents the tunnel client from getting stuck in “connecting” by adding a 20s connect-phase deadline and guaranteeing a reconnect even if
onclosecleanup throws. On timeout, we force-close the socket (4001), reset state, and schedule a retry.getAuthToken()and stalled WebSocket handshakes; on timeout, close with 4001, reset, and reconnect.onopen, auth failure,onclose, and catch) and ignore staleonclosevia socket identity check.onclosecleanup in try/finally soscheduleReconnect()always runs; also stops the watchdog and cleans up channels safely.Written for commit 7d9b7ec. Summary will update on new commits.
Summary by CodeRabbit