WebSocket: release pending-activity ref when close()/terminate() called during CONNECTING#29918
Conversation
…ed during CONNECTING connect() takes a pending-activity ref that is only released by didConnect / didFailWithErrorCode / didClose. When close() or terminate() is called while still CONNECTING, the Zig upgrade client's cancel() clears its back-pointer without calling any of those, so the WebSocket stayed in CLOSING with m_pendingActivityCount > 0 forever — hasPendingActivity() permanently true, never GC'd, and no close event fired. failConnectingWebSocket() now cancels the upgrade and queues a task that transitions to CLOSED, fires the close event (code 1006, wasClean=false — matching browsers/Node ws), and calls disablePendingActivity().
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughIntroduces a dedicated Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Review rate limit: 4/5 reviews remaining, refill in 12 minutes. Comment |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
Per the WHATWG WebSocket spec, close() while CONNECTING runs 'fail the WebSocket connection', which requires an error event before the close event. Mirrors the wasConnecting && isConnectionError branch in didReceiveClose(). Matches Chrome/Firefox and npm ws.
websocket.test.js has several pre-existing tests (300× TLS handshakes, 1000× connect+close, external echo host) that run near the 5s default timeout under debug+ASAN regardless of this change. Keeping the new tests in a dedicated file — consistent with the other per-topic test files in this directory — lets them be evaluated without those unrelated borderline tests.
There was a problem hiding this comment.
Thanks for addressing both nits — the error→close ordering now matches spec/Chrome/Firefox/ws and the test timeout is gone. The change looks correct to me, but since it alters user-visible event behavior and touches the pending-activity refcounting path (and CI is currently red on build-zig/build-bun), I'd like a human to give it a final look.
Extended reasoning...
Overview
This PR fixes a leak where calling close() or terminate() on a WebSocket while readyState === CONNECTING left the socket permanently in CLOSING with m_pendingActivityCount > 0, pinning the JS wrapper as a GC root and never firing a close event. The fix extracts the duplicated CONNECTING-branch logic from close()/terminate() into a new failConnectingWebSocket() helper (WebSocket.cpp:923-972) that cancels the upgrade client, then posts a task which transitions to CLOSED, dispatches error → close (per WHATWG "fail the WebSocket connection"), and calls disablePendingActivity() to balance the ref taken in connect(). A new test file covers event ordering, close-event properties, and a 64-socket heap-stats leak check for both close() and terminate().
I previously left two inline nits on an earlier revision — (1) missing error event before close, and (2) an explicit test timeout — both of which the author has addressed and resolved. I verified the Zig-side cancel() (WebSocketUpgradeClient.zig:439-459) clears outgoing_websocket before closing the TCP socket, so no completion callback can race the posted task; the m_state == CLOSED early-return guard in the task is defensive and effectively unreachable.
Security risks
None identified. No new untrusted-input parsing, no auth/crypto/permissions surface. The error message string is a fixed literal. The change is purely internal lifecycle/event sequencing.
Level of scrutiny
Moderate-to-high. This is core C++ in the WebSocket client touching refcounting (disablePendingActivity() performs a deref()) and introducing a user-visible behavior change: code that previously saw no events when closing during CONNECTING will now receive error then close. That's spec-correct and matches browsers/npm ws, but it's the kind of observable change that benefits from a maintainer's sign-off. The postTask + Ref { *this } + disablePendingActivity() pattern mirrors the existing m_upgradeClient == nullptr failure path in connect() and the tail of didFailWithErrorCode(), which gives confidence it's idiomatic for this file.
Other factors
- CI (Build #48938) is red on
build-zig/build-bunacross several targets. The C++ looks syntactically consistent with surrounding code (e.g.,createErrorEventInit(protectedThis, ...)matches the existing call in theconnect()failure path), so the failures may be infra-related, but I can't confirm without the logs. - New tests are well-targeted and would have caught the original leak (they time out on the unpatched build per the PR description).
- The github-actions bot suggests this may also fix #17111 (crash on
terminate()during CONNECTING), which raises the value of human review to confirm that linkage.
…ed during CONNECTING (oven-sh#29918) ## Repro ```js const net = require('net'); const server = net.createServer(() => {}).listen(0, '127.0.0.1', () => { const ws = new WebSocket(`ws://127.0.0.1:${server.address().port}`); ws.onclose = () => { server.close(); console.log('closed'); }; ws.close(); }); ``` On current main this script never prints `closed` and never exits. `ws.readyState` stays `CLOSING` (2) forever and the WebSocket is never garbage-collected — `heapStats()` shows every such instance pinned indefinitely. Same for `terminate()`. ## Cause `connect()` takes a pending-activity ref (`incPendingActivityCount()`, [WebSocket.cpp:667](https://github.com/oven-sh/bun/blob/main/src/bun.js/bindings/webcore/WebSocket.cpp#L667)) that is balanced only by `didConnect` / `didFailWithErrorCode` / `didClose`. When `close()`/`terminate()` is called while `m_state == CONNECTING`, the handler sets `m_state = CLOSING`, calls `Bun__WebSocketHTTP{S,}Client__cancel` on the upgrade client, and returns. The Zig `cancel()` ([WebSocketUpgradeClient.zig:439](https://github.com/oven-sh/bun/blob/main/src/http/websocket_client/WebSocketUpgradeClient.zig#L439)) clears `outgoing_websocket` **without** calling `didAbruptClose` — by design, since C++ initiated the cancel. But C++ never finishes the close either, so none of the completion callbacks fire, `m_state` never reaches `CLOSED`, `m_pendingActivityCount` stays ≥ 1, and `hasPendingActivity()` returns `true` permanently — the JS wrapper is a GC root forever and the native `ref()` is never `deref()`ed. As a visible side effect, the existing `oven-sh#16995` test leaves ~4096 `WebSocket` instances in `heapStats()` on current main. ## Fix Extract the CONNECTING branch of `close()`/`terminate()` into `failConnectingWebSocket()`. After cancelling the upgrade client it posts a task that: 1. Sets `m_state = CLOSED` 2. Fires an `error` event followed by a `close` event with code `1006`, `wasClean = false` — per the spec's "fail the WebSocket connection" steps; matches Chrome/Firefox and Node `ws` 3. Calls `disablePendingActivity()` to release the ref taken in `connect()` The native-callback path (`m_native.onClose`) is handled the same way as `didClose`. ## Verification New tests in `test/js/web/websocket/websocket-close-connecting.test.ts` for both `close()` and `terminate()`: - `fires error + close events and transitions to CLOSED` — TCP server that never responds keeps the socket in CONNECTING; after `close()` an `error` event fires, then a `close` event with `{code: 1006, wasClean: false}`, and `readyState === CLOSED`. - `does not leak` — creates 64 sockets, closes each during CONNECTING, verifies via `heapStats().objectTypeCounts.WebSocket` that they are collected. All four time out on the unpatched build (close event never fires). --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Jarred Sumner <jarred@jarredsumner.com>
Repro
On current main this script never prints
closedand never exits.ws.readyStatestaysCLOSING(2) forever and the WebSocket is never garbage-collected —heapStats()shows every such instance pinned indefinitely. Same forterminate().Cause
connect()takes a pending-activity ref (incPendingActivityCount(), WebSocket.cpp:667) that is balanced only bydidConnect/didFailWithErrorCode/didClose.When
close()/terminate()is called whilem_state == CONNECTING, the handler setsm_state = CLOSING, callsBun__WebSocketHTTP{S,}Client__cancelon the upgrade client, and returns. The Zigcancel()(WebSocketUpgradeClient.zig:439) clearsoutgoing_websocketwithout callingdidAbruptClose— by design, since C++ initiated the cancel. But C++ never finishes the close either, so none of the completion callbacks fire,m_statenever reachesCLOSED,m_pendingActivityCountstays ≥ 1, andhasPendingActivity()returnstruepermanently — the JS wrapper is a GC root forever and the nativeref()is neverderef()ed.As a visible side effect, the existing
#16995test leaves ~4096WebSocketinstances inheapStats()on current main.Fix
Extract the CONNECTING branch of
close()/terminate()intofailConnectingWebSocket(). After cancelling the upgrade client it posts a task that:m_state = CLOSEDerrorevent followed by acloseevent with code1006,wasClean = false— per the spec's "fail the WebSocket connection" steps; matches Chrome/Firefox and NodewsdisablePendingActivity()to release the ref taken inconnect()The native-callback path (
m_native.onClose) is handled the same way asdidClose.Verification
New tests in
test/js/web/websocket/websocket-close-connecting.test.tsfor bothclose()andterminate():fires error + close events and transitions to CLOSED— TCP server that never responds keeps the socket in CONNECTING; afterclose()anerrorevent fires, then acloseevent with{code: 1006, wasClean: false}, andreadyState === CLOSED.does not leak— creates 64 sockets, closes each during CONNECTING, verifies viaheapStats().objectTypeCounts.WebSocketthat they are collected.All four time out on the unpatched build (close event never fires).