Skip to content

net: emit 'close' when a backpressured socket's peer dies#31654

Open
robobun wants to merge 1 commit into
mainfrom
farm/265bd52f/fix-uds-backpressure-close
Open

net: emit 'close' when a backpressured socket's peer dies#31654
robobun wants to merge 1 commit into
mainfrom
farm/265bd52f/fix-uds-backpressure-close

Conversation

@robobun

@robobun robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Symptom

A node:net socket whose peer disappears while the socket still has a stuck half never emits 'close'. The process hangs forever (spinning on the half-closed fd when the peer keeps it hot), and server.close() never completes. Calling socket.destroy() from JS immediately clears it.

Two stuck-half shapes trigger it:

  1. Write backpressure toward a dead peer. The survivor has unflushed outbound data when the peer is SIGKILLed. A race: most peer kills deliver 'close' normally. (The originally reported case, node:net over a Unix domain socket, both accepted and connecting sockets, macOS + Linux.)
  2. A paused readable with nothing buffered. The peer sends FIN; the socket's readable is paused so the queued EOF is never consumed. This is the server.close() / process-exit hang behind several Express/Supertest/proxy reports.

Cause

A net.Socket is a Duplex constructed with { emitClose: false, autoDestroy: true }, so 'close' is emitted only from _destroy, and autoDestroy only runs _destroy once both halves of the Duplex finish.

When the native layer reports the connection closed, SocketEmitEndNT ends the readable half with push(null), but:

  • a paused readable with nothing buffered never schedules 'end' from that alone (read(0) is what kicks it, and SocketEmitEndNT wasn't calling it), so the readable half never finishes;
  • a writable holding backpressure toward a dead peer can never flush, so the writable half never finishes.

Either way autoDestroy waits forever, _destroy never runs, 'close' is never emitted, and server._connections never decrements (so server.close() hangs).

Fix

Two small changes in src/js/node/net.ts:

Verification

Two regression tests in test/js/node/net/node-net.test.ts:

  • backpressure: two Bun processes over a UDS; the survivor floods the peer, the peer is SIGKILLed mid-flight, and the survivor must receive 'close' and exit;
  • paused readable: a paused server socket whose peer ends must fire 'close' (after 'end') and let server.close() complete.
USE_SYSTEM_BUN=1 bun test test/js/node/net/node-net.test.ts -t "UDS peer"       → hangs, fails
USE_SYSTEM_BUN=1 bun test test/js/node/net/node-net.test.ts -t "paused socket"  → hangs, fails
bun bd test test/js/node/net/node-net.test.ts -t "UDS peer"                     → passes (~3s)
bun bd test test/js/node/net/node-net.test.ts -t "paused socket"                → passes (~0.5s)

All 141 test-net-* Node parallel tests pass under bun bd (ASAN), including test-net-socket-close-after-end.js and test-net-socket-write-after-close.js which earlier iterations of this PR regressed. node-net-server.test.ts (18/18), the http/tls parallel tests that previous CI runs flagged (test-https-eof-for-eom, test-http-1.0-keep-alive, test-tls-wrap-econnreset-socket, test-tls-reuse-host-from-socket, …), and regression/issue/12117 all pass.

Rebased onto #31155, which substantially reworked these close handlers (ECONNRESET-shaped destroy in SocketEmitEndNT, kwriteCallback flush). Those changes are kept; destroyAfterClose sits after them as the stuck-writable fallback they leave open. This also covers the net-level fix of #28350 / #28732 (server.close() hang on a paused socket), with the readableLength/setImmediate guard so 'end'-before-'close' ordering is preserved. Happy to rebase/defer however a maintainer prefers.

Fixes #24808

@github-actions github-actions Bot added the claude label Jun 1, 2026
@robobun

robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 7:15 AM PT - Jun 17th, 2026

@robobun, your commit 299f240 has 1 failures in Build #63134 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 31654

That installs a local version of the PR into your bun-31654 executable, so you can run:

bun-31654 --bun

@robobun

robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Reproduced and fixed.

A node:net socket with unflushed outbound data whose peer is SIGKILLed never emitted 'close' → the process hung forever spinning on the half-closed fd (the reported symptom; socket.destroy() cleared it). Root cause: the socket is a Duplex with emitClose:false, autoDestroy:true, so 'close' comes only from _destroy, which autoDestroy runs only once both halves finish — and the writable half never finishes when its buffered bytes can't reach the dead peer. Fixed by force-destroying in the native close handlers when there's an error or writableLength > 0; the already-drained path is untouched so it still emits 'end''close' naturally.

Matches stock Node v24 across clean FIN, RST-with-error-listener, and RST-without-listener.

Regression test: two Bun processes over a UDS, survivor builds backpressure, peer SIGKILLed mid-flight → survivor must get exactly one 'close' and exit. Hangs 15s on the baked bun; passes (~3s) with this change. Waiting on CI.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. node:http is broken for proxies — 4 PRs fixing createConnection, upgrade sockets, connection close, and socket cleanup #28396 - Umbrella issue whose sub-issue Copy source lines when generating error messages #3 describes "Socket not destroyed on native close" where autoDestroy never fires, leaving zombie sockets — the same root cause this PR addresses
  2. The 'node::net' module correctly emitted events. #24808 - node:net server writing large buffers never gets a close event when the client disconnects, causing the process to hang — exactly the backpressure + peer-death scenario this PR fixes

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #28396
Fixes #24808

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. fix(net): flush pending write callback on socket close to prevent stalled streams #27161 - fixes the same bug (socket with pending kwriteCallback never emits 'close' when peer disconnects) by flushing the write callback in SocketHandlers.close, SocketHandlers2.close, and ServerHandlers.close
  2. fix(net): destroy socket on native close to prevent server.close() hang #28350 - fixes the same close handler paths (SocketHandlers, SocketHandlers2, ServerHandlers) by adding process.nextTick(destroyNT, self) to ensure destroy() is called and 'close' is emitted when the readable is paused or autoDestroy doesn't fire
  3. fix(net): destroy socket on native close to prevent server.close() hang #28732 - same fix as fix(net): destroy socket on native close to prevent server.close() hang #28350 (a robobun re-attempt), modifying the same three close handlers with identical process.nextTick(destroyNT, self) additions

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR introduces a socket teardown fix to prevent hangs when a peer closes during backpressure on Unix Domain Sockets. A new destroyAfterClose() helper conditionally forces socket destruction when the native layer reports closure, and it is wired into three close handlers. New survivor/peer fixtures and two non-Windows tests reproduce and validate the regression.

Changes

Socket teardown fix and regression test

Layer / File(s) Summary
Socket teardown helper and wiring
src/js/node/net.ts
New destroyAfterClose() helper prevents lingering sockets by forcing destruction when closure is reported and the socket isn't destroyed; integrated into SocketHandlers.close, ServerHandlers.close, and SocketHandlers2.close.
Backpressure peer test fixture
test/js/node/net/node-hup-backpressure-peer-fixture.js
Client fixture that connects to UDS, logs CONNECTED, pauses to stop consuming data and generate backpressure, then remains idle until killed by the test harness.
Backpressure survivor test fixture
test/js/node/net/node-hup-backpressure-survivor-fixture.js
Server fixture that accepts connection, floods client until backpressure engages, validates exactly one close event, and fails with distinct exit codes if close is not observed within 15 seconds.
Regression test harness
test/js/node/net/node-net.test.ts
Adds two non-Windows tests: one that spawns survivor and peer subprocesses on a shared UDS path, waits for LISTENING/ACCEPTED, kills peer with SIGKILL during backpressure and asserts survivor exits cleanly; a second test verifies server.close() completes when server socket is paused.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing the issue where 'close' is not emitted when a backpressured socket's peer dies.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a detailed description covering symptoms, root causes, and the fix implemented, with verification steps and test results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/node/net/node-net.test.ts`:
- Around line 753-756: Remove the timer-based watchdog: delete the setTimeout
that assigns to watchdog and the subsequent clearTimeout call; rely on awaiting
survivor.exited (and, if needed, adjust the test harness timeout via test
framework config rather than using setTimeout). Ensure references are to the
existing survivor.exited promise and survivor.kill usage remains untouched.
- Line 710: Replace the use of tmpdirSync() when constructing the UDS path for
udsPath and instead use the test harness tempDir fixture (call tempDir() to get
a harness-managed temporary directory) so the line creating udsPath uses
join(tempDir(), `hup-${randomUUID()}.sock`); also ensure the test
imports/receives the tempDir fixture from 'harness' if it's not already
available.
- Around line 712-727: Replace the tmpdirSync-based socket directory with
harness.tempDir by creating udsPath from tempDir(...) (use the harness.tempDir
helper instead of tmpdirSync), remove the ad-hoc setTimeout watchdog logic
(delete the setTimeout-based timer and instead rely on the test framework
timeout or an AbortController-based cancellation), and revert any unnecessary
changes to stdout handling—keep waitForLine as a simple reader that releases the
lock after each call and do not attempt to preserve reader state between calls;
target the udsPath variable, the setTimeout/watchdog code, and the waitForLine
function when making these edits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7cf53707-8231-4abe-bd71-f5aa333f875f

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac120c and cca2876.

📒 Files selected for processing (4)
  • src/js/node/net.ts
  • test/js/node/net/node-hup-backpressure-peer-fixture.js
  • test/js/node/net/node-hup-backpressure-survivor-fixture.js
  • test/js/node/net/node-net.test.ts

Comment thread test/js/node/net/node-net.test.ts
Comment thread test/js/node/net/node-net.test.ts
Comment thread test/js/node/net/node-net.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this changes socket lifecycle semantics in three close-handler paths (including newly propagating err into destroy() where it was previously dropped), so it's worth a human look at the event-ordering and downstream http/tls implications.

Extended reasoning...

Overview

This PR adds a destroyAfterClose helper to src/js/node/net.ts and wires it into all three native close handlers (SocketHandlers.close, ServerHandlers.close, SocketHandlers2.close). When the native layer reports the connection closed and the writable side still has buffered data (or an error is present), it now force-destroy()s the Duplex so 'close' fires instead of hanging on autoDestroy waiting for a writable half that can never drain. Two new fixture files and a UDS-based regression test are added.

Security risks

None identified — this is stream-lifecycle / event-emission logic with no auth, crypto, or input-parsing surface.

Level of scrutiny

Medium-high. net.Socket close/destroy semantics sit underneath node:http, node:https, node:tls, and node:http2. The fix is small but it's a behavioral change in three code paths simultaneously, and one of those paths (SocketHandlers2.close) previously dropped err entirely (there was a TODO about it) — it now feeds err into destroy(err), which will surface an 'error' event where none was emitted before. That's the intended Node-compat fix per the PR table, but it's the kind of change where a maintainer familiar with the http/tls layers should sanity-check that nothing upstream was relying on the old swallow-the-error behavior, and that the interaction with SocketHandlers.error (which already emit('error', ...)s directly) doesn't double-emit.

Other factors

  • The regression test is solid (subprocess-isolated, watchdog-guarded, checks exactly-once semantics).
  • The PR description verifies against Node v24 and lists the related test suites that still pass, but CI was still building at review time.
  • Stream end/destroy ordering is notoriously subtle; the helper deliberately skips the writableLength === 0 && !err case to avoid racing 'end', which is the right instinct but also signals this is delicate territory.
  • No prior reviews from me on this PR.

@robobun

robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Linked #24808 (directly reproduced + tested here — the write-backpressure variant).

Not auto-closing #28396: it's an umbrella for four separate node:http proxy bugs (createConnection, upgrade-socket write, connection close, socket cleanup) with their own PRs — this change only addresses the "socket not destroyed on native close" piece, so Fixes #28396 would wrongly close the other three. That piece overlaps #28350 at the net level; see the "Relationship to #28350" section in the description — this PR also covers the backpressure case and keeps correct error semantics (RST → 'error' + 'close'(hadError=true) instead of a swallowed 'end'), and passes #28350's own regression tests.

@robobun

robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Updated: broadened the fix to cover both zombie-socket shapes on native close — write-backpressure (the original report) and a paused/unpiped readable whose EOF is never consumed (the server.close()/process-exit hang, issues #13184/#19563/#23648, #28350's scenarios). Error semantics stay Node-faithful: a peer RST surfaces as 'error' + 'close'(hadError=true), not a swallowed 'end'.

Two regression tests (backpressure via SIGKILL over a UDS; paused socket via server.close()), both hang on the baked bun and pass here. node-net-server 18/18, server.spec 38/38, allowHalfOpen, TLS close, and http backpressure all green. Passes #28350's own regression/issue/13184.test.ts cases too.

Consolidation: closed my earlier #28732 (strict subset of this); cross-linked #28350 and #27161 for a maintainer to pick. Waiting on CI.

Comment thread src/js/node/net.ts Outdated
@robobun

robobun commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

CI on the previous push caught a real regression (thanks, annotations): propagating the native close's error into destroy(err) made every passive peer close fatal, so benign post-response RSTs / keep-alive idle closes broke fetch/ws/h2/undici/grpc/valkey across all platforms, plus an ASAN ENOENT double-close.

Fixed by dropping the error — destroyAfterClose now just defers destroy() (no error), which still fixes the hang ('close' fires, server._connections decrements, spin stops) and leaves error delivery to the native error handler where it belongs. Also reverted the hadError tweak.

Re-verified under bun bd (ASAN): every test CI flagged now passes — 108 test-net-*, the http/tls parallel tests (eof-for-eom, keep-alive, econnreset-socket, reuse-host-from-socket), issue/12117, both regression tests. Pushed ef80fdc.

@robobun

robobun commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased onto main (133 commits, conflict in src/js/node/net.ts from #31155's close-handler rework) and refined the fix:

All 141 test-net-* Node parallel tests pass under bun bd (ASAN), plus node-net-server (18/18), regression/issue/12117, and the http/tls parallel tests earlier iterations regressed. Both regression tests still hang on the baked bun and pass here; 'end'-before-'close' ordering matches Node. Squashed to a single commit, force-pushed 9d3e856.

@robobun

robobun commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Diff is green; CI red is unrelated flake.

Four consecutive CI runs — each failing on a different test unrelated to this change, with zero node:net/tls/http test failures in any run:

Build Failing test Lane Reason
#63124 test/bake/deinitialization.test.ts 13 x64-asan native DevServer WebSocket UAF on thread T9 (known class, see #29949; no node:net)
#63124 test/cli/install/bun-install.test.ts 14 aarch64 bun install lockfile assertion
#63127 test/cli/install/bun-install.test.ts 11 aarch64 same bun install lockfile assertion
#63127 test/cli/hot/hot.test.ts 2019 x64-baseline Windows hot-reload ENOENT race (warning, retried)
#63133 test/js/bun/test/parallel/test-integration-rspack.ts 25.04 x64 integration timeout
#63133 test/js/web/websocket/autobahn.test.ts 14 aarch64 autobahn harness
#63133 test/cli/run/cpu-prof.test.ts 2019 x64 Windows cpu-prof (warning, retried)
#63134 test/integration/next-pages/test/dev-server.test.ts 26 aarch64 puppeteer/chrome download failure
#63134 test/bake/dev-and-prod.test.ts 2019 x64-baseline Windows HMR socket disconnect (warning, retried)

This PR touches only src/js/node/net.ts and test/js/node/net/. Locally under bun bd (ASAN): all 141 test-net-* Node parallel tests, node-net-server (18/18), server.spec (38/38), regression/issue/12117, the http/tls parallel tests previous iterations had regressed (test-tls-reuse-host-from-socket, test-https-eof-for-eom, test-http-1.0-keep-alive, test-net-socket-close-after-end, test-net-socket-write-after-close, …), and both new regression tests all pass. bake/deinitialization.test.ts passes 9/9 locally.

The review bot's final pass on 299f240 found no issues (all three edge cases it flagged earlier are addressed) and defers to a maintainer for sign-off given the overlap with #27161/#28350.

Ready for a maintainer to review/merge past the unrelated flakes.

Comment thread src/js/node/net.ts
@robobun robobun force-pushed the farm/265bd52f/fix-uds-backpressure-close branch from d6837b7 to 341c164 Compare June 17, 2026 12:44
Comment thread src/js/node/net.ts
A node:net socket whose peer disappears while the socket has a stuck half
(write-backpressure that can never flush, or a paused readable with nothing
buffered that never consumes the queued EOF) never emitted 'close'. The
process hung forever, spinning on the half-closed fd; server.close() never
completed.

The socket is a Duplex with { emitClose:false, autoDestroy:true }, so 'close'
only fires from _destroy, and autoDestroy only runs _destroy once both halves
finish. When a half is stuck it never does: the socket lingers as a zombie
(destroyed=false), server._connections never decrements, and a peer that keeps
the fd hot spins the loop.

Changes:

- SocketEmitEndNT now follows push(null) with read(0) (as SocketHandlers2.close
  already does), so a paused stream with nothing buffered still schedules
  'end' and can auto-destroy instead of stalling.

- SocketEmitEndNT's pending-write flush now also gates on self[kclosed] so a
  clean close (no error) still fails the in-flight write — otherwise a paused
  reader with buffered data plus a backpressured write leaves kWriting set and
  autoDestroy can never fire even after the reader drains.

- The native close handlers force teardown via destroyAfterClose(): when
  there is nothing left to read (readableLength === 0) and the socket is not
  already destroyed, schedule destroy() on setImmediate. Deferring past the
  nextTick queue lets the pending 'end' (from push(null)+read(0)) and any
  "write EBADF" callback _write scheduled for a write that raced the close
  fire first (test-net-socket-close-after-end.js,
  test-net-socket-write-after-close.js). Readers with data still buffered are
  left alone so they can consume it and emit 'end' before 'close', as Node
  does. No error is passed: real read errors already surface through the
  dedicated error paths in SocketEmitEndNT, and the passive peer close that
  lands here is benign.

Two regression tests: (1) two Bun processes over a UDS, survivor builds
backpressure, peer SIGKILLed mid-flight, survivor must get one 'close' and
exit; (2) a paused server socket whose peer ends, 'close' fires and
server.close() completes. Both hang on the baked bun. All 141 test-net-*
Node parallel tests pass under bun bd (ASAN), along with node-net-server
(18/18), the http/tls parallel tests that previous iterations regressed,
and regression/issue/12117.

Rebased onto #31155, which substantially reworked these close handlers
(ECONNRESET-shaped destroy in SocketEmitEndNT, kwriteCallback flush); those
changes are kept and destroyAfterClose sits after them as the stuck-writable
fallback they leave open.
@robobun robobun force-pushed the farm/265bd52f/fix-uds-backpressure-close branch from 341c164 to 299f240 Compare June 17, 2026 13:33

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found on the latest revision — all three edge cases I flagged earlier are addressed. Deferring to a maintainer since this touches core node:net socket-teardown ordering across all three close handlers and overlaps with #27161/#28350, which someone with the full picture should reconcile.

Extended reasoning...

Overview

The PR modifies src/js/node/net.ts — specifically the three native close handlers (SocketHandlers.close, ServerHandlers.close, SocketHandlers2.close) and SocketEmitEndNT — to ensure a net.Socket whose peer dies while one Duplex half is stuck (write backpressure, or a paused readable with nothing buffered) still reaches _destroy and emits 'close'. It adds a destroyAfterClose() helper (deferred destroy() via setImmediate, gated on readableLength === 0 and a kclosed re-check), a read(0) kick after push(null), and broadens the kwriteCallback flush gate to include self[kclosed]. Two new regression tests plus two subprocess fixtures cover both stuck-half shapes.

Security risks

None identified. This is socket-lifecycle event ordering; no auth, crypto, parsing of untrusted input, or permission surfaces are touched.

Level of scrutiny

High. This is the teardown path for every node:net socket in Bun's Node compat layer, sitting underneath http/https/tls/undici/grpc/etc. The change hinges on subtle event-loop ordering invariants — setImmediate vs nextTick, 'end'-before-'close', autoDestroy's both-halves-finished requirement, and the detached-handle vs attached-handle close paths. The PR's own history shows how easy this is to get wrong: an earlier iteration that propagated the close error into destroy(err) broke fetch/ws/h2/undici/grpc/valkey across all platforms; the readableLength guard and setImmediate defer were each added to un-break specific Node parallel tests; and two of my own inline comments flagged narrow races (reconnect-from-'close', clean-close + buffered-readable + pending write) that required further amendments. The current revision looks correct and well-guarded, but this is exactly the kind of change where a maintainer who knows the uSockets/native-handler contract should sign off.

Other factors

  • The PR overlaps three other PRs targeting the same handlers (#27161, #28350, #28732 — the last closed by the author as a subset of this one), and was rebased onto #31155 which substantially reworked the same code. The author explicitly invites a maintainer to pick/consolidate.
  • All prior review threads (CodeRabbit's and mine) are resolved; the bug-hunting system found nothing on the current revision.
  • Test coverage is solid: two targeted regression tests that hang on the unfixed binary, plus the author reports all 141 test-net-* Node parallel tests, node-net-server, and the previously-regressed http/tls parallels passing under ASAN. CI failures on the last two builds are unrelated flakes (bake DevServer UAF, bun install lockfile, Windows hot-reload).
  • Net change to runtime code is small (~30 lines), but the blast radius if the ordering is subtly wrong is large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The 'node::net' module correctly emitted events.

1 participant