udp: drain IP_RECVERR error queue to fix 100% CPU busy-loop#29473
udp: drain IP_RECVERR error queue to fix 100% CPU busy-loop#29473robobun wants to merge 3 commits into
Conversation
|
Updated 12:08 AM PT - May 13th, 2026
❌ @robobun, your commit 3c78ddc has 3 failures in
🧪 To try this PR locally: bunx bun-pr 29473That installs a local version of the PR into your bun-29473 --bun |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdded a Linux-only UDP error-queue draining helper and integrated it into the UDP-ready dispatch path to consume kernel ICMP/port-unreachable errors without causing an event-loop busy loop; added Linux-only regression tests validating a single ECONNREFUSED emission and no CPU busy-looping. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
LGTM — correct fix for the IP_RECVERR busy-loop, mirrors libuv's approach, and includes a solid regression test.
Extended reasoning...
Overview
The PR modifies three files in bun-usockets: adds bsd_udp_drain_errqueue() in bsd.c (Linux-only, guarded by #if defined(__linux__)), declares it in bsd.h, updates the POLL_TYPE_UDP dispatch in loop.c to call it when EPOLLERR fires, and adds a regression test.
Security risks
None. The change is confined to Linux UDP error queue draining, uses standard POSIX APIs (recvmsg with MSG_ERRQUEUE), and does not touch auth, crypto, or permission-sensitive code paths.
Level of scrutiny
Medium. This touches the core event-loop dispatch for UDP sockets, but the fix is narrowly scoped to Linux and only activates on EPOLLERR. It mirrors what libuv does (uv__udp_io calls recvmsg(MSG_ERRQUEUE) on POLLERR), which is strong precedent. The old code had two correctness bugs: it didn't actually drain the error queue (only plain recvmmsg), and it used complex recv_error_surfaced/recv_would_block_only flags that were hard to reason about. The replacement is simpler and correctly handles the closed-socket check after each callback.
Other factors
The commit (87baff6) is already present in the main branch. No bugs were found by the automated hunting system. The regression test is well-designed: it isolates the subprocess, measures CPU usage over a 1-second idle window, and asserts both that the error fires exactly once and that CPU stays below 75% of wall time.
|
Re: the bot suggestions above —
|
87baff6 to
22a03f4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/29436.test.ts`:
- Line 20: Add an explicit test timeout (e.g. 15000 ms) to the skipped Linux UDP
test by passing an options object with timeout to the test invocation (the call
that currently reads test.skipIf(!isLinux)("Bun.udpSocket: ICMP error does not
busy-loop the event loop", async () => { ... }). Update that invocation and the
two other similar tests at the same pattern (the tests referenced at the other
occurrences) to include { timeout: 15000 } so the spawned subprocesses have
enough time on slow CI.
- Around line 45-46: The test currently closes the socket immediately after
triggering the ICMP error (socket.close()), which prevents verifying the socket
remains usable; add a post-error usability check by performing a send (e.g.,
socket.send or the same send API used earlier) after the transient ECONNREFUSED
and before calling socket.close(), assert the send succeeds or that the socket's
state indicates it can still send, and then proceed to close and log; apply the
same additional send/usability check in the other similar spot referenced around
the second occurrence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fed81336-1505-49c1-9af3-e4a4e8f766ce
📥 Commits
Reviewing files that changed from the base of the PR and between 4f4062ff46f3fd366150dd8f651864bf5c1bd8bb and b05fdbda9abf59779e0fe7e9033c0e1d8f4ccfa5.
📒 Files selected for processing (1)
test/regression/issue/29436.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/29436.test.ts`:
- Around line 51-55: After the transient-error sequence that checks
socket.closed and then calls socket.close(), perform an explicit post-error send
on the same socket (e.g. call the socket.send or sendTo method used elsewhere)
and capture its return/response into a new field like sendAfterError, include
that value in the JSON output alongside
errorCount/errorCode/closed/cpuMs/wallMs, and ensure the parent test asserts
that result.sendAfterError indicates success (e.g. result.sendAfterError >= 0);
reference the existing socket, closed, socket.close(), and the JSON logging to
add the new capture and assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e96f0a35-b21b-4612-9a23-c519480e16a8
📥 Commits
Reviewing files that changed from the base of the PR and between b05fdbda9abf59779e0fe7e9033c0e1d8f4ccfa5 and 83c577affea40720fde5dabe2c992848e4dbb930.
📒 Files selected for processing (2)
packages/bun-usockets/src/loop.ctest/regression/issue/29436.test.ts
83c577a to
a2ad918
Compare
There was a problem hiding this comment.
All previously reported bugs have been addressed in the final commits.
Extended reasoning...
Overview
The PR adds bsd_udp_drain_errqueue() to drain the Linux IP_RECVERR error queue via recvmsg(MSG_ERRQUEUE) and integrates it into the POLL_TYPE_UDP dispatch path, replacing the old approach that left EPOLLERR asserted after a port-unreachable ICMP error. Changes touch bsd.c, bsd.h, loop.c (Linux-only paths), and a new regression test.
Security risks
No security concerns — this is a bug fix for an event-loop busy-loop in the UDP error-handling path. No auth, crypto, or permission code is touched.
Level of scrutiny
This required careful scrutiny because it modifies low-level C networking logic on a hot path. Three iterative review cycles identified and resolved: (1) the ENETDOWN-style sk_err-without-error-queue busy-loop, (2) the budget-exhaustion SO_ERROR double-report (guarded with if (drained == 0)), (3) the silent close without on_recv_error on drain failure, and (4) the TOCTOU port-selection race in tests (switched to port 1). All four issues are fixed in the current HEAD.
Other factors
The test covers unconnected, connected, and node:dgram cases; uses port 1 to avoid TOCTOU; has a 20s timeout for slow CI; and asserts exactly one error callback fires with ECONNREFUSED, the socket remains open post-error, and CPU usage stays low. The implementation mirrors libuv's approach. All existing UDP tests continue to pass per the PR description.
There was a problem hiding this comment.
All previously raised issues have been addressed — error-queue draining is correct, SO_ERROR fallback is properly guarded, and the musl portability fix landed. LGTM.
Extended reasoning...
Overview
The PR adds in and rewires the / handler in to drain the Linux IP_RECVERR error queue with , fixing a 100% CPU busy-loop regression introduced in #28827. A regression test is added in .
Security risks
No security-sensitive code is touched. The change is scoped to the UDP error path on Linux. The syscalls used (, ) are standard and well-understood.
Level of scrutiny
The PR went through four rounds of iterative fixes driven by my prior inline comments. All five flagged issues — ENETDOWN busy-loop, budget-exhaustion SO_ERROR double-report, missing on drain failure, TOCTOU test port race, and musl / build failure — were addressed in commits 83c577a, b05fdbda9a, and 1d7afda. The final implementation mirrors libuv's approach.
Other factors
The diff in the preloaded context is the post-fix state. The fallback is correctly guarded by , the 32-entry budget prevents loop starvation, and the test uses privileged port 1 (never auto-assigned) with a 20s timeout. Existing coverage was confirmed passing.
|
CI build 46446 failures are unrelated to this PR:
Neither touches UDP; all changes in this PR are Linux-only |
0a2b057 to
2f32c79
Compare
On Linux with IP_RECVERR enabled, an ICMP error (e.g. port unreachable) is queued on the socket's error queue and EPOLLERR is raised. Plain recvmsg/recvmmsg reports the pending error once but does not remove it from the error queue, so EPOLLERR stays level-triggered and epoll_wait busy-loops at 100% CPU. Drain the error queue with recvmsg(MSG_ERRQUEUE) (capped at 32 entries per dispatch), surfacing each sock_extended_err.ee_errno via on_recv_error and keeping the socket open. When the queue drains to empty, also read-and-clear sk_err via SO_ERROR to handle non-ICMP async errors that set sk_err without an error-queue entry. If the drain itself fails, surface that errno and close to avoid spinning on a stuck EPOLLERR. Fixes #29436
2f32c79 to
4752d70
Compare
|
Build 51115 failed on every lane because the branch was 26 commits behind main — notably the WebKit bump in #30527 ( Note: the core busy-loop in #29436 is already fixed on current canary via the inline Locally: 3/3 regression tests pass, all 209 |
|
CI is red on unrelated flakes only (re-rolled once, same result):
This PR's diff is 4 files in |
|
Friendly ping - this fixes a 100% CPU busy-loop regression. Would appreciate a review when you have a moment! The CI failures on Windows (test-http-should-emit-close-when-connection-is-aborted) and macOS aarch64 (s3-storage-class) appear to be pre-existing issues unrelated to this PR, as noted in earlier comments. |
What does this PR do?
Fixes #29436 — sending a UDP datagram to an unreachable port on Linux pinned CPU at 100% forever.
Reproduction
Affects
node:dgramthe same way. Regression from #28827 which enabledIP_RECVERR.Root cause
With
IP_RECVERR/IPV6_RECVERRenabled, an ICMP "port unreachable" is queued on the socket's error queue and the kernel raisesEPOLLERR(level-triggered on the error queue being non-empty). Plainrecvmsg/recvmmsgreports the pending error once but does not dequeue it — onlyrecvmsg(..., MSG_ERRQUEUE)does that. SoEPOLLERRstayed asserted andepoll_waitreturned immediately on every iteration.Fix
Add
bsd_udp_drain_errqueue()which reads one entry from the error queue viarecvmsg(MSG_ERRQUEUE | MSG_DONTWAIT)and extractsee_errnofrom thesock_extended_errcmsg.In the
POLL_TYPE_UDPdispatch, whenEPOLLERRfires on Linux:on_recv_error.sk_errviaSO_ERROR—EPOLLERRis also asserted whensk_erris set without an error-queue entry (non-ICMP async errors), andMSG_ERRQUEUEdoesn't consume it. Skipped when budget ran out, sincesock_dequeue_err_skbwrites the next entry's errno intosk_errand reading it now would double-report.recvmsg(MSG_ERRQUEUE)itself fails, surface that errno and close to avoid spinning on a stuckEPOLLERR.errorso the socket isn't closed by the generic handler; it stays open for subsequent sends.This mirrors libuv's approach (
uv__udp_io→uv__udp_recvmsg(handle, MSG_ERRQUEUE)onPOLLERR).Conflict resolution vs main
#29768 (HTTP/3) landed an inline
MSG_ERRQUEUEdrain inloop.cwhile this PR was open. This PR replaces that with thebsd_udp_drain_errqueue()helper and adds the budget cap,SO_ERRORfallback for non-queuesk_err, and drain-failure close path. It also drops the now-unusedrecv_error_surfaced/recv_would_block_onlyflags and the<linux/errqueue.h>include fromloop.c(moved tobsd.c). The QUICWRITABLE-block changes from #29768 are preserved.Verification
test/regression/issue/29436.test.tsspawns a subprocess that sends to port 1 (privileged, never auto-assigned), waits for the error callback, then measuresprocess.cpuUsage()over a 1-second sleep. Before the fix CPU time ≈ wall time (~1000ms); after the fix it's idle. Also asserts the error callback fires exactly once withECONNREFUSEDand the socket remains open. Covers unconnectedBun.udpSocket, connectedBun.udpSocket, andnode:dgram.All existing
test/js/bun/udp/tests pass (198 tests).