Skip to content

udp: drain IP_RECVERR error queue to fix 100% CPU busy-loop#29473

Open
robobun wants to merge 3 commits into
mainfrom
farm/e52dd978/fix-udp-errqueue-busy-loop
Open

udp: drain IP_RECVERR error queue to fix 100% CPU busy-loop#29473
robobun wants to merge 3 commits into
mainfrom
farm/e52dd978/fix-udp-errqueue-busy-loop

Conversation

@robobun

@robobun robobun commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Fixes #29436 — sending a UDP datagram to an unreachable port on Linux pinned CPU at 100% forever.

Reproduction

const socket = await Bun.udpSocket({
  socket: { error(err) { console.error(err); } },
});
socket.send("Hello", 41234, "127.0.0.1"); // nothing listening
// → "ECONNREFUSED" printed once, then CPU stays at 100%

Affects node:dgram the same way. Regression from #28827 which enabled IP_RECVERR.

Root cause

With IP_RECVERR/IPV6_RECVERR enabled, an ICMP "port unreachable" is queued on the socket's error queue and the kernel raises EPOLLERR (level-triggered on the error queue being non-empty). Plain recvmsg/recvmmsg reports the pending error once but does not dequeue it — only recvmsg(..., MSG_ERRQUEUE) does that. So EPOLLERR stayed asserted and epoll_wait returned immediately on every iteration.

Fix

Add bsd_udp_drain_errqueue() which reads one entry from the error queue via recvmsg(MSG_ERRQUEUE | MSG_DONTWAIT) and extracts ee_errno from the sock_extended_err cmsg.

In the POLL_TYPE_UDP dispatch, when EPOLLERR fires on Linux:

  • Drain the error queue (capped at 32 entries per dispatch to avoid loop starvation), surfacing each errno via on_recv_error.
  • If the queue drained to empty, read-and-clear sk_err via SO_ERROREPOLLERR is also asserted when sk_err is set without an error-queue entry (non-ICMP async errors), and MSG_ERRQUEUE doesn't consume it. Skipped when budget ran out, since sock_dequeue_err_skb writes the next entry's errno into sk_err and reading it now would double-report.
  • If recvmsg(MSG_ERRQUEUE) itself fails, surface that errno and close to avoid spinning on a stuck EPOLLERR.
  • Clear error so the socket isn't closed by the generic handler; it stays open for subsequent sends.

This mirrors libuv's approach (uv__udp_iouv__udp_recvmsg(handle, MSG_ERRQUEUE) on POLLERR).

Conflict resolution vs main

#29768 (HTTP/3) landed an inline MSG_ERRQUEUE drain in loop.c while this PR was open. This PR replaces that with the bsd_udp_drain_errqueue() helper and adds the budget cap, SO_ERROR fallback for non-queue sk_err, and drain-failure close path. It also drops the now-unused recv_error_surfaced/recv_would_block_only flags and the <linux/errqueue.h> include from loop.c (moved to bsd.c). The QUIC WRITABLE-block changes from #29768 are preserved.

Verification

test/regression/issue/29436.test.ts spawns a subprocess that sends to port 1 (privileged, never auto-assigned), waits for the error callback, then measures process.cpuUsage() over a 1-second sleep. Before the fix CPU time ≈ wall time (~1000ms); after the fix it's idle. Also asserts the error callback fires exactly once with ECONNREFUSED and the socket remains open. Covers unconnected Bun.udpSocket, connected Bun.udpSocket, and node:dgram.

All existing test/js/bun/udp/ tests pass (198 tests).

@robobun

robobun commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 12:08 AM PT - May 13th, 2026

@robobun, your commit 3c78ddc has 3 failures in Build #53968 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29473

That installs a local version of the PR into your bun-29473 executable, so you can run:

bun-29473 --bun

@github-actions

Copy link
Copy Markdown
Contributor

Found 1 issue this PR may fix:

  1. node:dgram emits ECONNREFUSED recv and exits in Bun 1.3.12, but not 1.3.11 #29116 - Same 1.3.12 regression: node:dgram emits ECONNREFUSED on recv and exits when sending to unreachable ports, breaking WebRTC/ICE stacks like werift. This PR's error-queue draining and proper on_recv_error surfacing should resolve it.

If this is helpful, copy the block below into the PR description to auto-close this issue on merge.

Fixes #29116

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. Fix UDP socket closing when sending to unreachable destination #28690 - Also fixes UDP socket error handling in bsd.c and loop.c to prevent ICMP errors from causing improper behavior (socket closure / CPU busy-loop) after IP_RECVERR was enabled

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Added a Linux-only UDP error-queue draining helper and integrated it into the UDP-ready dispatch path to consume kernel ICMP/port-unreachable errors without causing an event-loop busy loop; added Linux-only regression tests validating a single ECONNREFUSED emission and no CPU busy-looping.

Changes

Cohort / File(s) Summary
Linux UDP error drain implementation
packages/bun-usockets/src/bsd.c
Added int bsd_udp_drain_errqueue(LIBUS_SOCKET_DESCRIPTOR fd, int *err_out) (Linux-only). Non-blocking `recvmsg(..., MSG_ERRQUEUE
Linux UDP API header
packages/bun-usockets/src/internal/networking/bsd.h
Declared bsd_udp_drain_errqueue(...) behind #if defined(__linux__), exposing the symbol only on Linux builds.
UDP socket event-loop handling
packages/bun-usockets/src/loop.c
On Linux POLL_TYPE_UDP EPOLLERR, drain the UDP error queue (budgeted loop), invoke u->on_recv_error(u, err) per drained errno, close socket if draining fails to avoid EPOLLERR busy-loop, clear residual SO_ERROR and suppress generic close-on-error for drained cases. Tightened recv loop to run only when LIBUS_SOCKET_READABLE; on Linux, surface recv errno via on_recv_error instead of treating as fatal.
Regression tests (Linux-only)
test/regression/issue/29436.test.ts
Added Linux-gated test validating unconnected and connected Bun udpSocket and Node dgram behavior when sending to a privileged dead port: asserts exactly one ECONNREFUSED callback, socket not closed prematurely, CPU usage low during 1s idle window, and clean child exit.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main fix: draining the IP_RECVERR error queue to resolve a CPU busy-loop issue on UDP sockets.
Linked Issues check ✅ Passed The PR successfully addresses all requirements from issue #29436: eliminates the CPU busy-loop, ensures error callback fires exactly once, keeps UDP socket usable, and provides comprehensive regression test coverage.
Out of Scope Changes check ✅ Passed All changes directly address the linked issue #29436: implementation of error queue draining for Linux UDP sockets, related header declarations, and regression testing with no extraneous modifications.
Description check ✅ Passed The PR description is comprehensive and well-structured, including what the PR does, root cause analysis, the fix approach, conflict resolution, and verification details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — correct fix for the IP_RECVERR busy-loop, mirrors libuv's approach, and includes a solid regression test.

Extended reasoning...

Overview

The PR modifies three files in bun-usockets: adds bsd_udp_drain_errqueue() in bsd.c (Linux-only, guarded by #if defined(__linux__)), declares it in bsd.h, updates the POLL_TYPE_UDP dispatch in loop.c to call it when EPOLLERR fires, and adds a regression test.

Security risks

None. The change is confined to Linux UDP error queue draining, uses standard POSIX APIs (recvmsg with MSG_ERRQUEUE), and does not touch auth, crypto, or permission-sensitive code paths.

Level of scrutiny

Medium. This touches the core event-loop dispatch for UDP sockets, but the fix is narrowly scoped to Linux and only activates on EPOLLERR. It mirrors what libuv does (uv__udp_io calls recvmsg(MSG_ERRQUEUE) on POLLERR), which is strong precedent. The old code had two correctness bugs: it didn't actually drain the error queue (only plain recvmmsg), and it used complex recv_error_surfaced/recv_would_block_only flags that were hard to reason about. The replacement is simpler and correctly handles the closed-socket check after each callback.

Other factors

The commit (87baff6) is already present in the main branch. No bugs were found by the automated hunting system. The regression test is well-designed: it isolates the subprocess, measures CPU usage over a 1-second idle window, and asserts both that the error fires exactly once and that CPU stays below 75% of wall time.

@robobun

robobun commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator Author

Re: the bot suggestions above —

@robobun robobun force-pushed the farm/e52dd978/fix-udp-errqueue-busy-loop branch from 87baff6 to 22a03f4 Compare April 19, 2026 03:25
Comment thread packages/bun-usockets/src/loop.c
Comment thread test/regression/issue/29436.test.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/29436.test.ts`:
- Line 20: Add an explicit test timeout (e.g. 15000 ms) to the skipped Linux UDP
test by passing an options object with timeout to the test invocation (the call
that currently reads test.skipIf(!isLinux)("Bun.udpSocket: ICMP error does not
busy-loop the event loop", async () => { ... }). Update that invocation and the
two other similar tests at the same pattern (the tests referenced at the other
occurrences) to include { timeout: 15000 } so the spawned subprocesses have
enough time on slow CI.
- Around line 45-46: The test currently closes the socket immediately after
triggering the ICMP error (socket.close()), which prevents verifying the socket
remains usable; add a post-error usability check by performing a send (e.g.,
socket.send or the same send API used earlier) after the transient ECONNREFUSED
and before calling socket.close(), assert the send succeeds or that the socket's
state indicates it can still send, and then proceed to close and log; apply the
same additional send/usability check in the other similar spot referenced around
the second occurrence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fed81336-1505-49c1-9af3-e4a4e8f766ce

📥 Commits

Reviewing files that changed from the base of the PR and between 4f4062ff46f3fd366150dd8f651864bf5c1bd8bb and b05fdbda9abf59779e0fe7e9033c0e1d8f4ccfa5.

📒 Files selected for processing (1)
  • test/regression/issue/29436.test.ts

Comment thread test/regression/issue/29436.test.ts Outdated
Comment thread test/regression/issue/29436.test.ts Outdated
Comment thread packages/bun-usockets/src/loop.c Outdated
Comment thread packages/bun-usockets/src/loop.c Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/29436.test.ts`:
- Around line 51-55: After the transient-error sequence that checks
socket.closed and then calls socket.close(), perform an explicit post-error send
on the same socket (e.g. call the socket.send or sendTo method used elsewhere)
and capture its return/response into a new field like sendAfterError, include
that value in the JSON output alongside
errorCount/errorCode/closed/cpuMs/wallMs, and ensure the parent test asserts
that result.sendAfterError indicates success (e.g. result.sendAfterError >= 0);
reference the existing socket, closed, socket.close(), and the JSON logging to
add the new capture and assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e96f0a35-b21b-4612-9a23-c519480e16a8

📥 Commits

Reviewing files that changed from the base of the PR and between b05fdbda9abf59779e0fe7e9033c0e1d8f4ccfa5 and 83c577affea40720fde5dabe2c992848e4dbb930.

📒 Files selected for processing (2)
  • packages/bun-usockets/src/loop.c
  • test/regression/issue/29436.test.ts

Comment thread test/regression/issue/29436.test.ts Outdated
@robobun robobun force-pushed the farm/e52dd978/fix-udp-errqueue-busy-loop branch from 83c577a to a2ad918 Compare April 19, 2026 05:27

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously reported bugs have been addressed in the final commits.

Extended reasoning...

Overview

The PR adds bsd_udp_drain_errqueue() to drain the Linux IP_RECVERR error queue via recvmsg(MSG_ERRQUEUE) and integrates it into the POLL_TYPE_UDP dispatch path, replacing the old approach that left EPOLLERR asserted after a port-unreachable ICMP error. Changes touch bsd.c, bsd.h, loop.c (Linux-only paths), and a new regression test.

Security risks

No security concerns — this is a bug fix for an event-loop busy-loop in the UDP error-handling path. No auth, crypto, or permission code is touched.

Level of scrutiny

This required careful scrutiny because it modifies low-level C networking logic on a hot path. Three iterative review cycles identified and resolved: (1) the ENETDOWN-style sk_err-without-error-queue busy-loop, (2) the budget-exhaustion SO_ERROR double-report (guarded with if (drained == 0)), (3) the silent close without on_recv_error on drain failure, and (4) the TOCTOU port-selection race in tests (switched to port 1). All four issues are fixed in the current HEAD.

Other factors

The test covers unconnected, connected, and node:dgram cases; uses port 1 to avoid TOCTOU; has a 20s timeout for slow CI; and asserts exactly one error callback fires with ECONNREFUSED, the socket remains open post-error, and CPU usage stays low. The implementation mirrors libuv's approach. All existing UDP tests continue to pass per the PR description.

Comment thread packages/bun-usockets/src/bsd.c Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously raised issues have been addressed — error-queue draining is correct, SO_ERROR fallback is properly guarded, and the musl portability fix landed. LGTM.

Extended reasoning...

Overview

The PR adds in and rewires the / handler in to drain the Linux IP_RECVERR error queue with , fixing a 100% CPU busy-loop regression introduced in #28827. A regression test is added in .

Security risks

No security-sensitive code is touched. The change is scoped to the UDP error path on Linux. The syscalls used (, ) are standard and well-understood.

Level of scrutiny

The PR went through four rounds of iterative fixes driven by my prior inline comments. All five flagged issues — ENETDOWN busy-loop, budget-exhaustion SO_ERROR double-report, missing on drain failure, TOCTOU test port race, and musl / build failure — were addressed in commits 83c577a, b05fdbda9a, and 1d7afda. The final implementation mirrors libuv's approach.

Other factors

The diff in the preloaded context is the post-fix state. The fallback is correctly guarded by , the 32-entry budget prevents loop starvation, and the test uses privileged port 1 (never auto-assigned) with a 20s timeout. Existing coverage was confirmed passing.

@robobun

robobun commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator Author

CI build 46446 failures are unrelated to this PR:

  • test/bake/dev/stress.test.ts on Windows x64-baseline — HMR stress test, Subprocess.send() cannot be used after the process has exited
  • test/js/bun/cron/in-process-cron.test.ts on Debian x64-asan — worker-termination race, panic: EventLoop.enqueueTaskConcurrent: VM has terminated

Neither touches UDP; all changes in this PR are Linux-only EPOLLERR handling in bun-usockets. The regression test 29436.test.ts passed on every Linux lane, and all 62 individual job statuses reported to GitHub as pass. Main is also intermittently failing (e.g. build 46439).

Comment thread packages/bun-usockets/src/loop.c Outdated
@robobun robobun force-pushed the farm/e52dd978/fix-udp-errqueue-busy-loop branch 2 times, most recently from 0a2b057 to 2f32c79 Compare April 27, 2026 20:28
On Linux with IP_RECVERR enabled, an ICMP error (e.g. port unreachable)
is queued on the socket's error queue and EPOLLERR is raised. Plain
recvmsg/recvmmsg reports the pending error once but does not remove it
from the error queue, so EPOLLERR stays level-triggered and epoll_wait
busy-loops at 100% CPU.

Drain the error queue with recvmsg(MSG_ERRQUEUE) (capped at 32 entries
per dispatch), surfacing each sock_extended_err.ee_errno via
on_recv_error and keeping the socket open. When the queue drains to
empty, also read-and-clear sk_err via SO_ERROR to handle non-ICMP async
errors that set sk_err without an error-queue entry. If the drain
itself fails, surface that errno and close to avoid spinning on a
stuck EPOLLERR.

Fixes #29436
@Jarred-Sumner Jarred-Sumner force-pushed the farm/e52dd978/fix-udp-errqueue-busy-loop branch from 2f32c79 to 4752d70 Compare May 4, 2026 10:37
@robobun

robobun commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Build 51115 failed on every lane because the branch was 26 commits behind main — notably the WebKit bump in #30527 (398620405488984d), so builds couldn't fetch the matching headers. Merged main in 2a3dcc5; diff vs main is unchanged (bsd.c/bsd.h/loop.c + the regression test).

Note: the core busy-loop in #29436 is already fixed on current canary via the inline MSG_ERRQUEUE drain that landed with #29768. This PR is now the extraction into bsd_udp_drain_errqueue(), the 32-entry budget cap, the SO_ERROR fallback for sk_err without an error-queue entry, the drain-failure close path, and the regression test pinning #29436.

Locally: 3/3 regression tests pass, all 209 test/js/bun/udp/ tests pass.

@robobun

robobun commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

CI is red on unrelated flakes only (re-rolled once, same result):

  • test/js/web/fetch/fetch-http2-client.test.ts — WebKit AtomStringImpl::remove assertion / internal assertion in the HTTP/2 fetch client on x64-asan. This PR doesn't touch fetch, HTTP/2, or WebKit.
  • test/js/bun/test/parallel/test-http-should-emit-close-when-connection-is-aborted.ts — Windows timeout, failing on builds 53958–53962 across the tree.
  • test/cli/install/bun-install-registry.test.ts — lockfile snapshot flake (retried).

This PR's diff is 4 files in packages/bun-usockets/ (Linux-only EPOLLERR error-queue drain) + one Linux-gated regression test, all of which pass on every Linux lane. No CI annotation references UDP, 29436, or bun-usockets. Ready for merge.

@Raghuboi

Copy link
Copy Markdown

Friendly ping - this fixes a 100% CPU busy-loop regression. Would appreciate a review when you have a moment!

The CI failures on Windows (test-http-should-emit-close-when-connection-is-aborted) and macOS aarch64 (s3-storage-class) appear to be pre-existing issues unrelated to this PR, as noted in earlier comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CPU 100% on UDP error !!!!

3 participants