Fix panic when FilePoll unregister fails on macOS#31701
Conversation
kevent reports per-change failures by writing the errno value into the eventlist entry's `data` field (EV_ERROR set). Those values were fed through `errno_sys`, which decodes the -1-sentinel return-code convention and therefore returns `None` for every real errno — so the `.unwrap()` at the EV_ERROR call sites panicked whenever a change failed. A failed EV_DELETE is a tolerated race on the teardown path (closing an fd removes its kevents, so a FilePoll force-unregistered after its fd was closed gets EBADF/ENOENT) and every unregister caller already discards the error. Build the error from the errno value directly via a checked SystemErrno::init, narrow errno_sys to the platforms that still use the return-code convention, and wire the FilePoll debug logs to the real bun_sys::syslog now that the macro is exported.
|
Warning Review limit reached
More reviews will be available in 3 minutes and 55 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
Comment |
|
Status: fix complete and CI-verified on macOS — every remaining red lane is broken docker infra on the darwin-13/14 agents. Needs a maintainer merge-call.
Final CI triage, build 59860 (after lane retries; 69/73 GitHub contexts green — all Linux incl. ASAN, Windows, musl, alpine, and all darwin builds):
None of these touch this diff (kevent error construction + a spawn teardown test). One
|
|
Updated 5:14 AM PT - Jun 2nd, 2026
❌ @robobun, your commit bdb56c8 has 8 failures in
🧪 To try this PR locally: bunx bun-pr 31701That installs a local version of the PR into your bun-31701 --bun |
There was a problem hiding this comment.
I didn't find any issues — the root-cause analysis is convincing and the fix is clearly correct — but this touches the core kqueue FilePoll register/unregister path plus a side-along enable of the real syslog! macro, so it's worth a human sanity-check before merging.
Extended reasoning...
Overview
This PR fixes Sentry crash BUN-38SJ, an Option::unwrap() panic in FilePoll::unregister_with_fd_impl on macOS. The change is in src/io/posix_event_loop.rs plus a new integration test and a unit test.
The core fix replaces three errno_sys(changelist[i].data, Tag::kevent).unwrap() calls with a new kevent_change_error(data) helper. The root cause is well-explained: kevent's EV_ERROR reports the errno value in data, but errno_sys decodes the -1-sentinel return-code convention (only reads thread-local errno when rc == -1), so it returned None for every real errno and the .unwrap() panicked. The new helper constructs the error directly from the errno value via SystemErrno::init(data).unwrap_or(EINVAL), which I verified exists in src/errno/lib.rs and follows the same unwrap_or(EINVAL) pattern used elsewhere (e.g. WindowsWatcher.rs).
Three additional changes ride along: (1) errno_sys is narrowed from cfg(not(windows)) to linux/android/freebsd only — correct, since macOS no longer calls it and FreeBSD/Linux still do; (2) the no-op local syslog! shim is replaced with the real bun_sys::syslog (verified #[macro_export] at src/sys/lib.rs:4980), enabling FilePoll debug logging under BUN_DEBUG_SYS=1; (3) a unit test and an integration test that deterministically reproduces the stale-fd unregister race.
Security risks
None. This is error-path handling in event-loop fd deregistration; no auth, crypto, parsing, or untrusted input. The integration test fixture closes fds behind the runtime's back, but only in a child process spawned by the test itself.
Level of scrutiny
Medium-high. FilePoll::register_with_fd_impl / unregister_with_fd_impl are the kqueue/epoll registration primitives underlying all polled fds (subprocess pipes, sockets, process watchers) on POSIX. While the actual logic change is narrow and obviously correct — replacing a guaranteed-panic with a properly constructed error that all callers already discard via let _ = self.unregister(…) — this is core runtime infrastructure where regressions would be widely felt, and the patch is macOS-gated so Linux CI alone won't catch a mistake.
Other factors
- Side-along behavior change: enabling the real
syslog!is debug-build-only and was explicitly TODO'd in the shim, but it's worth a maintainer confirming they want it bundled here rather than split out. - Test fixture reliability: the integration test discovers the polled fd by diffing
/proc/self/fd///dev/fdbefore/after spawn and asserts exactly one new pipe/socket fd appears. This relies on implementation details (no dup, only stdout piped) that the comments document, but a maintainer should confirm this won't be flaky in CI. - No CODEOWNERS entry for
src/io/. - Bug hunter found no issues;
cargo check/clippyreported clean across linux/darwin/windows targets per the PR body.
Given this is critical event-loop code fixing a production crash, deferring so a human can confirm the fix matches their read of the Sentry trace and that the test fixture is acceptable for CI.
There was a problem hiding this comment.
The fix itself looks correct and well-reasoned, but this touches the core macOS kevent path in the event loop and the darwin CI lanes haven't produced a green run yet (expired/retriggered) — worth a human glance before merge, especially with the bundled syslog! shim → real macro swap.
Extended reasoning...
Overview
This PR fixes a Sentry crash (BUN-38SJ) in src/io/posix_event_loop.rs: an Option::unwrap() panic in FilePoll::unregister_with_fd_impl on macOS. The change replaces three errno_sys(changelist[N].data, …).unwrap() call sites with a new kevent_change_error(data) helper that correctly interprets kevent's EV_ERROR convention (errno value in data, not a -1 sentinel return code). It also narrows the errno_sys cfg gate to linux/android/freebsd, swaps the local no-op syslog! shim for the now-exported bun_sys::syslog macro, and adds both an in-crate unit test and a JS integration test.
Security risks
None identified. This is error-handling plumbing in fd deregistration; no auth, crypto, untrusted input parsing, or permission logic is involved. The new helper uses a checked SystemErrno::init with an EINVAL fallback, so out-of-range kernel data cannot panic.
Level of scrutiny
Medium-high. The diff is small (~40 net LOC in the .rs file) and the logic is straightforward — the root-cause analysis is excellent and the fix is the obvious correct one (build sys::Error directly from the errno value rather than round-tripping through the -1-sentinel decoder). However, this is the core POSIX event loop's kqueue registration/deregistration path, which underlies every piped subprocess, socket poll, and process watcher on macOS. Subtle mistakes here have broad blast radius.
Other factors
- macOS CI not green yet: The bug and fix are both macOS-only (
#[cfg(target_os = "macos")]). Per the robobun comment, the first CI run's darwin test lanes expired (no agent picked them up; zero tests ran) and were retriggered. I don't see a passing darwin run in the timeline yet, so the fix hasn't been CI-verified on its target platform. - Bundled incidental change: The
syslog!no-op shim →use bun_sys::syslogswap means debug builds now actually emit FilePoll register/unregister logs underBUN_DEBUG_SYS=1. This is debug-only (gated ondebug_assertionsper src/sys/lib.rs:4981+) and matches the Zig original, but it's a behavior change orthogonal to the crash fix that a human should be aware of. - All callers discard the error: I verified the PR's claim that
unregistercallers (deinit_possibly_deferetc.) uselet _ = self.unregister(…), so the only user-visible effect of the fix is removing the panic — the returned error was never acted on. - No CODEOWNERS coverage for
src/io/ortest/js/bun/spawn/. - The bug hunting system found no issues.
Given the critical-path location and the absence of a green darwin CI run, I'm deferring rather than approving.
…ate on aarch64 (#31794) ### Symptom On aarch64, a bun process that hits the crash handler with a JS `SIGTRAP` listener registered does not die — it spins forever in kernel signal delivery, pinning one core, with `crash` → `_sigtramp` at the top of every sample. Observed in the wild as bun-based CLI processes surviving their closed terminal tab at 100% CPU: the revoked tty fd triggered the `FilePoll::unregister_with_fd` unwrap panic (fixed separately in #31701), and the panic then failed to terminate the process. Any crash funnels into this path (`rust_panic_hook` → `crash()`, `crash_handler()` → `crash()`), so every panic/segfault on aarch64 becomes an immortal core-pinning process when such a listener exists. ### Cause `crash()` resets SIGSEGV/SIGILL/SIGBUS/SIGABRT/SIGFPE/SIGHUP/SIGTERM to `SIG_DFL`, then executes `core::intrinsics::abort()` (Zig heritage: `@trap()`). That trap is `ud2` → SIGILL on x86_64 — covered by the list — but `brk` → **SIGTRAP** on aarch64, which the list omits. `process.on("SIGTRAP")` installs a real sigaction (`BunProcess.cpp`) whose handler enqueues to the JS thread and returns. Returning from a handler for a *synchronous* trap re-executes the trapping instruction with PC unchanged: `brk` → SIGTRAP → handler returns → `brk` → … forever. The JS listener never observes anything because the crashing thread never returns to the event loop. npm's `signal-exit` package — ubiquitous in CLI apps (npm itself vendors its signal list, which is quoted in our own `c-bindings.cpp`) — registers exactly this listener set, SIGTRAP included, so real-world exposure is broad. ### Fix Add `libc::SIGTRAP` to the `SIG_DFL` reset list in `crash()` (`src/crash_handler/lib.rs`), so the first trap is lethal regardless of what JS registered. One-line behavior change; x86_64 (`ud2`/SIGILL) was already covered and is unchanged. ### Test `test/cli/run/run-crash-handler.test.ts`: child registers `process.on("SIGTRAP")` **and** `process.on("SIGILL")` (the invariant is arch-independent; the trap signal isn't), triggers a real panic via `bun:internal-for-testing`, and asserts the process dies through the crash handler rather than spinning — the wait is bounded and the child is SIGKILLed on failure so a buggy build fails cleanly instead of leaking a core-pinning process. **Gate note:** the missing signal is only raised by the trap on aarch64, so the test can only fail-before on aarch64 lanes (Linux and macOS); on x86_64 it passes with and without the patch. Same situation as #31701's darwin-only fail-before. The mechanism and the test's sensitivity were verified on x86_64 by temporarily removing SIGILL (x86_64's trap signal) from the reset list: the test then catches the spin at its 8s bound and fails — byte-for-byte the aarch64 failure mode. ### Verification - `bun bd test test/cli/run/run-crash-handler.test.ts` (debug+ASAN, x86_64 Linux): 6 pass, 1 skip (darwin-only), 0 fail; new test completes in ~1s - Mechanism proof above (remove-SIGILL build → test fails at the spin bound; restored → passes) - `cargo check` + `cargo clippy -p bun_crash_handler` clean on x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu, aarch64-apple-darwin
…31821) ### What `FilePoll::unregister_with_fd_impl` returns an error when its `EV_DELETE` (kqueue) or `EPOLL_CTL_DEL` fails with ENOENT/EBADF. Those errnos mean the kernel-side registration is already gone — which is a routine teardown outcome, not a failure. This PR counts them as successful deregistration on all three platform branches (macOS kevent, Linux epoll, FreeBSD kevent). ### Why Two ordinary ways a registration disappears underneath a live `FilePoll`: - **fd closed while registered**: `close()` removes an fd's kevents, and epoll drops closed fds automatically. The forced unregister during reader teardown then gets EBADF/ENOENT. This is the scenario from #31701. - **macOS pty teardown**: closing a pty master marks the slave's knotes `EV_EOF|EV_ONESHOT`, so the kernel deletes them when the hangup event is delivered (verified: the delivered event carries `flags=0x8095`, i.e. `EV_EOF|EV_ONESHOT|EV_DISPATCH|EV_ENABLE|EV_ADD`, even though we registered with `EV_DISPATCH`). The reader's teardown `EV_DELETE` then finds nothing → ENOENT. This fires on **every terminal window/tab close** while a tty is polled, e.g. raw-mode stdin. #31701 fixed the errno decoding for these entries (they previously fed an errno *value* through the −1-sentinel decoder and panicked on the unwrap), but still surfaces them as errors. That has a real cost: the early error return skips the registration-flag clearing at the end of `unregister_with_fd_impl`, so the poll still claims `poll_readable` and a later teardown call re-issues another doomed `EV_DELETE` (observed via instrumentation: the same poll failing the delete twice during one pty teardown). Every in-tree caller discards the unregister result, so nothing recovers the state. libuv handles this the same way: `uv__kqueue_delete` (src/unix/kqueue.c) returns silently for exactly EBADF/ENOENT on `EV_DELETE` (and aborts on anything else), and the epoll side ignores `EPOLL_CTL_DEL` failures for closed fds (`uv__platform_invalidate_fd` discards the return value; the io_uring batch path comments "Failed submissions are either EPOLL_CTL_DEL commands for file descriptors that have been closed ... Ignore the former"). ### How verified - pty harness: bun under `script(1)` polling raw-mode stdin (plus subprocess churn), then SIGHUP + `revoke(2)` + pty-master SIGKILL — 5/5 runs tear down cleanly, stdin delivers `end`/`close`, process exits normally - `test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts` passes (exercises the modified macOS kevent and Linux epoll branches) - `test/cli/run/run-crash-handler.test.ts` passes No new test: callers discard the unregister result, so the change has no JS-observable surface; the modified branches are covered by the existing stale-fd fixture above.
Fixes Sentry crash BUN-38SJ:
Option::unwrap()panic inFilePoll::unregister_with_fd_implwhile aPosixBufferedReadertears down a subprocess pipe (read_pipe→close_without_reporting→PollOrFd::close_impl→deinit_force_unregister→unregister). macOS only, 1.4 canary.Cause
With
KEVENT_FLAG_ERROR_EVENTS, kevent reports a failed changelist entry by writing it back to the eventlist withEV_ERRORset inflagsand the errno value indata(e.g. ENOENT=2, EBADF=9). Thosedatavalues were fed througherrno_sys, which decodes the -1-sentinel return-code convention: it reads the thread-local errno only whenrcis the all-ones failure value and returnsNoneotherwise. A positive errno value is never-1, soerrno_sys(data, …)returnedNonefor every real error and the.unwrap()at theEV_ERRORcall sites panicked.The panic fires whenever an
EV_DELETEfails, which is a legitimate race on the teardown path:close()removes an fd's kevents, so aFilePollforce-unregistered after its fd was closed out from under it gets EBADF/ENOENT. libuv tolerates exactly these two errnos when deleting kevents, and everyunregistercaller in-tree already discards the returned error (let _ = self.unregister(…)indeinit_possibly_deferetc.) — the only bad outcome was the unwrap.Why the Zig original never crashed here: it tested
changelist[0].flags == EV.ERROR(strict equality), but xnu ORsEV_ERRORinto the existing action bits (EV_DELETE|EV_ERROR= 0x4002), so the check never fired and changelist errors were silently swallowed. The port fixed the flag test ("test the bit, not equality"), which made the path live with the broken errno round-trip behind it.Fix
src/io/posix_event_loop.rs:kevent_change_error(data), which builds thesys::Errorfrom the errno value via the checkedSystemErrno::init(EINVAL fallback for out-of-range data, so the never-happens case can't panic either), and use it at all threeEV_ERRORsites (unregister ×2 — the live crash; register ×1 — latent, itsdatastays 0 because it passesnevents=0).errno_systo linux/android/freebsd, the platforms that still use the return-code convention (macOS no longer references it;-D dead-codeenforces this), and document the convention it decodes.syslog!shim with the realbun_sys::syslogper the shim's own TODO (the macro is exported now). This is the Zig original'sconst log = bun.sys.syslog;— debug builds now log FilePoll register/unregister underBUN_DEBUG_SYS=1, which is how the regression fixture below was validated.Test
test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.tsmakes the race deterministic:Bun.spawn({stdout: "pipe"})registers aFilePollon the parent's read end of the stdout socketpair at spawn (used as-is, no dup, so it's discoverable via/proc/self/fd///dev/fd). The fixture closes that fd behind the runtime's back (kernel drops the kevent/epoll registration), re-occupies the fd number with/dev/nullso the reader's own deferredclose()stays valid, then cancels the stream — teardown force-unregisters the stale poll. Debug logs confirm the sequence:register: FilePoll(…) readable (11[socket:…])at spawn, thenunregister: FilePoll(…) readable (11[/dev/null])at cancel.Gate note: the buggy code is inside
#[cfg(target_os = "macos")]— only the kevent path feeds an errno value intoerrno_sys; Linux's epoll branch decodes theepoll_ctlreturn code, which is the conventionerrno_sysexpects and always handled the same race correctly. So the test can only fail-before on macOS (where the unfixed build aborts with the unwrap panic); on Linux it passes both with and without the patch while still exercising the EPOLL_CTL_DEL-on-stale-fd teardown. Same situation as the darwin-only statfs64 fix (753f1c4).In-crate unit test
kevent_change_error_decodes_errno_value_not_return_codepins the decode behavior.cargo test -p bun_iodoesn't link on Linux (pre-existing:bun_windows_sys'#[link]directives leak into test binaries), but it runs undercargo miri test -p bun_io kevent_change_error— passes with the fix, and panics exactly like the crash signature if the olderrno_sys(data, …).unwrap()expression is substituted back.Verification
cargo check -p bun_ioon x86_64-linux, aarch64-apple-darwin, x86_64-pc-windows-msvccargo clippy -p bun_io --all-targetsclean on linux + darwin targetscargo miri test -p bun_io kevent_change_errorpassesbun bd test test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts(debug+ASAN) passes; alsospawn-pipe-read-error-leak.test.ts,process-stdin-stale-hup.test.ts,spawn-streaming-stdout.test.ts, and a partialspawn.test.tsrun (35 tests, 0 failures)