Skip to content

Fix panic when FilePoll unregister fails on macOS#31701

Merged
Jarred-Sumner merged 2 commits into
mainfrom
farm/ae9da7b3/fix-kevent-unregister-panic
Jun 2, 2026
Merged

Fix panic when FilePoll unregister fails on macOS#31701
Jarred-Sumner merged 2 commits into
mainfrom
farm/ae9da7b3/fix-kevent-unregister-panic

Conversation

@robobun

@robobun robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Fixes Sentry crash BUN-38SJ: Option::unwrap() panic in FilePoll::unregister_with_fd_impl while a PosixBufferedReader tears down a subprocess pipe (read_pipeclose_without_reportingPollOrFd::close_impldeinit_force_unregisterunregister). macOS only, 1.4 canary.

Cause

With KEVENT_FLAG_ERROR_EVENTS, kevent reports a failed changelist entry by writing it back to the eventlist with EV_ERROR set in flags and the errno value in data (e.g. ENOENT=2, EBADF=9). Those data values were fed through errno_sys, which decodes the -1-sentinel return-code convention: it reads the thread-local errno only when rc is the all-ones failure value and returns None otherwise. A positive errno value is never -1, so errno_sys(data, …) returned None for every real error and the .unwrap() at the EV_ERROR call sites panicked.

The panic fires whenever an EV_DELETE fails, which is a legitimate race on the teardown path: close() removes an fd's kevents, so a FilePoll force-unregistered after its fd was closed out from under it gets EBADF/ENOENT. libuv tolerates exactly these two errnos when deleting kevents, and every unregister caller in-tree already discards the returned error (let _ = self.unregister(…) in deinit_possibly_defer etc.) — the only bad outcome was the unwrap.

Why the Zig original never crashed here: it tested changelist[0].flags == EV.ERROR (strict equality), but xnu ORs EV_ERROR into the existing action bits (EV_DELETE|EV_ERROR = 0x4002), so the check never fired and changelist errors were silently swallowed. The port fixed the flag test ("test the bit, not equality"), which made the path live with the broken errno round-trip behind it.

Fix

src/io/posix_event_loop.rs:

  • Add kevent_change_error(data), which builds the sys::Error from the errno value via the checked SystemErrno::init (EINVAL fallback for out-of-range data, so the never-happens case can't panic either), and use it at all three EV_ERROR sites (unregister ×2 — the live crash; register ×1 — latent, its data stays 0 because it passes nevents=0).
  • Narrow errno_sys to linux/android/freebsd, the platforms that still use the return-code convention (macOS no longer references it; -D dead-code enforces this), and document the convention it decodes.
  • Replace the local no-op syslog! shim with the real bun_sys::syslog per the shim's own TODO (the macro is exported now). This is the Zig original's const log = bun.sys.syslog; — debug builds now log FilePoll register/unregister under BUN_DEBUG_SYS=1, which is how the regression fixture below was validated.

Test

  • test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts makes the race deterministic: Bun.spawn({stdout: "pipe"}) registers a FilePoll on the parent's read end of the stdout socketpair at spawn (used as-is, no dup, so it's discoverable via /proc/self/fd // /dev/fd). The fixture closes that fd behind the runtime's back (kernel drops the kevent/epoll registration), re-occupies the fd number with /dev/null so the reader's own deferred close() stays valid, then cancels the stream — teardown force-unregisters the stale poll. Debug logs confirm the sequence: register: FilePoll(…) readable (11[socket:…]) at spawn, then unregister: FilePoll(…) readable (11[/dev/null]) at cancel.

    Gate note: the buggy code is inside #[cfg(target_os = "macos")] — only the kevent path feeds an errno value into errno_sys; Linux's epoll branch decodes the epoll_ctl return code, which is the convention errno_sys expects and always handled the same race correctly. So the test can only fail-before on macOS (where the unfixed build aborts with the unwrap panic); on Linux it passes both with and without the patch while still exercising the EPOLL_CTL_DEL-on-stale-fd teardown. Same situation as the darwin-only statfs64 fix (753f1c4).

  • In-crate unit test kevent_change_error_decodes_errno_value_not_return_code pins the decode behavior. cargo test -p bun_io doesn't link on Linux (pre-existing: bun_windows_sys' #[link] directives leak into test binaries), but it runs under cargo miri test -p bun_io kevent_change_error — passes with the fix, and panics exactly like the crash signature if the old errno_sys(data, …).unwrap() expression is substituted back.

Verification

  • cargo check -p bun_io on x86_64-linux, aarch64-apple-darwin, x86_64-pc-windows-msvc
  • cargo clippy -p bun_io --all-targets clean on linux + darwin targets
  • cargo miri test -p bun_io kevent_change_error passes
  • bun bd test test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts (debug+ASAN) passes; also spawn-pipe-read-error-leak.test.ts, process-stdin-stale-hup.test.ts, spawn-streaming-stdout.test.ts, and a partial spawn.test.ts run (35 tests, 0 failures)

kevent reports per-change failures by writing the errno value into the
eventlist entry's `data` field (EV_ERROR set). Those values were fed
through `errno_sys`, which decodes the -1-sentinel return-code
convention and therefore returns `None` for every real errno — so the
`.unwrap()` at the EV_ERROR call sites panicked whenever a change
failed. A failed EV_DELETE is a tolerated race on the teardown path
(closing an fd removes its kevents, so a FilePoll force-unregistered
after its fd was closed gets EBADF/ENOENT) and every unregister caller
already discards the error.

Build the error from the errno value directly via a checked
SystemErrno::init, narrow errno_sys to the platforms that still use
the return-code convention, and wire the FilePoll debug logs to the
real bun_sys::syslog now that the macro is exported.
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@robobun, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 55 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cd5319ee-032b-4d4c-9321-af201a70a828

📥 Commits

Reviewing files that changed from the base of the PR and between 64ae83c and bdb56c8.

📒 Files selected for processing (2)
  • src/io/posix_event_loop.rs
  • test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the claude label Jun 2, 2026
@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Status: fix complete and CI-verified on macOS — every remaining red lane is broken docker infra on the darwin-13/14 agents. Needs a maintainer merge-call.

  • Crash: BUN-38SJ — unwrap panic in FilePoll::unregister_with_fd_impl during pipe-reader teardown, macOS canary.
  • Root cause: kevent EV_ERROR entries carry the errno value in data; the code decoded it with errno_sys, which expects a -1-sentinel return code and returns None for every real errno → .unwrap() panic whenever an EV_DELETE failed (EBADF/ENOENT when the fd was closed while the poll was still registered — a tolerated teardown race; all callers discard the error).
  • Fix: build the error from the errno value (kevent_change_error, checked SystemErrno::init + EINVAL fallback). Verified via cargo check/clippy (linux+darwin+windows targets), a miri unit test, and debug-build FilePoll logs of the fixture's stale-fd unregister sequence.
  • Regression test on macOS CI: spawn-pipe-stale-fd-unregister.test.ts passed on darwin-26-aarch64 (both shards) and darwin-14-x64 (1 pass, 0 fail [66ms], job 019e8794-b66e).

Final CI triage, build 59860 (after lane retries; 69/73 GitHub contexts green — all Linux incl. ASAN, Windows, musl, alpine, and all darwin builds):

  • darwin-26-aarch64-test-bun: both shards passed.
  • darwin-14-aarch64-test-bun: one shard passed; the other failed only on test-docker-build-{alpine,debian}.ts — the agent's docker has no buildx (unknown flag: --progress → exit 125).
  • darwin-14-x64-test-bun shard 1: failed only on websocket/autobahn.test.ts — autobahn container can't start (IOError: Is a directory: '/config/fuzzingserver.json').
  • darwin-14-x64-test-bun shard 2: failed only on squid/postgres-dependent files (websocket-proxy, sql.test, sql-postgres-datetime-roundtrip, 21311) — squid container mount fails (mount squid.conf … not a directory) and postgres connections die on the same agent; plus one serve-protocols.test.ts HTTP3StreamReset flake that passed on retry.

None of these touch this diff (kevent error construction + a spawn teardown test). One ci: retrigger was already spent; further pushes would only cancel lane retries. The darwin-13/14 agents' docker services (bind-mounts + buildx) need an infra fix, but this PR is verified on its target platform.

  • Reviews: claude[bot] ×2 — no issues found, fix "clearly correct", deferred for human merge-glance.

@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 5:14 AM PT - Jun 2nd, 2026

@robobun, your commit bdb56c8 has 8 failures in Build #59860 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 31701

That installs a local version of the PR into your bun-31701 executable, so you can run:

bun-31701 --bun

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any issues — the root-cause analysis is convincing and the fix is clearly correct — but this touches the core kqueue FilePoll register/unregister path plus a side-along enable of the real syslog! macro, so it's worth a human sanity-check before merging.

Extended reasoning...

Overview

This PR fixes Sentry crash BUN-38SJ, an Option::unwrap() panic in FilePoll::unregister_with_fd_impl on macOS. The change is in src/io/posix_event_loop.rs plus a new integration test and a unit test.

The core fix replaces three errno_sys(changelist[i].data, Tag::kevent).unwrap() calls with a new kevent_change_error(data) helper. The root cause is well-explained: kevent's EV_ERROR reports the errno value in data, but errno_sys decodes the -1-sentinel return-code convention (only reads thread-local errno when rc == -1), so it returned None for every real errno and the .unwrap() panicked. The new helper constructs the error directly from the errno value via SystemErrno::init(data).unwrap_or(EINVAL), which I verified exists in src/errno/lib.rs and follows the same unwrap_or(EINVAL) pattern used elsewhere (e.g. WindowsWatcher.rs).

Three additional changes ride along: (1) errno_sys is narrowed from cfg(not(windows)) to linux/android/freebsd only — correct, since macOS no longer calls it and FreeBSD/Linux still do; (2) the no-op local syslog! shim is replaced with the real bun_sys::syslog (verified #[macro_export] at src/sys/lib.rs:4980), enabling FilePoll debug logging under BUN_DEBUG_SYS=1; (3) a unit test and an integration test that deterministically reproduces the stale-fd unregister race.

Security risks

None. This is error-path handling in event-loop fd deregistration; no auth, crypto, parsing, or untrusted input. The integration test fixture closes fds behind the runtime's back, but only in a child process spawned by the test itself.

Level of scrutiny

Medium-high. FilePoll::register_with_fd_impl / unregister_with_fd_impl are the kqueue/epoll registration primitives underlying all polled fds (subprocess pipes, sockets, process watchers) on POSIX. While the actual logic change is narrow and obviously correct — replacing a guaranteed-panic with a properly constructed error that all callers already discard via let _ = self.unregister(…) — this is core runtime infrastructure where regressions would be widely felt, and the patch is macOS-gated so Linux CI alone won't catch a mistake.

Other factors

  • Side-along behavior change: enabling the real syslog! is debug-build-only and was explicitly TODO'd in the shim, but it's worth a maintainer confirming they want it bundled here rather than split out.
  • Test fixture reliability: the integration test discovers the polled fd by diffing /proc/self/fd // /dev/fd before/after spawn and asserts exactly one new pipe/socket fd appears. This relies on implementation details (no dup, only stdout piped) that the comments document, but a maintainer should confirm this won't be flaky in CI.
  • No CODEOWNERS entry for src/io/.
  • Bug hunter found no issues; cargo check/clippy reported clean across linux/darwin/windows targets per the PR body.

Given this is critical event-loop code fixing a production crash, deferring so a human can confirm the fix matches their read of the Sentry trace and that the test fixture is acceptable for CI.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself looks correct and well-reasoned, but this touches the core macOS kevent path in the event loop and the darwin CI lanes haven't produced a green run yet (expired/retriggered) — worth a human glance before merge, especially with the bundled syslog! shim → real macro swap.

Extended reasoning...

Overview

This PR fixes a Sentry crash (BUN-38SJ) in src/io/posix_event_loop.rs: an Option::unwrap() panic in FilePoll::unregister_with_fd_impl on macOS. The change replaces three errno_sys(changelist[N].data, …).unwrap() call sites with a new kevent_change_error(data) helper that correctly interprets kevent's EV_ERROR convention (errno value in data, not a -1 sentinel return code). It also narrows the errno_sys cfg gate to linux/android/freebsd, swaps the local no-op syslog! shim for the now-exported bun_sys::syslog macro, and adds both an in-crate unit test and a JS integration test.

Security risks

None identified. This is error-handling plumbing in fd deregistration; no auth, crypto, untrusted input parsing, or permission logic is involved. The new helper uses a checked SystemErrno::init with an EINVAL fallback, so out-of-range kernel data cannot panic.

Level of scrutiny

Medium-high. The diff is small (~40 net LOC in the .rs file) and the logic is straightforward — the root-cause analysis is excellent and the fix is the obvious correct one (build sys::Error directly from the errno value rather than round-tripping through the -1-sentinel decoder). However, this is the core POSIX event loop's kqueue registration/deregistration path, which underlies every piped subprocess, socket poll, and process watcher on macOS. Subtle mistakes here have broad blast radius.

Other factors

  • macOS CI not green yet: The bug and fix are both macOS-only (#[cfg(target_os = "macos")]). Per the robobun comment, the first CI run's darwin test lanes expired (no agent picked them up; zero tests ran) and were retriggered. I don't see a passing darwin run in the timeline yet, so the fix hasn't been CI-verified on its target platform.
  • Bundled incidental change: The syslog! no-op shim → use bun_sys::syslog swap means debug builds now actually emit FilePoll register/unregister logs under BUN_DEBUG_SYS=1. This is debug-only (gated on debug_assertions per src/sys/lib.rs:4981+) and matches the Zig original, but it's a behavior change orthogonal to the crash fix that a human should be aware of.
  • All callers discard the error: I verified the PR's claim that unregister callers (deinit_possibly_defer etc.) use let _ = self.unregister(…), so the only user-visible effect of the fix is removing the panic — the returned error was never acted on.
  • No CODEOWNERS coverage for src/io/ or test/js/bun/spawn/.
  • The bug hunting system found no issues.

Given the critical-path location and the absence of a green darwin CI run, I'm deferring rather than approving.

@Jarred-Sumner Jarred-Sumner merged commit b7fa3cb into main Jun 2, 2026
76 of 78 checks passed
@Jarred-Sumner Jarred-Sumner deleted the farm/ae9da7b3/fix-kevent-unregister-panic branch June 2, 2026 21:35
Jarred-Sumner pushed a commit that referenced this pull request Jun 4, 2026
…ate on aarch64 (#31794)

### Symptom

On aarch64, a bun process that hits the crash handler with a JS
`SIGTRAP` listener registered does not die — it spins forever in kernel
signal delivery, pinning one core, with `crash` → `_sigtramp` at the top
of every sample. Observed in the wild as bun-based CLI processes
surviving their closed terminal tab at 100% CPU: the revoked tty fd
triggered the `FilePoll::unregister_with_fd` unwrap panic (fixed
separately in #31701), and the panic then failed to terminate the
process. Any crash funnels into this path (`rust_panic_hook` →
`crash()`, `crash_handler()` → `crash()`), so every panic/segfault on
aarch64 becomes an immortal core-pinning process when such a listener
exists.

### Cause

`crash()` resets SIGSEGV/SIGILL/SIGBUS/SIGABRT/SIGFPE/SIGHUP/SIGTERM to
`SIG_DFL`, then executes `core::intrinsics::abort()` (Zig heritage:
`@trap()`). That trap is `ud2` → SIGILL on x86_64 — covered by the list
— but `brk` → **SIGTRAP** on aarch64, which the list omits.

`process.on("SIGTRAP")` installs a real sigaction (`BunProcess.cpp`)
whose handler enqueues to the JS thread and returns. Returning from a
handler for a *synchronous* trap re-executes the trapping instruction
with PC unchanged: `brk` → SIGTRAP → handler returns → `brk` → …
forever. The JS listener never observes anything because the crashing
thread never returns to the event loop. npm's `signal-exit` package —
ubiquitous in CLI apps (npm itself vendors its signal list, which is
quoted in our own `c-bindings.cpp`) — registers exactly this listener
set, SIGTRAP included, so real-world exposure is broad.

### Fix

Add `libc::SIGTRAP` to the `SIG_DFL` reset list in `crash()`
(`src/crash_handler/lib.rs`), so the first trap is lethal regardless of
what JS registered. One-line behavior change; x86_64 (`ud2`/SIGILL) was
already covered and is unchanged.

### Test

`test/cli/run/run-crash-handler.test.ts`: child registers
`process.on("SIGTRAP")` **and** `process.on("SIGILL")` (the invariant is
arch-independent; the trap signal isn't), triggers a real panic via
`bun:internal-for-testing`, and asserts the process dies through the
crash handler rather than spinning — the wait is bounded and the child
is SIGKILLed on failure so a buggy build fails cleanly instead of
leaking a core-pinning process.

**Gate note:** the missing signal is only raised by the trap on aarch64,
so the test can only fail-before on aarch64 lanes (Linux and macOS); on
x86_64 it passes with and without the patch. Same situation as #31701's
darwin-only fail-before. The mechanism and the test's sensitivity were
verified on x86_64 by temporarily removing SIGILL (x86_64's trap signal)
from the reset list: the test then catches the spin at its 8s bound and
fails — byte-for-byte the aarch64 failure mode.

### Verification

- `bun bd test test/cli/run/run-crash-handler.test.ts` (debug+ASAN,
x86_64 Linux): 6 pass, 1 skip (darwin-only), 0 fail; new test completes
in ~1s
- Mechanism proof above (remove-SIGILL build → test fails at the spin
bound; restored → passes)
- `cargo check` + `cargo clippy -p bun_crash_handler` clean on
x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu,
aarch64-apple-darwin
dylan-conway added a commit that referenced this pull request Jun 4, 2026
…31821)

### What

`FilePoll::unregister_with_fd_impl` returns an error when its
`EV_DELETE` (kqueue) or `EPOLL_CTL_DEL` fails with ENOENT/EBADF. Those
errnos mean the kernel-side registration is already gone — which is a
routine teardown outcome, not a failure. This PR counts them as
successful deregistration on all three platform branches (macOS kevent,
Linux epoll, FreeBSD kevent).

### Why

Two ordinary ways a registration disappears underneath a live
`FilePoll`:

- **fd closed while registered**: `close()` removes an fd's kevents, and
epoll drops closed fds automatically. The forced unregister during
reader teardown then gets EBADF/ENOENT. This is the scenario from
#31701.
- **macOS pty teardown**: closing a pty master marks the slave's knotes
`EV_EOF|EV_ONESHOT`, so the kernel deletes them when the hangup event is
delivered (verified: the delivered event carries `flags=0x8095`, i.e.
`EV_EOF|EV_ONESHOT|EV_DISPATCH|EV_ENABLE|EV_ADD`, even though we
registered with `EV_DISPATCH`). The reader's teardown `EV_DELETE` then
finds nothing → ENOENT. This fires on **every terminal window/tab
close** while a tty is polled, e.g. raw-mode stdin.

#31701 fixed the errno decoding for these entries (they previously fed
an errno *value* through the −1-sentinel decoder and panicked on the
unwrap), but still surfaces them as errors. That has a real cost: the
early error return skips the registration-flag clearing at the end of
`unregister_with_fd_impl`, so the poll still claims `poll_readable` and
a later teardown call re-issues another doomed `EV_DELETE` (observed via
instrumentation: the same poll failing the delete twice during one pty
teardown). Every in-tree caller discards the unregister result, so
nothing recovers the state.

libuv handles this the same way: `uv__kqueue_delete` (src/unix/kqueue.c)
returns silently for exactly EBADF/ENOENT on `EV_DELETE` (and aborts on
anything else), and the epoll side ignores `EPOLL_CTL_DEL` failures for
closed fds (`uv__platform_invalidate_fd` discards the return value; the
io_uring batch path comments "Failed submissions are either
EPOLL_CTL_DEL commands for file descriptors that have been closed ...
Ignore the former").

### How verified

- pty harness: bun under `script(1)` polling raw-mode stdin (plus
subprocess churn), then SIGHUP + `revoke(2)` + pty-master SIGKILL — 5/5
runs tear down cleanly, stdin delivers `end`/`close`, process exits
normally
- `test/js/bun/spawn/spawn-pipe-stale-fd-unregister.test.ts` passes
(exercises the modified macOS kevent and Linux epoll branches)
- `test/cli/run/run-crash-handler.test.ts` passes

No new test: callers discard the unregister result, so the change has no
JS-observable surface; the modified branches are covered by the existing
stale-fd fixture above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants