fs_events(darwin): fix broken double-checked locking on fsevents_default_loop by robobun · Pull Request #30111 · oven-sh/bun

robobun · 2026-05-02T15:45:32Z

Problem

FSEvents.watch() is called from Darwin.addWatch (path_watcher.zig:708) without holding manager.mutex — it's released first (path_watcher.zig:325) to keep lock order one-way (fsevents_loop.mutex → manager.mutex). Two Workers can therefore enter FSEvents.watch() concurrently.

The old code:

pub fn watch(...) !*FSEventsWatcher {
    if (fsevents_default_loop) |loop| {                    // ← unlocked read, no acquire
        return FSEventsWatcher.init(loop, ...);
    } else {
        fsevents_default_loop_mutex.lock();
        defer fsevents_default_loop_mutex.unlock();
        if (fsevents_default_loop == null) {
            fsevents_default_loop = try FSEventsLoop.init();
        }
        return FSEventsWatcher.init(fsevents_default_loop.?, ...);
    }
}

This is the textbook broken DCLP that PathWatcherManager.get() (path_watcher.zig:51-55) explicitly fixed for its own default_manager with a comment explaining why. On ARM64, Worker A's store to fsevents_default_loop can become visible to Worker B before the stores inside FSEventsLoop.init() (this.* = fs_loop). Worker B then calls FSEventsWatcher.init(loop, ...) → loop.registerWatcher(this) → this.mutex.lock() on a Mutex whose bytes may still be uninitialized/garbage, and reads a garbage this.watchers BabyList — UB / crash.

CoreFoundation.get(), CoreServices.get(), and closeAndWait() have the identical unlocked-fast-path pattern on fsevents_cf / fsevents_cs / fsevents_default_loop.

Fix

Drop the unlocked fast path everywhere — always take the mutex first. This is exactly what PathWatcherManager.get() already does. These run once per fs.watch() call; the mutex is uncontended after initialization.

In watch(), release fsevents_default_loop_mutex before calling FSEventsWatcher.init() (which takes loop.mutex) so we never nest the two — loop is stable once published since only closeAndWait() at process exit ever clears it.

Test

test/js/node/watch/fs.watch.worker-init-race.test.ts (macOS-only, like the neighboring fs.watch.events-cb-race.test.ts): spawns a fresh process per iteration (so fsevents_default_loop starts null every time) with N Workers that each call fs.watch() as their very first statement on distinct directories, so multiple threads hit FSEvents.watch() before any has published the loop.

The race is memory-ordering-dependent (ARM64 store reordering) so it's low-probability even on Apple Silicon; the stress test exercises the concurrent path every iteration. Passes with the fix.

Verification

bun run zig:check-all — compiles on all targets including aarch64-macos / x86_64-macos.
bun bd test test/js/node/watch/ — all existing tests pass on Linux (the 2 permission failures are pre-existing on main; container runs as root so chmod 0 has no effect).

robobun · 2026-05-02T15:45:42Z

^{Updated 11:46 AM PT - May 4th, 2026}

❌ @robobun, your commit 9fea19c has 1 failures in Build #51152 (All Failures):

🧪 To try this PR locally:

bunx bun-pr 30111

That installs a local version of the PR into your bun-30111 executable, so you can run:

bun-30111 --bun

coderabbitai · 2026-05-02T15:48:18Z

Warning

Rate limit exceeded

@robobun has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 42 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57652769-8830-4b5e-af84-99b138f9b222

📥 Commits

Reviewing files that changed from the base of the PR and between be2a71d and 9ca0bf8.

📒 Files selected for processing (1)

test/js/node/watch/fs.watch.worker-init-race.test.ts

Walkthrough

Synchronized FSEvents initialization on macOS by removing unlocked fast-path reads and always guarding default-loop access with mutexes; added a macOS-only regression test that stress-tests concurrent first-time fs.watch() initialization across worker threads and fresh processes.

Changes

FSEvents Concurrency Safety

Layer / File(s)	Summary
Data Shape / State `src/bun.js/node/fs_events.zig`	Removes unlocked cached-handle reads for `fsevents_cf` and `fsevents_cs` (no API signature changes).
Core Implementation `src/bun.js/node/fs_events.zig`	Moves CF/CS checks under `fsevents_mutex`; makes `watch()` always acquire `fsevents_default_loop_mutex` before init/use; `closeAndWait()` now locks the same mutex before checking/clearing the default loop; adds concurrency comments.
Tests / Regression `test/js/node/watch/fs.watch.worker-init-race.test.ts`	Adds macOS-only stress test that spawns 8 worker threads calling `fs.watch()` concurrently, repeated across 20 fresh Bun process runs; asserts `stdout` is `OK`, `stderr` empty, and exit code 0.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: fixing broken double-checked locking (DCLP) in the fsevents_default_loop, which is the primary objective of this PR.
Description check	✅ Passed	The PR description comprehensively covers the problem (unlocked fast-path reads and memory ordering issues), the fix (always take mutex first), and verification (compilation and testing), exceeding the template's basic requirements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Review rate limit: 0/5 reviews remaining, refill in 1 minute and 42 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/watch/fs.watch.worker-init-race.test.ts`:
- Around line 54-69: The test creates N Worker instances and only listens for
"message" and "error", so if a worker silently exits the test can hang; add a
listener for w.on("exit", (code) => { ... }) that treats non-zero exit codes as
a failure (set failed = true and process.exit(1) or log) and for any exit
increments done and runs the same completion check used in the "message" handler
(if (++done === N && !failed) { console.log("OK"); process.exit(0); }) to ensure
the test completes or fails deterministically; update references to Worker,
w.on("message"), w.on("error"), the done/failed counters and N accordingly.
- Around line 84-86: The stderr assertion can fail due to known ASAN startup
noise; normalize stderr before asserting by splitting proc.stderr.text() result
into lines, filtering out lines that start with "WARNING: ASAN interferes"
(e.g., stderrLines.filter(line => !line.startsWith("WARNING: ASAN
interferes"))), then join/trim the remaining lines and assert that the filtered
stderr is empty instead of raw stderr; update the code around the Promise.all
usage and the expect on stderr to use this filtered value (refer to variables
proc, stderr and the expect call).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72d5b91f-1712-49ff-b26a-1e98f365e005

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5bf31 and be2a71d.

📒 Files selected for processing (1)

test/js/node/watch/fs.watch.worker-init-race.test.ts

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 test/js/node/watch/fs.watch.worker-init-race.test.ts:93-94 — nit: test/CLAUDE.md says "Do not set a timeout on tests. Bun already has timeouts." — consider dropping the explicit 60_000 here. (Acknowledged that the neighboring fs.watch.events-cb-race.test.ts and fs.watch.deadlock.test.ts do the same thing, so feel free to keep it for consistency with this directory.)
Extended reasoning...

What this is

test/CLAUDE.md (line 120) states:

CRITICAL: Do not set a timeout on tests. Bun already has timeouts.

The new test passes 60_000 as the third argument to test.skipIf(!isMacOS)(name, fn, 60_000) at line 94. By the letter of the repo's test conventions, that's a violation.

Why it's only a nit

This is a style/convention concern, not a functional bug. There is overwhelming local precedent in test/js/node/watch/ for explicit timeouts:
- fs.watch.events-cb-race.test.ts:100 → 60000 (the file this PR explicitly cites as its model)
- fs.watch.deadlock.test.ts:70 → 30000
- fs.watch.test.ts → 10000 in several places
- fs.watchFile.test.ts:201 → 20000
So the author followed the existing pattern in this directory exactly.

Is the timeout actually needed?

Possibly. The test spawns 20 sequential subprocesses, each launching 8 Workers that call fs.watch(). Bun's default per-test timeout is 5s, which is plausibly too tight for 20× process spawn + Worker init on a loaded macOS CI runner. If the timeout is dropped and CI starts flaking on this test, that would be a regression — so there's a reasonable argument for keeping it.

Step-by-step
1. test/CLAUDE.md:120 marks "Do not set a timeout on tests" as CRITICAL with no listed exceptions.
2. fs.watch.worker-init-race.test.ts:94 passes 60_000 as the third arg to test().
3. → Letter-of-the-rule violation.
4. But every neighbor in test/js/node/watch/ does the same, and the PR was modeled on one of them.
Suggested fix

Either drop the 60_000 argument (relying on Bun's harness-level timeout), or leave it as-is for consistency with the rest of test/js/node/watch/. Either choice is defensible; this is non-blocking.

claude

The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this reworks lock acquisition/ordering in the macOS fs.watch() runtime path (notably now releasing fsevents_default_loop_mutex before FSEventsWatcher.init(), which the old else-branch did not), it's worth a human sign-off.

Extended reasoning...

Overview

Two files changed:

src/bun.js/node/fs_events.zig: removes unlocked fast-path reads of fsevents_cf / fsevents_cs / fsevents_default_loop in four places (CoreFoundation.get(), CoreServices.get(), watch(), closeAndWait()), replacing them with always-lock-first. In watch(), also restructures to capture loop under fsevents_default_loop_mutex via a labeled block and release it before calling FSEventsWatcher.init() → registerWatcher() (which takes loop.mutex).
test/js/node/watch/fs.watch.worker-init-race.test.ts: new macOS-only stress test (20 fresh subprocesses × 8 Workers each racing on first fs.watch()).

I verified the precedent at path_watcher.zig:50-58 — PathWatcherManager.get() uses the identical "no unlocked fast path / textbook broken DCLP" pattern and comment, and path_watcher.zig:704-709 confirms Darwin.addWatch calls FSEvents.watch() without holding manager.mutex, so the concurrent-entry premise is accurate.

Security risks

None. No auth/crypto/permissions/user-input handling; this is internal lock-ordering on lazy-initialized process globals.

Level of scrutiny

Medium-high. The transformation itself is the most conservative possible (drop the optimization, always lock) and is strictly safer for the DCLP concern. However, this is concurrency code in the runtime's macOS file-watching path, and there is a behavioral delta beyond "just add a lock": the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init() (and thus across loop.mutex acquisition), whereas the new code releases it first. The PR's reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids nesting the two mutexes) is sound and well-documented in the comment, but lock-ordering changes in production runtime code are exactly the kind of thing a maintainer should glance at rather than merge on bot approval alone.

Other factors

All CodeRabbit review comments are resolved (worker exit handling addressed in 51feedf; ASAN-stderr suggestion declined with a reasonable rationale matching sibling tests).
The one inline nit (vestigial parentPort.postMessage in the test fixture after the 51feedf refactor) is cosmetic and non-blocking.
No CODEOWNERS cover these paths.
CI shows build failures on several non-macOS shards, but they look like infra issues ("Failed to create agent", build-zig on freebsd/android/musl) rather than anything this macOS-only logic change would cause; still, a human should confirm green CI before merge.

claude

LGTM — straightforward DCLP fix mirroring the existing PathWatcherManager.get() pattern; my earlier nit was addressed in 9ca0bf8.

Extended reasoning...

Overview

This PR removes the unlocked fast-path reads on fsevents_default_loop, fsevents_cf, and fsevents_cs in src/bun.js/node/fs_events.zig (macOS-only FSEvents code), replacing them with always-take-the-mutex-first. It also restructures watch() to release fsevents_default_loop_mutex before FSEventsWatcher.init() so the two mutexes never nest. A new macOS-only stress test (fs.watch.worker-init-race.test.ts) spawns 8 Workers per fresh process across 20 iterations to exercise the concurrent first-init path.

Security risks

None. No auth, crypto, permissions, or user-controlled input handling is touched — this is purely internal initialization synchronization for the FSEvents run loop.

Level of scrutiny

Low-to-moderate. The change is small (~20 lines of logic + comments), strictly additive in safety (adds locking, never removes it), and is a near-verbatim copy of the pattern already applied in PathWatcherManager.get() (path_watcher.zig:50-58) with the same explanatory comment. The mutex is taken once per fs.watch() call and is uncontended after init, so there's no meaningful performance concern. The lock-nesting avoidance in watch() is correct: loop is captured under the mutex and is stable thereafter (only closeAndWait() at process exit clears it).

Other factors

No CODEOWNERS coverage for these paths.
All CodeRabbit comments are resolved; the author addressed the worker-exit-tracking concern in 51feedf and gave a sound rationale (matching sibling tests + centralized bunEnv ASAN handling) for keeping expect(stderr).toBe("").
My own prior nit (vestigial parentPort.postMessage after the exit-based refactor) was addressed in 9ca0bf8.
The bug hunting system found no issues.
The test follows the established convention for test/js/node/watch/ race tests: standalone file, tempDir + bunEnv, fresh subprocess per iteration, exit code asserted last.

robobun · 2026-05-02T16:54:20Z

CI status

Build #50251 (9ca0bf8): all 6 darwin test lanes ✅ passed (darwin-26-aarch64 ×2, darwin-14-aarch64 ×2, darwin-14-x64 ×2) — fs.watch.worker-init-race.test.ts ran green on every macOS target. Only failure was test/bake/dev-and-prod.test.ts on windows-2019-x64, a known flake hitting 9/10 recent builds.

Build #51152 (9fea19c, after rebase onto src/ restructure): 261 jobs passed. Darwin shards that ran all ✅ passed. Failures:

Infra: :windows: x64 - build-bun expired (no agent) → all windows-2019-x64 test shards never ran. Several darwin-aarch64 / debian-asan shards also expired.
test-http-should-emit-close-when-connection-is-aborted.ts timeout on Windows — pre-existing flake also on builds 51148/51150/51151 and latest main (#50956 is red).

No failure touches fs_events.zig (macOS-only, not compiled on Windows) or the skipIf(!isMacOS) test.

… fsevents_cf / fsevents_cs Darwin.addWatch (path_watcher.zig) calls FSEvents.watch() without holding manager.mutex (released first to keep lock order fsevents -> manager), so two Workers can enter FSEvents.watch() concurrently. The old code read fsevents_default_loop with no lock and no acquire fence; only the else-branch took fsevents_default_loop_mutex. On ARM64 Worker A's store of the pointer can become visible to Worker B before the stores inside FSEventsLoop.init() (this.* = fs_loop), so Worker B would call registerWatcher() on a partially-visible loop and lock a garbage loop.mutex / read a garbage loop.watchers BabyList. CoreFoundation.get() / CoreServices.get() / closeAndWait() had the identical unlocked-fast-path pattern on fsevents_cf / fsevents_cs / fsevents_default_loop. Fix: always take the mutex first - exactly what PathWatcherManager.get() already does for its own default_manager (with the same explanatory comment). These run once per fs.watch() call; the mutex is uncontended after initialization. In watch(), release the init mutex before calling FSEventsWatcher.init() so we never nest fsevents_default_loop_mutex and loop.mutex. Adds a macOS-only regression stress test that spawns N Workers which each call fs.watch() as their very first statement on distinct directories, in a fresh process per iteration so the global starts null every time.

…'t hang the test

…acked via 'exit'

claude

The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this is native concurrency/lock-ordering code in the macOS fs.watch() path — and watch() now releases fsevents_default_loop_mutex before FSEventsWatcher.init() where the old else-branch held it — it's worth a quick human sanity check on the lock-scope change.

Extended reasoning...

Overview

Two files: src/runtime/node/fs_events.zig (macOS FSEvents backend for fs.watch()) and a new macOS-only stress test. The Zig change removes the unlocked fast-path read of three globals (fsevents_cf, fsevents_cs, fsevents_default_loop) in CoreFoundation.get(), CoreServices.get(), watch(), and closeAndWait(), so the mutex is always acquired before reading. In watch(), the loop pointer is captured under the lock via a labeled block and the lock is released before calling FSEventsWatcher.init() (which takes loop.mutex via registerWatcher()), avoiding nesting the two mutexes.

Security risks

None. No user-controlled input, auth, crypto, or external I/O is involved — this is purely internal mutex acquisition ordering around process-global singletons.

Level of scrutiny

Moderate-to-high. The diff is small (~30 lines of real change) and exactly mirrors the already-landed pattern in PathWatcherManager.get() (path_watcher.zig:50–70, verified), and the change is strictly more locking, not less. CI is green on all six darwin lanes including Apple Silicon. However, this is concurrency code in the native runtime on a production-critical path (fs.watch()), and there is one non-mechanical decision: the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init(); the new code releases it first. The reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids a two-level lock nest) is sound, but lock-scope changes in native code are exactly where a second pair of eyes is cheap insurance.

Other factors

All prior review feedback is resolved: CodeRabbit's worker-exit handling suggestion was applied in 51feedf, my vestigial-postMessage nit was applied in 9ca0bf8, and my stale-path-comment nit was applied in 9fea19c. No CODEOWNERS entry covers src/runtime/node/. The bug-hunting system found nothing. The new test follows the established standalone-race-test convention for test/js/node/watch/. I'd approve if this were a pure "always take the lock" change with no scope adjustment, but the lock-release-before-init refinement nudges it just over my threshold for auto-approval.

github-actions Bot added the claude label May 2, 2026

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts

claude Bot reviewed May 2, 2026

View reviewed changes

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts Outdated

claude Bot reviewed May 2, 2026

View reviewed changes

robobun and others added 4 commits May 4, 2026 10:24

[autofix.ci] apply automated fixes

b267506

test: track worker completion via 'exit' so a silent worker death can…

91223d9

…'t hang the test

test: drop vestigial parentPort.postMessage now that completion is tr…

f5d25eb

…acked via 'exit'

Jarred-Sumner force-pushed the farm/10e14f47/fsevents-dclp branch from 9ca0bf8 to f5d25eb Compare May 4, 2026 10:24

claude Bot reviewed May 4, 2026

View reviewed changes

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts Outdated

test: update fs_events.zig path in comment after src/ restructure

9fea19c

claude Bot reviewed May 4, 2026

View reviewed changes

robobun mentioned this pull request May 7, 2026

fmt: route pure-{s} allocPrint through shared concat (-287 KB .text) #30358

Closed

robobun mentioned this pull request May 19, 2026

Fix unchecked exception when Bun.plugin target coercion throws #31017

Closed

Jarred-Sumner mentioned this pull request May 21, 2026

Make ArrayBuffer::from_bytes unsafe and add owning constructors #31174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111

fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111
robobun wants to merge 5 commits into
mainfrom
farm/10e14f47/fsevents-dclp

robobun commented May 2, 2026

Uh oh!

robobun commented May 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

robobun commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robobun commented May 2, 2026

Problem

Fix

Test

Verification

Uh oh!

robobun commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What this is

Why it's only a nit

Is the timeout actually needed?

Step-by-step

Suggested fix

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

robobun commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI status

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robobun commented May 2, 2026 •

edited

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading

robobun commented May 2, 2026 •

edited

Loading