fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111
fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111robobun wants to merge 5 commits into
Conversation
|
Updated 11:46 AM PT - May 4th, 2026
❌ @robobun, your commit 9fea19c has 1 failures in
🧪 To try this PR locally: bunx bun-pr 30111That installs a local version of the PR into your bun-30111 --bun |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughSynchronized FSEvents initialization on macOS by removing unlocked fast-path reads and always guarding default-loop access with mutexes; added a macOS-only regression test that stress-tests concurrent first-time ChangesFSEvents Concurrency Safety
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Review rate limit: 0/5 reviews remaining, refill in 1 minute and 42 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/node/watch/fs.watch.worker-init-race.test.ts`:
- Around line 54-69: The test creates N Worker instances and only listens for
"message" and "error", so if a worker silently exits the test can hang; add a
listener for w.on("exit", (code) => { ... }) that treats non-zero exit codes as
a failure (set failed = true and process.exit(1) or log) and for any exit
increments done and runs the same completion check used in the "message" handler
(if (++done === N && !failed) { console.log("OK"); process.exit(0); }) to ensure
the test completes or fails deterministically; update references to Worker,
w.on("message"), w.on("error"), the done/failed counters and N accordingly.
- Around line 84-86: The stderr assertion can fail due to known ASAN startup
noise; normalize stderr before asserting by splitting proc.stderr.text() result
into lines, filtering out lines that start with "WARNING: ASAN interferes"
(e.g., stderrLines.filter(line => !line.startsWith("WARNING: ASAN
interferes"))), then join/trim the remaining lines and assert that the filtered
stderr is empty instead of raw stderr; update the code around the Promise.all
usage and the expect on stderr to use this filtered value (refer to variables
proc, stderr and the expect call).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 72d5b91f-1712-49ff-b26a-1e98f365e005
📒 Files selected for processing (1)
test/js/node/watch/fs.watch.worker-init-race.test.ts
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
test/js/node/watch/fs.watch.worker-init-race.test.ts:93-94— nit:test/CLAUDE.mdsays "Do not set a timeout on tests. Bun already has timeouts." — consider dropping the explicit60_000here. (Acknowledged that the neighboringfs.watch.events-cb-race.test.tsandfs.watch.deadlock.test.tsdo the same thing, so feel free to keep it for consistency with this directory.)Extended reasoning...
What this is
test/CLAUDE.md(line 120) states:CRITICAL: Do not set a timeout on tests. Bun already has timeouts.
The new test passes
60_000as the third argument totest.skipIf(!isMacOS)(name, fn, 60_000)at line 94. By the letter of the repo's test conventions, that's a violation.Why it's only a nit
This is a style/convention concern, not a functional bug. There is overwhelming local precedent in
test/js/node/watch/for explicit timeouts:fs.watch.events-cb-race.test.ts:100→60000(the file this PR explicitly cites as its model)fs.watch.deadlock.test.ts:70→30000fs.watch.test.ts→10000in several placesfs.watchFile.test.ts:201→20000
So the author followed the existing pattern in this directory exactly.
Is the timeout actually needed?
Possibly. The test spawns 20 sequential subprocesses, each launching 8 Workers that call
fs.watch(). Bun's default per-test timeout is 5s, which is plausibly too tight for 20× process spawn + Worker init on a loaded macOS CI runner. If the timeout is dropped and CI starts flaking on this test, that would be a regression — so there's a reasonable argument for keeping it.Step-by-step
test/CLAUDE.md:120marks "Do not set a timeout on tests" as CRITICAL with no listed exceptions.fs.watch.worker-init-race.test.ts:94passes60_000as the third arg totest().- → Letter-of-the-rule violation.
- But every neighbor in
test/js/node/watch/does the same, and the PR was modeled on one of them.
Suggested fix
Either drop the
60_000argument (relying on Bun's harness-level timeout), or leave it as-is for consistency with the rest oftest/js/node/watch/. Either choice is defensible; this is non-blocking.
There was a problem hiding this comment.
The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this reworks lock acquisition/ordering in the macOS fs.watch() runtime path (notably now releasing fsevents_default_loop_mutex before FSEventsWatcher.init(), which the old else-branch did not), it's worth a human sign-off.
Extended reasoning...
Overview
Two files changed:
src/bun.js/node/fs_events.zig: removes unlocked fast-path reads offsevents_cf/fsevents_cs/fsevents_default_loopin four places (CoreFoundation.get(),CoreServices.get(),watch(),closeAndWait()), replacing them with always-lock-first. Inwatch(), also restructures to captureloopunderfsevents_default_loop_mutexvia a labeled block and release it before callingFSEventsWatcher.init()→registerWatcher()(which takesloop.mutex).test/js/node/watch/fs.watch.worker-init-race.test.ts: new macOS-only stress test (20 fresh subprocesses × 8 Workers each racing on firstfs.watch()).
I verified the precedent at path_watcher.zig:50-58 — PathWatcherManager.get() uses the identical "no unlocked fast path / textbook broken DCLP" pattern and comment, and path_watcher.zig:704-709 confirms Darwin.addWatch calls FSEvents.watch() without holding manager.mutex, so the concurrent-entry premise is accurate.
Security risks
None. No auth/crypto/permissions/user-input handling; this is internal lock-ordering on lazy-initialized process globals.
Level of scrutiny
Medium-high. The transformation itself is the most conservative possible (drop the optimization, always lock) and is strictly safer for the DCLP concern. However, this is concurrency code in the runtime's macOS file-watching path, and there is a behavioral delta beyond "just add a lock": the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init() (and thus across loop.mutex acquisition), whereas the new code releases it first. The PR's reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids nesting the two mutexes) is sound and well-documented in the comment, but lock-ordering changes in production runtime code are exactly the kind of thing a maintainer should glance at rather than merge on bot approval alone.
Other factors
- All CodeRabbit review comments are resolved (worker
exithandling addressed in 51feedf; ASAN-stderr suggestion declined with a reasonable rationale matching sibling tests). - The one inline nit (vestigial
parentPort.postMessagein the test fixture after the 51feedf refactor) is cosmetic and non-blocking. - No CODEOWNERS cover these paths.
- CI shows build failures on several non-macOS shards, but they look like infra issues ("Failed to create agent", build-zig on freebsd/android/musl) rather than anything this macOS-only logic change would cause; still, a human should confirm green CI before merge.
There was a problem hiding this comment.
LGTM — straightforward DCLP fix mirroring the existing PathWatcherManager.get() pattern; my earlier nit was addressed in 9ca0bf8.
Extended reasoning...
Overview
This PR removes the unlocked fast-path reads on fsevents_default_loop, fsevents_cf, and fsevents_cs in src/bun.js/node/fs_events.zig (macOS-only FSEvents code), replacing them with always-take-the-mutex-first. It also restructures watch() to release fsevents_default_loop_mutex before FSEventsWatcher.init() so the two mutexes never nest. A new macOS-only stress test (fs.watch.worker-init-race.test.ts) spawns 8 Workers per fresh process across 20 iterations to exercise the concurrent first-init path.
Security risks
None. No auth, crypto, permissions, or user-controlled input handling is touched — this is purely internal initialization synchronization for the FSEvents run loop.
Level of scrutiny
Low-to-moderate. The change is small (~20 lines of logic + comments), strictly additive in safety (adds locking, never removes it), and is a near-verbatim copy of the pattern already applied in PathWatcherManager.get() (path_watcher.zig:50-58) with the same explanatory comment. The mutex is taken once per fs.watch() call and is uncontended after init, so there's no meaningful performance concern. The lock-nesting avoidance in watch() is correct: loop is captured under the mutex and is stable thereafter (only closeAndWait() at process exit clears it).
Other factors
- No CODEOWNERS coverage for these paths.
- All CodeRabbit comments are resolved; the author addressed the worker-exit-tracking concern in 51feedf and gave a sound rationale (matching sibling tests + centralized
bunEnvASAN handling) for keepingexpect(stderr).toBe(""). - My own prior nit (vestigial
parentPort.postMessageafter the exit-based refactor) was addressed in 9ca0bf8. - The bug hunting system found no issues.
- The test follows the established convention for
test/js/node/watch/race tests: standalone file,tempDir+bunEnv, fresh subprocess per iteration, exit code asserted last.
CI statusBuild #50251 (9ca0bf8): all 6 darwin test lanes ✅ passed ( Build #51152 (9fea19c, after rebase onto src/ restructure): 261 jobs passed. Darwin shards that ran all ✅ passed. Failures:
No failure touches |
… fsevents_cf / fsevents_cs Darwin.addWatch (path_watcher.zig) calls FSEvents.watch() without holding manager.mutex (released first to keep lock order fsevents -> manager), so two Workers can enter FSEvents.watch() concurrently. The old code read fsevents_default_loop with no lock and no acquire fence; only the else-branch took fsevents_default_loop_mutex. On ARM64 Worker A's store of the pointer can become visible to Worker B before the stores inside FSEventsLoop.init() (this.* = fs_loop), so Worker B would call registerWatcher() on a partially-visible loop and lock a garbage loop.mutex / read a garbage loop.watchers BabyList. CoreFoundation.get() / CoreServices.get() / closeAndWait() had the identical unlocked-fast-path pattern on fsevents_cf / fsevents_cs / fsevents_default_loop. Fix: always take the mutex first - exactly what PathWatcherManager.get() already does for its own default_manager (with the same explanatory comment). These run once per fs.watch() call; the mutex is uncontended after initialization. In watch(), release the init mutex before calling FSEventsWatcher.init() so we never nest fsevents_default_loop_mutex and loop.mutex. Adds a macOS-only regression stress test that spawns N Workers which each call fs.watch() as their very first statement on distinct directories, in a fresh process per iteration so the global starts null every time.
9ca0bf8 to
f5d25eb
Compare
There was a problem hiding this comment.
The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this is native concurrency/lock-ordering code in the macOS fs.watch() path — and watch() now releases fsevents_default_loop_mutex before FSEventsWatcher.init() where the old else-branch held it — it's worth a quick human sanity check on the lock-scope change.
Extended reasoning...
Overview
Two files: src/runtime/node/fs_events.zig (macOS FSEvents backend for fs.watch()) and a new macOS-only stress test. The Zig change removes the unlocked fast-path read of three globals (fsevents_cf, fsevents_cs, fsevents_default_loop) in CoreFoundation.get(), CoreServices.get(), watch(), and closeAndWait(), so the mutex is always acquired before reading. In watch(), the loop pointer is captured under the lock via a labeled block and the lock is released before calling FSEventsWatcher.init() (which takes loop.mutex via registerWatcher()), avoiding nesting the two mutexes.
Security risks
None. No user-controlled input, auth, crypto, or external I/O is involved — this is purely internal mutex acquisition ordering around process-global singletons.
Level of scrutiny
Moderate-to-high. The diff is small (~30 lines of real change) and exactly mirrors the already-landed pattern in PathWatcherManager.get() (path_watcher.zig:50–70, verified), and the change is strictly more locking, not less. CI is green on all six darwin lanes including Apple Silicon. However, this is concurrency code in the native runtime on a production-critical path (fs.watch()), and there is one non-mechanical decision: the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init(); the new code releases it first. The reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids a two-level lock nest) is sound, but lock-scope changes in native code are exactly where a second pair of eyes is cheap insurance.
Other factors
All prior review feedback is resolved: CodeRabbit's worker-exit handling suggestion was applied in 51feedf, my vestigial-postMessage nit was applied in 9ca0bf8, and my stale-path-comment nit was applied in 9fea19c. No CODEOWNERS entry covers src/runtime/node/. The bug-hunting system found nothing. The new test follows the established standalone-race-test convention for test/js/node/watch/. I'd approve if this were a pure "always take the lock" change with no scope adjustment, but the lock-release-before-init refinement nudges it just over my threshold for auto-approval.
Problem
FSEvents.watch()is called fromDarwin.addWatch(path_watcher.zig:708) without holdingmanager.mutex— it's released first (path_watcher.zig:325) to keep lock order one-way (fsevents_loop.mutex → manager.mutex). Two Workers can therefore enterFSEvents.watch()concurrently.The old code:
This is the textbook broken DCLP that
PathWatcherManager.get()(path_watcher.zig:51-55) explicitly fixed for its owndefault_managerwith a comment explaining why. On ARM64, Worker A's store tofsevents_default_loopcan become visible to Worker B before the stores insideFSEventsLoop.init()(this.* = fs_loop). Worker B then callsFSEventsWatcher.init(loop, ...)→loop.registerWatcher(this)→this.mutex.lock()on aMutexwhose bytes may still be uninitialized/garbage, and reads a garbagethis.watchersBabyList — UB / crash.CoreFoundation.get(),CoreServices.get(), andcloseAndWait()have the identical unlocked-fast-path pattern onfsevents_cf/fsevents_cs/fsevents_default_loop.Fix
Drop the unlocked fast path everywhere — always take the mutex first. This is exactly what
PathWatcherManager.get()already does. These run once perfs.watch()call; the mutex is uncontended after initialization.In
watch(), releasefsevents_default_loop_mutexbefore callingFSEventsWatcher.init()(which takesloop.mutex) so we never nest the two —loopis stable once published since onlycloseAndWait()at process exit ever clears it.Test
test/js/node/watch/fs.watch.worker-init-race.test.ts(macOS-only, like the neighboringfs.watch.events-cb-race.test.ts): spawns a fresh process per iteration (sofsevents_default_loopstartsnullevery time) with N Workers that each callfs.watch()as their very first statement on distinct directories, so multiple threads hitFSEvents.watch()before any has published the loop.The race is memory-ordering-dependent (ARM64 store reordering) so it's low-probability even on Apple Silicon; the stress test exercises the concurrent path every iteration. Passes with the fix.
Verification
bun run zig:check-all— compiles on all targets including aarch64-macos / x86_64-macos.bun bd test test/js/node/watch/— all existing tests pass on Linux (the 2permissionfailures are pre-existing onmain; container runs as root sochmod 0has no effect).