Skip to content

fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111

Open
robobun wants to merge 5 commits into
mainfrom
farm/10e14f47/fsevents-dclp
Open

fs_events(darwin): fix broken double-checked locking on fsevents_default_loop#30111
robobun wants to merge 5 commits into
mainfrom
farm/10e14f47/fsevents-dclp

Conversation

@robobun

@robobun robobun commented May 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

FSEvents.watch() is called from Darwin.addWatch (path_watcher.zig:708) without holding manager.mutex — it's released first (path_watcher.zig:325) to keep lock order one-way (fsevents_loop.mutex → manager.mutex). Two Workers can therefore enter FSEvents.watch() concurrently.

The old code:

pub fn watch(...) !*FSEventsWatcher {
    if (fsevents_default_loop) |loop| {                    // ← unlocked read, no acquire
        return FSEventsWatcher.init(loop, ...);
    } else {
        fsevents_default_loop_mutex.lock();
        defer fsevents_default_loop_mutex.unlock();
        if (fsevents_default_loop == null) {
            fsevents_default_loop = try FSEventsLoop.init();
        }
        return FSEventsWatcher.init(fsevents_default_loop.?, ...);
    }
}

This is the textbook broken DCLP that PathWatcherManager.get() (path_watcher.zig:51-55) explicitly fixed for its own default_manager with a comment explaining why. On ARM64, Worker A's store to fsevents_default_loop can become visible to Worker B before the stores inside FSEventsLoop.init() (this.* = fs_loop). Worker B then calls FSEventsWatcher.init(loop, ...)loop.registerWatcher(this)this.mutex.lock() on a Mutex whose bytes may still be uninitialized/garbage, and reads a garbage this.watchers BabyList — UB / crash.

CoreFoundation.get(), CoreServices.get(), and closeAndWait() have the identical unlocked-fast-path pattern on fsevents_cf / fsevents_cs / fsevents_default_loop.

Fix

Drop the unlocked fast path everywhere — always take the mutex first. This is exactly what PathWatcherManager.get() already does. These run once per fs.watch() call; the mutex is uncontended after initialization.

In watch(), release fsevents_default_loop_mutex before calling FSEventsWatcher.init() (which takes loop.mutex) so we never nest the two — loop is stable once published since only closeAndWait() at process exit ever clears it.

Test

test/js/node/watch/fs.watch.worker-init-race.test.ts (macOS-only, like the neighboring fs.watch.events-cb-race.test.ts): spawns a fresh process per iteration (so fsevents_default_loop starts null every time) with N Workers that each call fs.watch() as their very first statement on distinct directories, so multiple threads hit FSEvents.watch() before any has published the loop.

The race is memory-ordering-dependent (ARM64 store reordering) so it's low-probability even on Apple Silicon; the stress test exercises the concurrent path every iteration. Passes with the fix.

Verification

  • bun run zig:check-all — compiles on all targets including aarch64-macos / x86_64-macos.
  • bun bd test test/js/node/watch/ — all existing tests pass on Linux (the 2 permission failures are pre-existing on main; container runs as root so chmod 0 has no effect).

@github-actions github-actions Bot added the claude label May 2, 2026
@robobun

robobun commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 11:46 AM PT - May 4th, 2026

@robobun, your commit 9fea19c has 1 failures in Build #51152 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30111

That installs a local version of the PR into your bun-30111 executable, so you can run:

bun-30111 --bun

@coderabbitai

coderabbitai Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@robobun has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 42 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57652769-8830-4b5e-af84-99b138f9b222

📥 Commits

Reviewing files that changed from the base of the PR and between be2a71d and 9ca0bf8.

📒 Files selected for processing (1)
  • test/js/node/watch/fs.watch.worker-init-race.test.ts

Walkthrough

Synchronized FSEvents initialization on macOS by removing unlocked fast-path reads and always guarding default-loop access with mutexes; added a macOS-only regression test that stress-tests concurrent first-time fs.watch() initialization across worker threads and fresh processes.

Changes

FSEvents Concurrency Safety

Layer / File(s) Summary
Data Shape / State
src/bun.js/node/fs_events.zig
Removes unlocked cached-handle reads for fsevents_cf and fsevents_cs (no API signature changes).
Core Implementation
src/bun.js/node/fs_events.zig
Moves CF/CS checks under fsevents_mutex; makes watch() always acquire fsevents_default_loop_mutex before init/use; closeAndWait() now locks the same mutex before checking/clearing the default loop; adds concurrency comments.
Tests / Regression
test/js/node/watch/fs.watch.worker-init-race.test.ts
Adds macOS-only stress test that spawns 8 worker threads calling fs.watch() concurrently, repeated across 20 fresh Bun process runs; asserts stdout is OK, stderr empty, and exit code 0.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing broken double-checked locking (DCLP) in the fsevents_default_loop, which is the primary objective of this PR.
Description check ✅ Passed The PR description comprehensively covers the problem (unlocked fast-path reads and memory ordering issues), the fix (always take mutex first), and verification (compilation and testing), exceeding the template's basic requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Review rate limit: 0/5 reviews remaining, refill in 1 minute and 42 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/watch/fs.watch.worker-init-race.test.ts`:
- Around line 54-69: The test creates N Worker instances and only listens for
"message" and "error", so if a worker silently exits the test can hang; add a
listener for w.on("exit", (code) => { ... }) that treats non-zero exit codes as
a failure (set failed = true and process.exit(1) or log) and for any exit
increments done and runs the same completion check used in the "message" handler
(if (++done === N && !failed) { console.log("OK"); process.exit(0); }) to ensure
the test completes or fails deterministically; update references to Worker,
w.on("message"), w.on("error"), the done/failed counters and N accordingly.
- Around line 84-86: The stderr assertion can fail due to known ASAN startup
noise; normalize stderr before asserting by splitting proc.stderr.text() result
into lines, filtering out lines that start with "WARNING: ASAN interferes"
(e.g., stderrLines.filter(line => !line.startsWith("WARNING: ASAN
interferes"))), then join/trim the remaining lines and assert that the filtered
stderr is empty instead of raw stderr; update the code around the Promise.all
usage and the expect on stderr to use this filtered value (refer to variables
proc, stderr and the expect call).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72d5b91f-1712-49ff-b26a-1e98f365e005

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5bf31 and be2a71d.

📒 Files selected for processing (1)
  • test/js/node/watch/fs.watch.worker-init-race.test.ts

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts
Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 test/js/node/watch/fs.watch.worker-init-race.test.ts:93-94 — nit: test/CLAUDE.md says "Do not set a timeout on tests. Bun already has timeouts." — consider dropping the explicit 60_000 here. (Acknowledged that the neighboring fs.watch.events-cb-race.test.ts and fs.watch.deadlock.test.ts do the same thing, so feel free to keep it for consistency with this directory.)

    Extended reasoning...

    What this is

    test/CLAUDE.md (line 120) states:

    CRITICAL: Do not set a timeout on tests. Bun already has timeouts.

    The new test passes 60_000 as the third argument to test.skipIf(!isMacOS)(name, fn, 60_000) at line 94. By the letter of the repo's test conventions, that's a violation.

    Why it's only a nit

    This is a style/convention concern, not a functional bug. There is overwhelming local precedent in test/js/node/watch/ for explicit timeouts:

    • fs.watch.events-cb-race.test.ts:10060000 (the file this PR explicitly cites as its model)
    • fs.watch.deadlock.test.ts:7030000
    • fs.watch.test.ts10000 in several places
    • fs.watchFile.test.ts:20120000

    So the author followed the existing pattern in this directory exactly.

    Is the timeout actually needed?

    Possibly. The test spawns 20 sequential subprocesses, each launching 8 Workers that call fs.watch(). Bun's default per-test timeout is 5s, which is plausibly too tight for 20× process spawn + Worker init on a loaded macOS CI runner. If the timeout is dropped and CI starts flaking on this test, that would be a regression — so there's a reasonable argument for keeping it.

    Step-by-step

    1. test/CLAUDE.md:120 marks "Do not set a timeout on tests" as CRITICAL with no listed exceptions.
    2. fs.watch.worker-init-race.test.ts:94 passes 60_000 as the third arg to test().
    3. → Letter-of-the-rule violation.
    4. But every neighbor in test/js/node/watch/ does the same, and the PR was modeled on one of them.

    Suggested fix

    Either drop the 60_000 argument (relying on Bun's harness-level timeout), or leave it as-is for consistency with the rest of test/js/node/watch/. Either choice is defensible; this is non-blocking.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this reworks lock acquisition/ordering in the macOS fs.watch() runtime path (notably now releasing fsevents_default_loop_mutex before FSEventsWatcher.init(), which the old else-branch did not), it's worth a human sign-off.

Extended reasoning...

Overview

Two files changed:

  • src/bun.js/node/fs_events.zig: removes unlocked fast-path reads of fsevents_cf / fsevents_cs / fsevents_default_loop in four places (CoreFoundation.get(), CoreServices.get(), watch(), closeAndWait()), replacing them with always-lock-first. In watch(), also restructures to capture loop under fsevents_default_loop_mutex via a labeled block and release it before calling FSEventsWatcher.init()registerWatcher() (which takes loop.mutex).
  • test/js/node/watch/fs.watch.worker-init-race.test.ts: new macOS-only stress test (20 fresh subprocesses × 8 Workers each racing on first fs.watch()).

I verified the precedent at path_watcher.zig:50-58PathWatcherManager.get() uses the identical "no unlocked fast path / textbook broken DCLP" pattern and comment, and path_watcher.zig:704-709 confirms Darwin.addWatch calls FSEvents.watch() without holding manager.mutex, so the concurrent-entry premise is accurate.

Security risks

None. No auth/crypto/permissions/user-input handling; this is internal lock-ordering on lazy-initialized process globals.

Level of scrutiny

Medium-high. The transformation itself is the most conservative possible (drop the optimization, always lock) and is strictly safer for the DCLP concern. However, this is concurrency code in the runtime's macOS file-watching path, and there is a behavioral delta beyond "just add a lock": the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init() (and thus across loop.mutex acquisition), whereas the new code releases it first. The PR's reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids nesting the two mutexes) is sound and well-documented in the comment, but lock-ordering changes in production runtime code are exactly the kind of thing a maintainer should glance at rather than merge on bot approval alone.

Other factors

  • All CodeRabbit review comments are resolved (worker exit handling addressed in 51feedf; ASAN-stderr suggestion declined with a reasonable rationale matching sibling tests).
  • The one inline nit (vestigial parentPort.postMessage in the test fixture after the 51feedf refactor) is cosmetic and non-blocking.
  • No CODEOWNERS cover these paths.
  • CI shows build failures on several non-macOS shards, but they look like infra issues ("Failed to create agent", build-zig on freebsd/android/musl) rather than anything this macOS-only logic change would cause; still, a human should confirm green CI before merge.

Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward DCLP fix mirroring the existing PathWatcherManager.get() pattern; my earlier nit was addressed in 9ca0bf8.

Extended reasoning...

Overview

This PR removes the unlocked fast-path reads on fsevents_default_loop, fsevents_cf, and fsevents_cs in src/bun.js/node/fs_events.zig (macOS-only FSEvents code), replacing them with always-take-the-mutex-first. It also restructures watch() to release fsevents_default_loop_mutex before FSEventsWatcher.init() so the two mutexes never nest. A new macOS-only stress test (fs.watch.worker-init-race.test.ts) spawns 8 Workers per fresh process across 20 iterations to exercise the concurrent first-init path.

Security risks

None. No auth, crypto, permissions, or user-controlled input handling is touched — this is purely internal initialization synchronization for the FSEvents run loop.

Level of scrutiny

Low-to-moderate. The change is small (~20 lines of logic + comments), strictly additive in safety (adds locking, never removes it), and is a near-verbatim copy of the pattern already applied in PathWatcherManager.get() (path_watcher.zig:50-58) with the same explanatory comment. The mutex is taken once per fs.watch() call and is uncontended after init, so there's no meaningful performance concern. The lock-nesting avoidance in watch() is correct: loop is captured under the mutex and is stable thereafter (only closeAndWait() at process exit clears it).

Other factors

  • No CODEOWNERS coverage for these paths.
  • All CodeRabbit comments are resolved; the author addressed the worker-exit-tracking concern in 51feedf and gave a sound rationale (matching sibling tests + centralized bunEnv ASAN handling) for keeping expect(stderr).toBe("").
  • My own prior nit (vestigial parentPort.postMessage after the exit-based refactor) was addressed in 9ca0bf8.
  • The bug hunting system found no issues.
  • The test follows the established convention for test/js/node/watch/ race tests: standalone file, tempDir + bunEnv, fresh subprocess per iteration, exit code asserted last.

@robobun

robobun commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI status

Build #50251 (9ca0bf8): all 6 darwin test lanes ✅ passed (darwin-26-aarch64 ×2, darwin-14-aarch64 ×2, darwin-14-x64 ×2) — fs.watch.worker-init-race.test.ts ran green on every macOS target. Only failure was test/bake/dev-and-prod.test.ts on windows-2019-x64, a known flake hitting 9/10 recent builds.

Build #51152 (9fea19c, after rebase onto src/ restructure): 261 jobs passed. Darwin shards that ran all ✅ passed. Failures:

  • Infra: :windows: x64 - build-bun expired (no agent) → all windows-2019-x64 test shards never ran. Several darwin-aarch64 / debian-asan shards also expired.
  • test-http-should-emit-close-when-connection-is-aborted.ts timeout on Windows — pre-existing flake also on builds 51148/51150/51151 and latest main (#50956 is red).

No failure touches fs_events.zig (macOS-only, not compiled on Windows) or the skipIf(!isMacOS) test.

robobun and others added 4 commits May 4, 2026 10:24
… fsevents_cf / fsevents_cs

Darwin.addWatch (path_watcher.zig) calls FSEvents.watch() without holding
manager.mutex (released first to keep lock order fsevents -> manager), so
two Workers can enter FSEvents.watch() concurrently.

The old code read fsevents_default_loop with no lock and no acquire fence;
only the else-branch took fsevents_default_loop_mutex. On ARM64 Worker A's
store of the pointer can become visible to Worker B before the stores
inside FSEventsLoop.init() (this.* = fs_loop), so Worker B would call
registerWatcher() on a partially-visible loop and lock a garbage
loop.mutex / read a garbage loop.watchers BabyList.

CoreFoundation.get() / CoreServices.get() / closeAndWait() had the
identical unlocked-fast-path pattern on fsevents_cf / fsevents_cs /
fsevents_default_loop.

Fix: always take the mutex first - exactly what PathWatcherManager.get()
already does for its own default_manager (with the same explanatory
comment). These run once per fs.watch() call; the mutex is uncontended
after initialization. In watch(), release the init mutex before calling
FSEventsWatcher.init() so we never nest fsevents_default_loop_mutex and
loop.mutex.

Adds a macOS-only regression stress test that spawns N Workers which each
call fs.watch() as their very first statement on distinct directories, in
a fresh process per iteration so the global starts null every time.
@Jarred-Sumner Jarred-Sumner force-pushed the farm/10e14f47/fsevents-dclp branch from 9ca0bf8 to f5d25eb Compare May 4, 2026 10:24
Comment thread test/js/node/watch/fs.watch.worker-init-race.test.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks correct and mirrors the established PathWatcherManager.get() pattern, but since this is native concurrency/lock-ordering code in the macOS fs.watch() path — and watch() now releases fsevents_default_loop_mutex before FSEventsWatcher.init() where the old else-branch held it — it's worth a quick human sanity check on the lock-scope change.

Extended reasoning...

Overview

Two files: src/runtime/node/fs_events.zig (macOS FSEvents backend for fs.watch()) and a new macOS-only stress test. The Zig change removes the unlocked fast-path read of three globals (fsevents_cf, fsevents_cs, fsevents_default_loop) in CoreFoundation.get(), CoreServices.get(), watch(), and closeAndWait(), so the mutex is always acquired before reading. In watch(), the loop pointer is captured under the lock via a labeled block and the lock is released before calling FSEventsWatcher.init() (which takes loop.mutex via registerWatcher()), avoiding nesting the two mutexes.

Security risks

None. No user-controlled input, auth, crypto, or external I/O is involved — this is purely internal mutex acquisition ordering around process-global singletons.

Level of scrutiny

Moderate-to-high. The diff is small (~30 lines of real change) and exactly mirrors the already-landed pattern in PathWatcherManager.get() (path_watcher.zig:50–70, verified), and the change is strictly more locking, not less. CI is green on all six darwin lanes including Apple Silicon. However, this is concurrency code in the native runtime on a production-critical path (fs.watch()), and there is one non-mechanical decision: the old else-branch held fsevents_default_loop_mutex across FSEventsWatcher.init(); the new code releases it first. The reasoning (loop is stable once published; only closeAndWait() at process exit clears it; avoids a two-level lock nest) is sound, but lock-scope changes in native code are exactly where a second pair of eyes is cheap insurance.

Other factors

All prior review feedback is resolved: CodeRabbit's worker-exit handling suggestion was applied in 51feedf, my vestigial-postMessage nit was applied in 9ca0bf8, and my stale-path-comment nit was applied in 9fea19c. No CODEOWNERS entry covers src/runtime/node/. The bug-hunting system found nothing. The new test follows the established standalone-race-test convention for test/js/node/watch/. I'd approve if this were a pure "always take the lock" change with no scope adjustment, but the lock-release-before-init refinement nudges it just over my threshold for auto-approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant