fs.watch: fix PathWatcher double-free race between close() and DirectoryRegisterTask#29936
fs.watch: fix PathWatcher double-free race between close() and DirectoryRegisterTask#29936robobun wants to merge 4 commits into
Conversation
…oryRegisterTask
PathWatcher.deinit() did:
setClosed(); // lock; closed=true; unlock
if (hasPendingDirectories()) // lock-free atomic load
return;
...destroy(this)
On Linux/FreeBSD, watching a directory schedules a DirectoryRegisterTask
on the work pool. When its unrefPendingDirectory() landed in the gap
between setClosed()'s unlock and the hasPendingDirectories() load, it
observed closed==true, stored has_pending_directories=false, and
scheduled its own deinit(). Both the main thread and the worker then
saw has_pending_directories==false and both proceeded to
bun.default_allocator.destroy(this) — a double-free.
In release builds this corrupted mimalloc's cross-thread free list; on
alpine aarch64 CI the next PathWatcher.init() allocation segfaulted at
0x75622F706D742F (ASCII "/tmp/bu") walking that list. Under ASAN it
reports use-after-poison reading this.file_paths from the freed struct.
Merge closed=true and the has_pending_directories check into a single
critical section so the worker cannot interleave: the main-thread
deinit() now always observes has_pending_directories==true (the worker
needs the same lock plus closed==true to clear it) and returns early;
the worker's deinit() is the sole owner of teardown. File watches and
macOS FSEvents (no DirectoryRegisterTask, atomic never set) proceed
directly as before.
|
Updated 3:40 PM PT - Apr 29th, 2026
❌ @robobun, your commit 8725d03 has 4 failures in
🧪 To try this PR locally: bunx bun-pr 29936That installs a local version of the PR into your bun-29936 --bun |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughPathWatcher teardown in Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Review rate limit: 2/5 reviews remaining, refill in 33 minutes and 39 seconds. Comment |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/node/watch/fs.watch.close-race.test.ts`:
- Around line 1-77: This new regression test "close() racing
DirectoryRegisterTask completion does not double-free PathWatcher" should be
moved into the existing fs.watch test file rather than created as a new file;
copy the entire test.skipIf(...) block (including the tempDir usage, fixture,
ATTEMPTS loop and final expect) and append it to the existing fs.watch test
suite, ensuring you keep the imports used (expect, test, bunEnv, bunExe,
isMacOS, isWindows, tempDir) and preserve the test name and timeout (60000) and
local symbols (using dir, fixture, ATTEMPTS, results) so the test runs in the
same module context as the other fs.watch tests; delete this new standalone file
after adding the test to avoid duplicate declarations.
- Around line 69-75: Replace the brittle aggregate equality check on the
collected results with per-attempt assertions: for each entry in results
(collected from proc.stdout.text(), proc.stderr.text(), proc.exited where
ATTEMPTS and proc come from the test), assert stdout.trim() === "ok 3000" and
exitCode === 0 individually, and only assert or surface stderr when exitCode !==
0 (e.g., include stderr in the failure message or call expect(stderr).toBe("")
conditionally) so that bunExe()/bunEnv spawned subprocess stderr is used as
diagnostic rather than a strict empty-string requirement.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b14c11de-ecd6-4998-9c76-0b006be001e8
📒 Files selected for processing (1)
test/js/node/watch/fs.watch.close-race.test.ts
|
Not a match for #18919 — that issue is about re-watching a file after This PR only fixes the |
…er ASAN startup warning from subprocess stderr
…hwatcher-double-free
There was a problem hiding this comment.
The fix looks correct and minimal — merging closed=true and the has_pending_directories check into one critical section closes the interleaving the PR description lays out, and my earlier nit about the now-dead helpers was addressed. I'm deferring only because cross-thread teardown ownership in this file has a track record of subtle adjacent issues (per the "Not fixed here" section and prior PRs), so a second pair of human eyes on the concurrency reasoning is worth the cost.
Extended reasoning...
Overview
Two files: src/bun.js/node/path_watcher.zig (rewrites the opening of PathWatcher.deinit() to set closed=true and check has_pending_directories under a single this.mutex hold instead of two separate operations with a gap; removes the now-unused setClosed()/hasPendingDirectories() helpers; updates a comment in unrefPendingDirectory()) and a new Linux/FreeBSD-only stress test test/js/node/watch/fs.watch.close-race.test.ts that runs 4 subprocesses × 3000 watch/close iterations.
Security risks
None. This is internal lifecycle/teardown logic for fs.watch directory watchers; no auth, crypto, parsing, or untrusted-input surface is touched.
Level of scrutiny
High. The change is small (~30 net lines in the .zig file, mostly comments), but it governs which of two threads owns destruction of a heap object. I traced the three relevant interleavings (main wins lock first, worker wins lock first, no contention) and the fix correctly eliminates the double-destroy: under the new code the main-thread deinit() can only proceed past the early-return when has_pending_directories is false, and the worker can only clear that flag while holding the same lock after observing closed==true — so at most one caller falls through. The pre-existing leak case (worker finishes before close(), flag never cleared) is unchanged and explicitly documented as out of scope. However, this file has a documented history (#27469, #28104, #29391) where seemingly-correct adjustments to this exact atomic exposed latent UAFs in the main_watcher watchlist path-string ownership, so concurrency changes here merit human sign-off.
Other factors
The PR description includes a precise interleaving table and ASAN repro; verification covers the new test (3/3 fail unpatched, 3/3 pass patched), the full fs.watch.test.ts suite (15/15 clean), and the sibling deadlock/events-cb-race tests. My earlier review feedback (delete dead helpers) and CodeRabbit's stderr-filtering suggestion were both addressed in 224a696; all inline threads are resolved. The standalone test file follows the established pattern in test/js/node/watch/ for race-condition regressions.
|
Closing — obsoleted by #29952, which rewrote The double-free race this PR fixed was between The merge conflict with |
What
Fixes a double-free of
PathWatcherwhenfs.watch(<directory>).close()races the work-pool directory scan, surfacing in CI as a segfault infs.watch.test.tsonalpine-3.23-aarch64(build #49108).Reproduction
Under
bun bd(ASAN) on Linux this crashes ~90% of runs with:In release builds the double-free corrupts mimalloc's cross-thread free list; on alpine aarch64 the next
PathWatcher.init()allocation segfaulted at0x75622F706D742F(ASCII/tmp/bu) walking that list — this is the CI crash in the newclosed FSWatcher is collectabletest (64× watch→close in a fresh subprocess) from #29907.Root cause
PathWatcher.deinit()did:On Linux/FreeBSD, watching a directory schedules a
DirectoryRegisterTaskon the work pool (refPendingDirectory()→has_pending_directories = true). When the task completes it callsunrefPendingDirectory(), which — ifclosedis already true — storeshas_pending_directories = falseand schedules its owndeinit():setClosed()→ lock;closed=true; unlockunrefPendingDirectory()→ lock;pending=0; seesclosed==true→has_pending=false,should_deinit=true; unlockhasPendingDirectories()→ false → proceedunregisterWatcher,file_paths.deinit,destroy(this)deinit()→has_pending==false→ proceedunregisterWatcher(no-op),file_paths.deinit(UAF),destroy(this)← double-freeThe gap between
setClosed()'s unlock and the lock-freehasPendingDirectories()load lets the worker interleave and clear the atomic, so both callers pass the check.This race predates #29907 — the new test's tight watch→close loop in a fresh subprocess just made it reproducible.
Fix
Merge
closed = trueand thehas_pending_directoriescheck into a singlethis.mutexcritical section indeinit(). With the gap closed, the main-threaddeinit()(for a directory watcher) always observeshas_pending_directories == true— the worker needs the same lock andclosed == trueto clear it, and the main thread is the one that just setclosedunder this lock hold — so it returns early, and the worker's deferreddeinit()is the sole owner of teardown. File watches and macOS FSEvents (noDirectoryRegisterTask, so the atomic was never set) proceed directly as before.Verification
test/js/node/watch/fs.watch.close-race.test.tsruns 4× subprocesses each doing 3000fs.watch(dir).close()on a 1-file directory. Without the fix under ASAN: fails 3/3. With the fix: passes 3/3; targeted repro 60/60.fs.watch.test.tssuite: 15/15 clean runs with the fix, no ASAN errors.fs.watch.deadlock.test.tsandfs.watch.events-cb-race.test.tsstill pass.bun run zig:check-allpasses on all targets.Not fixed here
Two adjacent pre-existing issues noticed while investigating, intentionally left for follow-up to keep this change minimal:
unrefPendingDirectory()beforeclose()(soclosed == falseat its check),has_pending_directoriesis never cleared and the main thread'sdeinit()returns early forever → thePathWatcherleaks. Fixing this naively (clearing the atomic unconditionally / checkingpending_directoriesdirectly) exposes a latent UAF:_decrementPathRefNoLock()frees the path string whilemain_watcher.watchliststill borrows it across theremove()→flushEvictions()gap, and the File-Watcher thread'sonFileUpdate()then reads freed memory.Watcher.evict_list(8096 entries) is only drained byflushEvictions()fromonFileUpdate(); manyclose()calls without any fs events eventually overflow it (index out of bounds: index 8096, len 8096).