Skip to content

fs.watch: fix PathWatcher double-free race between close() and DirectoryRegisterTask#29936

Closed
robobun wants to merge 4 commits into
mainfrom
farm/e59010f8/fix-pathwatcher-double-free
Closed

fs.watch: fix PathWatcher double-free race between close() and DirectoryRegisterTask#29936
robobun wants to merge 4 commits into
mainfrom
farm/e59010f8/fix-pathwatcher-double-free

Conversation

@robobun

@robobun robobun commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

What

Fixes a double-free of PathWatcher when fs.watch(<directory>).close() races the work-pool directory scan, surfacing in CI as a segfault in fs.watch.test.ts on alpine-3.23-aarch64 (build #49108).

Reproduction

const fs = require('fs');
const dir = ...; // directory containing one file
for (let i = 0; i < 3000; i++) {
  const w = fs.watch(dir, { persistent: false }, () => {});
  w.close();
}

Under bun bd (ASAN) on Linux this crashes ~90% of runs with:

AddressSanitizer: use-after-poison
READ in BabyList([:0]const u8).assertOwned
  from PathWatcher.deinit → this.file_paths.deinit()   ← main thread T0
  from FSWatcher.close

In release builds the double-free corrupts mimalloc's cross-thread free list; on alpine aarch64 the next PathWatcher.init() allocation segfaulted at 0x75622F706D742F (ASCII /tmp/bu) walking that list — this is the CI crash in the new closed FSWatcher is collectable test (64× watch→close in a fresh subprocess) from #29907.

Root cause

PathWatcher.deinit() did:

setClosed();                    // lock; closed=true; unlock
if (hasPendingDirectories())    // lock-free atomic load
    return;
...destroy(this)

On Linux/FreeBSD, watching a directory schedules a DirectoryRegisterTask on the work pool (refPendingDirectory()has_pending_directories = true). When the task completes it calls unrefPendingDirectory(), which — if closed is already true — stores has_pending_directories = false and schedules its own deinit():

main thread worker thread
setClosed() → lock; closed=true; unlock
unrefPendingDirectory() → lock; pending=0; sees closed==truehas_pending=false, should_deinit=true; unlock
hasPendingDirectories()false → proceed
unregisterWatcher, file_paths.deinit, destroy(this) deinit()has_pending==false → proceed
unregisterWatcher (no-op), file_paths.deinit (UAF), destroy(this) ← double-free

The gap between setClosed()'s unlock and the lock-free hasPendingDirectories() load lets the worker interleave and clear the atomic, so both callers pass the check.

This race predates #29907 — the new test's tight watch→close loop in a fresh subprocess just made it reproducible.

Fix

Merge closed = true and the has_pending_directories check into a single this.mutex critical section in deinit(). With the gap closed, the main-thread deinit() (for a directory watcher) always observes has_pending_directories == true — the worker needs the same lock and closed == true to clear it, and the main thread is the one that just set closed under this lock hold — so it returns early, and the worker's deferred deinit() is the sole owner of teardown. File watches and macOS FSEvents (no DirectoryRegisterTask, so the atomic was never set) proceed directly as before.

Verification

  • New test test/js/node/watch/fs.watch.close-race.test.ts runs 4× subprocesses each doing 3000 fs.watch(dir).close() on a 1-file directory. Without the fix under ASAN: fails 3/3. With the fix: passes 3/3; targeted repro 60/60.
  • Full fs.watch.test.ts suite: 15/15 clean runs with the fix, no ASAN errors.
  • fs.watch.deadlock.test.ts and fs.watch.events-cb-race.test.ts still pass.
  • bun run zig:check-all passes on all targets.

Not fixed here

Two adjacent pre-existing issues noticed while investigating, intentionally left for follow-up to keep this change minimal:

  1. When the worker finishes unrefPendingDirectory() before close() (so closed == false at its check), has_pending_directories is never cleared and the main thread's deinit() returns early forever → the PathWatcher leaks. Fixing this naively (clearing the atomic unconditionally / checking pending_directories directly) exposes a latent UAF: _decrementPathRefNoLock() frees the path string while main_watcher.watchlist still borrows it across the remove()flushEvictions() gap, and the File-Watcher thread's onFileUpdate() then reads freed memory.
  2. Watcher.evict_list (8096 entries) is only drained by flushEvictions() from onFileUpdate(); many close() calls without any fs events eventually overflow it (index out of bounds: index 8096, len 8096).

…oryRegisterTask

PathWatcher.deinit() did:
    setClosed();                    // lock; closed=true; unlock
    if (hasPendingDirectories())    // lock-free atomic load
        return;
    ...destroy(this)

On Linux/FreeBSD, watching a directory schedules a DirectoryRegisterTask
on the work pool. When its unrefPendingDirectory() landed in the gap
between setClosed()'s unlock and the hasPendingDirectories() load, it
observed closed==true, stored has_pending_directories=false, and
scheduled its own deinit(). Both the main thread and the worker then
saw has_pending_directories==false and both proceeded to
bun.default_allocator.destroy(this) — a double-free.

In release builds this corrupted mimalloc's cross-thread free list; on
alpine aarch64 CI the next PathWatcher.init() allocation segfaulted at
0x75622F706D742F (ASCII "/tmp/bu") walking that list. Under ASAN it
reports use-after-poison reading this.file_paths from the freed struct.

Merge closed=true and the has_pending_directories check into a single
critical section so the worker cannot interleave: the main-thread
deinit() now always observes has_pending_directories==true (the worker
needs the same lock plus closed==true to clear it) and returns early;
the worker's deinit() is the sole owner of teardown. File watches and
macOS FSEvents (no DirectoryRegisterTask, atomic never set) proceed
directly as before.
@robobun

robobun commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 3:40 PM PT - Apr 29th, 2026

@robobun, your commit 8725d03 has 4 failures in Build #49213 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29936

That installs a local version of the PR into your bun-29936 executable, so you can run:

bun-29936 --bun

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 90d429f3-fda6-4d3a-9e42-81c0b7f71d49

📥 Commits

Reviewing files that changed from the base of the PR and between cd1d311 and 8725d03.

📒 Files selected for processing (2)
  • src/bun.js/node/path_watcher.zig
  • test/js/node/watch/fs.watch.close-race.test.ts

Walkthrough

PathWatcher teardown in src/bun.js/node/path_watcher.zig was changed to use a single mutex-protected critical section that atomically sets closed and checks pending directories; a platform-gated regression test test/js/node/watch/fs.watch.close-race.test.ts was added for Linux/FreeBSD fs.watch close races.

Changes

Cohort / File(s) Summary
PathWatcher Race Condition & API removal
src/bun.js/node/path_watcher.zig
Rewrote PathWatcher.deinit() to lock the mutex, set closed=true, and check has_pending_directories inside the same critical section to avoid teardown races. Removed pub fn hasPendingDirectories(...) and pub fn setClosed(...). Updated comments in unrefPendingDirectory() to reflect the new control flow.
Regression Tests
test/js/node/watch/fs.watch.close-race.test.ts
Added a new Linux/FreeBSD-only regression test that spawns subprocesses which repeatedly create non-persistent fs.watch(dir) and immediately call close() to reproduce/guard against a race causing double-free/crash; skips Windows/macOS and asserts each subprocess prints ok 3000, has empty stderr, and exits 0.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing a PathWatcher double-free race between close() and DirectoryRegisterTask, which is exactly the core issue addressed by the PR.
Description check ✅ Passed The description comprehensively covers both required template sections: 'What does this PR do?' (detailed explanation of the double-free bug, root cause analysis, and fix) and 'How did you verify your code works?' (new test, existing test suite results, zig validation). Includes reproduction steps, root cause analysis, and verification details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Review rate limit: 2/5 reviews remaining, refill in 33 minutes and 39 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Found 1 issue this PR may fix:

  1. fs.watch does not work after previous .close() #18919 - fs.watch does not work after previous .close() — the PR's PathWatcher lifecycle fix (pre-existing leak where has_pending_directories is never cleared so deinit() returns early forever) directly explains why re-watching the same path after .close() silently stops working on Linux.

If this is helpful, copy the block below into the PR description to auto-close this issue on merge.

Fixes #18919

🤖 Generated with Claude Code

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/watch/fs.watch.close-race.test.ts`:
- Around line 1-77: This new regression test "close() racing
DirectoryRegisterTask completion does not double-free PathWatcher" should be
moved into the existing fs.watch test file rather than created as a new file;
copy the entire test.skipIf(...) block (including the tempDir usage, fixture,
ATTEMPTS loop and final expect) and append it to the existing fs.watch test
suite, ensuring you keep the imports used (expect, test, bunEnv, bunExe,
isMacOS, isWindows, tempDir) and preserve the test name and timeout (60000) and
local symbols (using dir, fixture, ATTEMPTS, results) so the test runs in the
same module context as the other fs.watch tests; delete this new standalone file
after adding the test to avoid duplicate declarations.
- Around line 69-75: Replace the brittle aggregate equality check on the
collected results with per-attempt assertions: for each entry in results
(collected from proc.stdout.text(), proc.stderr.text(), proc.exited where
ATTEMPTS and proc come from the test), assert stdout.trim() === "ok 3000" and
exitCode === 0 individually, and only assert or surface stderr when exitCode !==
0 (e.g., include stderr in the failure message or call expect(stderr).toBe("")
conditionally) so that bunExe()/bunEnv spawned subprocess stderr is used as
diagnostic rather than a strict empty-string requirement.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b14c11de-ecd6-4998-9c76-0b006be001e8

📥 Commits

Reviewing files that changed from the base of the PR and between 7bcb021 and cd1d311.

📒 Files selected for processing (1)
  • test/js/node/watch/fs.watch.close-race.test.ts

Comment thread test/js/node/watch/fs.watch.close-race.test.ts
Comment thread test/js/node/watch/fs.watch.close-race.test.ts
Comment thread src/bun.js/node/path_watcher.zig
@robobun

robobun commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author

Not a match for #18919 — that issue is about re-watching a file after close() and the second watcher never firing. Tested the repro with this PR applied: still reproduces (File changed → timeout, second watcher never fires).

This PR only fixes the PathWatcher double-free race for directory watches. The has_pending_directories leak the bot references only applies to directory watches (file watches never schedule a DirectoryRegisterTask), and is intentionally left unfixed here — see the "Not fixed here" section.

…er ASAN startup warning from subprocess stderr

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks correct and minimal — merging closed=true and the has_pending_directories check into one critical section closes the interleaving the PR description lays out, and my earlier nit about the now-dead helpers was addressed. I'm deferring only because cross-thread teardown ownership in this file has a track record of subtle adjacent issues (per the "Not fixed here" section and prior PRs), so a second pair of human eyes on the concurrency reasoning is worth the cost.

Extended reasoning...

Overview

Two files: src/bun.js/node/path_watcher.zig (rewrites the opening of PathWatcher.deinit() to set closed=true and check has_pending_directories under a single this.mutex hold instead of two separate operations with a gap; removes the now-unused setClosed()/hasPendingDirectories() helpers; updates a comment in unrefPendingDirectory()) and a new Linux/FreeBSD-only stress test test/js/node/watch/fs.watch.close-race.test.ts that runs 4 subprocesses × 3000 watch/close iterations.

Security risks

None. This is internal lifecycle/teardown logic for fs.watch directory watchers; no auth, crypto, parsing, or untrusted-input surface is touched.

Level of scrutiny

High. The change is small (~30 net lines in the .zig file, mostly comments), but it governs which of two threads owns destruction of a heap object. I traced the three relevant interleavings (main wins lock first, worker wins lock first, no contention) and the fix correctly eliminates the double-destroy: under the new code the main-thread deinit() can only proceed past the early-return when has_pending_directories is false, and the worker can only clear that flag while holding the same lock after observing closed==true — so at most one caller falls through. The pre-existing leak case (worker finishes before close(), flag never cleared) is unchanged and explicitly documented as out of scope. However, this file has a documented history (#27469, #28104, #29391) where seemingly-correct adjustments to this exact atomic exposed latent UAFs in the main_watcher watchlist path-string ownership, so concurrency changes here merit human sign-off.

Other factors

The PR description includes a precise interleaving table and ASAN repro; verification covers the new test (3/3 fail unpatched, 3/3 pass patched), the full fs.watch.test.ts suite (15/15 clean), and the sibling deadlock/events-cb-race tests. My earlier review feedback (delete dead helpers) and CodeRabbit's stderr-filtering suggestion were both addressed in 224a696; all inline threads are resolved. The standalone test file follows the established pattern in test/js/node/watch/ for race-condition regressions.

@robobun

robobun commented May 1, 2026

Copy link
Copy Markdown
Collaborator Author

Closing — obsoleted by #29952, which rewrote path_watcher.zig to own inotify/FSEvents/kqueue directly instead of going through bun.Watcher.

The double-free race this PR fixed was between PathWatcher.deinit() (main thread) and the work-pool DirectoryRegisterTask's unrefPendingDirectory() (worker thread), both passing the lock-free has_pending_directories check. #29952 removed DirectoryRegisterTask, unrefPendingDirectory, has_pending_directories, and the three-mutex architecture entirely — the new design has one mutex and no work-pool task, so the race cannot exist.

The merge conflict with main is the entire old PathWatcher struct vs the new one; there's nothing meaningful to carry over.

@robobun robobun closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant