Skip to content

Fix PathWatcherManager deadlock and UAF in deferred deinit#28088

Closed
robobun wants to merge 8 commits into
mainfrom
farm/c3c48d40/fix-pathwatcher-deadlock
Closed

Fix PathWatcherManager deadlock and UAF in deferred deinit#28088
robobun wants to merge 8 commits into
mainfrom
farm/c3c48d40/fix-pathwatcher-deadlock

Conversation

@robobun

@robobun robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Fixes multiple concurrency bugs in PathWatcherManager and PathWatcher that cause deadlocks, hangs, and use-after-free when fs.watch() watchers are rapidly created and destroyed.

This is a minimal, focused port of #27469 (by @chrislloyd) onto current main. Previous attempts (#27957) added too much scope and introduced new issues. This PR applies only the defer-deinit pattern fixes.

Root causes

  1. unregisterWatcher self-deadlockdeinit() called inside a defer that fires while this.mutex is still held. deinit() re-acquires the same non-recursive mutex → self-deadlock in __ulock_wait2.

  2. unrefPendingTask UAF — Same pattern. If deinit() completes and calls destroy(this), the subsequent defer this.mutex.unlock() writes to freed memory.

  3. processWatcher AB/BA lock inversion — Worker thread holds watcher.mutex → calls _decrementPathRef → acquires manager.mutex. Main thread unregisterWatcher holds manager.mutex → acquires watcher.mutex. Classic AB/BA.

  4. unrefPendingDirectory self-deadlockdeinit() calls setClosed() which re-locks this.mutex while it's already held.

  5. onError lock ordering — Acquired this.mutex then default_manager_mutex, but deinit() acquires them in reverse order.

Fix

All fixes use the same pattern: set a should_deinit flag under the lock, defer actual deinit() call to after the mutex is released. For processWatcher, unlock watcher.mutex before calling _decrementPathRef. For onError, split into two separate lock scopes.

How did you verify your code works?

  • bun bd test test/js/node/watch/fs.watch.deadlock.test.ts passes
  • Existing fs.watch.test.ts crashes at unrefPendingDirectory on unmodified main (confirming the bug), no longer crashes with this fix
  • The deadlock is race-condition dependent so the test exercises the code path but may not reliably trigger the hang on every run

Based on #27469 by @chrislloyd.


Verification (b674d12): CI green (56/58 pass; 1 infra flake upload-benchmark.mjs also fails on main; 1 windows-aarch64 still running). ASAN build+test passed — critical for this UAF fix. Baked binary (main) deadlocks on the new test; all CI test suites pass on PR binary. 4/4 bot review threads resolved. Diff matches claimed root causes.

Four fixes in the deferred-deinit pattern used by PathWatcherManager,
all triggered by rapid fs.watch() churn:

1. unregisterWatcher self-deadlock: deinit() was called while holding
   this.mutex. deinit() re-acquires the same non-recursive mutex when
   hasPendingTasks() is true, causing self-deadlock. Fix: hoist to a
   should_deinit flag checked by a defer registered before the lock.

2. unrefPendingTask UAF: same shape — deinit() called while holding
   mutex. If deinit() completes and calls destroy(this), the deferred
   mutex.unlock() writes to freed memory. Same fix pattern.

3. processWatcher AB/BA lock inversion: worker thread holds
   watcher.mutex then calls _decrementPathRef which acquires
   manager.mutex. Main thread unregisterWatcher holds manager.mutex
   then wants watcher.mutex. Fix: unlock watcher.mutex before calling
   _decrementPathRef.

4. unrefPendingDirectory self-deadlock: deinit() calls setClosed()
   which re-locks this.mutex. Same should_deinit pattern.

5. onError lock ordering: acquired this.mutex then default_manager_mutex,
   but deinit() acquires them in reverse order. Fix: release this.mutex
   before acquiring default_manager_mutex.

Based on #27469 by @chrislloyd.
@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

✅ Opened fresh PR #28104 with the targeted fix from #27469 (rebased onto current main).

Three fixes in path_watcher.zig:

  1. unregisterWatcher self-deadlock (defer ordering)
  2. unrefPendingTask UAF (same pattern)
  3. processWatcher AB/BA lock inversion (unlock watcher.mutex before _decrementPathRef)

Debug build compiles, waiting for CI.

@coderabbitai

coderabbitai Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds deferred deinitialization guards, TOCTOU protections, and mutex-ordering fixes to PathWatcher and Watcher cleanup paths to prevent double-deinit/UAF and deadlocks; introduces a skip-thread-cleanup flag for owner-led teardown and a subprocess test that rapidly creates/closes recursive fs.watch watchers to detect deadlocks.

Changes

Cohort / File(s) Summary
Path watcher implementation
src/bun.js/node/path_watcher.zig
Adds guarded, deferred deinit flow (e.g., should_deinit/needs_deinit), preserves lock ordering by acquiring default_manager_mutex after releasing watcher mutex, adds deinit_started: std.atomic.Value(bool), and avoids destroying while holding/unlocking mutexes across error/unregister/close paths.
Watcher thread cleanup flag
src/Watcher.zig
Adds public field skip_thread_cleanup: bool = false and updates threadMain and deinit so thread-owned cleanup (fd close, watchlist/trace/allocator deinit) can be skipped when an external owner manages teardown, preventing use-after-free races.
Deadlock detection test
test/js/node/watch/fs.watch.deadlock.test.ts
Adds a subprocess-based test that creates a deep temp directory tree and repeatedly opens/closes recursive fs.watch watchers, enforcing a 10s timeout and asserting the subprocess prints OK and exits 0 to detect deadlocks.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: addressing deadlock and use-after-free (UAF) bugs in PathWatcherManager through the defer-deinit pattern.
Description check ✅ Passed The description comprehensively explains what the PR does, its root causes, the fix applied, and verification steps; it is directly relevant to all changes in the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

🔍 Verifying PR #28088 (599c9b5) — push 2.

Gate 1 (CI): Build 39553: All build steps ✅. 11/12 test suites ✅ (including ASAN ✅). windows-11-aarch64 still pending. Only failure is upload-benchmark.mjs (pipeline infra, not code).
Gate 2 (Classification): Bug fix — test proof required.
Gate 3 (Test proof):

  • Step 1 ✅ new test: test/js/node/watch/fs.watch.deadlock.test.ts
  • Step 2 ✅ baked binary deadlocks: build/debug/bun-debug test → times out at 5s (this test timed out after 5000ms)
  • Step 3 ✅ PR binary passes: CI test suites pass on debian-13-x64, ubuntu-25.04-x64, alpine-x64, alpine-aarch64, debian-13-aarch64, windows-2019-x64 + baseline, etc.
  • Step 4 ✅ ASAN: debian-13-x64-asan-test-bun passed
    Gate 4 (Diff): No TODO/FIXME/HACK. Defer-deinit pattern correct across all 5 call sites. Lock ordering fix in processWatcher and onError verified. ✅
    Gate 5 (Bots): CodeRabbit reviewed 599c9b5. All threads resolved. ✅
    Gate 6 (Hygiene): Branch farm/c3c48d40/fix-pathwatcher-deadlock ✅. PR body thorough. ✅

⏳ Waiting for windows-11-aarch64-test-bun to finish, then posting final verdict.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/bun.js/node/path_watcher.zig (1)

323-334: ⚠️ Potential issue | 🟠 Major

Let deinit() own the global manager reset.

PathWatcherManager.deinit() already locks default_manager_mutex and only clears the slot when default_manager == this (Lines 673-677). This extra block is redundant and less safe: if an already-retired manager hits onError() again after a replacement manager has been installed, the unconditional default_manager = null here will wipe out the replacement.

🧹 Minimal fix
-        // Acquire default_manager_mutex AFTER releasing this.mutex to maintain
-        // consistent lock ordering (default_manager_mutex before this.mutex).
-        // deinit() acquires default_manager_mutex first; holding this.mutex
-        // while acquiring default_manager_mutex would invert that order.
-        {
-            default_manager_mutex.lock();
-            defer default_manager_mutex.unlock();
-            default_manager = null;
-        }
-
         // deinit manager when all watchers are closed
         this.deinit();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/node/path_watcher.zig` around lines 323 - 334, Remove the
redundant global reset by deleting the explicit block that locks
default_manager_mutex and sets default_manager = null; instead rely on
PathWatcherManager.deinit() (which already locks default_manager_mutex and
clears default_manager only when default_manager == this) to perform that
responsibility—keep the this.deinit() call but remove the manual
default_manager_mutex.lock()/unlock() and default_manager = null statements to
avoid wiping a replacement manager.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/watch/fs.watch.deadlock.test.ts`:
- Around line 48-53: The stress loop currently allows transient errors from
fs.watch and watcher.close to escape and flake the test; wrap the creation and
immediate close of the watcher inside a try/catch that silently ignores any
thrown errors so the loop only probes for deadlocks/crashes, e.g., surround the
fs.watch(dir, { recursive: true }, ...) call and the subsequent w.close() with a
try/catch block referencing the loop variables (total, dir, done) and the
watcher variable (w) to ensure transient watcher failures do not cause test
failures.
- Line 5: The watchdog timeout that currently runs inside the child process (the
child setTimeout used to emit the DEADLOCK diagnostic) must be removed and
instead implemented in the parent test that spawns the child for the "rapid
create/close of recursive watchers should not hang" case; ensure the parent test
installs a 10s watchdog timer that kills the child and emits the DEADLOCK
diagnostic if the child hasn’t exited, await proc.exited with a larger test
timeout (parent test timeout > 10s) so the test can fail fast, and similarly
move/replace any other child-side watchdogs referenced around the child helper
code (lines handling fs.watch/w.close in the child, and the code paths at the
sections referenced 38-43 and 66) into the parent harness so native-thread
deadlocks in the child don’t prevent the watchdog from running.

---

Outside diff comments:
In `@src/bun.js/node/path_watcher.zig`:
- Around line 323-334: Remove the redundant global reset by deleting the
explicit block that locks default_manager_mutex and sets default_manager = null;
instead rely on PathWatcherManager.deinit() (which already locks
default_manager_mutex and clears default_manager only when default_manager ==
this) to perform that responsibility—keep the this.deinit() call but remove the
manual default_manager_mutex.lock()/unlock() and default_manager = null
statements to avoid wiping a replacement manager.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d31a444-ef91-45a5-b912-576c9e743b21

📥 Commits

Reviewing files that changed from the base of the PR and between 685b934 and 599c9b5.

📒 Files selected for processing (2)
  • src/bun.js/node/path_watcher.zig
  • test/js/node/watch/fs.watch.deadlock.test.ts

Comment thread test/js/node/watch/fs.watch.deadlock.test.ts
Comment thread test/js/node/watch/fs.watch.deadlock.test.ts
Comment thread src/bun.js/node/path_watcher.zig Outdated
Comment thread src/bun.js/node/path_watcher.zig Outdated
@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

🔄 Push 4 (8299d81): Fixer addressed both UAF/double-free findings. All review threads resolved. CI running (build #39585). Checking diff hygiene...

onError is called from the watcher thread's threadMain, which frees the
Watcher struct after onError returns. The previous code called this.deinit()
unconditionally from onError, and when watcher_count==0 with no pending
tasks, deinit() reached main_watcher.deinit(false). Since threadMain had
already set watchloop_handle=null, Watcher.deinit took the else branch and
freed the Watcher. Control then returned to threadMain which accessed the
freed struct: UAF on close_descriptors, double-free of watchlist and the
Watcher itself.

Fix: Add main_watcher_thread_owns_cleanup flag. onError sets it and defers
cleanup to unregisterWatcher (via deinit_on_last_watcher). When no watchers
exist, onError calls deinit() but deinit() now skips main_watcher.deinit()
when the flag is set.

Also fix: default_manager=null in onError was unconditional. Now guarded
with if (default_manager == this) to match the pattern in deinit().

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
@robobun robobun requested a review from dylan-conway March 14, 2026 08:19
Comment thread src/bun.js/node/path_watcher.zig Outdated
Comment thread src/bun.js/node/path_watcher.zig
Bug 1 (UAF): After onError, threadMain unconditionally freed the Watcher
at lines 244-257, but PathWatcherManager still holds main_watcher and
dereferences it in _decrementPathRefNoLock, processWatcher, and
_addDirectory. Add skip_thread_cleanup flag to Watcher that
PathWatcherManager's onError handler sets, preventing threadMain from
freeing the Watcher while the manager still needs it. Remove the
main_watcher_thread_owns_cleanup field — PathWatcherManager.deinit now
always calls main_watcher.deinit(false).

Bug 2 (double-free TOCTOU): PathWatcher.deinit() calls setClosed()
which releases its mutex, then separately reads hasPendingDirectories().
Between these, unrefPendingDirectory on the work pool thread can
decrement to zero, see isClosed=true, and also call deinit(). Both
threads destroy the same PathWatcher. Add deinit_started atomic flag;
the second caller bails out via atomic swap.

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/bun.js/node/path_watcher.zig (1)

717-724: ⚠️ Potential issue | 🟡 Minor

Dead code: watcher unlinking block can never execute.

Lines 701-704 return early if this.watcher_count > 0, so the condition at line 717 if (this.watcher_count > 0) can never be true when this code is reached. This block appears to be dead code.

🧹 Suggested removal
         this.main_watcher.deinit(false);

-        if (this.watcher_count > 0) {
-            while (this.watchers.pop()) |watcher| {
-                if (watcher) |w| {
-                    // unlink watcher
-                    w.manager = null;
-                }
-            }
-        }
-
         // close all file descriptors and free paths
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/node/path_watcher.zig` around lines 717 - 724, The conditional
block checking this.watcher_count > 0 and the loop that pops and unlinks
watchers is dead code because earlier logic returns when this.watcher_count > 0;
remove the entire unreachable block that uses this.watchers.pop() and sets
w.manager = null (or, if the intention was to run when watcher_count == 0,
change the condition accordingly), locating it by the symbols watcher_count,
watchers.pop(), and w.manager to ensure you modify the correct code region.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Watcher.zig`:
- Around line 248-266: The trace file is never closed when skip_thread_cleanup
is true because WatcherTrace.deinit() is only called in the thread cleanup path;
update Watcher.deinit() (the else branch that runs when watchloop_handle is
null) to call WatcherTrace.deinit() as well so the trace is cleaned up when
PathWatcherManager.deinit() calls main_watcher.deinit(false); ensure the call is
safe to run multiple times (i.e., idempotent or guarded by the same
open/initialized check used elsewhere).

---

Outside diff comments:
In `@src/bun.js/node/path_watcher.zig`:
- Around line 717-724: The conditional block checking this.watcher_count > 0 and
the loop that pops and unlinks watchers is dead code because earlier logic
returns when this.watcher_count > 0; remove the entire unreachable block that
uses this.watchers.pop() and sets w.manager = null (or, if the intention was to
run when watcher_count == 0, change the condition accordingly), locating it by
the symbols watcher_count, watchers.pop(), and w.manager to ensure you modify
the correct code region.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dbfb79ac-325e-4adf-943b-b658afb458ab

📥 Commits

Reviewing files that changed from the base of the PR and between b674d12 and 8299d81.

📒 Files selected for processing (2)
  • src/Watcher.zig
  • src/bun.js/node/path_watcher.zig

Comment thread src/Watcher.zig
@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

🔍 Verification status (push 8: 823908a)

Gate 1 (CI): ⏳ Buildkite build running. Lint JS ✅. Waiting for full build.
Gate 2 (Classification): Bug fix — test proof required.
Gate 3 (Test proof): Previous push confirmed baked ASAN binary passes fs-watch tests (no ASAN errors), PR introduced them. Fixer narrowed mutex scope + added self-join guard. Awaiting CI to confirm ASAN clean.
Gate 4 (Diff): Reviewed incremental diff — changes are correct and targeted.
Gate 5 (Bots): No unresolved threads. Coderabbit paused (active development).
Gate 6 (Hygiene): Branch farm/c3c48d40/fix-pathwatcher-deadlock ✅.

When watchloop_handle is null (skip_thread_cleanup path),
Watcher.deinit skipped WatcherTrace.deinit(), leaking the
trace file descriptor. WatcherTrace.deinit() is idempotent
so it's safe to call in both branches.

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
Comment thread src/bun.js/node/path_watcher.zig
Comment thread src/bun.js/node/path_watcher.zig Outdated
Comment thread src/bun.js/node/path_watcher.zig
1. Guard main_watcher.deinit(false) with skip_thread_cleanup check
   to prevent freeing the Watcher while threadMain still holds a ref.

2. Move deinit_started atomic guard to top of PathWatcher.deinit(),
   before setClosed() acquires mutex, to prevent concurrent re-entry.
   Reset on hasPendingDirectories early-return path.

3. Check pending_tasks under mutex in PathWatcherManager.deinit()
   instead of lock-free hasPendingTasks() to close TOCTOU window
   between checking and setting deinit_on_last_task.

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
Comment thread src/bun.js/node/path_watcher.zig
Comment thread src/bun.js/node/path_watcher.zig Outdated
1. Move watcher_count check under mutex alongside pending_tasks
   to close TOCTOU window where a watcher unregisters between the
   unlocked check and setting deinit_on_last_watcher.

2. Join the watcher thread before calling main_watcher.deinit(false)
   when skip_thread_cleanup is true, instead of skipping deinit
   entirely. This ensures threadMain has exited (it skips its own
   cleanup) so the Watcher can be safely freed by deinit().

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
Comment thread src/bun.js/node/path_watcher.zig
Two deadlocks fixed:

1. Cross-thread deadlock: deinit() held default_manager_mutex across
   thread.join(), but the watcher thread's onError path also acquires
   default_manager_mutex. Fix: narrow the mutex scope to just the
   default_manager null-out, releasing it before join.

2. Self-join deadlock: when onError is called with watcher_count == 0,
   it calls deinit() from the watcher thread. deinit() then tries to
   join the calling thread (EDEADLK). Fix: compare current thread ID
   against watchloop_handle and skip join when they match.

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

🔍 Verification status823908a

CI: ❌ Build #39618 failed — 6 steps failed (darwin-13-aarch64, darwin-14-aarch64, darwin-14-x64, debian-13-x64-asan).

Direction change: Per @alii's request, the simpler approach from #27469 was preferred over this comprehensive rewrite. CI is failing and no further fixes will be applied to this PR.

Closing — this PR is superseded. A replacement PR with the targeted #27469 approach will be opened separately.

Comment on lines +723 to 734
// When skip_thread_cleanup is true (error path), threadMain is still
// running but will skip its own cleanup. Join to ensure it has exited
// before we free the Watcher — unless we ARE the watcher thread
// (onError calls deinit() directly when watcher_count == 0, and
// joining ourselves would deadlock with EDEADLK).
if (this.main_watcher.skip_thread_cleanup) {
const watcher_tid = this.main_watcher.watchloop_handle;
if (watcher_tid == null or watcher_tid.? != std.Thread.getCurrentId()) {
this.main_watcher.thread.join();
}
}
this.main_watcher.deinit(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The self-join guard in deinit() (lines 729-730) is broken: threadMain sets watchloop_handle = null (Watcher.zig:240) BEFORE calling onError (line 243), so when onError calls deinit() with watcher_count == 0, the guard reads null, watcher_tid == null short-circuits to true, and thread.join() executes from the watcher thread itself — a deterministic self-join (POSIX EDEADLK / Zig unreachable panic). Furthermore, even if the self-join guard were fixed to skip the join, line 734 unconditionally calls main_watcher.deinit(false) which frees the Watcher struct (watchloop_handle is null → else branch at Watcher.zig:125 → allocator.destroy), and control returns to threadMain line 252 which reads this.skip_thread_cleanup from freed memory — UAF. Fix: save the thread ID before clearing watchloop_handle, and when deinit detects it is on the watcher thread, skip both thread.join() AND main_watcher.deinit(false), letting threadMain handle its own cleanup.

Extended reasoning...

The Self-Join Deadlock

The self-join guard at lines 728-732 attempts to detect whether deinit() is being called from the watcher thread by reading this.main_watcher.watchloop_handle. However, threadMain (Watcher.zig:240) sets this.watchloop_handle = null before calling this.onError() at line 243. When onError calls deinit() (which happens when watcher_count == 0 at line 331), watchloop_handle has already been cleared.

The guard condition at line 730 is: if (watcher_tid == null or watcher_tid.? != std.Thread.getCurrentId()). Since watcher_tid is null, the or short-circuits to true, and the code enters the if-block and calls this.main_watcher.thread.join() at line 731 — from the watcher thread itself. POSIX pthread_join on the calling thread returns EDEADLK; Zig std.Thread.join() treats this as unreachable, causing a panic in safe builds or undefined behavior in release builds.

Step-by-step proof of the self-join

  1. threadMain (Watcher.zig:240): this.watchloop_handle = null
  2. threadMain (Watcher.zig:243): calls this.onError(this.ctx, err)
  3. onError (path_watcher.zig:326): sets this.main_watcher.skip_thread_cleanup = true
  4. onError (line 331): watcher_count == 0needs_deinit = true
  5. onError (line 354): calls this.deinit()
  6. deinit (line 728): skip_thread_cleanup is true → enters block
  7. deinit (line 729): watcher_tid = watchloop_handle = null
  8. deinit (line 730): null == null evaluates to true via short-circuit → enters if-body
  9. deinit (line 731): this.main_watcher.thread.join()SELF-JOIN DEADLOCK

This is deterministic, not a race condition — it triggers whenever all JS fs.watch() handles are already closed when the OS watcher reports a system error.

The UAF (even if the self-join were fixed)

Even if the self-join guard were correctly fixed to skip the join when called from the watcher thread, line 734 unconditionally calls this.main_watcher.deinit(false). Since watchloop_handle was set to null at Watcher.zig:240, Watcher.deinit(false) takes the else branch (Watcher.zig:125-136): it calls this.watchlist.deinit(this.allocator) at line 133 freeing the watchlist, then allocator.destroy(this) at line 135 freeing the entire Watcher struct.

Control then returns through deinit()onError()threadMain(). At Watcher.zig:252, threadMain reads this.skip_thread_cleanup from the freed Watcher struct — a use-after-free. In the current code this UAF is masked by the self-join panic/deadlock occurring first at line 731, but fixing the self-join without fixing this would expose the UAF.

Impact

The self-join is a deterministic crash/deadlock on the error path. The UAF is a memory safety violation that would be exposed by any correct fix of the self-join. Both must be fixed together.

Suggested Fix

Save the thread ID in a separate field before clearing watchloop_handle, or use std.Thread.getCurrentId() compared against thread.getHandle() instead of relying on watchloop_handle. When deinit detects it is on the watcher thread, skip both thread.join() and main_watcher.deinit(false), letting threadMain handle its own Watcher cleanup after onError returns (the if (!this.skip_thread_cleanup) block at Watcher.zig:252 would need to be inverted to perform cleanup when on the error path).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified: threadMain sets watchloop_handle = null (line 241) before calling onError (line 244). When watcher_count == 0, onError calls deinit() where the null watchloop_handle short-circuits the self-join guard → thread.join() from the watcher thread itself = EDEADLK. Briefing fixer.

Comment on lines +948 to 953
if (this.deinit_started.swap(true, .acq_rel)) return;
this.setClosed();
if (this.hasPendingDirectories()) {
// will be freed on last directory
this.deinit_started.store(false, .release);
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The deinit_started.store(false) reset at line 952 creates a narrow TOCTOU window where neither thread performs cleanup, causing a permanent PathWatcher leak. If Thread A reads hasPendingDirectories() as true (line 950) and Thread B completes unrefPendingDirectory (decrementing to 0, seeing closed=true) and enters deferred deinit() before Thread A reaches line 952, Thread B's swap returns true and bails, then Thread A resets deinit_started to false and returns — no future trigger exists to free the PathWatcher. Fix: keep deinit_started true in the early-return path and have unrefPendingDirectory check deinit_started directly instead of calling deinit().

Extended reasoning...

What the bug is

The deinit_started atomic guard in PathWatcher.deinit() (lines 948-953) has a residual TOCTOU race between the atomic swap and the reset that causes a permanent PathWatcher resource leak. The deinit_started.store(false, .release) at line 952 is intended to allow unrefPendingDirectory to later trigger cleanup when the last pending directory completes, but there is a window where both threads bail without cleaning up.

The specific race condition

The race occurs between hasPendingDirectories() returning true (line 950) and deinit_started.store(false) (line 952):

  1. Thread A (main thread calling PathWatcher.deinit()): deinit_started.swap(true) returns false at line 948 → proceeds. Calls setClosed() (line 949). Reads hasPendingDirectories() → returns true (line 950).
  2. Thread B (work pool in unrefPendingDirectory): Acquires this.mutex (line 898). Decrements pending_directories to 0 (line 900). Stores has_pending_directories = false (line 902). Reads isClosed() → returns true (from step 1). Sets should_deinit = true. Releases mutex. Deferred this.deinit() fires.
  3. Thread B in deinit(): deinit_started.swap(true) returns true (set by Thread A in step 1) → returns immediately at line 948.
  4. Thread A continues: Executes deinit_started.store(false) (line 952) → resets the flag. Returns (line 953).

Why neither thread performs cleanup

After this interleaving:

  • Thread A gave up because hasPendingDirectories() was true (step 1)
  • Thread B gave up because deinit_started was true (step 3)
  • Thread A reset deinit_started to false (step 4)

The final state is: pending_directories == 0, closed == true, deinit_started == false. No future trigger exists — unrefPendingDirectory will never be called again (pending_directories is already 0), and no other code path calls deinit() on this PathWatcher. The PathWatcher struct, its file_paths, and associated allocations are permanently leaked.

Timing analysis

The window is narrow: Thread B must execute steps 2-3 (the entire unrefPendingDirectory critical section plus the deferred deinit() entry including the atomic swap) between Thread A's lines 950 and 952, which are essentially consecutive instructions (branch + store). On a multi-core system, Thread B runs concurrently on another core and only needs to complete ~15 instructions (mutex lock/decrement/store/check/unlock/function call/swap) during Thread A's ~3 instructions (branch evaluation + store + return). While unlikely per invocation, it is theoretically triggerable under load, especially on ARM with weaker memory ordering.

Impact

The consequence is a resource leak only — the PathWatcher struct plus associated allocations (file_paths, etc.) are never freed. This is not a crash, UAF, or security issue. The trigger scenario (main thread calling deinit() racing with the very last unrefPendingDirectory completing on the work pool) is itself uncommon, and the timing window within that scenario is extremely tight.

How to fix

Instead of resetting deinit_started to false in the hasPendingDirectories() early-return path, keep it true. Then modify unrefPendingDirectory to check deinit_started directly (similar to how deinit_on_last_task/deinit_on_last_watcher work elsewhere in this file). When unrefPendingDirectory sees pending_directories == 0 and deinit_started == true, it performs the cleanup directly rather than calling deinit() which would bail on the swap. Alternatively, perform the hasPendingDirectories() check under the mutex to coordinate with unrefPendingDirectory atomically.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid finding — narrow TOCTOU window where neither thread frees the PathWatcher. Low severity (leak only, not crash), but should be fixed. Briefing fixer.

@robobun

robobun commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator Author

Closing: CI failed (build #39618, 6 steps) and this approach is superseded per maintainer feedback. A replacement PR with the simpler #27469 approach will be opened separately.

@robobun robobun closed this Mar 14, 2026
@alii

alii commented Apr 16, 2026

Copy link
Copy Markdown
Member

@robobun pick this back up and remake your replacement PR you said would exist

@robobun

robobun commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator Author

✅ Replacement PR opened → #29391

Applies the minimal #27469 defer-deinit fix onto current main. Verified locally:

  • Without fix: panic: Deadlock detected on rapid fs.watch({recursive:true}).close() churn
  • With fix: passes; no ASAN regressions on test-fs-watch-recursive-* node-parallel tests

Rebased onto latest main to pick up the process.test.js mimalloc-hash fix. Also closed stale #28104 as superseded.

Jarred-Sumner pushed a commit that referenced this pull request Apr 17, 2026
## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of #27469 (by @chrislloyd) onto
current main — previous attempts (#27957, #28088, #28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in #28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes #27469, #27957, #28088, #28104.

---

**Re: duplicate-bot flagging #26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging #18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…9391)

## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto
current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in oven-sh#28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104.

---

**Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…9391)

## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto
current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in oven-sh#28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104.

---

**Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants