Fix PathWatcherManager deadlock and UAF in deferred deinit#28088
Fix PathWatcherManager deadlock and UAF in deferred deinit#28088robobun wants to merge 8 commits into
Conversation
Four fixes in the deferred-deinit pattern used by PathWatcherManager, all triggered by rapid fs.watch() churn: 1. unregisterWatcher self-deadlock: deinit() was called while holding this.mutex. deinit() re-acquires the same non-recursive mutex when hasPendingTasks() is true, causing self-deadlock. Fix: hoist to a should_deinit flag checked by a defer registered before the lock. 2. unrefPendingTask UAF: same shape — deinit() called while holding mutex. If deinit() completes and calls destroy(this), the deferred mutex.unlock() writes to freed memory. Same fix pattern. 3. processWatcher AB/BA lock inversion: worker thread holds watcher.mutex then calls _decrementPathRef which acquires manager.mutex. Main thread unregisterWatcher holds manager.mutex then wants watcher.mutex. Fix: unlock watcher.mutex before calling _decrementPathRef. 4. unrefPendingDirectory self-deadlock: deinit() calls setClosed() which re-locks this.mutex. Same should_deinit pattern. 5. onError lock ordering: acquired this.mutex then default_manager_mutex, but deinit() acquires them in reverse order. Fix: release this.mutex before acquiring default_manager_mutex. Based on #27469 by @chrislloyd.
|
✅ Opened fresh PR #28104 with the targeted fix from #27469 (rebased onto current main). Three fixes in
Debug build compiles, waiting for CI. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds deferred deinitialization guards, TOCTOU protections, and mutex-ordering fixes to PathWatcher and Watcher cleanup paths to prevent double-deinit/UAF and deadlocks; introduces a skip-thread-cleanup flag for owner-led teardown and a subprocess test that rapidly creates/closes recursive fs.watch watchers to detect deadlocks. Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
|
🔍 Verifying PR #28088 (599c9b5) — push 2. Gate 1 (CI): Build 39553: All build steps ✅. 11/12 test suites ✅ (including ASAN ✅). windows-11-aarch64 still pending. Only failure is
⏳ Waiting for windows-11-aarch64-test-bun to finish, then posting final verdict. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bun.js/node/path_watcher.zig (1)
323-334:⚠️ Potential issue | 🟠 MajorLet
deinit()own the global manager reset.
PathWatcherManager.deinit()already locksdefault_manager_mutexand only clears the slot whendefault_manager == this(Lines 673-677). This extra block is redundant and less safe: if an already-retired manager hitsonError()again after a replacement manager has been installed, the unconditionaldefault_manager = nullhere will wipe out the replacement.🧹 Minimal fix
- // Acquire default_manager_mutex AFTER releasing this.mutex to maintain - // consistent lock ordering (default_manager_mutex before this.mutex). - // deinit() acquires default_manager_mutex first; holding this.mutex - // while acquiring default_manager_mutex would invert that order. - { - default_manager_mutex.lock(); - defer default_manager_mutex.unlock(); - default_manager = null; - } - // deinit manager when all watchers are closed this.deinit();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/node/path_watcher.zig` around lines 323 - 334, Remove the redundant global reset by deleting the explicit block that locks default_manager_mutex and sets default_manager = null; instead rely on PathWatcherManager.deinit() (which already locks default_manager_mutex and clears default_manager only when default_manager == this) to perform that responsibility—keep the this.deinit() call but remove the manual default_manager_mutex.lock()/unlock() and default_manager = null statements to avoid wiping a replacement manager.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/node/watch/fs.watch.deadlock.test.ts`:
- Around line 48-53: The stress loop currently allows transient errors from
fs.watch and watcher.close to escape and flake the test; wrap the creation and
immediate close of the watcher inside a try/catch that silently ignores any
thrown errors so the loop only probes for deadlocks/crashes, e.g., surround the
fs.watch(dir, { recursive: true }, ...) call and the subsequent w.close() with a
try/catch block referencing the loop variables (total, dir, done) and the
watcher variable (w) to ensure transient watcher failures do not cause test
failures.
- Line 5: The watchdog timeout that currently runs inside the child process (the
child setTimeout used to emit the DEADLOCK diagnostic) must be removed and
instead implemented in the parent test that spawns the child for the "rapid
create/close of recursive watchers should not hang" case; ensure the parent test
installs a 10s watchdog timer that kills the child and emits the DEADLOCK
diagnostic if the child hasn’t exited, await proc.exited with a larger test
timeout (parent test timeout > 10s) so the test can fail fast, and similarly
move/replace any other child-side watchdogs referenced around the child helper
code (lines handling fs.watch/w.close in the child, and the code paths at the
sections referenced 38-43 and 66) into the parent harness so native-thread
deadlocks in the child don’t prevent the watchdog from running.
---
Outside diff comments:
In `@src/bun.js/node/path_watcher.zig`:
- Around line 323-334: Remove the redundant global reset by deleting the
explicit block that locks default_manager_mutex and sets default_manager = null;
instead rely on PathWatcherManager.deinit() (which already locks
default_manager_mutex and clears default_manager only when default_manager ==
this) to perform that responsibility—keep the this.deinit() call but remove the
manual default_manager_mutex.lock()/unlock() and default_manager = null
statements to avoid wiping a replacement manager.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8d31a444-ef91-45a5-b912-576c9e743b21
📒 Files selected for processing (2)
src/bun.js/node/path_watcher.zigtest/js/node/watch/fs.watch.deadlock.test.ts
|
🔄 Push 4 ( |
onError is called from the watcher thread's threadMain, which frees the Watcher struct after onError returns. The previous code called this.deinit() unconditionally from onError, and when watcher_count==0 with no pending tasks, deinit() reached main_watcher.deinit(false). Since threadMain had already set watchloop_handle=null, Watcher.deinit took the else branch and freed the Watcher. Control then returned to threadMain which accessed the freed struct: UAF on close_descriptors, double-free of watchlist and the Watcher itself. Fix: Add main_watcher_thread_owns_cleanup flag. onError sets it and defers cleanup to unregisterWatcher (via deinit_on_last_watcher). When no watchers exist, onError calls deinit() but deinit() now skips main_watcher.deinit() when the flag is set. Also fix: default_manager=null in onError was unconditional. Now guarded with if (default_manager == this) to match the pattern in deinit(). https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
Bug 1 (UAF): After onError, threadMain unconditionally freed the Watcher at lines 244-257, but PathWatcherManager still holds main_watcher and dereferences it in _decrementPathRefNoLock, processWatcher, and _addDirectory. Add skip_thread_cleanup flag to Watcher that PathWatcherManager's onError handler sets, preventing threadMain from freeing the Watcher while the manager still needs it. Remove the main_watcher_thread_owns_cleanup field — PathWatcherManager.deinit now always calls main_watcher.deinit(false). Bug 2 (double-free TOCTOU): PathWatcher.deinit() calls setClosed() which releases its mutex, then separately reads hasPendingDirectories(). Between these, unrefPendingDirectory on the work pool thread can decrement to zero, see isClosed=true, and also call deinit(). Both threads destroy the same PathWatcher. Add deinit_started atomic flag; the second caller bails out via atomic swap. https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bun.js/node/path_watcher.zig (1)
717-724:⚠️ Potential issue | 🟡 MinorDead code: watcher unlinking block can never execute.
Lines 701-704 return early if
this.watcher_count > 0, so the condition at line 717if (this.watcher_count > 0)can never be true when this code is reached. This block appears to be dead code.🧹 Suggested removal
this.main_watcher.deinit(false); - if (this.watcher_count > 0) { - while (this.watchers.pop()) |watcher| { - if (watcher) |w| { - // unlink watcher - w.manager = null; - } - } - } - // close all file descriptors and free paths🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/node/path_watcher.zig` around lines 717 - 724, The conditional block checking this.watcher_count > 0 and the loop that pops and unlinks watchers is dead code because earlier logic returns when this.watcher_count > 0; remove the entire unreachable block that uses this.watchers.pop() and sets w.manager = null (or, if the intention was to run when watcher_count == 0, change the condition accordingly), locating it by the symbols watcher_count, watchers.pop(), and w.manager to ensure you modify the correct code region.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Watcher.zig`:
- Around line 248-266: The trace file is never closed when skip_thread_cleanup
is true because WatcherTrace.deinit() is only called in the thread cleanup path;
update Watcher.deinit() (the else branch that runs when watchloop_handle is
null) to call WatcherTrace.deinit() as well so the trace is cleaned up when
PathWatcherManager.deinit() calls main_watcher.deinit(false); ensure the call is
safe to run multiple times (i.e., idempotent or guarded by the same
open/initialized check used elsewhere).
---
Outside diff comments:
In `@src/bun.js/node/path_watcher.zig`:
- Around line 717-724: The conditional block checking this.watcher_count > 0 and
the loop that pops and unlinks watchers is dead code because earlier logic
returns when this.watcher_count > 0; remove the entire unreachable block that
uses this.watchers.pop() and sets w.manager = null (or, if the intention was to
run when watcher_count == 0, change the condition accordingly), locating it by
the symbols watcher_count, watchers.pop(), and w.manager to ensure you modify
the correct code region.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dbfb79ac-325e-4adf-943b-b658afb458ab
📒 Files selected for processing (2)
src/Watcher.zigsrc/bun.js/node/path_watcher.zig
|
🔍 Verification status (push 8: Gate 1 (CI): ⏳ Buildkite build running. Lint JS ✅. Waiting for full build. |
When watchloop_handle is null (skip_thread_cleanup path), Watcher.deinit skipped WatcherTrace.deinit(), leaking the trace file descriptor. WatcherTrace.deinit() is idempotent so it's safe to call in both branches. https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
1. Guard main_watcher.deinit(false) with skip_thread_cleanup check to prevent freeing the Watcher while threadMain still holds a ref. 2. Move deinit_started atomic guard to top of PathWatcher.deinit(), before setClosed() acquires mutex, to prevent concurrent re-entry. Reset on hasPendingDirectories early-return path. 3. Check pending_tasks under mutex in PathWatcherManager.deinit() instead of lock-free hasPendingTasks() to close TOCTOU window between checking and setting deinit_on_last_task. https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
1. Move watcher_count check under mutex alongside pending_tasks to close TOCTOU window where a watcher unregisters between the unlocked check and setting deinit_on_last_watcher. 2. Join the watcher thread before calling main_watcher.deinit(false) when skip_thread_cleanup is true, instead of skipping deinit entirely. This ensures threadMain has exited (it skips its own cleanup) so the Watcher can be safely freed by deinit(). https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
Two deadlocks fixed: 1. Cross-thread deadlock: deinit() held default_manager_mutex across thread.join(), but the watcher thread's onError path also acquires default_manager_mutex. Fix: narrow the mutex scope to just the default_manager null-out, releasing it before join. 2. Self-join deadlock: when onError is called with watcher_count == 0, it calls deinit() from the watcher thread. deinit() then tries to join the calling thread (EDEADLK). Fix: compare current thread ID against watchloop_handle and skip join when they match. https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
|
🔍 Verification status — CI: ❌ Build #39618 failed — 6 steps failed (darwin-13-aarch64, darwin-14-aarch64, darwin-14-x64, debian-13-x64-asan). Direction change: Per @alii's request, the simpler approach from #27469 was preferred over this comprehensive rewrite. CI is failing and no further fixes will be applied to this PR. Closing — this PR is superseded. A replacement PR with the targeted #27469 approach will be opened separately. |
| // When skip_thread_cleanup is true (error path), threadMain is still | ||
| // running but will skip its own cleanup. Join to ensure it has exited | ||
| // before we free the Watcher — unless we ARE the watcher thread | ||
| // (onError calls deinit() directly when watcher_count == 0, and | ||
| // joining ourselves would deadlock with EDEADLK). | ||
| if (this.main_watcher.skip_thread_cleanup) { | ||
| const watcher_tid = this.main_watcher.watchloop_handle; | ||
| if (watcher_tid == null or watcher_tid.? != std.Thread.getCurrentId()) { | ||
| this.main_watcher.thread.join(); | ||
| } | ||
| } | ||
| this.main_watcher.deinit(false); |
There was a problem hiding this comment.
🔴 The self-join guard in deinit() (lines 729-730) is broken: threadMain sets watchloop_handle = null (Watcher.zig:240) BEFORE calling onError (line 243), so when onError calls deinit() with watcher_count == 0, the guard reads null, watcher_tid == null short-circuits to true, and thread.join() executes from the watcher thread itself — a deterministic self-join (POSIX EDEADLK / Zig unreachable panic). Furthermore, even if the self-join guard were fixed to skip the join, line 734 unconditionally calls main_watcher.deinit(false) which frees the Watcher struct (watchloop_handle is null → else branch at Watcher.zig:125 → allocator.destroy), and control returns to threadMain line 252 which reads this.skip_thread_cleanup from freed memory — UAF. Fix: save the thread ID before clearing watchloop_handle, and when deinit detects it is on the watcher thread, skip both thread.join() AND main_watcher.deinit(false), letting threadMain handle its own cleanup.
Extended reasoning...
The Self-Join Deadlock
The self-join guard at lines 728-732 attempts to detect whether deinit() is being called from the watcher thread by reading this.main_watcher.watchloop_handle. However, threadMain (Watcher.zig:240) sets this.watchloop_handle = null before calling this.onError() at line 243. When onError calls deinit() (which happens when watcher_count == 0 at line 331), watchloop_handle has already been cleared.
The guard condition at line 730 is: if (watcher_tid == null or watcher_tid.? != std.Thread.getCurrentId()). Since watcher_tid is null, the or short-circuits to true, and the code enters the if-block and calls this.main_watcher.thread.join() at line 731 — from the watcher thread itself. POSIX pthread_join on the calling thread returns EDEADLK; Zig std.Thread.join() treats this as unreachable, causing a panic in safe builds or undefined behavior in release builds.
Step-by-step proof of the self-join
threadMain(Watcher.zig:240):this.watchloop_handle = nullthreadMain(Watcher.zig:243): callsthis.onError(this.ctx, err)onError(path_watcher.zig:326): setsthis.main_watcher.skip_thread_cleanup = trueonError(line 331):watcher_count == 0→needs_deinit = trueonError(line 354): callsthis.deinit()deinit(line 728):skip_thread_cleanupis true → enters blockdeinit(line 729):watcher_tid = watchloop_handle = nulldeinit(line 730):null == nullevaluates to true via short-circuit → enters if-bodydeinit(line 731):this.main_watcher.thread.join()— SELF-JOIN DEADLOCK
This is deterministic, not a race condition — it triggers whenever all JS fs.watch() handles are already closed when the OS watcher reports a system error.
The UAF (even if the self-join were fixed)
Even if the self-join guard were correctly fixed to skip the join when called from the watcher thread, line 734 unconditionally calls this.main_watcher.deinit(false). Since watchloop_handle was set to null at Watcher.zig:240, Watcher.deinit(false) takes the else branch (Watcher.zig:125-136): it calls this.watchlist.deinit(this.allocator) at line 133 freeing the watchlist, then allocator.destroy(this) at line 135 freeing the entire Watcher struct.
Control then returns through deinit() → onError() → threadMain(). At Watcher.zig:252, threadMain reads this.skip_thread_cleanup from the freed Watcher struct — a use-after-free. In the current code this UAF is masked by the self-join panic/deadlock occurring first at line 731, but fixing the self-join without fixing this would expose the UAF.
Impact
The self-join is a deterministic crash/deadlock on the error path. The UAF is a memory safety violation that would be exposed by any correct fix of the self-join. Both must be fixed together.
Suggested Fix
Save the thread ID in a separate field before clearing watchloop_handle, or use std.Thread.getCurrentId() compared against thread.getHandle() instead of relying on watchloop_handle. When deinit detects it is on the watcher thread, skip both thread.join() and main_watcher.deinit(false), letting threadMain handle its own Watcher cleanup after onError returns (the if (!this.skip_thread_cleanup) block at Watcher.zig:252 would need to be inverted to perform cleanup when on the error path).
There was a problem hiding this comment.
Verified: threadMain sets watchloop_handle = null (line 241) before calling onError (line 244). When watcher_count == 0, onError calls deinit() where the null watchloop_handle short-circuits the self-join guard → thread.join() from the watcher thread itself = EDEADLK. Briefing fixer.
| if (this.deinit_started.swap(true, .acq_rel)) return; | ||
| this.setClosed(); | ||
| if (this.hasPendingDirectories()) { | ||
| // will be freed on last directory | ||
| this.deinit_started.store(false, .release); | ||
| return; |
There was a problem hiding this comment.
🟡 The deinit_started.store(false) reset at line 952 creates a narrow TOCTOU window where neither thread performs cleanup, causing a permanent PathWatcher leak. If Thread A reads hasPendingDirectories() as true (line 950) and Thread B completes unrefPendingDirectory (decrementing to 0, seeing closed=true) and enters deferred deinit() before Thread A reaches line 952, Thread B's swap returns true and bails, then Thread A resets deinit_started to false and returns — no future trigger exists to free the PathWatcher. Fix: keep deinit_started true in the early-return path and have unrefPendingDirectory check deinit_started directly instead of calling deinit().
Extended reasoning...
What the bug is
The deinit_started atomic guard in PathWatcher.deinit() (lines 948-953) has a residual TOCTOU race between the atomic swap and the reset that causes a permanent PathWatcher resource leak. The deinit_started.store(false, .release) at line 952 is intended to allow unrefPendingDirectory to later trigger cleanup when the last pending directory completes, but there is a window where both threads bail without cleaning up.
The specific race condition
The race occurs between hasPendingDirectories() returning true (line 950) and deinit_started.store(false) (line 952):
- Thread A (main thread calling
PathWatcher.deinit()):deinit_started.swap(true)returns false at line 948 → proceeds. CallssetClosed()(line 949). ReadshasPendingDirectories()→ returns true (line 950). - Thread B (work pool in
unrefPendingDirectory): Acquiresthis.mutex(line 898). Decrementspending_directoriesto 0 (line 900). Storeshas_pending_directories = false(line 902). ReadsisClosed()→ returns true (from step 1). Setsshould_deinit = true. Releases mutex. Deferredthis.deinit()fires. - Thread B in
deinit():deinit_started.swap(true)returns true (set by Thread A in step 1) → returns immediately at line 948. - Thread A continues: Executes
deinit_started.store(false)(line 952) → resets the flag. Returns (line 953).
Why neither thread performs cleanup
After this interleaving:
- Thread A gave up because
hasPendingDirectories()was true (step 1) - Thread B gave up because
deinit_startedwas true (step 3) - Thread A reset
deinit_startedto false (step 4)
The final state is: pending_directories == 0, closed == true, deinit_started == false. No future trigger exists — unrefPendingDirectory will never be called again (pending_directories is already 0), and no other code path calls deinit() on this PathWatcher. The PathWatcher struct, its file_paths, and associated allocations are permanently leaked.
Timing analysis
The window is narrow: Thread B must execute steps 2-3 (the entire unrefPendingDirectory critical section plus the deferred deinit() entry including the atomic swap) between Thread A's lines 950 and 952, which are essentially consecutive instructions (branch + store). On a multi-core system, Thread B runs concurrently on another core and only needs to complete ~15 instructions (mutex lock/decrement/store/check/unlock/function call/swap) during Thread A's ~3 instructions (branch evaluation + store + return). While unlikely per invocation, it is theoretically triggerable under load, especially on ARM with weaker memory ordering.
Impact
The consequence is a resource leak only — the PathWatcher struct plus associated allocations (file_paths, etc.) are never freed. This is not a crash, UAF, or security issue. The trigger scenario (main thread calling deinit() racing with the very last unrefPendingDirectory completing on the work pool) is itself uncommon, and the timing window within that scenario is extremely tight.
How to fix
Instead of resetting deinit_started to false in the hasPendingDirectories() early-return path, keep it true. Then modify unrefPendingDirectory to check deinit_started directly (similar to how deinit_on_last_task/deinit_on_last_watcher work elsewhere in this file). When unrefPendingDirectory sees pending_directories == 0 and deinit_started == true, it performs the cleanup directly rather than calling deinit() which would bail on the swap. Alternatively, perform the hasPendingDirectories() check under the mutex to coordinate with unrefPendingDirectory atomically.
There was a problem hiding this comment.
Valid finding — narrow TOCTOU window where neither thread frees the PathWatcher. Low severity (leak only, not crash), but should be fixed. Briefing fixer.
|
Closing: CI failed (build #39618, 6 steps) and this approach is superseded per maintainer feedback. A replacement PR with the simpler #27469 approach will be opened separately. |
|
@robobun pick this back up and remake your replacement PR you said would exist |
|
✅ Replacement PR opened → #29391 Applies the minimal #27469 defer-deinit fix onto current main. Verified locally:
Rebased onto latest main to pick up the |
## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of #27469 (by @chrislloyd) onto current main — previous attempts (#27957, #28088, #28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in #28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes #27469, #27957, #28088, #28104. --- **Re: duplicate-bot flagging #26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging #18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
…9391) ## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in oven-sh#28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104. --- **Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
…9391) ## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in oven-sh#28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104. --- **Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
What does this PR do?
Fixes multiple concurrency bugs in
PathWatcherManagerandPathWatcherthat cause deadlocks, hangs, and use-after-free whenfs.watch()watchers are rapidly created and destroyed.This is a minimal, focused port of #27469 (by @chrislloyd) onto current main. Previous attempts (#27957) added too much scope and introduced new issues. This PR applies only the defer-deinit pattern fixes.
Root causes
unregisterWatcherself-deadlock —deinit()called inside adeferthat fires whilethis.mutexis still held.deinit()re-acquires the same non-recursive mutex → self-deadlock in__ulock_wait2.unrefPendingTaskUAF — Same pattern. Ifdeinit()completes and callsdestroy(this), the subsequentdefer this.mutex.unlock()writes to freed memory.processWatcherAB/BA lock inversion — Worker thread holdswatcher.mutex→ calls_decrementPathRef→ acquiresmanager.mutex. Main threadunregisterWatcherholdsmanager.mutex→ acquireswatcher.mutex. Classic AB/BA.unrefPendingDirectoryself-deadlock —deinit()callssetClosed()which re-locksthis.mutexwhile it's already held.onErrorlock ordering — Acquiredthis.mutexthendefault_manager_mutex, butdeinit()acquires them in reverse order.Fix
All fixes use the same pattern: set a
should_deinitflag under the lock, defer actualdeinit()call to after the mutex is released. ForprocessWatcher, unlockwatcher.mutexbefore calling_decrementPathRef. ForonError, split into two separate lock scopes.How did you verify your code works?
bun bd test test/js/node/watch/fs.watch.deadlock.test.tspassesfs.watch.test.tscrashes atunrefPendingDirectoryon unmodified main (confirming the bug), no longer crashes with this fixBased on #27469 by @chrislloyd.
Verification (b674d12): CI green (56/58 pass; 1 infra flake
upload-benchmark.mjsalso fails on main; 1 windows-aarch64 still running). ASAN build+test passed — critical for this UAF fix. Baked binary (main) deadlocks on the new test; all CI test suites pass on PR binary. 4/4 bot review threads resolved. Diff matches claimed root causes.