fix(watcher): resolve lock ordering inversion in PathWatcherManager#26385
Open
Jarred-Sumner wants to merge 3 commits into
Open
fix(watcher): resolve lock ordering inversion in PathWatcherManager#26385Jarred-Sumner wants to merge 3 commits into
Jarred-Sumner wants to merge 3 commits into
Conversation
The watcher thread acquires locks in the order Watcher.mutex -> PathWatcherManager.mutex, but several PathWatcherManager methods were calling main_watcher.remove() (which acquires Watcher.mutex) while already holding PathWatcherManager.mutex. This lock ordering inversion could cause a deadlock. Fix by deferring main_watcher.remove() calls until after the manager mutex is released: - _decrementPathRefNoLock now returns the hash instead of calling remove - _decrementPathRef releases the mutex before calling remove - getNext in DirectoryRegisterTask releases the mutex before calling remove - unregisterWatcher collects hashes into a stack-backed list and removes them after releasing the mutex Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
Contributor
WalkthroughDefers removal of Changes
Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
sosukesuzuki
left a comment
Contributor
There was a problem hiding this comment.
does this work as tests for this dead lock?
// Reproduction test for PathWatcherManager deadlock.
//
// Bug: Lock order inversion between two mutexes causes deadlock:
// - File Watcher thread: Watcher.mutex (A) -> PathWatcherManager.mutex (B)
// (watchLoopCycle holds A, then onFileUpdate acquires B)
// - Main thread: PathWatcherManager.mutex (B) -> Watcher.mutex (A)
// (unregisterWatcher holds B, then _decrementPathRefNoLock -> main_watcher.remove acquires A)
//
// Reproduction strategy:
// 1. A "writer" subprocess continuously writes to watched files, generating
// kqueue events that keep the file watcher thread busy (holding Watcher.mutex).
// 2. A "watcher" subprocess rapidly creates and closes fs.watch watchers,
// triggering unregisterWatcher on the main thread (holding PathWatcherManager.mutex).
// 3. The writer runs in a separate process so it doesn't share the event loop,
// ensuring truly concurrent pressure on the file watcher thread.
import { expect, test } from "bun:test";
import { bunEnv, bunExe, tempDirWithFiles } from "harness";
import path from "node:path";
test("PathWatcherManager should not deadlock when closing watchers during rapid file events", async () => {
const fileCount = 100;
const files: Record<string, string> = {};
for (let i = 0; i < fileCount; i++) {
files[`watch-${i}.txt`] = `initial ${i}`;
}
const attempts = 5;
for (let attempt = 0; attempt < attempts; attempt++) {
const testDir = tempDirWithFiles("deadlock-repro-" + attempt, {
...files,
// Writer process: writes files in a tight loop to generate kqueue events
"writer.js": `
const fs = require("node:fs");
const path = require("node:path");
const dir = process.argv[2];
const fileCount = 100;
let iteration = 0;
// Write continuously until parent kills us
while (true) {
for (let i = 0; i < fileCount; i++) {
try {
fs.writeFileSync(path.join(dir, "watch-" + i + ".txt"), "i" + iteration + "f" + i);
} catch {}
}
iteration++;
}
`,
// Watcher process: rapidly creates and closes watchers
"watcher.js": `
const fs = require("node:fs");
const path = require("node:path");
const dir = process.argv[2];
const fileCount = 100;
const maxCycles = 500;
let cycles = 0;
function doCycle() {
if (cycles >= maxCycles) {
console.log("OK");
process.exit(0);
return;
}
// Create watchers for all files (non-recursive = kqueue path on macOS)
const watchers = [];
for (let i = 0; i < fileCount; i++) {
try {
const w = fs.watch(path.join(dir, "watch-" + i + ".txt"));
w.on("error", () => {});
watchers.push(w);
} catch {}
}
// Close all immediately — triggers unregisterWatcher which locks
// PathWatcherManager.mutex, then _decrementPathRefNoLock calls
// main_watcher.remove() which tries to lock Watcher.mutex.
// If file watcher thread holds Watcher.mutex (processing kqueue events
// from writer) and tries PathWatcherManager.mutex -> DEADLOCK
for (const w of watchers) {
w.close();
}
cycles++;
setImmediate(doCycle);
}
doCycle();
`,
});
// Start writer process (runs in tight loop, no event loop needed)
const writer = Bun.spawn({
cmd: [bunExe(), path.join(testDir, "writer.js"), testDir],
env: bunEnv,
stdout: "ignore",
stderr: "ignore",
});
// Give writer time to start generating events
await Bun.sleep(100);
// Start watcher process
await using watcher = Bun.spawn({
cmd: [bunExe(), path.join(testDir, "watcher.js"), testDir],
env: bunEnv,
stdout: "pipe",
stderr: "pipe",
});
// If the watcher process deadlocks, it won't exit within timeout
const timeoutMs = 10_000;
const result = await Promise.race([
watcher.exited.then(code => ({ status: "exited" as const, code })),
Bun.sleep(timeoutMs).then(() => ({ status: "timeout" as const, code: -1 })),
]);
// Kill writer regardless of outcome
writer.kill();
await writer.exited;
if (result.status === "timeout") {
watcher.kill(9);
await watcher.exited;
throw new Error(
`Deadlock detected: watcher process did not exit within ${timeoutMs}ms ` +
`(attempt ${attempt + 1}/${attempts})`,
);
}
const stderr = await watcher.stderr.text();
const stdout = await watcher.stdout.text();
if (result.code !== 0) {
throw new Error(
`Watcher process crashed with exit code ${result.code} ` +
`(attempt ${attempt + 1}/${attempts})` +
(stderr ? `\nstderr: ${stderr.slice(0, 1000)}` : ""),
);
}
expect(stdout.trim()).toBe("OK");
}
}, 90_000);…erWatcher StackFallbackAllocator.get() resets internal state and asserts it is only called once per instance. The previous code called sfb.get() multiple times: once in the defer statement (which eagerly evaluates the argument) and again in each hashes.append() call, triggering a debug assertion failure. Store the allocator from sfb.get() in a local variable and reuse it.
Jarred-Sumner
pushed a commit
that referenced
this pull request
Apr 17, 2026
## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of #27469 (by @chrislloyd) onto current main — previous attempts (#27957, #28088, #28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in #28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes #27469, #27957, #28088, #28104. --- **Re: duplicate-bot flagging #26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging #18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
structwafel
pushed a commit
to structwafel/bun
that referenced
this pull request
Apr 25, 2026
…9391) ## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in oven-sh#28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104. --- **Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
This was referenced Apr 29, 2026
xhjkl
pushed a commit
to xhjkl/bun
that referenced
this pull request
May 14, 2026
…9391) ## What does this PR do? Targeted fix for four concurrency bugs in `src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope and introduced ASAN regressions. This PR applies only the defer-ordering fix. ## Root causes ### `unregisterWatcher` self-deadlock (primary) The deinit-on-last-watcher `defer` was registered *after* the `mutex.lock()`/`defer mutex.unlock()` pair: ```zig this.mutex.lock(); defer this.mutex.unlock(); // fires last var watchers = this.watchers.slice(); defer { // fires BEFORE unlock if (this.deinit_on_last_watcher and this.watcher_count == 0) { this.deinit(); // called holding this.mutex } } ``` Zig defers fire LIFO, so `deinit()` ran while still holding `this.mutex`. `deinit()` re-acquires `this.mutex` when `hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock (observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock detected` on Linux). Also UAF: if `deinit()` completes it calls `destroy(this)`, then the deferred `mutex.unlock()` writes to freed memory. ### `unrefPendingTask` UAF Same shape: `deinit()` called while holding `this.mutex`. If it completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF. ### `unrefPendingDirectory` self-deadlock `deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while holding → self-deadlock. ### `processWatcher` AB/BA lock inversion Worker thread holds `watcher.mutex` → calls `manager._decrementPathRef()` on the OOM error path, which acquires `manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` → wants `watcher.mutex`. Classic AB/BA deadlock. ## Fix All four use the same pattern: set a `should_deinit` flag under the lock, checked by a `defer` registered *before* the lock/unlock pair. LIFO ordering then yields unlock → deinit. For `processWatcher`, capture the append result and release `watcher.mutex` before calling `_decrementPathRef`. No semantic changes to when the `has_pending_*` atomics are cleared — the conditions remain exactly as before, only `deinit()` is moved outside the lock. ## How did you verify your code works? - `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes - Without the fix (src/ stashed), the same test fails with `panic: Deadlock detected` from the debug-build mutex checker - All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under ASAN (these regressed in oven-sh#28104) - `test/regression/issue/3657.test.ts` passes under ASAN - `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing root-permission failures (same on main) Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104. --- **Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around `main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex` self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex` inversion. They're complementary, not duplicates. **Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a separate pre-existing use-after-poison in the File Watcher thread that also reproduces on main without this change. Not claiming that fix here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PathWatcherManagercaused by lock ordering inversion betweenPathWatcherManager.mutexandWatcher.mutexWatcher.mutexthenPathWatcherManager.mutex, but several manager methods were callingmain_watcher.remove()(which acquiresWatcher.mutex) while already holdingPathWatcherManager.mutexmain_watcher.remove()calls until after the manager mutex is released in_decrementPathRefNoLock,_decrementPathRef,getNext, andunregisterWatcherTest plan
fs.watchandfs.watchFilestill function correctly withbun bd test test/js/node/watchbun bd test test/js/bun/watcherChangelog
Fixed a potential deadlock in
fs.watch/ path watcher caused by lock ordering inversion when unregistering watchers or decrementing path references.🤖 Generated with Claude Code (0% 8-shotted by claude)