Skip to content

fix(watcher): resolve lock ordering inversion in PathWatcherManager#26385

Open
Jarred-Sumner wants to merge 3 commits into
mainfrom
jarred/fix-path-watcher-race
Open

fix(watcher): resolve lock ordering inversion in PathWatcherManager#26385
Jarred-Sumner wants to merge 3 commits into
mainfrom
jarred/fix-path-watcher-race

Conversation

@Jarred-Sumner

Copy link
Copy Markdown
Collaborator

Summary

  • Fix a potential deadlock in PathWatcherManager caused by lock ordering inversion between PathWatcherManager.mutex and Watcher.mutex
  • The watcher thread acquires Watcher.mutex then PathWatcherManager.mutex, but several manager methods were calling main_watcher.remove() (which acquires Watcher.mutex) while already holding PathWatcherManager.mutex
  • Defer all main_watcher.remove() calls until after the manager mutex is released in _decrementPathRefNoLock, _decrementPathRef, getNext, and unregisterWatcher

Test plan

  • Verify fs.watch and fs.watchFile still function correctly with bun bd test test/js/node/watch
  • Stress-test concurrent file watching scenarios to confirm no deadlocks occur
  • Run existing path watcher tests: bun bd test test/js/bun/watcher

Changelog

Fixed a potential deadlock in fs.watch / path watcher caused by lock ordering inversion when unregistering watchers or decrementing path references.

🤖 Generated with Claude Code (0% 8-shotted by claude)

The watcher thread acquires locks in the order Watcher.mutex ->
PathWatcherManager.mutex, but several PathWatcherManager methods were
calling main_watcher.remove() (which acquires Watcher.mutex) while
already holding PathWatcherManager.mutex. This lock ordering inversion
could cause a deadlock.

Fix by deferring main_watcher.remove() calls until after the manager
mutex is released:

- _decrementPathRefNoLock now returns the hash instead of calling remove
- _decrementPathRef releases the mutex before calling remove
- getNext in DirectoryRegisterTask releases the mutex before calling remove
- unregisterWatcher collects hashes into a stack-backed list and removes
  them after releasing the mutex

Co-Authored-By: Claude <noreply@anthropic.com>
@robobun

robobun commented Jan 23, 2026

Copy link
Copy Markdown
Collaborator
Updated 12:53 AM PT - Jan 23rd, 2026

@Jarred-Sumner, your commit 7f0e291 has 4 failures in Build #35642 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 26385

That installs a local version of the PR into your bun-26385 executable, so you can run:

bun-26385 --bun

@coderabbitai

coderabbitai Bot commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Defers removal of main_watcher entries until after releasing PathWatcherManager.mutex, changes _decrementPathRefNoLock to return an optional Watcher.HashType, and restructures getNext and unregisterWatcher to collect removals under the lock and perform cleanup post-mutex. Adds ownership tracking for watcher file_path strings.

Changes

Cohort / File(s) Change Summary
Path watcher core
src/bun.js/node/path_watcher.zig
_decrementPathRefNoLock now returns ?Watcher.HashType (was void) so callers can remove main_watcher entries after releasing PathWatcherManager.mutex. getNext computes optional hashes in a short block and defers main_watcher removals post-mutex. unregisterWatcher collects hashes while holding the mutex and performs removals and deinit checks after releasing it. Added detailed lock-order and post-mutex removal comments. Net change: +93/-47 lines.
Watcher ownership
src/Watcher.zig
Added owns_file_path: bool to WatchItem; propagate ownership through appendFileAssumeCapacity / appendDirectoryAssumeCapacity. Free owned file_path allocations in flushEvictions, deinit, and threadMain deallocation paths to ensure proper cleanup on eviction/deinit. Net change: +36/-0 lines.

Suggested reviewers

  • taylordotfish
  • nektro
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: fixing a lock ordering inversion in PathWatcherManager that causes potential deadlocks.
Description check ✅ Passed The description covers the required template sections with clear explanations of the problem, solution, and test plan. It explains the deadlock issue, the lock ordering inversion details, and provides specific test commands.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@sosukesuzuki sosukesuzuki left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work as tests for this dead lock?

// Reproduction test for PathWatcherManager deadlock.
//
// Bug: Lock order inversion between two mutexes causes deadlock:
//   - File Watcher thread: Watcher.mutex (A) -> PathWatcherManager.mutex (B)
//     (watchLoopCycle holds A, then onFileUpdate acquires B)
//   - Main thread: PathWatcherManager.mutex (B) -> Watcher.mutex (A)
//     (unregisterWatcher holds B, then _decrementPathRefNoLock -> main_watcher.remove acquires A)
//
// Reproduction strategy:
//   1. A "writer" subprocess continuously writes to watched files, generating
//      kqueue events that keep the file watcher thread busy (holding Watcher.mutex).
//   2. A "watcher" subprocess rapidly creates and closes fs.watch watchers,
//      triggering unregisterWatcher on the main thread (holding PathWatcherManager.mutex).
//   3. The writer runs in a separate process so it doesn't share the event loop,
//      ensuring truly concurrent pressure on the file watcher thread.

import { expect, test } from "bun:test";
import { bunEnv, bunExe, tempDirWithFiles } from "harness";
import path from "node:path";

test("PathWatcherManager should not deadlock when closing watchers during rapid file events", async () => {
  const fileCount = 100;
  const files: Record<string, string> = {};
  for (let i = 0; i < fileCount; i++) {
    files[`watch-${i}.txt`] = `initial ${i}`;
  }

  const attempts = 5;
  for (let attempt = 0; attempt < attempts; attempt++) {
    const testDir = tempDirWithFiles("deadlock-repro-" + attempt, {
      ...files,
      // Writer process: writes files in a tight loop to generate kqueue events
      "writer.js": `
const fs = require("node:fs");
const path = require("node:path");
const dir = process.argv[2];
const fileCount = 100;
let iteration = 0;
// Write continuously until parent kills us
while (true) {
  for (let i = 0; i < fileCount; i++) {
    try {
      fs.writeFileSync(path.join(dir, "watch-" + i + ".txt"), "i" + iteration + "f" + i);
    } catch {}
  }
  iteration++;
}
`,
      // Watcher process: rapidly creates and closes watchers
      "watcher.js": `
const fs = require("node:fs");
const path = require("node:path");
const dir = process.argv[2];
const fileCount = 100;
const maxCycles = 500;

let cycles = 0;

function doCycle() {
  if (cycles >= maxCycles) {
    console.log("OK");
    process.exit(0);
    return;
  }

  // Create watchers for all files (non-recursive = kqueue path on macOS)
  const watchers = [];
  for (let i = 0; i < fileCount; i++) {
    try {
      const w = fs.watch(path.join(dir, "watch-" + i + ".txt"));
      w.on("error", () => {});
      watchers.push(w);
    } catch {}
  }

  // Close all immediately — triggers unregisterWatcher which locks
  // PathWatcherManager.mutex, then _decrementPathRefNoLock calls
  // main_watcher.remove() which tries to lock Watcher.mutex.
  // If file watcher thread holds Watcher.mutex (processing kqueue events
  // from writer) and tries PathWatcherManager.mutex -> DEADLOCK
  for (const w of watchers) {
    w.close();
  }

  cycles++;
  setImmediate(doCycle);
}

doCycle();
`,
    });

    // Start writer process (runs in tight loop, no event loop needed)
    const writer = Bun.spawn({
      cmd: [bunExe(), path.join(testDir, "writer.js"), testDir],
      env: bunEnv,
      stdout: "ignore",
      stderr: "ignore",
    });

    // Give writer time to start generating events
    await Bun.sleep(100);

    // Start watcher process
    await using watcher = Bun.spawn({
      cmd: [bunExe(), path.join(testDir, "watcher.js"), testDir],
      env: bunEnv,
      stdout: "pipe",
      stderr: "pipe",
    });

    // If the watcher process deadlocks, it won't exit within timeout
    const timeoutMs = 10_000;
    const result = await Promise.race([
      watcher.exited.then(code => ({ status: "exited" as const, code })),
      Bun.sleep(timeoutMs).then(() => ({ status: "timeout" as const, code: -1 })),
    ]);

    // Kill writer regardless of outcome
    writer.kill();
    await writer.exited;

    if (result.status === "timeout") {
      watcher.kill(9);
      await watcher.exited;
      throw new Error(
        `Deadlock detected: watcher process did not exit within ${timeoutMs}ms ` +
          `(attempt ${attempt + 1}/${attempts})`,
      );
    }

    const stderr = await watcher.stderr.text();
    const stdout = await watcher.stdout.text();

    if (result.code !== 0) {
      throw new Error(
        `Watcher process crashed with exit code ${result.code} ` +
          `(attempt ${attempt + 1}/${attempts})` +
          (stderr ? `\nstderr: ${stderr.slice(0, 1000)}` : ""),
      );
    }

    expect(stdout.trim()).toBe("OK");
  }
}, 90_000);

…erWatcher

StackFallbackAllocator.get() resets internal state and asserts it is
only called once per instance. The previous code called sfb.get()
multiple times: once in the defer statement (which eagerly evaluates
the argument) and again in each hashes.append() call, triggering a
debug assertion failure.

Store the allocator from sfb.get() in a local variable and reuse it.
Jarred-Sumner pushed a commit that referenced this pull request Apr 17, 2026
## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of #27469 (by @chrislloyd) onto
current main — previous attempts (#27957, #28088, #28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in #28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes #27469, #27957, #28088, #28104.

---

**Re: duplicate-bot flagging #26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging #18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…9391)

## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto
current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in oven-sh#28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104.

---

**Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…9391)

## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto
current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in oven-sh#28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104.

---

**Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants