Skip to content

Fix PathWatcherManager deadlock and UAF in deferred deinit#27954

Closed
robobun wants to merge 9 commits into
mainfrom
claude/fix-http-sslconfig-sharedptr
Closed

Fix PathWatcherManager deadlock and UAF in deferred deinit#27954
robobun wants to merge 9 commits into
mainfrom
claude/fix-http-sslconfig-sharedptr

Conversation

@robobun

@robobun robobun commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

One-line change

-        if (props.server_name) |sn| {
+        if (props.get().server_name) |sn| {

Resolves all 13 failing build-zig CI checks on #27469.

https://claude.ai/code/session_013XzvW9VsevSPURzRLixE7G

chrislloyd and others added 8 commits March 2, 2026 10:59
Three issues in the deferred-deinit pattern used by
PathWatcherManager, all triggered by rapid fs.watch() churn.

## unregisterWatcher self-deadlock (primary)

The deinit-on-last-watcher defer was registered AFTER the
mutex.lock()/defer mutex.unlock() pair:

    this.mutex.lock();
    defer this.mutex.unlock();           // fires last
    var watchers = this.watchers.slice();
    defer {                               // fires BEFORE unlock
        if (this.deinit_on_last_watcher and this.watcher_count == 0) {
            this.deinit();                // called holding this.mutex
        }
    }

Zig defers fire LIFO, so deinit() ran while still holding
this.mutex. deinit() then re-acquires this.mutex when
hasPendingTasks() is true (a DirectoryRegisterTask running on
the workpool). os_unfair_lock is non-recursive -> the thread
blocks in __ulock_wait2 forever.

Trigger: onError() called (sets deinit_on_last_watcher), then
last watcher closed while a workpool task is mid-flight.

Also UAF: if deinit() completes it calls destroy(this), then
defer this.mutex.unlock() fires on freed memory.

Fix: hoist to a should_deinit flag checked by a defer
registered before the lock. LIFO ordering now yields
unlock -> deinit.

## unrefPendingTask UAF

Same shape: this.deinit() called while holding this.mutex. In
this path has_pending_tasks is stored false just before, so
deinit() doesn't re-lock -- but if it completes, destroy(this)
means the defer this.mutex.unlock() is a UAF. Same fix pattern.

## processWatcher AB/BA lock inversion

Worker thread holds watcher.mutex (processWatcher) -> calls
manager._decrementPathRef() on the OOM error path, which
acquires manager.mutex. Main thread unregisterWatcher holds
manager.mutex -> wants watcher.mutex. Classic AB/BA. Only on
allocation failure during file_paths.append, but real under
memory pressure.

Fix: capture append result, unlock watcher.mutex BEFORE
calling _decrementPathRef.

## Test plan

- zig fmt --check + zig ast-check (Bun's zig 0.15.2)
- bun run zig:check (full semantic analysis, 5/5)
- Full compile to bun-zig.o (asan + noasan), symbols verified
- No JS regression test: the trigger (onError() firing) requires
  platform-level watch loop failure which isn't reproducible
  via rmSync of watched paths (inotify/kqueue return deletion
  events, not errors). Stress tests pass on unpatched Linux.

:house: Remote-Dev: homespace
This reverts commit 5c57817.
Merge with main pulled in #27838 (SSLConfig SharedPtr refactor) which
changed tls_props from a raw pointer to ?SSLConfig.SharedPtr. The
getTlsHostname function (from #27891) needs .get() to access the inner
struct.

https://claude.ai/code/session_013XzvW9VsevSPURzRLixE7G
@robobun

robobun commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 3:14 PM PT - Mar 9th, 2026

@claude, your commit 147e058 has 16 failures in Build #39099 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 27954

That installs a local version of the PR into your bun-27954 executable, so you can run:

bun-27954 --bun

@coderabbitai

coderabbitai Bot commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

Deferred deinitialization and mutex-unlock ordering were added to PathWatcher to prevent unlocking freed mutexes and deadlocks; deinit calls are deferred until after relevant mutexes are released. HTTP TLS hostname lookup now reads server_name via props.get().server_name when tls_props is present.

Changes

Cohort / File(s) Summary
Path Watcher deinit & mutex safety
src/bun.js/node/path_watcher.zig
Adds a should_deinit deferred-deinit pattern and rearranges mutex lock/unlock order to avoid unlock-after-free and AB/BA deadlocks. Updates unrefPendingTask, unrefPendingDirectory, unregisterWatcher, and PathWatcher.deinit to clear pending flags unconditionally when counts hit zero and defer actual deinit() until after releasing mutexes. Adds explanatory comments; no public signatures changed.
TLS hostname access
src/http.zig
Modifies getTlsHostname() to obtain the explicit TLS server_name via props.get().server_name when tls_props exists, replacing direct props.server_name access.
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers the summary and one-line change but lacks explicit verification details as required by the template's 'How did you verify your code works?' section. Add a 'How did you verify your code works?' section documenting testing approach, CI results, or other verification methods used.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a PathWatcherManager deadlock and UAF issue through deferred deinit logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@alii alii changed the title fix: dereference SSLConfig SharedPtr in getTlsHostname Fix PathWatcherManager deadlock and UAF in deferred deinit Mar 9, 2026
Comment thread src/bun.js/node/path_watcher.zig
@robobun

robobun commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator Author

Race fix landed in 147e058. All review threads resolved. Full diff reviewed — 5 fixes all look correct (3 deadlocks, 1 FD leak, 1 SharedPtr accessor). Waiting on CI (23 checks still running)...

PathWatcher.deinit() previously called setClosed() (which locks/unlocks
the mutex) followed by hasPendingDirectories() (lockless atomic load).
Between these two operations, a worker thread in unrefPendingDirectory()
could acquire the mutex, see closed=true and pending_directories==0,
and set should_deinit=true — causing both threads to call destroy().

Fix by combining the closed flag store and pending_directories check
under a single mutex hold in deinit(). Since unrefPendingDirectory()
already checks isClosed() inside its own mutex hold, the two threads
can no longer both decide to proceed with cleanup.

Apply the same pattern to PathWatcherManager.deinit() where
hasPendingTasks() was checked outside the mutex after the watcher_count
gate, allowing the last task to complete unobserved.
@robobun

robobun commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator Author

Fixed the double-deinit race in PathWatcher.deinit() and the analogous issue in PathWatcherManager.deinit().

PathWatcher.deinit(): Previously called setClosed() (lock → store closed=true → unlock) then hasPendingDirectories() (lockless atomic load). Between these two operations, a worker in unrefPendingDirectory() could lock the mutex, see isClosed()=true and pending_directories==0, and set should_deinit=true — both threads then proceed to destroy(). Fixed by inlining the closed store and pending_directories check under a single mutex hold. Since unrefPendingDirectory() already checks isClosed() inside its own mutex hold, exactly one thread can now decide to proceed with cleanup.

PathWatcherManager.deinit(): Same class of bug — hasPendingTasks() was a lockless atomic load outside the mutex, so the last task could complete unobserved between the check and the deinit_on_last_task = true assignment. Fixed by checking pending_tasks > 0 under the mutex.

setClosed() is left unchanged since other callers still use it.

@robobun

robobun commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #27957 which targets main directly per @alii's request.

@robobun robobun closed this Mar 9, 2026
Comment on lines +684 to 696
// Combine checking pending_tasks and setting deinit_on_last_task
// under a single mutex hold to prevent a race where the last task
// completes between the lockless hasPendingTasks() check and the
// mutex acquisition, causing neither thread to proceed with cleanup.
{
this.mutex.lock();
defer this.mutex.unlock();
// deinit when all tasks are done
this.deinit_on_last_task = true;
return;
if (this.pending_tasks > 0) {
this.deinit_on_last_task = true;
return;
}
this.has_pending_tasks.store(false, .release);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 AB/BA deadlock: onError acquires this.mutex (line 311) then default_manager_mutex (line 326), while deferred deinit() acquires default_manager_mutex (line 671) then this.mutex (line 689). When deinit_on_last_task=true and main_watcher is still running, a concurrent onError and last-task completion will deadlock. Additionally, deinit() reads watcher_count and writes deinit_on_last_watcher (lines 678-680) under default_manager_mutex only, while unregisterWatcher accesses them under this.mutex (line 645, 662) — a data race with no mutual exclusion, which the PR fixed for pending_tasks but not for watcher_count.

Extended reasoning...

AB/BA Deadlock Between onError and Deferred deinit

The PR correctly fixes a self-deadlock in unrefPendingTask by deferring deinit() until after this.mutex is released. However, this introduces a lock ordering inversion with onError.

Lock ordering conflict:

  • onError (lines 311-326): acquires this.mutex first (line 311), then default_manager_mutex second (line 326)
  • deinit() (lines 671-689): acquires default_manager_mutex first (line 671), then this.mutex second (line 689)

This is a textbook AB/BA deadlock.

Step-by-step proof

  1. A prior deinit() call sees pending_tasks > 0 at line 691, sets deinit_on_last_task = true at line 692, and returns at line 693. Crucially, main_watcher is NOT deinited (line 698 is never reached), so it can still fire callbacks including onError.

  2. File Watcher thread: onError fires (main_watcher is still running), acquires this.mutex at line 311, then attempts to acquire default_manager_mutex at line 326 — blocks if held by another thread.

  3. Work pool thread: The last task completes. unrefPendingTask decrements pending_tasks to 0, sees deinit_on_last_task=true, sets should_deinit=true, releases this.mutex. The deferred deinit() fires, acquires default_manager_mutex at line 671, then attempts to acquire this.mutex at line 689 — blocks because the File Watcher thread holds it.

  4. Deadlock: File Watcher thread holds this.mutex, waits on default_manager_mutex. Work pool thread holds default_manager_mutex, waits on this.mutex. Neither can proceed.

In the old code, unrefPendingTask called deinit() while STILL HOLDING this.mutex, making the lock sequence this.mutex -> default_manager_mutex (same order as onError). The self-deadlock was avoided because hasPendingTasks() returned false inside the old deinit. The PR correctly fixes that self-deadlock but reverses the lock ordering.

Data Race on watcher_count / deinit_on_last_watcher

deinit() reads watcher_count and writes deinit_on_last_watcher at lines 678-680 under default_manager_mutex only. Meanwhile, unregisterWatcher writes watcher_count at line 645 and reads deinit_on_last_watcher at line 662 under this.mutex only. These are different locks providing no mutual exclusion.

Before this PR, deinit() from unrefPendingTask was called while this.mutex was held, providing implicit serialization. The PR defers deinit() to after mutex release, breaking that protection. On ARM (which Bun supports), the main thread in unregisterWatcher could miss the deinit_on_last_watcher=true write, causing a leaked PathWatcherManager. Note that the PR correctly fixed the analogous pending_tasks race by moving the check under this.mutex (lines 689-696), but the same fix was not applied to watcher_count.

Fix

For the deadlock: restructure onError to release this.mutex before acquiring default_manager_mutex, matching the deinit lock ordering (default_manager_mutex -> this.mutex).

For the data race: move the watcher_count > 0 check and deinit_on_last_watcher = true assignment inside this.mutex in deinit(), mirroring the fix already applied to pending_tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants