Fix PathWatcherManager deadlock and UAF in deferred deinit#27954
Fix PathWatcherManager deadlock and UAF in deferred deinit#27954robobun wants to merge 9 commits into
Conversation
Three issues in the deferred-deinit pattern used by
PathWatcherManager, all triggered by rapid fs.watch() churn.
## unregisterWatcher self-deadlock (primary)
The deinit-on-last-watcher defer was registered AFTER the
mutex.lock()/defer mutex.unlock() pair:
this.mutex.lock();
defer this.mutex.unlock(); // fires last
var watchers = this.watchers.slice();
defer { // fires BEFORE unlock
if (this.deinit_on_last_watcher and this.watcher_count == 0) {
this.deinit(); // called holding this.mutex
}
}
Zig defers fire LIFO, so deinit() ran while still holding
this.mutex. deinit() then re-acquires this.mutex when
hasPendingTasks() is true (a DirectoryRegisterTask running on
the workpool). os_unfair_lock is non-recursive -> the thread
blocks in __ulock_wait2 forever.
Trigger: onError() called (sets deinit_on_last_watcher), then
last watcher closed while a workpool task is mid-flight.
Also UAF: if deinit() completes it calls destroy(this), then
defer this.mutex.unlock() fires on freed memory.
Fix: hoist to a should_deinit flag checked by a defer
registered before the lock. LIFO ordering now yields
unlock -> deinit.
## unrefPendingTask UAF
Same shape: this.deinit() called while holding this.mutex. In
this path has_pending_tasks is stored false just before, so
deinit() doesn't re-lock -- but if it completes, destroy(this)
means the defer this.mutex.unlock() is a UAF. Same fix pattern.
## processWatcher AB/BA lock inversion
Worker thread holds watcher.mutex (processWatcher) -> calls
manager._decrementPathRef() on the OOM error path, which
acquires manager.mutex. Main thread unregisterWatcher holds
manager.mutex -> wants watcher.mutex. Classic AB/BA. Only on
allocation failure during file_paths.append, but real under
memory pressure.
Fix: capture append result, unlock watcher.mutex BEFORE
calling _decrementPathRef.
## Test plan
- zig fmt --check + zig ast-check (Bun's zig 0.15.2)
- bun run zig:check (full semantic analysis, 5/5)
- Full compile to bun-zig.o (asan + noasan), symbols verified
- No JS regression test: the trigger (onError() firing) requires
platform-level watch loop failure which isn't reproducible
via rmSync of watched paths (inotify/kqueue return deletion
events, not errors). Stress tests pass on unpatched Linux.
:house: Remote-Dev: homespace
This reverts commit 5c57817.
Merge with main pulled in #27838 (SSLConfig SharedPtr refactor) which changed tls_props from a raw pointer to ?SSLConfig.SharedPtr. The getTlsHostname function (from #27891) needs .get() to access the inner struct. https://claude.ai/code/session_013XzvW9VsevSPURzRLixE7G
|
Caution Review failedPull request was closed or merged during review WalkthroughDeferred deinitialization and mutex-unlock ordering were added to PathWatcher to prevent unlocking freed mutexes and deadlocks; deinit calls are deferred until after relevant mutexes are released. HTTP TLS hostname lookup now reads Changes
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Race fix landed in 147e058. All review threads resolved. Full diff reviewed — 5 fixes all look correct (3 deadlocks, 1 FD leak, 1 SharedPtr accessor). Waiting on CI (23 checks still running)... |
PathWatcher.deinit() previously called setClosed() (which locks/unlocks the mutex) followed by hasPendingDirectories() (lockless atomic load). Between these two operations, a worker thread in unrefPendingDirectory() could acquire the mutex, see closed=true and pending_directories==0, and set should_deinit=true — causing both threads to call destroy(). Fix by combining the closed flag store and pending_directories check under a single mutex hold in deinit(). Since unrefPendingDirectory() already checks isClosed() inside its own mutex hold, the two threads can no longer both decide to proceed with cleanup. Apply the same pattern to PathWatcherManager.deinit() where hasPendingTasks() was checked outside the mutex after the watcher_count gate, allowing the last task to complete unobserved.
|
Fixed the double-deinit race in PathWatcher.deinit(): Previously called PathWatcherManager.deinit(): Same class of bug —
|
| // Combine checking pending_tasks and setting deinit_on_last_task | ||
| // under a single mutex hold to prevent a race where the last task | ||
| // completes between the lockless hasPendingTasks() check and the | ||
| // mutex acquisition, causing neither thread to proceed with cleanup. | ||
| { | ||
| this.mutex.lock(); | ||
| defer this.mutex.unlock(); | ||
| // deinit when all tasks are done | ||
| this.deinit_on_last_task = true; | ||
| return; | ||
| if (this.pending_tasks > 0) { | ||
| this.deinit_on_last_task = true; | ||
| return; | ||
| } | ||
| this.has_pending_tasks.store(false, .release); | ||
| } |
There was a problem hiding this comment.
🔴 AB/BA deadlock: onError acquires this.mutex (line 311) then default_manager_mutex (line 326), while deferred deinit() acquires default_manager_mutex (line 671) then this.mutex (line 689). When deinit_on_last_task=true and main_watcher is still running, a concurrent onError and last-task completion will deadlock. Additionally, deinit() reads watcher_count and writes deinit_on_last_watcher (lines 678-680) under default_manager_mutex only, while unregisterWatcher accesses them under this.mutex (line 645, 662) — a data race with no mutual exclusion, which the PR fixed for pending_tasks but not for watcher_count.
Extended reasoning...
AB/BA Deadlock Between onError and Deferred deinit
The PR correctly fixes a self-deadlock in unrefPendingTask by deferring deinit() until after this.mutex is released. However, this introduces a lock ordering inversion with onError.
Lock ordering conflict:
onError(lines 311-326): acquiresthis.mutexfirst (line 311), thendefault_manager_mutexsecond (line 326)deinit()(lines 671-689): acquiresdefault_manager_mutexfirst (line 671), thenthis.mutexsecond (line 689)
This is a textbook AB/BA deadlock.
Step-by-step proof
-
A prior
deinit()call seespending_tasks > 0at line 691, setsdeinit_on_last_task = trueat line 692, and returns at line 693. Crucially,main_watcheris NOT deinited (line 698 is never reached), so it can still fire callbacks includingonError. -
File Watcher thread:
onErrorfires (main_watcher is still running), acquiresthis.mutexat line 311, then attempts to acquiredefault_manager_mutexat line 326 — blocks if held by another thread. -
Work pool thread: The last task completes.
unrefPendingTaskdecrementspending_tasksto 0, seesdeinit_on_last_task=true, setsshould_deinit=true, releasesthis.mutex. The deferreddeinit()fires, acquiresdefault_manager_mutexat line 671, then attempts to acquirethis.mutexat line 689 — blocks because the File Watcher thread holds it. -
Deadlock: File Watcher thread holds
this.mutex, waits ondefault_manager_mutex. Work pool thread holdsdefault_manager_mutex, waits onthis.mutex. Neither can proceed.
In the old code, unrefPendingTask called deinit() while STILL HOLDING this.mutex, making the lock sequence this.mutex -> default_manager_mutex (same order as onError). The self-deadlock was avoided because hasPendingTasks() returned false inside the old deinit. The PR correctly fixes that self-deadlock but reverses the lock ordering.
Data Race on watcher_count / deinit_on_last_watcher
deinit() reads watcher_count and writes deinit_on_last_watcher at lines 678-680 under default_manager_mutex only. Meanwhile, unregisterWatcher writes watcher_count at line 645 and reads deinit_on_last_watcher at line 662 under this.mutex only. These are different locks providing no mutual exclusion.
Before this PR, deinit() from unrefPendingTask was called while this.mutex was held, providing implicit serialization. The PR defers deinit() to after mutex release, breaking that protection. On ARM (which Bun supports), the main thread in unregisterWatcher could miss the deinit_on_last_watcher=true write, causing a leaked PathWatcherManager. Note that the PR correctly fixed the analogous pending_tasks race by moving the check under this.mutex (lines 689-696), but the same fix was not applied to watcher_count.
Fix
For the deadlock: restructure onError to release this.mutex before acquiring default_manager_mutex, matching the deinit lock ordering (default_manager_mutex -> this.mutex).
For the data race: move the watcher_count > 0 check and deinit_on_last_watcher = true assignment inside this.mutex in deinit(), mirroring the fix already applied to pending_tasks.
Summary
tls_propschanged from raw pointer to?SSLConfig.SharedPtrin fix: SSLConfig intern/deref race causing segfault in proxy tunnel setup #27838, sogetTlsHostname(from fix(http): customlookupfunction breaking TLS certificate verification #27891) needs.get()to access the inner structOne-line change
Resolves all 13 failing
build-zigCI checks on #27469.https://claude.ai/code/session_013XzvW9VsevSPURzRLixE7G