fix use-after-free when terminating a worker after it exited#28840
fix use-after-free when terminating a worker after it exited#28840robobun wants to merge 1 commit into
Conversation
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
WalkthroughAdds a lock protecting Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
25742ff to
823037b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/worker.test.ts`:
- Around line 414-428: The test uses the Web Worker API (Worker) while sitting
in the worker_threads describe block that uses wt.Worker; change the test to
instantiate wt.Worker instead of Worker (update the variable creation lines like
new Worker("does-not-exist-" + i) to new wt.Worker(...)) or move the entire test
into the web worker describe block to keep API consistency, and add a short
assertion/comment at the end of the test (e.g., a no-op expect or a comment) to
indicate the test completed without crashing so non-ASAN runs still signal
success.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 622d95c0-4715-4cfa-84b1-d4f391302e5e
📥 Commits
Reviewing files that changed from the base of the PR and between 25742ff6b0225e4ec40294b7c01e6b83dfa31165 and 823037ba47204e10cbcac1d185c038fde9a677c1.
📒 Files selected for processing (4)
src/bun.js/bindings/webcore/Worker.cppsrc/bun.js/bindings/webcore/Worker.hsrc/bun.js/web_worker.zigtest/js/web/workers/worker.test.ts
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
src/bun.js/web_worker.zig:626-636— The PR introduces a UAF: inexitAndDeinit,WebWorker__clearImpl(cpp_worker)is called afterWebWorker__dispatchExit(), butWebWorker__dispatchExitcallsworker->deref()which can drop the C++ Worker's refcount to zero and destroy the object ifscriptExecutionContext()is null (skipping theRef{*this}capture indispatchExit). The fix is to moveWebWorker__clearImpl(cpp_worker)to beforeWebWorker__dispatchExit()so the Worker is still alive when itsimpl_is cleared.Extended reasoning...
What the bug is and how it manifests
In
exitAndDeinit(web_worker.zig:626–636), the new code callsWebWorker__dispatchExit(globalObject, cpp_worker, exit_code)first, then later callsWebWorker__clearImpl(cpp_worker). TheWebWorker__dispatchExitfunction (Worker.cpp) callsworker->dispatchExit(exitCode)followed byworker->deref(), releasing the Zig reference to the C++Worker. ThedispatchExitmethod only captures aRef{*this}guard (which would keep the object alive pastderef()) whenscriptExecutionContext()is non-null. When it is null, the function returns early with no ref captured.The specific code path that triggers it
The scenario is a worker spawned by another worker (worker-in-worker). When the parent worker exits, its
ScriptExecutionContextis destroyed, makingscriptExecutionContext()return null for any child workers it spawned. BecauseisContextStopped()returns true (\!scriptExecutionContext()), the JSWorker JS wrapper in JSWorker.cpp line 736 is no longer kept alive byhasPendingActivity(), so the GC can and does collect it. JSWorker's destructor callsderef(), dropping the C++ Worker's refcount from 2 to 1. Now only the Zig-held reference remains.Why existing code doesn't prevent it
When the child worker thread then calls
exitAndDeinit→WebWorker__dispatchExit:worker->dispatchExit()sees null ctx, returns immediately, noRef{*this}is captured. Thenworker->deref()drops the refcount from 1 to 0 — the C++ Worker object is freed by its destructor. Control returns toexitAndDeinit, which then callsWebWorker__clearImpl(cpp_worker). This callsworker->clearImpl(), which acquiresm_implLockand writes toimpl_— both of which are in already-freed memory. This is a use-after-free.Addressing the refutation
One verifier argued this only happens during process shutdown, when the main ScriptExecutionContext is alive for the entire test run. This is true for the top-level parent context, but the bug specifically manifests with worker-in-worker: when a worker (not the main thread) creates a child worker and then exits, the child's
scriptExecutionContext()— which points to the parent worker's context — becomes null when the parent's context tears down. This is a normal, realistic scenario (e.g., a pool of nested workers where the parent exits), not just during process shutdown.Step-by-step proof
- Main thread creates Worker A. Worker A creates Worker B (worker-in-worker).
- Worker A exits normally; its
ScriptExecutionContextis destroyed. Worker B'sscriptExecutionContext()now returns null. - Worker B's JSWorker wrapper in Worker A's VM heap is no longer kept reachable (isContextStopped() == true), GC collects it. JSWorker destructor calls
worker->deref()→ Worker B's C++ refcount: 2 → 1. - Worker B finishes and calls
exitAndDeinit→WebWorker__dispatchExit(globalObject, cpp_worker, exit_code). - Inside:
worker->dispatchExit(exitCode)sees null ctx → early return, noRef{*this}captured. Thenworker->deref()→ refcount: 1 → 0 → Worker B's C++ object is destroyed. - Back in
exitAndDeinit:WebWorker__clearImpl(cpp_worker)is called on the now-freed Worker B → UAF.
How to fix it
Move
WebWorker__clearImpl(cpp_worker)to beforeWebWorker__dispatchExit(globalObject, cpp_worker, exit_code)inexitAndDeinit. This ensuresimpl_is cleared under the lock while the Worker is still guaranteed alive (the Zig ref has not yet been released viaderef()). The main thread's lock-protected checkif (impl_)interminate(),setKeepAlive(), etc. will then correctly see a null pointer after clearImpl runs.
823037b to
35f086f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/worker.test.ts`:
- Around line 414-428: The test uses a fixed Bun.sleep(20) to wait for the
worker to fail which is flaky; replace the timed wait with an explicit wait for
the worker failure event (use the same pattern as elsewhere: await once(w,
"error") or await once(worker, "error")) so the main thread only calls
w.terminate() after the worker has emitted its error and run exitAndDeinit;
remove the Bun.sleep call and keep the two w.terminate() calls to exercise the
freed-impl_ path (refer to the test function and the Worker instance variable
w).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 70638404-00e6-44a8-b3de-5599306ab70f
📥 Commits
Reviewing files that changed from the base of the PR and between 823037ba47204e10cbcac1d185c038fde9a677c1 and 35f086f7560704bc7470e332adc276141eda0f11.
📒 Files selected for processing (4)
src/bun.js/bindings/webcore/Worker.cppsrc/bun.js/bindings/webcore/Worker.hsrc/bun.js/web_worker.zigtest/js/web/workers/worker.test.ts
Worker::terminate called WebWorker__notifyNeedTermination(impl_) without checking whether the worker thread had already freed its WebWorker. When the worker thread's exitAndDeinit ran before the main thread observed the close event, terminate() dereferenced freed memory. setKeepAlive and updatePtr had the same pattern. Guard impl_ with a mutex and clear it from the worker thread right before freeing, so callers from the main thread see nullptr and skip the call.
35f086f to
8eaae51
Compare
There was a problem hiding this comment.
This is a focused, correct fix for an ASAN-detected use-after-free, but it touches safety-critical threading code in the Worker implementation and the CI shows build failures on three musl targets — those warrant a human look before merge.
Extended reasoning...
Overview
The PR adds Lock m_implLock to the C++ Worker class protecting the impl_ (Zig WebWorker*) pointer. It adds a clearImpl() C++ method and a WebWorker__clearImpl C export, called from exitAndDeinit() in Zig before deinit() frees the object. All main-thread callers (terminate, setKeepAlive, updatePtr) now snapshot or null-check impl_ while holding the lock. The fix is accompanied by a regression test that repeatedly creates workers pointing at nonexistent files and calls terminate() twice.
Security risks
The bug being fixed is itself a memory-safety vulnerability (use-after-free), so the fix improves the security posture. No new security risks are introduced: the lock is local to the Worker object, there is no privilege escalation path, and the null checks are conservative (skipping the call rather than crashing).
Level of scrutiny
This is production-critical threading code. The fix is conceptually simple (null-check under a lock), but the correctness depends on the ordering of WebWorker__clearImpl vs WebWorker__dispatchExit in exitAndDeinit, and on all paths that dereference impl_ being covered. The CI report for this PR's commit shows three build failures on musl targets (aarch64-musl, x64-musl, x64-musl-baseline) listed as scripts/build/ci.ts failures — not mere clang warnings — which need to be confirmed as pre-existing before merge.
Other factors
The regression test is placed inside the worker_threads describe block but uses the Web Worker API directly; coderabbitai flagged this as a trivial inconsistency. Functionally the test exercises the same C++ code path, so it correctly validates the fix. The musl build failures are the main blocker for automated approval.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bun.js/bindings/webcore/Worker.cpp (1)
147-158:⚠️ Potential issue | 🔴 Critical
updatePtr()still leavesimpl_dangling.Line 151 drops
m_implLockbefore Line 154 calls into Zig, so the worker thread can runclearImpl()+deinit()in that gap and makeimplstale. Separately, ifWebWorker__updatePtr()returnsfalse, Lines 76-78 insrc/bun.js/web_worker.zighave already freed the ZigWebWorker, but this function never nullsimpl_, so later callers can still see a freed pointer.🔧 Proposed fix
bool Worker::updatePtr() { - void* impl; - { - Locker lock { m_implLock }; - impl = impl_; - } - if (!impl || !WebWorker__updatePtr(impl, this)) { + Locker lock { m_implLock }; + if (!impl_ || !WebWorker__updatePtr(impl_, this)) { + impl_ = nullptr; m_onlineClosingFlags = ClosingFlag; m_terminationFlags.fetch_or(TerminatedFlag); return false; } return true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/Worker.cpp` around lines 147 - 158, Worker::updatePtr currently releases m_implLock before calling WebWorker__updatePtr, allowing the worker to call clearImpl()/deinit() and make impl stale; also when WebWorker__updatePtr returns false the Zig WebWorker is freed but impl_ is not nulled. Fix by holding m_implLock across the WebWorker__updatePtr call (do not copy impl out and unlock early), and if WebWorker__updatePtr returns false, set impl_ = nullptr while still holding the lock and then set m_onlineClosingFlags = ClosingFlag and m_terminationFlags.fetch_or(TerminatedFlag) as currently done; ensure references to impl_ are protected by m_implLock and mention Worker::updatePtr, impl_, m_implLock, WebWorker__updatePtr, clearImpl(), deinit(), m_onlineClosingFlags, and m_terminationFlags in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 147-158: Worker::updatePtr currently releases m_implLock before
calling WebWorker__updatePtr, allowing the worker to call clearImpl()/deinit()
and make impl stale; also when WebWorker__updatePtr returns false the Zig
WebWorker is freed but impl_ is not nulled. Fix by holding m_implLock across the
WebWorker__updatePtr call (do not copy impl out and unlock early), and if
WebWorker__updatePtr returns false, set impl_ = nullptr while still holding the
lock and then set m_onlineClosingFlags = ClosingFlag and
m_terminationFlags.fetch_or(TerminatedFlag) as currently done; ensure references
to impl_ are protected by m_implLock and mention Worker::updatePtr, impl_,
m_implLock, WebWorker__updatePtr, clearImpl(), deinit(), m_onlineClosingFlags,
and m_terminationFlags in the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 70e506c1-bce2-42a4-a93d-ebf2176027ef
📥 Commits
Reviewing files that changed from the base of the PR and between 35f086f7560704bc7470e332adc276141eda0f11 and 8eaae51.
📒 Files selected for processing (4)
src/bun.js/bindings/webcore/Worker.cppsrc/bun.js/bindings/webcore/Worker.hsrc/bun.js/web_worker.zigtest/js/web/workers/worker.test.ts
| return Exception { TypeError, errorMessage.toWTFString(BunString::ZeroCopy) }; | ||
| } | ||
|
|
||
| worker->impl_ = impl; | ||
| { | ||
| Locker lock { worker->m_implLock }; | ||
| worker->impl_ = impl; | ||
| } | ||
| worker->m_workerCreationTime = MonotonicTime::now(); | ||
|
|
||
| return worker; |
There was a problem hiding this comment.
🟣 Pre-existing ref leak in Worker::create(): worker->ref() is called unconditionally before the \!impl null check, so when WebWorker__create() returns null (e.g., a preload module path fails to resolve), the local Ref<Worker> destructs and drops the refcount from 2→1 with no owner for the remaining reference—the Zig WebWorker was never created, so the balancing worker->deref() inside WebWorker__dispatchExit is never called, permanently leaking the C++ Worker object. Fix: move worker->ref() to after the null check, or call worker->deref() before returning the Exception.
Extended reasoning...
What the bug is. In Worker::create() (Worker.cpp), worker->ref() is called immediately after WebWorker__create() with the comment "now referenced by Zig". This call is unconditional—it happens regardless of whether WebWorker__create() actually succeeded. Three lines later there is a null check on impl, and if null, the function returns an Exception. At that point the local Ref<Worker> worker falls out of scope, calling deref(), which drops the refcount from 2 to 1. No other entity holds the remaining reference, so the Worker object is permanently leaked.
The code path that triggers it. WebWorker__create() returns null whenever the Zig create() function returns early with an error—most concretely when one of the preload module paths fails resolveEntryPointSpecifier (web_worker.zig ~line 233). In that case the C++ code executes: adoptRef (refcount = 1) → WebWorker__create (returns null, no Zig WebWorker struct is ever created) → worker->ref() (refcount = 2) → return Exception{} (local Ref destructs → deref() → refcount = 1). The caller receives an Exception, not a Ref<Worker>, so it never holds the remaining reference.
Why existing code doesn't prevent it. The only place that calls the balancing worker->deref() is WebWorker__dispatchExit at line ~495, which is reached only when a Zig WebWorker thread actually runs to completion and calls exitAndDeinit(). Since WebWorker__create() returned null, no WebWorker struct exists and no thread was spawned, so exitAndDeinit() is never called and the deref never happens. The allWorkers() hashmap stores a raw pointer to the Worker, so the leaked object also stays in that map forever—its destructor, which removes the entry, is never called.
Impact. Every failed new Worker(path) call where preload module resolution fails (or any other early-exit in WebWorker__create) leaks one C++ Worker object indefinitely. The leak accumulates across the process lifetime. Additionally the orphaned entry in allWorkers() means Worker::forEachWorker will attempt to post tasks to a dead ScriptExecutionContextIdentifier that is never cleaned up.
How to fix it. Option A (minimal): add worker->deref(); immediately before return Exception { ... }; on the null-check path. Option B (cleaner): move worker->ref() to after the null check—only call it when we know Zig actually took ownership of the pointer. Option B is preferred because it keeps the invariant that ref() is called only when Zig holds the reference.
Step-by-step proof.
auto worker = adoptRef(*new Worker(context, WTF::move(options)));— refcount = 1.void* impl = WebWorker__create(worker.ptr(), ...);— Zig'screate()hits a preload resolution failure, setserror_message, returnsnull. NoWebWorkerstruct is allocated;cpp_workerpointer is never stored anywhere.worker->ref();— refcount = 2. Comment says "now referenced by Zig", but Zig never took the reference.if (\!impl) { return Exception { TypeError, ... }; }— theExceptionOr<Ref<Worker>>return type means the localRef<Worker> workerdestructs here, callingderef()→ refcount = 1.- Caller receives
Exception, not aRef<Worker>. Nobody holds refcount = 1. WebWorker__dispatchExit(which contains the only otherworker->deref()) is never called because there is no Zig worker thread to callexitAndDeinit().- The
Workerobject leaks with refcount = 1; its destructor never runs; its entry inallWorkers()is never removed.
|
CI failures on Windows are pre-existing and unrelated to this PR:
My |
Found by Fuzzilli with ASAN:
Address:use-after-poisoninsideWebWorker__notifyNeedTermination.Repro
What happens
Worker::terminate()in C++ callsWebWorker__notifyNeedTermination(impl_)unconditionally. When the worker thread'sexitAndDeinithas already run,impl_points to freed memory, so the call is a use-after-free.setKeepAliveandupdatePtrreadimpl_the same way.The worker thread's
exitAndDeinitposts a close task to the main thread, then callsthis.deinit()which frees theWebWorker. The close task — which is what marks the Worker terminated on the main thread — runs later, so there is a window where the main thread still sees the worker as alive whileimpl_is already freed.Fix
Protect
impl_with a lock inWorker. The worker thread calls a newWebWorker__clearImplbeforedeinit()to null it out under the lock. Callers on the main thread (terminate,setKeepAlive,updatePtr) check for nullptr while holding the lock before dereferencing.Stack (before fix)