Skip to content

fix use-after-free when terminating a worker after it exited#28840

Open
robobun wants to merge 1 commit into
mainfrom
farm/63a708f6/fix-worker-terminate-uaf
Open

fix use-after-free when terminating a worker after it exited#28840
robobun wants to merge 1 commit into
mainfrom
farm/63a708f6/fix-worker-terminate-uaf

Conversation

@robobun

@robobun robobun commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator

Found by Fuzzilli with ASAN: Address:use-after-poison inside WebWorker__notifyNeedTermination.

Repro

for (let i = 0; i < 100; i++) {
  const w = new Worker("does-not-exist-" + i);
  w.onerror = () => {};
  w.terminate();
  w.terminate();
}

What happens

Worker::terminate() in C++ calls WebWorker__notifyNeedTermination(impl_) unconditionally. When the worker thread's exitAndDeinit has already run, impl_ points to freed memory, so the call is a use-after-free. setKeepAlive and updatePtr read impl_ the same way.

The worker thread's exitAndDeinit posts a close task to the main thread, then calls this.deinit() which frees the WebWorker. The close task — which is what marks the Worker terminated on the main thread — runs later, so there is a window where the main thread still sees the worker as alive while impl_ is already freed.

Fix

Protect impl_ with a lock in Worker. The worker thread calls a new WebWorker__clearImpl before deinit() to null it out under the lock. Callers on the main thread (terminate, setKeepAlive, updatePtr) check for nullptr while holding the lock before dereferencing.

Stack (before fix)

atomic.Value(bun.js.web_worker.Status).load
WebWorker__notifyNeedTermination        src/bun.js/web_worker.zig:571
WebCore::Worker::terminate()            src/bun.js/bindings/webcore/Worker.cpp:266
jsWorkerPrototypeFunction_terminate     src/bun.js/bindings/webcore/JSWorker.cpp:554

@github-actions github-actions Bot added the claude label Apr 4, 2026
@robobun

robobun commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 11:57 PM PT - Apr 3rd, 2026

@robobun, your commit 8eaae51 has 6 failures in Build #43577 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 28840

That installs a local version of the PR into your bun-28840 executable, so you can run:

bun-28840 --bun

@github-actions

github-actions Bot commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Found 1 issue this PR may fix:

  1. Segfault / RangeError: invalid time zone when workers using Intl are repeatedly spawned and terminated #28415 - Segfault when workers using Intl are repeatedly spawned and terminated, matching the exact spawn-terminate race condition this PR's m_implLock mutex fixes

If this is helpful, consider adding Fixes #28415 to the PR description to auto-close the issue on merge.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds a lock protecting Worker::impl_, moves reads/writes of impl_ under that lock, adds Worker::clearImpl and exported WebWorker__clearImpl, updates setKeepAlive, updatePtr, terminate, and create to use the lock, calls the clear wrapper from Zig during exitAndDeinit, and adds a termination-on-failed-startup test.

Changes

Cohort / File(s) Summary
Worker core synchronization
src/bun.js/bindings/webcore/Worker.h, src/bun.js/bindings/webcore/Worker.cpp
Added Lock m_implLock and void clearImpl(); serialize accesses to impl_ under m_implLock; setKeepAlive, updatePtr, terminate, and create now snapshot/check impl_ while holding the lock; added exported extern "C" void WebWorker__clearImpl(Worker*).
Zig integration
src/bun.js/web_worker.zig
Declared extern fn WebWorker__clearImpl(cpp_worker: *anyopaque) void; and call it from exitAndDeinit before dispatching exit/deinit so C++ impl_ is cleared prior to worker teardown.
Regression test
test/js/web/workers/worker.test.ts
Added test that repeatedly creates workers with an invalid entry, awaits the "error" event, and calls terminate() twice to exercise shutdown/cleanup races on failed startup.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main fix: resolving a use-after-free bug when terminating a worker after it has exited, which directly matches the changeset's purpose.
Description check ✅ Passed The description includes both required sections: a detailed explanation of what the PR does (root cause, fix approach, and implementation details) and verification method (reproducer and stack trace), comprehensively addressing the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@robobun robobun force-pushed the farm/63a708f6/fix-worker-terminate-uaf branch from 25742ff to 823037b Compare April 4, 2026 04:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/worker.test.ts`:
- Around line 414-428: The test uses the Web Worker API (Worker) while sitting
in the worker_threads describe block that uses wt.Worker; change the test to
instantiate wt.Worker instead of Worker (update the variable creation lines like
new Worker("does-not-exist-" + i) to new wt.Worker(...)) or move the entire test
into the web worker describe block to keep API consistency, and add a short
assertion/comment at the end of the test (e.g., a no-op expect or a comment) to
indicate the test completed without crashing so non-ASAN runs still signal
success.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 622d95c0-4715-4cfa-84b1-d4f391302e5e

📥 Commits

Reviewing files that changed from the base of the PR and between 25742ff6b0225e4ec40294b7c01e6b83dfa31165 and 823037ba47204e10cbcac1d185c038fde9a677c1.

📒 Files selected for processing (4)
  • src/bun.js/bindings/webcore/Worker.cpp
  • src/bun.js/bindings/webcore/Worker.h
  • src/bun.js/web_worker.zig
  • test/js/web/workers/worker.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 src/bun.js/web_worker.zig:626-636 — The PR introduces a UAF: in exitAndDeinit, WebWorker__clearImpl(cpp_worker) is called after WebWorker__dispatchExit(), but WebWorker__dispatchExit calls worker->deref() which can drop the C++ Worker's refcount to zero and destroy the object if scriptExecutionContext() is null (skipping the Ref{*this} capture in dispatchExit). The fix is to move WebWorker__clearImpl(cpp_worker) to before WebWorker__dispatchExit() so the Worker is still alive when its impl_ is cleared.

    Extended reasoning...

    What the bug is and how it manifests

    In exitAndDeinit (web_worker.zig:626–636), the new code calls WebWorker__dispatchExit(globalObject, cpp_worker, exit_code) first, then later calls WebWorker__clearImpl(cpp_worker). The WebWorker__dispatchExit function (Worker.cpp) calls worker->dispatchExit(exitCode) followed by worker->deref(), releasing the Zig reference to the C++ Worker. The dispatchExit method only captures a Ref{*this} guard (which would keep the object alive past deref()) when scriptExecutionContext() is non-null. When it is null, the function returns early with no ref captured.

    The specific code path that triggers it

    The scenario is a worker spawned by another worker (worker-in-worker). When the parent worker exits, its ScriptExecutionContext is destroyed, making scriptExecutionContext() return null for any child workers it spawned. Because isContextStopped() returns true (\!scriptExecutionContext()), the JSWorker JS wrapper in JSWorker.cpp line 736 is no longer kept alive by hasPendingActivity(), so the GC can and does collect it. JSWorker's destructor calls deref(), dropping the C++ Worker's refcount from 2 to 1. Now only the Zig-held reference remains.

    Why existing code doesn't prevent it

    When the child worker thread then calls exitAndDeinitWebWorker__dispatchExit: worker->dispatchExit() sees null ctx, returns immediately, no Ref{*this} is captured. Then worker->deref() drops the refcount from 1 to 0 — the C++ Worker object is freed by its destructor. Control returns to exitAndDeinit, which then calls WebWorker__clearImpl(cpp_worker). This calls worker->clearImpl(), which acquires m_implLock and writes to impl_ — both of which are in already-freed memory. This is a use-after-free.

    Addressing the refutation

    One verifier argued this only happens during process shutdown, when the main ScriptExecutionContext is alive for the entire test run. This is true for the top-level parent context, but the bug specifically manifests with worker-in-worker: when a worker (not the main thread) creates a child worker and then exits, the child's scriptExecutionContext() — which points to the parent worker's context — becomes null when the parent's context tears down. This is a normal, realistic scenario (e.g., a pool of nested workers where the parent exits), not just during process shutdown.

    Step-by-step proof

    1. Main thread creates Worker A. Worker A creates Worker B (worker-in-worker).
    2. Worker A exits normally; its ScriptExecutionContext is destroyed. Worker B's scriptExecutionContext() now returns null.
    3. Worker B's JSWorker wrapper in Worker A's VM heap is no longer kept reachable (isContextStopped() == true), GC collects it. JSWorker destructor calls worker->deref() → Worker B's C++ refcount: 2 → 1.
    4. Worker B finishes and calls exitAndDeinitWebWorker__dispatchExit(globalObject, cpp_worker, exit_code).
    5. Inside: worker->dispatchExit(exitCode) sees null ctx → early return, no Ref{*this} captured. Then worker->deref() → refcount: 1 → 0 → Worker B's C++ object is destroyed.
    6. Back in exitAndDeinit: WebWorker__clearImpl(cpp_worker) is called on the now-freed Worker B → UAF.

    How to fix it

    Move WebWorker__clearImpl(cpp_worker) to before WebWorker__dispatchExit(globalObject, cpp_worker, exit_code) in exitAndDeinit. This ensures impl_ is cleared under the lock while the Worker is still guaranteed alive (the Zig ref has not yet been released via deref()). The main thread's lock-protected check if (impl_) in terminate(), setKeepAlive(), etc. will then correctly see a null pointer after clearImpl runs.

@robobun robobun force-pushed the farm/63a708f6/fix-worker-terminate-uaf branch from 823037b to 35f086f Compare April 4, 2026 04:46

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/worker.test.ts`:
- Around line 414-428: The test uses a fixed Bun.sleep(20) to wait for the
worker to fail which is flaky; replace the timed wait with an explicit wait for
the worker failure event (use the same pattern as elsewhere: await once(w,
"error") or await once(worker, "error")) so the main thread only calls
w.terminate() after the worker has emitted its error and run exitAndDeinit;
remove the Bun.sleep call and keep the two w.terminate() calls to exercise the
freed-impl_ path (refer to the test function and the Worker instance variable
w).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 70638404-00e6-44a8-b3de-5599306ab70f

📥 Commits

Reviewing files that changed from the base of the PR and between 823037ba47204e10cbcac1d185c038fde9a677c1 and 35f086f7560704bc7470e332adc276141eda0f11.

📒 Files selected for processing (4)
  • src/bun.js/bindings/webcore/Worker.cpp
  • src/bun.js/bindings/webcore/Worker.h
  • src/bun.js/web_worker.zig
  • test/js/web/workers/worker.test.ts

Comment thread test/js/web/workers/worker.test.ts
Worker::terminate called WebWorker__notifyNeedTermination(impl_) without
checking whether the worker thread had already freed its WebWorker. When
the worker thread's exitAndDeinit ran before the main thread observed
the close event, terminate() dereferenced freed memory. setKeepAlive
and updatePtr had the same pattern.

Guard impl_ with a mutex and clear it from the worker thread right
before freeing, so callers from the main thread see nullptr and skip
the call.
@robobun robobun force-pushed the farm/63a708f6/fix-worker-terminate-uaf branch from 35f086f to 8eaae51 Compare April 4, 2026 04:59

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a focused, correct fix for an ASAN-detected use-after-free, but it touches safety-critical threading code in the Worker implementation and the CI shows build failures on three musl targets — those warrant a human look before merge.

Extended reasoning...

Overview

The PR adds Lock m_implLock to the C++ Worker class protecting the impl_ (Zig WebWorker*) pointer. It adds a clearImpl() C++ method and a WebWorker__clearImpl C export, called from exitAndDeinit() in Zig before deinit() frees the object. All main-thread callers (terminate, setKeepAlive, updatePtr) now snapshot or null-check impl_ while holding the lock. The fix is accompanied by a regression test that repeatedly creates workers pointing at nonexistent files and calls terminate() twice.

Security risks

The bug being fixed is itself a memory-safety vulnerability (use-after-free), so the fix improves the security posture. No new security risks are introduced: the lock is local to the Worker object, there is no privilege escalation path, and the null checks are conservative (skipping the call rather than crashing).

Level of scrutiny

This is production-critical threading code. The fix is conceptually simple (null-check under a lock), but the correctness depends on the ordering of WebWorker__clearImpl vs WebWorker__dispatchExit in exitAndDeinit, and on all paths that dereference impl_ being covered. The CI report for this PR's commit shows three build failures on musl targets (aarch64-musl, x64-musl, x64-musl-baseline) listed as scripts/build/ci.ts failures — not mere clang warnings — which need to be confirmed as pre-existing before merge.

Other factors

The regression test is placed inside the worker_threads describe block but uses the Web Worker API directly; coderabbitai flagged this as a trivial inconsistency. Functionally the test exercises the same C++ code path, so it correctly validates the fix. The musl build failures are the main blocker for automated approval.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/bun.js/bindings/webcore/Worker.cpp (1)

147-158: ⚠️ Potential issue | 🔴 Critical

updatePtr() still leaves impl_ dangling.

Line 151 drops m_implLock before Line 154 calls into Zig, so the worker thread can run clearImpl() + deinit() in that gap and make impl stale. Separately, if WebWorker__updatePtr() returns false, Lines 76-78 in src/bun.js/web_worker.zig have already freed the Zig WebWorker, but this function never nulls impl_, so later callers can still see a freed pointer.

🔧 Proposed fix
 bool Worker::updatePtr()
 {
-    void* impl;
-    {
-        Locker lock { m_implLock };
-        impl = impl_;
-    }
-    if (!impl || !WebWorker__updatePtr(impl, this)) {
+    Locker lock { m_implLock };
+    if (!impl_ || !WebWorker__updatePtr(impl_, this)) {
+        impl_ = nullptr;
         m_onlineClosingFlags = ClosingFlag;
         m_terminationFlags.fetch_or(TerminatedFlag);
         return false;
     }
 
     return true;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/bindings/webcore/Worker.cpp` around lines 147 - 158,
Worker::updatePtr currently releases m_implLock before calling
WebWorker__updatePtr, allowing the worker to call clearImpl()/deinit() and make
impl stale; also when WebWorker__updatePtr returns false the Zig WebWorker is
freed but impl_ is not nulled. Fix by holding m_implLock across the
WebWorker__updatePtr call (do not copy impl out and unlock early), and if
WebWorker__updatePtr returns false, set impl_ = nullptr while still holding the
lock and then set m_onlineClosingFlags = ClosingFlag and
m_terminationFlags.fetch_or(TerminatedFlag) as currently done; ensure references
to impl_ are protected by m_implLock and mention Worker::updatePtr, impl_,
m_implLock, WebWorker__updatePtr, clearImpl(), deinit(), m_onlineClosingFlags,
and m_terminationFlags in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 147-158: Worker::updatePtr currently releases m_implLock before
calling WebWorker__updatePtr, allowing the worker to call clearImpl()/deinit()
and make impl stale; also when WebWorker__updatePtr returns false the Zig
WebWorker is freed but impl_ is not nulled. Fix by holding m_implLock across the
WebWorker__updatePtr call (do not copy impl out and unlock early), and if
WebWorker__updatePtr returns false, set impl_ = nullptr while still holding the
lock and then set m_onlineClosingFlags = ClosingFlag and
m_terminationFlags.fetch_or(TerminatedFlag) as currently done; ensure references
to impl_ are protected by m_implLock and mention Worker::updatePtr, impl_,
m_implLock, WebWorker__updatePtr, clearImpl(), deinit(), m_onlineClosingFlags,
and m_terminationFlags in the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 70e506c1-bce2-42a4-a93d-ebf2176027ef

📥 Commits

Reviewing files that changed from the base of the PR and between 35f086f7560704bc7470e332adc276141eda0f11 and 8eaae51.

📒 Files selected for processing (4)
  • src/bun.js/bindings/webcore/Worker.cpp
  • src/bun.js/bindings/webcore/Worker.h
  • src/bun.js/web_worker.zig
  • test/js/web/workers/worker.test.ts

Comment on lines 229 to 238
return Exception { TypeError, errorMessage.toWTFString(BunString::ZeroCopy) };
}

worker->impl_ = impl;
{
Locker lock { worker->m_implLock };
worker->impl_ = impl;
}
worker->m_workerCreationTime = MonotonicTime::now();

return worker;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing ref leak in Worker::create(): worker->ref() is called unconditionally before the \!impl null check, so when WebWorker__create() returns null (e.g., a preload module path fails to resolve), the local Ref<Worker> destructs and drops the refcount from 2→1 with no owner for the remaining reference—the Zig WebWorker was never created, so the balancing worker->deref() inside WebWorker__dispatchExit is never called, permanently leaking the C++ Worker object. Fix: move worker->ref() to after the null check, or call worker->deref() before returning the Exception.

Extended reasoning...

What the bug is. In Worker::create() (Worker.cpp), worker->ref() is called immediately after WebWorker__create() with the comment "now referenced by Zig". This call is unconditional—it happens regardless of whether WebWorker__create() actually succeeded. Three lines later there is a null check on impl, and if null, the function returns an Exception. At that point the local Ref<Worker> worker falls out of scope, calling deref(), which drops the refcount from 2 to 1. No other entity holds the remaining reference, so the Worker object is permanently leaked.

The code path that triggers it. WebWorker__create() returns null whenever the Zig create() function returns early with an error—most concretely when one of the preload module paths fails resolveEntryPointSpecifier (web_worker.zig ~line 233). In that case the C++ code executes: adoptRef (refcount = 1) → WebWorker__create (returns null, no Zig WebWorker struct is ever created) → worker->ref() (refcount = 2) → return Exception{} (local Ref destructs → deref() → refcount = 1). The caller receives an Exception, not a Ref<Worker>, so it never holds the remaining reference.

Why existing code doesn't prevent it. The only place that calls the balancing worker->deref() is WebWorker__dispatchExit at line ~495, which is reached only when a Zig WebWorker thread actually runs to completion and calls exitAndDeinit(). Since WebWorker__create() returned null, no WebWorker struct exists and no thread was spawned, so exitAndDeinit() is never called and the deref never happens. The allWorkers() hashmap stores a raw pointer to the Worker, so the leaked object also stays in that map forever—its destructor, which removes the entry, is never called.

Impact. Every failed new Worker(path) call where preload module resolution fails (or any other early-exit in WebWorker__create) leaks one C++ Worker object indefinitely. The leak accumulates across the process lifetime. Additionally the orphaned entry in allWorkers() means Worker::forEachWorker will attempt to post tasks to a dead ScriptExecutionContextIdentifier that is never cleaned up.

How to fix it. Option A (minimal): add worker->deref(); immediately before return Exception { ... }; on the null-check path. Option B (cleaner): move worker->ref() to after the null check—only call it when we know Zig actually took ownership of the pointer. Option B is preferred because it keeps the invariant that ref() is called only when Zig holds the reference.

Step-by-step proof.

  1. auto worker = adoptRef(*new Worker(context, WTF::move(options))); — refcount = 1.
  2. void* impl = WebWorker__create(worker.ptr(), ...); — Zig's create() hits a preload resolution failure, sets error_message, returns null. No WebWorker struct is allocated; cpp_worker pointer is never stored anywhere.
  3. worker->ref(); — refcount = 2. Comment says "now referenced by Zig", but Zig never took the reference.
  4. if (\!impl) { return Exception { TypeError, ... }; } — the ExceptionOr<Ref<Worker>> return type means the local Ref<Worker> worker destructs here, calling deref() → refcount = 1.
  5. Caller receives Exception, not a Ref<Worker>. Nobody holds refcount = 1.
  6. WebWorker__dispatchExit (which contains the only other worker->deref()) is never called because there is no Zig worker thread to call exitAndDeinit().
  7. The Worker object leaks with refcount = 1; its destructor never runs; its entry in allWorkers() is never removed.

@robobun

robobun commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator Author

CI failures on Windows are pre-existing and unrelated to this PR:

  • test/js/bun/webview/webview.test.tsCannot call .todoIf() on test.skip at line 493. Same bug pattern that was already fixed at line 470 by Fix .todoIf() on test.skip chain in webview.test.ts #28733, but line 493 wasn't covered. Present on main.
  • test/js/node/net/node-net-server.test.ts — flaky no data received on a single echo test.

My test/js/web/workers/worker.test.ts doesn't run on these shards (only message-event.test.ts and message-port-closed-leak.test.ts from test/js/web/workers/ did, both pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant