Skip to content

Guard MessagePortChannel state with a per-channel Lock#29991

Closed
robobun wants to merge 1 commit into
mainfrom
farm/d098c4c7/messageportchannel-lock
Closed

Guard MessagePortChannel state with a per-channel Lock#29991
robobun wants to merge 1 commit into
mainfrom
farm/d098c4c7/messageportchannel-lock

Conversation

@robobun

@robobun robobun commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

Problem

MessagePortChannel is shared between the main thread and worker threads in Bun. Upstream WebKit serializes all access to it with ASSERT(isMainThread()), but Bun commented those out and calls directly from worker threads (see the note in MessagePortChannelRegistry.cpp: "we totally are calling these off the main thread in many cases in Bun, so ........").

This leaves two unsynchronized hazards inside MessagePortChannel itself:

  1. m_pendingMessages Vector racepostMessageToRemote() on the sender thread calls m_pendingMessages[i].append(...) while takeAllMessagesForPort() on the receiver thread does result.swap(m_pendingMessages[i]). When append() reallocs the backing buffer mid-swap, ASAN reports container-overflow / SEGV in Vector::appendSlowCase → expandCapacity.
  2. Non-atomic refcountMessagePortChannel derives from RefCountedAndCanMakeWeakPtr (non-atomic RefCounted). Concurrent RefPtr channel = m_openChannels.get(...) from the registry on two threads can lose an increment → premature destruction → ASSERTION FAILED: m_openChannels.get(channel.port1()) == &channel in messagePortChannelDestroyed().

Repro

Worker floods 20 000 messages through a transferred MessagePort while the main thread's onmessage handler echoes each one back, so both threads are continuously appending to and draining from the same channel. Without the fix, under the debug ASAN build:

ERROR: AddressSanitizer: container-overflow ... thread T9 (Worker)
  #9 WebCore::MessagePortChannel::postMessageToRemote(...) MessagePortChannel.cpp:127
  #8 WTF::Vector<MessageWithMessagePorts>::append(...)
  #7 ...::appendSlowCase → expandCapacity → reserveCapacity

or

ASSERTION FAILED: m_openChannels.get(channel.port1()) == &channel
MessagePortChannelRegistry.cpp(72) : messagePortChannelDestroyed

9/10 runs crash. With the fix: 15/15 pass.

Fix

  • Add Lock m_lock to MessagePortChannel and take it in every method that touches mutable state (postMessageToRemote, takeAllMessagesForPort, tryTakeMessageForPort, closePort, disentanglePort, entanglePortWithProcess, processForPort). This matches the Locker locker { m_lock } pattern used in EventListenerMap, BroadcastChannel, AbortSignal.
  • Drop the lock before invoking the takeAllMessagesForPort completion callback (which can re-enter via dispatchMessages → postMessage) and before the last self-protector RefPtr is released (so the destructor never runs while holding its own lock). The formerly-vestigial "guarantee that the lock is unlocked before the last ref is released" comments are now accurate again.
  • Change the base class to ThreadSafeRefCounted<MessagePortChannel>, CanMakeWeakPtr<MessagePortChannel> so ref()/deref() are atomic. Drop the explicit relaxAdoptionRequirement() since ThreadSafeRefCountedBase already calls it in its constructor.

Verification

# Without fix (src/ stashed)
bun bd test test/js/web/workers/message-channel.test.ts -t 'concurrent MessagePort'
  (fail) ... AddressSanitizer: container-overflow ...

# With fix
bun bd test test/js/web/workers/message-channel.test.ts -t 'concurrent MessagePort'
  (pass) ... [17098ms]

# Full file
bun bd test test/js/web/workers/message-channel.test.ts
  13 pass, 0 fail

Related

Upstream WebKit serializes all MessagePortChannel access to the main
thread via ASSERT(isMainThread()). Bun removed those asserts and calls
postMessageToRemote() / takeAllMessagesForPort() from both the worker
thread and the main thread directly, so m_pendingMessages[i].append()
races with result.swap(m_pendingMessages[i]) and the non-atomic
RefCounted refcount is corrupted by concurrent ref()/deref() in the
registry lookup path.

Under ASAN this surfaces as container-overflow / SEGV in
Vector::appendSlowCase -> expandCapacity, or as a premature destructor
firing messagePortChannelDestroyed with a stale registry entry.

Add a per-channel Lock guarding all mutable state (m_pendingMessages,
m_pendingMessageProtectors, m_entangledToProcessProtectors, m_processes,
m_isClosed, m_pendingMessagePortTransfers, m_messageBatchesInFlight) and
switch the base class from RefCountedAndCanMakeWeakPtr to
ThreadSafeRefCounted + CanMakeWeakPtr so ref/deref are atomic. The lock
is dropped before invoking the takeAllMessagesForPort callback and
before the last self-protector RefPtr is released, so there is no
re-entrancy deadlock and no self-destruction under the lock.
@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@robobun has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 18 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5b530003-ef55-4917-8f1f-3f654b7138ca

📥 Commits

Reviewing files that changed from the base of the PR and between 360bbb5 and fcfb9db.

📒 Files selected for processing (3)
  • src/bun.js/bindings/webcore/MessagePortChannel.cpp
  • src/bun.js/bindings/webcore/MessagePortChannel.h
  • test/js/web/workers/message-channel.test.ts

Review rate limit: 0/5 reviews remaining, refill in 5 minutes and 18 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@robobun

robobun commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 4:18 AM PT - Apr 30th, 2026

@robobun, your commit fcfb9db has 2 failures in Build #49377 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29991

That installs a local version of the PR into your bun-29991 executable, so you can run:

bun-29991 --bun

@github-actions

Copy link
Copy Markdown
Contributor

Found 3 issues this PR may fix:

  1. Consistent SEGFAULT when using MessagePort #16186 - Consistent SEGFAULT when using MessagePort; the crash occurs during heavy concurrent worker MessagePort usage, matching the postMessage/takeAllMessages data race this PR guards with a per-channel lock.
  2. Segmentation fault at address 0x00000018 #22471 - Segmentation fault at address 0x00000018; stack trace points directly to WebCore::MessagePort::postMessage (MessagePort.cpp:186), the exact code path now protected by the new lock.
  3. Worker messages and close events often lost if worker calls process.exit immediately after postMessage #14144 - Worker messages and close events often lost if worker calls process.exit immediately after postMessage; the race between posting a message and channel teardown is the concurrent access scenario this PR's lock guards against.

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #16186
Fixes #22471
Fixes #14144

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. MessagePort: serialize MessagePortChannelRegistry with a Lock (fix worker_threads data race) #29442 - Also fixes MessageChannel between Workers causes crash/unexpected exit #25805 by adding locking to MessagePortChannel (with ThreadSafeRefCounted) and MessagePortChannelRegistry; modifies the same files
  2. fix(worker): add thread safety to MessageChannel to avoid crashes #25806 - Adds a Lock to MessagePortChannel protecting m_pendingMessages and m_processes for thread safety; same root cause and approach
  3. Replace MessagePort/BroadcastChannel registries with MessagePortPipe primitive #29937 - Replaces the entire MessagePortChannel/MessagePortChannelRegistry stack with a thread-safe MessagePortPipe primitive; fixes MessageChannel between Workers causes crash/unexpected exit #25805 via rewrite

🤖 Generated with Claude Code

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking and ThreadSafeRefCounted changes look correct to me, but since this is cross-thread C++ lifetime/locking code and you've explicitly flagged #29442 as an alternative design, I'd like a human to make the call on per-channel vs. registry-wide locking (and to sanity-check that the non-thread-safe CanMakeWeakPtr base is still OK for the registry's WeakRef map now that destruction can race across threads).

Extended reasoning...

Overview

This PR adds a per-instance WTF::Lock to MessagePortChannel and switches its base from RefCountedAndCanMakeWeakPtr to ThreadSafeRefCounted + CanMakeWeakPtr, guarding m_pendingMessages, m_processes, m_isClosed, the self-protector RefPtrs, and m_messageBatchesInFlight in every mutator. It carefully moves protectors and pending containers out to locals before the Locker scope ends so the destructor never runs while holding its own lock, and drops the lock before invoking the takeAllMessagesForPort completion callback to avoid re-entrancy deadlocks. A new ASAN/debug-only stress test floods 20k messages bidirectionally through a transferred port across three subprocess attempts.

Security risks

None in the traditional sense — no parsing, auth, or untrusted-input handling changed. The change removes a memory-safety hazard (Vector realloc race / non-atomic refcount UAF) rather than introducing one. The remaining risk is a subtler concurrency bug (deadlock or a missed lock site) rather than an exploitable surface.

Level of scrutiny

High. This is hand-written cross-thread synchronization in C++ touching object lifetime (ThreadSafeRefCounted, self-referential RefPtr protectors, WeakRef consumers in the registry). Mistakes here manifest as rare crashes that are hard to reproduce. Two specific things I'd want a human to confirm:

  • The base is now ThreadSafeRefCounted but still the non-thread-safe CanMakeWeakPtr; MessagePortChannelRegistry::m_openChannels holds WeakRef<MessagePortChannel>, and #29442 went with ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr instead. If weak-ref invalidation can now race with m_openChannels.get() on another thread, that may need the thread-safe variant (or rely on #29972's registry lock to serialize it).
  • Dropping relaxAdoptionRequirement() relies on ThreadSafeRefCountedBase not enforcing the adoption assertion that RefCountedBase does — that's true in current WebKit, but worth a glance since the constructor still creates RefPtr-to-this before adoptRef.

Other factors

The author explicitly offers to close this in favor of #29442's broader single-registry-lock design, which is a maintainer-level architectural choice I shouldn't make. The only inline finding is a style nit about the explicit 120_000 test timeout, which is informational and consistent with many other slow tests in test/js/ — not a blocker. No CODEOWNERS cover this path.

Comment thread test/js/web/workers/message-channel.test.ts
@robobun

robobun commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

CI build #49377 finished with 2 failed jobs, both pre-existing flakes unrelated to MessagePortChannel:

Test Job Also failing on
test/js/node/watch/fs.watch.test.ts — "closed FSWatcher is collectable > persistent: false", segfault at 0x0 ubuntu 25.04 x64 main #49200, #49300
test/js/bun/s3/s3-storage-class.test.tsS3Error: UnknownError in "should work with writer + options on big file" darwin 14 aarch64 #49300 (unrelated lockfile branch)

All 20 x64-asan test shards — where the new concurrency test in message-channel.test.ts runs — passed. No MessagePort/MessageChannel/worker test failures on any platform.

@robobun

robobun commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

Re the review's CanMakeWeakPtr concern:

CanMakeWeakPtr<T> with the default DefaultWeakPtrImpl already uses ThreadSafeRefCounted for the impl's own refcount (WeakPtrImplBase : public ThreadSafeRefCounted<Derived> in WeakPtrImpl.h), and the factory is eagerly initialized during messagePortChannelCreated() in the constructor before the channel is published to other threads — so there's no lazy-init race.

The remaining hazard — thread A in ~MessagePortChannel() removing the map entry while thread B does m_openChannels.get(port) and upgrades the WeakRef to a RefPtr — is the Registry's unsynchronized m_openChannels HashMap, which is exactly what #29972 guards. That race exists today on main independent of this PR (destruction could already land on any thread once both ports close); switching to atomic refcounting here doesn't introduce it, it just makes the refcount itself stop being one of the corruption sources.

If #29972's Registry lock turns out to be insufficient and the weak→strong upgrade needs ThreadSafeWeakPtr semantics, that would be a Registry-side change to m_openChannels' value type rather than to this class's base — happy to follow up there or defer to #29442's broader approach.

@Jarred-Sumner

Copy link
Copy Markdown
Collaborator

Closing in favor of #29937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MessageChannel between Workers causes crash/unexpected exit

2 participants