Skip to content

MessagePort: serialize MessagePortChannelRegistry with a Lock (fix worker_threads data race)#29442

Closed
dylan-conway wants to merge 8 commits into
mainfrom
claude/lucid-cerf-6930a4
Closed

MessagePort: serialize MessagePortChannelRegistry with a Lock (fix worker_threads data race)#29442
dylan-conway wants to merge 8 commits into
mainfrom
claude/lucid-cerf-6930a4

Conversation

@dylan-conway

@dylan-conway dylan-conway commented Apr 18, 2026

Copy link
Copy Markdown
Member

What

Replaces the commented-out ASSERT(isMainThread()) guards in MessagePortChannelRegistry with a real WTF::Lock, and makes MessagePortChannel thread-safe-refcounted so the registry can be safely used from worker threads.

Fixes #29458
Fixes #25805
Fixes #22471
Fixes #16186

Why

WebKit's MessagePortChannelRegistry is main-thread-only by design — every method asserts isMainThread() and worker access is routed through WorkerMessagePortChannelProvidercallOnMainThread. Bun stripped the proxy and commented the assertions out (with a // we totally are calling these off the main thread in many cases in Bun, so ........ note), so worker threads were directly mutating m_openChannels (an unguarded HashMap) and per-channel m_pendingMessages Vectors concurrently with the main thread.

The new stress test reproduces this as bmalloc heap corruption (pas panic: deallocation did fail / bitfit allocation error) on every run against current main.

The linked issues are all the same underlying race surfacing at different points in the call path:

Issue Crash site Faulting address What was torn
#29458 SerializedScriptValue::deserializetryTakeMessagereceiveMessageOnPort 0x40 m_pendingMessages[i].first() returned a torn slot with a null RefPtr<SerializedScriptValue> while another thread reallocated the vector
#25805 MessageChannel transferred between two workers, hammered with postMessage 0x8 concurrent m_openChannels rehash + per-channel state mutation
#22471 MessagePort::postMessage 0x18 corrupted m_openChannels.get() / channel state under workers_spawned
#16186 MessagePort::postMessage 0x30 same as above, 17 workers

Why a lock instead of WebKit's thread-hop

WebKit bounces every registry call through callOnMainThread. Bun cannot do that directly because Node's synchronous receiveMessageOnPort() (tryTakeMessageForPort) needs to read the queue from the calling thread without round-tripping through the main loop. A lock gives the same single-writer guarantee WebKit's main-thread invariant provides, while keeping receiveMessageOnPort synchronous.

Details

  • MessagePortChannelRegistry gains Lock m_lock; m_openChannels becomes HashMap<…, ThreadSafeWeakPtr<MessagePortChannel>> and is WTF_GUARDED_BY_LOCK.
  • Every registry entry point takes the lock. The looked-up RefPtr<MessagePortChannel> is hoisted outside the locked scope so the channel destructor (which re-enters via messagePortChannelDestroyed) cannot deadlock.
  • MessagePortChannel is now ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr so ref/weak ops are atomic across threads.
  • MessagePortChannel::takeAllMessagesForPort now returns the message vector instead of invoking the callback directly; the registry invokes the callback after releasing the lock (the callback re-enters the registry via MessagePort::entanglePorts).
  • MessagePortChannel::tryTakeMessageForPort now clears m_pendingMessageProtectors[i] when it drains the queue (pre-existing self-ref leak on the receiveMessageOnPort path).
  • Removed dead m_messageBatchesInFlight / hasAnyMessagesPendingOrInFlight / existingChannelContainingPort (declared but never used in Bun).
  • Performance::timing()'s commented isMainThread() guard was a Window-only check in WebKit; in Bun each context has its own Performance/m_timing, so it's deleted rather than restored.

Testing

New test/js/web/workers/message-channel-concurrent.test.ts spawns 6 workers + main, each creating/transferring/posting/closing 8 channels × 400 iterations in a tight loop with no yields. Against current main this corrupts the heap on every run.

… of disabling isMainThread asserts

WebKit's MessagePortChannelRegistry is main-thread-only — every method has
ASSERT(isMainThread()) and worker access is routed through
WorkerMessagePortChannelProvider via callOnMainThread. Bun stripped that
proxy and commented the assertions out, so worker threads were mutating
m_openChannels (an unguarded HashMap) and per-channel m_pendingMessages
Vectors concurrently with the main thread. The new stress test reproduces
this as a bmalloc bitfit / pas_segregated_page double-free on every run.

Bun cannot adopt WebKit's thread-hop model directly because Node's
synchronous receiveMessageOnPort() needs to read the queue from any thread
without bouncing through the main loop, so instead the registry now owns a
WTF::Lock and every entry point takes it. The channel RefPtr is hoisted
outside the locked scope so the destructor (which re-enters the registry
via messagePortChannelDestroyed) cannot deadlock, takeAllMessagesForPort
returns the message vector and the callback is invoked after the lock is
released (it re-enters the registry via entanglePorts), and
MessagePortChannel becomes ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr
so RefPtr/WeakPtr operations are atomic across threads. The dead
m_messageBatchesInFlight tracking and the unused
existingChannelContainingPort/hasAnyMessagesPendingOrInFlight declarations
are removed along the way.

Performance::timing()'s commented isMainThread() guard was a different
case: in WebKit it gates a Window-only API, but in Bun each context has
its own Performance instance with its own m_timing, so the guard is simply
inapplicable and is deleted rather than restored.
@robobun

robobun commented Apr 18, 2026

Copy link
Copy Markdown
Collaborator
Updated 5:06 PM PT - Apr 18th, 2026

@dylan-conway, your commit de0fd56 has 2 failures in Build #46343 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29442

That installs a local version of the PR into your bun-29442 executable, so you can run:

bun-29442 --bun

@github-actions

Copy link
Copy Markdown
Contributor

Found 3 issues this PR may fix:

  1. MessageChannel between Workers causes crash/unexpected exit #25805 - MessageChannel between Workers causes crash/unexpected exit — directly reproduces concurrent MessagePort usage across workers causing segfault
  2. Consistent SEGFAULT when using MessagePort #16186 - Consistent SEGFAULT when using MessagePort — segfault in MessagePort::postMessage with 17 concurrent workers
  3. Segmentation fault at address 0x00000018 #22471 - Segmentation fault at address 0x00000018 — near-null dereference in MessagePort::postMessage, consistent with use-after-free from unserialized registry access

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #25805
Fixes #16186
Fixes #22471

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. fix(worker): add thread safety to MessageChannel to avoid crashes #25806 - Adds WTF::Lock to MessagePortChannel and MessagePortChannelRegistry to fix the same thread-safety data race with the same approach
  2. fix(worker_threads): add thread-safety to MessagePortChannelRegistry #26016 - Adds a Lock to MessagePortChannelRegistry to protect m_openChannels from concurrent access, which is the core change in this PR

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

MessagePortChannel and its registry were made thread-safe (thread-safe refcounting, thread-safe weak pointers, and a Lock). The message-draining API changed from an async CompletionHandler to a synchronous return. Message-in-flight telemetry/counters were removed. A concurrent worker stress test and minor provider/header cleanups were added.

Changes

Cohort / File(s) Summary
MessagePortChannel core
src/bun.js/bindings/webcore/MessagePortChannel.h, src/bun.js/bindings/webcore/MessagePortChannel.cpp
Switched class to ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr. Replaced async takeAllMessagesForPort(... CompletionHandler<...>) with synchronous Vector<MessageWithMessagePorts> takeAllMessagesForPort(const MessagePortIdentifier&). Removed hasAnyMessagesPendingOrInFlight(), beingTransferredCount(), and m_messageBatchesInFlight. Removed local keep-alive guards and now clears pending-message protectors directly; also removed relaxAdoptionRequirement() call.
Registry thread-safety
src/bun.js/bindings/webcore/MessagePortChannelRegistry.h, src/bun.js/bindings/webcore/MessagePortChannelRegistry.cpp
Added wtf::Lock m_lock and WTF_GUARDED_BY_LOCK(m_lock); changed stored weak refs to ThreadSafeWeakPtr<MessagePortChannel>. Serialized access: lock around registry mutations and lookups, copy out channels/messages under lock, then invoke channel methods or callbacks after unlocking. Removed existingChannelContainingPort() declaration.
Provider plumbing & forwarding
src/bun.js/bindings/webcore/MessagePortChannelProvider.cpp, src/bun.js/bindings/webcore/MessagePortChannelProviderImpl.cpp
Removed a commented-out ASSERT(isMainThread()). Simplified takeAllMessagesForPort forwarding by renaming the parameter and passing the callback directly to the registry (removed wrapper lambda).
Headers / includes
src/bun.js/bindings/webcore/MessagePortChannel.h, src/bun.js/bindings/webcore/MessagePortChannelRegistry.h
Added includes for ThreadSafeWeakPtr.h and wtf/Lock.h; removed now-unneeded non-thread-safe RefCounted includes.
Performance cleanup
src/bun.js/bindings/webcore/Performance.cpp
Removed previously commented-out scriptExecutionContext/main-thread checks; behavior unchanged.
Tests — concurrent stress
test/js/web/workers/message-channel-concurrent-fixture.js, test/js/web/workers/message-channel-concurrent.test.ts
Added a worker-thread stress fixture that concurrently creates/posts/transfers/closes MessagePorts and drains messages; added a test that spawns the fixture, validates PASS stdout and exit code 0, and filters sanitizer warnings from stderr.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a Lock to serialize MessagePortChannelRegistry to fix a worker_threads data race.
Description check ✅ Passed PR description covers both required sections: detailed 'What' explaining the fix and comprehensive 'Why' with technical rationale and design decisions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 32-61: Add a subprocess watchdog that enforces a timeout when
running in the parent (isMainThread) so the fixture fails fast if a worker never
posts back: create a timer (e.g., watchdog = setTimeout(..., TIMEOUT_MS))
started before spawning/aggregating workers, have the timeout handler log an
error, terminate all Worker instances in the workers array, and call
process.exit(1); ensure the timer is cleared (clearTimeout(watchdog)) inside
finish() so successful completion cancels the watchdog; reference the existing
isMainThread block, WORKERS loop, workers array, finish() function, and worker
termination logic when implementing this.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a210a353-36ad-49f8-8630-5ff5976ce112

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2519f and a3a7721.

📒 Files selected for processing (9)
  • src/bun.js/bindings/webcore/MessagePortChannel.cpp
  • src/bun.js/bindings/webcore/MessagePortChannel.h
  • src/bun.js/bindings/webcore/MessagePortChannelProvider.cpp
  • src/bun.js/bindings/webcore/MessagePortChannelProviderImpl.cpp
  • src/bun.js/bindings/webcore/MessagePortChannelRegistry.cpp
  • src/bun.js/bindings/webcore/MessagePortChannelRegistry.h
  • src/bun.js/bindings/webcore/Performance.cpp
  • test/js/web/workers/message-channel-concurrent-fixture.js
  • test/js/web/workers/message-channel-concurrent.test.ts
💤 Files with no reviewable changes (2)
  • src/bun.js/bindings/webcore/Performance.cpp
  • src/bun.js/bindings/webcore/MessagePortChannelProvider.cpp

Comment thread test/js/web/workers/message-channel-concurrent-fixture.js Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-motivated and correctly structured threading fix, but the scope (ref-counting model change, lock protocol, callback → return-value refactor) warrants a human eye before merging — especially given the pre-existing protector-leak found in tryTakeMessageForPort.

Extended reasoning...

Overview

This PR replaces the previously unguarded (comment-only) main-thread assertions in MessagePortChannelRegistry with a real WTF::Lock, upgrades MessagePortChannel to ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr, changes m_openChannels to use ThreadSafeWeakPtr, and converts takeAllMessagesForPort from a callback to a return-value interface. It also removes dead code (m_messageBatchesInFlight, existingChannelContainingPort, hasAnyMessagesPendingOrInFlight) and adds a concurrent stress test.

Security risks

No authentication, crypto, or permissions code is involved. The change affects object lifetime and heap safety — the pre-existing issue in tryTakeMessageForPort not clearing m_pendingMessageProtectors[i] after draining the last message produces a temporary self-referential retain cycle. This is a memory-safety concern (delayed deallocation window in long-lived workers) rather than a security vulnerability.

Level of scrutiny

This merits careful human review. The change is not mechanical: it alters the ref-counting model for a class that interacts with the GC/finalizer path, introduces a lock protocol that must be consistently followed across all future callers, and changes the callback contract for takeAllMessagesForPort. Lock discipline bugs in C++ (use-after-free, deadlock) can be hard to catch in code review and catastrophic at runtime.

Other factors

The stress test is thorough and the PR description accurately explains the design trade-offs. The bug found by the hunter (tryTakeMessageForPort leaving a stale self-ref protector) is pre-existing and not introduced here, but the PR makes that path more prominent via receiveMessageOnPort. A trivial one-liner fix in the same file would eliminate the asymmetry cleanly. CI results are also still pending.

Comment thread src/bun.js/bindings/webcore/MessagePortChannel.cpp
Comment thread test/js/web/workers/message-channel-concurrent.test.ts Outdated
…; harden stress fixture

tryTakeMessageForPort (the receiveMessageOnPort path) never cleared
m_pendingMessageProtectors[i] after draining the queue, leaving a
self-reference that delayed channel destruction until close(). Clear it
when the queue empties, matching takeAllMessagesForPort.

The stress fixture is reworked to wait on each worker's exit event and let
the main loop drain naturally instead of calling worker.terminate() then
process.exit() — the abrupt teardown was racing a separate, pre-existing
postTaskConcurrently-after-terminate check on x64-asan and obscuring the
result. A 30s subprocess watchdog and a 60s outer timeout are added so a
deadlock regression fails fast with a useful message instead of hanging.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 45-74: The test only checks total !== 0 so a failing worker can be
hidden; compute expectedPerThread = ITERATIONS * CHANNELS_PER_ITERATION * 2,
track each worker's reported count (e.g. collect messages in a workersCounts
array or object from the worker.on("message", n => ...) handler keyed by worker
index/ID), keep the main-thread hammer() result separate, and in finish() verify
the main-thread result equals expectedPerThread and every worker's reported
count equals expectedPerThread (or at least that every worker reported >0)
before printing PASS; if any worker is missing or mismatched, fail the test with
process.exit(1).

In `@test/js/web/workers/message-channel-concurrent.test.ts`:
- Around line 18-20: The test currently asserts stderr is empty directly (using
the variables stderr and expect), but on ASAN builders Bun subprocesses can emit
a known startup warning; update the assertion to filter out any stderr lines
that start with "WARNING: ASAN interferes" (split stderr into lines, remove
lines matching that prefix) and then assert the filtered stderr is empty; apply
this change where stdout, stderr, exitCode are awaited and before the
expect(stderr).toBe("") assertion so the test uses the repo's filtered-stderr
convention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a7e33a01-fb55-464d-9d48-7bc9b46943ea

📥 Commits

Reviewing files that changed from the base of the PR and between 8bf9d4e and 6d5e1ee.

📒 Files selected for processing (3)
  • src/bun.js/bindings/webcore/MessagePortChannel.cpp
  • test/js/web/workers/message-channel-concurrent-fixture.js
  • test/js/web/workers/message-channel-concurrent.test.ts

Comment thread test/js/web/workers/message-channel-concurrent-fixture.js Outdated
Comment thread test/js/web/workers/message-channel-concurrent.test.ts Outdated
…rning

The fixture now verifies each hammer() returns exactly
ITERATIONS * CHANNELS_PER_ITERATION * 2 messages and that every worker
reports before printing PASS, so a worker-only regression cannot be hidden
by the main-thread contribution. The test wrapper filters the known
'WARNING: ASAN interferes' startup line before asserting empty stderr,
matching the convention in other subprocess tests.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 9-12: The four constants WORKERS, ITERATIONS,
CHANNELS_PER_ITERATION, and EXPECTED_PER_HAMMER use SCREAMING_SNAKE_CASE but the
repo requires camelCase for JS/TS; rename them to camelCase (e.g., workers,
iterations, channelsPerIteration, expectedPerHammer) and update all references
in this file (and other occurrences noted at lines referenced in the review) so
the variable names remain consistent and tests still compute EXPECTED_PER_HAMMER
as iterations * channelsPerIteration * 2.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f0e59577-466e-4c06-9afb-7da8d54fd0bb

📥 Commits

Reviewing files that changed from the base of the PR and between 6d5e1ee and 8fb83f9.

📒 Files selected for processing (2)
  • test/js/web/workers/message-channel-concurrent-fixture.js
  • test/js/web/workers/message-channel-concurrent.test.ts

Comment thread test/js/web/workers/message-channel-concurrent-fixture.js Outdated
Comment thread test/js/web/workers/message-channel-concurrent.test.ts
CI's runner passes --timeout=90000 (scripts/runner.node.mjs testTimeout/2),
so the fixture's own 30s watchdog already fires well within the per-test
ceiling and the explicit 60s override is redundant.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/workers/message-channel-concurrent.test.ts`:
- Around line 18-24: Remove the brittle empty-stderr gating and instead rely on
the subprocess output and exit code as the regression signal: drop the
filteredStderr computation and its expect assertion, and change the assertions
to verify that stdout (the variable `stdout` resolved from `proc.stdout.text()`)
contains the expected "PASS" message and that `exitCode` (from `proc.exited`) is
0; update the Promise.all usage taking `stdout` and `exitCode` into account and
remove references to `stderr`/`filteredStderr` (variables `stderr` and
`filteredStderr`) from the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2a1ec395-1d87-4614-b019-b9e2aec4574c

📥 Commits

Reviewing files that changed from the base of the PR and between 8fb83f9 and 36248eb.

📒 Files selected for processing (1)
  • test/js/web/workers/message-channel-concurrent.test.ts

Comment thread test/js/web/workers/message-channel-concurrent.test.ts Outdated
The fixture now exports hammer() and self-executes when loaded as a Worker;
the test spawns the workers directly and asserts each returns the expected
count. On a regression the test process itself crashes, which the runner
reports as a failure — no need for the subprocess wrapper.
… bug

receiveMessageOnPort's JS wrapper uses if (!res) so a message value of 0
is treated as no-message, which short-circuited the receive loop on the
first channel (j === 0) of every iteration and made the count come up 400
short. Post {j} instead so the payload is always truthy; the wrapper bug
is pre-existing and tracked separately.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — both previously flagged issues have been addressed: the protector leak is fixed (commit 6d5e1ee) and the spurious test timeout was removed (commit 36248eb). The locking approach is sound.

Extended reasoning...

Overview

This PR replaces the commented-out ASSERT(isMainThread()) guards in MessagePortChannelRegistry with a WTF::Lock, and upgrades MessagePortChannel to ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr. It touches 7 C++ source files and 2 new test files.

Security risks

No security-sensitive code paths (auth, crypto, permissions) are involved. The change narrows a race condition; it does not introduce new attack surface.

Level of scrutiny

This is a concurrency fix in production-critical runtime code, but the approach is straightforward: a single coarse-grained lock around an otherwise-unguarded HashMap. The key correctness details — hoisting RefPtr outside lock scope before the destructor can re-enter via messagePortChannelDestroyed, calling the takeAllMessagesForPort callback after releasing the lock to avoid deadlock on re-entrant registry access — are clearly handled. Both previously raised concerns (protector leak in tryTakeMessageForPort and test timeout convention violation) have been resolved.

Other factors

No outstanding reviewer comments remain unaddressed. The stress test reliably reproduces the pre-fix heap corruption and validates the fix. CI build failures visible in the timeline are on the earlier commit and appear to be unrelated zig build issues, not failures in the C++ changes under review.

Jarred-Sumner pushed a commit that referenced this pull request Apr 30, 2026
…primitive (#29937)

## What

Replaces the WebKit-derived `MessagePortChannelProvider` /
`MessagePortChannelProviderImpl` / `MessagePortChannelRegistry` /
`MessagePortChannel` stack — plus the
`BroadcastChannel::MainThreadBridge` / per-context
`BunBroadcastChannelRegistry` — with a small, thread-safe concurrency
primitive (`MessagePortPipe`) that the Web-facing classes wrap thinly.

Net −1082 lines (+737 / −1819), 10 files deleted.

## Why

The existing stack is Safari's multi-*process* design carried over
verbatim: an abstract provider so UIProcess can swap in an IPC impl, a
process-global `MessagePortIdentifier → MessagePortChannel` HashMap,
`ProcessIdentifier` tracking, self-`RefPtr` "protector" members to keep
channels alive across IPC hops, and a main-thread serialization point
for `BroadcastChannel`. None of that indirection is needed in Bun's
single-process multi-thread model — and worse, the
`MessagePortChannelRegistry::m_openChannels` HashMap and
`MessagePortChannel::m_pendingMessages` Vector are mutated from worker
threads with **no lock** (the WebKit `ASSERT(isMainThread())` guards
were just commented out). Concurrent `new MessageChannel()` across
workers corrupts the HashMap:

```
HashTable.h:1574:17: runtime error: member access within null pointer of type
  'const HashTable<MessagePortIdentifier, KeyValuePair<MessagePortIdentifier,
   WeakRef<MessagePortChannel>>, ...>'
```

This is the cause of #16186, #22471, #25805 and likely #29458.

## Design

**`MessagePortPipe`** — `ThreadSafeRefCounted`, two sides. Each side
has:
- a `Lock` + `Deque<MessageWithMessagePorts>` inbox
- one `std::atomic<uint64_t>` state word: `Closed | DrainScheduled |
Attached` in the low byte, queued-count in the high bits

`send()` locks the destination side, enqueues, and if `Attached &&
!DrainScheduled` flips `DrainScheduled` and posts one task to the
receiver's `ScriptExecutionContext`. A burst of N sends schedules at
most one cross-thread wakeup. `takeAll()` clears `DrainScheduled` before
swapping the deque, so a racing send reschedules — at most one extra
no-op drain, never a stranded message. `attach()`/`detach()` handle
transfer: a detached side buffers; attach flushes the backlog with one
wakeup. `hasPendingActivity` is two atomic loads (my queued count +
peer's `Closed`), safe from the GC thread with no lock.

**`MessagePort`** — holds `RefPtr<MessagePortPipe>` + `uint8_t side`.
`postMessage` → `pipe->send`. `start()` → `pipe->attach`. `close()` →
`pipe->close`. Transfer → `pipe->detach`, hand `{pipe, side}` through
`TransferredMessagePort`, re-create on the other side. No global maps,
no identifiers, no per-context port iteration.

The drain task takes the whole inbox in one go (sender-side batching),
then posts each message as its own local task so microtasks checkpoint
between deliveries — matching the HTML "port message queue is a task
source" semantics and Node's observable ordering.

**`BroadcastChannel`** — one process-global `Lock` + `HashMap<String,
Vector<{ctxId, weak channel}>>`. `postMessage` snapshots subscribers
under the lock, releases, then posts one task per `(message,
subscriber)` directly to each subscriber's context — no bounce through
the main thread, no `MainThreadBridge`, no second `allBroadcastChannels`
map. Per-channel inbox batching was prototyped and reverted: the spec's
same-event-loop `(message-major, creation-minor)` delivery-order test
(WPT `broadcast-channel.test.ts::"messages are delivered in port
creation order"`) fails if subscribers drain channel-major. A
per-`(context, name)` inbox could restore batching without breaking
order — left as future work.

## Deleted

`MessagePortChannel.{h,cpp}`, `MessagePortChannelProvider.{h,cpp}`,
`MessagePortChannelProviderImpl.{h,cpp}`,
`MessagePortChannelRegistry.{h,cpp}`, `MessagePortIdentifier.h`,
`BroadcastChannelRegistry.h`, the `allMessagePorts()` /
`portToContextIdentifier()` / `allBroadcastChannels()` /
`channelToContextIdentifier()` global maps,
`ScriptExecutionContext::{m_messagePorts,
processMessageWithMessagePortsSoon, dispatchMessagePortEvents,
createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}`,
and `WorkerGlobalScope::messagePortChannelProvider()`.

## Verification

- `test/js/web/workers/message-channel.test.ts` — 12/12
- `test/js/web/broadcastchannel/broadcast-channel.test.ts` — 11/11
- `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` —
3/3
- `test/js/web/workers/` suite — 228 pass / 6 fail; all 6 fail
identically on `main` (container-slowness timeouts), zero new failures
- `test/js/node/worker_threads/worker_threads.test.ts` — 25 pass / 2
fail; both fail identically on `main`
- 10 node `parallel/test-worker-message-port-*.js` +
`test-broadcastchannel-*.js` — all pass
- New `test/js/web/workers/message-port-pipe.test.ts`:
- `concurrent MessageChannel creation across workers is race-free` —
**fails 3/3 on main** with the HashTable UBSan null-deref shown above;
**passes 5/5** on this branch
- microtask ordering, buffered-before-start, FIFO
`receiveMessageOnPort`, 1000-message cross-thread burst

## Relationship to other PRs

#29832 / #29442 patch the race by adding a lock to the existing
registry. This PR removes the registry instead — there's nothing left to
race on. Either approach fixes the crash; this one also sheds the
layering.

Fixes #16186
Fixes #22471
Fixes #25805

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…primitive (oven-sh#29937)

## What

Replaces the WebKit-derived `MessagePortChannelProvider` /
`MessagePortChannelProviderImpl` / `MessagePortChannelRegistry` /
`MessagePortChannel` stack — plus the
`BroadcastChannel::MainThreadBridge` / per-context
`BunBroadcastChannelRegistry` — with a small, thread-safe concurrency
primitive (`MessagePortPipe`) that the Web-facing classes wrap thinly.

Net −1082 lines (+737 / −1819), 10 files deleted.

## Why

The existing stack is Safari's multi-*process* design carried over
verbatim: an abstract provider so UIProcess can swap in an IPC impl, a
process-global `MessagePortIdentifier → MessagePortChannel` HashMap,
`ProcessIdentifier` tracking, self-`RefPtr` "protector" members to keep
channels alive across IPC hops, and a main-thread serialization point
for `BroadcastChannel`. None of that indirection is needed in Bun's
single-process multi-thread model — and worse, the
`MessagePortChannelRegistry::m_openChannels` HashMap and
`MessagePortChannel::m_pendingMessages` Vector are mutated from worker
threads with **no lock** (the WebKit `ASSERT(isMainThread())` guards
were just commented out). Concurrent `new MessageChannel()` across
workers corrupts the HashMap:

```
HashTable.h:1574:17: runtime error: member access within null pointer of type
  'const HashTable<MessagePortIdentifier, KeyValuePair<MessagePortIdentifier,
   WeakRef<MessagePortChannel>>, ...>'
```

This is the cause of oven-sh#16186, oven-sh#22471, oven-sh#25805 and likely oven-sh#29458.

## Design

**`MessagePortPipe`** — `ThreadSafeRefCounted`, two sides. Each side
has:
- a `Lock` + `Deque<MessageWithMessagePorts>` inbox
- one `std::atomic<uint64_t>` state word: `Closed | DrainScheduled |
Attached` in the low byte, queued-count in the high bits

`send()` locks the destination side, enqueues, and if `Attached &&
!DrainScheduled` flips `DrainScheduled` and posts one task to the
receiver's `ScriptExecutionContext`. A burst of N sends schedules at
most one cross-thread wakeup. `takeAll()` clears `DrainScheduled` before
swapping the deque, so a racing send reschedules — at most one extra
no-op drain, never a stranded message. `attach()`/`detach()` handle
transfer: a detached side buffers; attach flushes the backlog with one
wakeup. `hasPendingActivity` is two atomic loads (my queued count +
peer's `Closed`), safe from the GC thread with no lock.

**`MessagePort`** — holds `RefPtr<MessagePortPipe>` + `uint8_t side`.
`postMessage` → `pipe->send`. `start()` → `pipe->attach`. `close()` →
`pipe->close`. Transfer → `pipe->detach`, hand `{pipe, side}` through
`TransferredMessagePort`, re-create on the other side. No global maps,
no identifiers, no per-context port iteration.

The drain task takes the whole inbox in one go (sender-side batching),
then posts each message as its own local task so microtasks checkpoint
between deliveries — matching the HTML "port message queue is a task
source" semantics and Node's observable ordering.

**`BroadcastChannel`** — one process-global `Lock` + `HashMap<String,
Vector<{ctxId, weak channel}>>`. `postMessage` snapshots subscribers
under the lock, releases, then posts one task per `(message,
subscriber)` directly to each subscriber's context — no bounce through
the main thread, no `MainThreadBridge`, no second `allBroadcastChannels`
map. Per-channel inbox batching was prototyped and reverted: the spec's
same-event-loop `(message-major, creation-minor)` delivery-order test
(WPT `broadcast-channel.test.ts::"messages are delivered in port
creation order"`) fails if subscribers drain channel-major. A
per-`(context, name)` inbox could restore batching without breaking
order — left as future work.

## Deleted

`MessagePortChannel.{h,cpp}`, `MessagePortChannelProvider.{h,cpp}`,
`MessagePortChannelProviderImpl.{h,cpp}`,
`MessagePortChannelRegistry.{h,cpp}`, `MessagePortIdentifier.h`,
`BroadcastChannelRegistry.h`, the `allMessagePorts()` /
`portToContextIdentifier()` / `allBroadcastChannels()` /
`channelToContextIdentifier()` global maps,
`ScriptExecutionContext::{m_messagePorts,
processMessageWithMessagePortsSoon, dispatchMessagePortEvents,
createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}`,
and `WorkerGlobalScope::messagePortChannelProvider()`.

## Verification

- `test/js/web/workers/message-channel.test.ts` — 12/12
- `test/js/web/broadcastchannel/broadcast-channel.test.ts` — 11/11
- `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` —
3/3
- `test/js/web/workers/` suite — 228 pass / 6 fail; all 6 fail
identically on `main` (container-slowness timeouts), zero new failures
- `test/js/node/worker_threads/worker_threads.test.ts` — 25 pass / 2
fail; both fail identically on `main`
- 10 node `parallel/test-worker-message-port-*.js` +
`test-broadcastchannel-*.js` — all pass
- New `test/js/web/workers/message-port-pipe.test.ts`:
- `concurrent MessageChannel creation across workers is race-free` —
**fails 3/3 on main** with the HashTable UBSan null-deref shown above;
**passes 5/5** on this branch
- microtask ordering, buffered-before-start, FIFO
`receiveMessageOnPort`, 1000-message cross-thread burst

## Relationship to other PRs

oven-sh#29832 / oven-sh#29442 patch the race by adding a lock to the existing
registry. This PR removes the registry instead — there's nothing left to
race on. Either approach fixes the crash; this one also sheds the
layering.

Fixes oven-sh#16186
Fixes oven-sh#22471
Fixes oven-sh#25805

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
@robobun

robobun commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Superseded by #29937, which removed MessagePortChannelRegistry / MessagePortChannel entirely and replaced them with the thread-safe MessagePortPipe primitive. The files this PR locks are now empty stubs on main, so the diff no longer applies.

The regression coverage here is preserved: test/js/web/workers/message-port-pipe.test.ts has an equivalent "concurrent MessageChannel creation across workers is race-free" test plus a cross-thread burst ordering test. The linked issues (#16186, #22471, #25805, #29458) are all closed as fixed by #29937.

Closing as superseded.

@robobun robobun closed this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants