MessagePort: serialize MessagePortChannelRegistry with a Lock (fix worker_threads data race)#29442
MessagePort: serialize MessagePortChannelRegistry with a Lock (fix worker_threads data race)#29442dylan-conway wants to merge 8 commits into
Conversation
… of disabling isMainThread asserts WebKit's MessagePortChannelRegistry is main-thread-only — every method has ASSERT(isMainThread()) and worker access is routed through WorkerMessagePortChannelProvider via callOnMainThread. Bun stripped that proxy and commented the assertions out, so worker threads were mutating m_openChannels (an unguarded HashMap) and per-channel m_pendingMessages Vectors concurrently with the main thread. The new stress test reproduces this as a bmalloc bitfit / pas_segregated_page double-free on every run. Bun cannot adopt WebKit's thread-hop model directly because Node's synchronous receiveMessageOnPort() needs to read the queue from any thread without bouncing through the main loop, so instead the registry now owns a WTF::Lock and every entry point takes it. The channel RefPtr is hoisted outside the locked scope so the destructor (which re-enters the registry via messagePortChannelDestroyed) cannot deadlock, takeAllMessagesForPort returns the message vector and the callback is invoked after the lock is released (it re-enters the registry via entanglePorts), and MessagePortChannel becomes ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr so RefPtr/WeakPtr operations are atomic across threads. The dead m_messageBatchesInFlight tracking and the unused existingChannelContainingPort/hasAnyMessagesPendingOrInFlight declarations are removed along the way. Performance::timing()'s commented isMainThread() guard was a different case: in WebKit it gates a Window-only API, but in Bun each context has its own Performance instance with its own m_timing, so the guard is simply inapplicable and is deleted rather than restored.
|
Updated 5:06 PM PT - Apr 18th, 2026
❌ @dylan-conway, your commit de0fd56 has 2 failures in
🧪 To try this PR locally: bunx bun-pr 29442That installs a local version of the PR into your bun-29442 --bun |
|
Found 3 issues this PR may fix:
🤖 Generated with Claude Code |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughMessagePortChannel and its registry were made thread-safe (thread-safe refcounting, thread-safe weak pointers, and a Lock). The message-draining API changed from an async CompletionHandler to a synchronous return. Message-in-flight telemetry/counters were removed. A concurrent worker stress test and minor provider/header cleanups were added. Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 32-61: Add a subprocess watchdog that enforces a timeout when
running in the parent (isMainThread) so the fixture fails fast if a worker never
posts back: create a timer (e.g., watchdog = setTimeout(..., TIMEOUT_MS))
started before spawning/aggregating workers, have the timeout handler log an
error, terminate all Worker instances in the workers array, and call
process.exit(1); ensure the timer is cleared (clearTimeout(watchdog)) inside
finish() so successful completion cancels the watchdog; reference the existing
isMainThread block, WORKERS loop, workers array, finish() function, and worker
termination logic when implementing this.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a210a353-36ad-49f8-8630-5ff5976ce112
📒 Files selected for processing (9)
src/bun.js/bindings/webcore/MessagePortChannel.cppsrc/bun.js/bindings/webcore/MessagePortChannel.hsrc/bun.js/bindings/webcore/MessagePortChannelProvider.cppsrc/bun.js/bindings/webcore/MessagePortChannelProviderImpl.cppsrc/bun.js/bindings/webcore/MessagePortChannelRegistry.cppsrc/bun.js/bindings/webcore/MessagePortChannelRegistry.hsrc/bun.js/bindings/webcore/Performance.cpptest/js/web/workers/message-channel-concurrent-fixture.jstest/js/web/workers/message-channel-concurrent.test.ts
💤 Files with no reviewable changes (2)
- src/bun.js/bindings/webcore/Performance.cpp
- src/bun.js/bindings/webcore/MessagePortChannelProvider.cpp
There was a problem hiding this comment.
This is a well-motivated and correctly structured threading fix, but the scope (ref-counting model change, lock protocol, callback → return-value refactor) warrants a human eye before merging — especially given the pre-existing protector-leak found in tryTakeMessageForPort.
Extended reasoning...
Overview
This PR replaces the previously unguarded (comment-only) main-thread assertions in MessagePortChannelRegistry with a real WTF::Lock, upgrades MessagePortChannel to ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr, changes m_openChannels to use ThreadSafeWeakPtr, and converts takeAllMessagesForPort from a callback to a return-value interface. It also removes dead code (m_messageBatchesInFlight, existingChannelContainingPort, hasAnyMessagesPendingOrInFlight) and adds a concurrent stress test.
Security risks
No authentication, crypto, or permissions code is involved. The change affects object lifetime and heap safety — the pre-existing issue in tryTakeMessageForPort not clearing m_pendingMessageProtectors[i] after draining the last message produces a temporary self-referential retain cycle. This is a memory-safety concern (delayed deallocation window in long-lived workers) rather than a security vulnerability.
Level of scrutiny
This merits careful human review. The change is not mechanical: it alters the ref-counting model for a class that interacts with the GC/finalizer path, introduces a lock protocol that must be consistently followed across all future callers, and changes the callback contract for takeAllMessagesForPort. Lock discipline bugs in C++ (use-after-free, deadlock) can be hard to catch in code review and catastrophic at runtime.
Other factors
The stress test is thorough and the PR description accurately explains the design trade-offs. The bug found by the hunter (tryTakeMessageForPort leaving a stale self-ref protector) is pre-existing and not introduced here, but the PR makes that path more prominent via receiveMessageOnPort. A trivial one-liner fix in the same file would eliminate the asymmetry cleanly. CI results are also still pending.
…; harden stress fixture tryTakeMessageForPort (the receiveMessageOnPort path) never cleared m_pendingMessageProtectors[i] after draining the queue, leaving a self-reference that delayed channel destruction until close(). Clear it when the queue empties, matching takeAllMessagesForPort. The stress fixture is reworked to wait on each worker's exit event and let the main loop drain naturally instead of calling worker.terminate() then process.exit() — the abrupt teardown was racing a separate, pre-existing postTaskConcurrently-after-terminate check on x64-asan and obscuring the result. A 30s subprocess watchdog and a 60s outer timeout are added so a deadlock regression fails fast with a useful message instead of hanging.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 45-74: The test only checks total !== 0 so a failing worker can be
hidden; compute expectedPerThread = ITERATIONS * CHANNELS_PER_ITERATION * 2,
track each worker's reported count (e.g. collect messages in a workersCounts
array or object from the worker.on("message", n => ...) handler keyed by worker
index/ID), keep the main-thread hammer() result separate, and in finish() verify
the main-thread result equals expectedPerThread and every worker's reported
count equals expectedPerThread (or at least that every worker reported >0)
before printing PASS; if any worker is missing or mismatched, fail the test with
process.exit(1).
In `@test/js/web/workers/message-channel-concurrent.test.ts`:
- Around line 18-20: The test currently asserts stderr is empty directly (using
the variables stderr and expect), but on ASAN builders Bun subprocesses can emit
a known startup warning; update the assertion to filter out any stderr lines
that start with "WARNING: ASAN interferes" (split stderr into lines, remove
lines matching that prefix) and then assert the filtered stderr is empty; apply
this change where stdout, stderr, exitCode are awaited and before the
expect(stderr).toBe("") assertion so the test uses the repo's filtered-stderr
convention.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a7e33a01-fb55-464d-9d48-7bc9b46943ea
📒 Files selected for processing (3)
src/bun.js/bindings/webcore/MessagePortChannel.cpptest/js/web/workers/message-channel-concurrent-fixture.jstest/js/web/workers/message-channel-concurrent.test.ts
…rning The fixture now verifies each hammer() returns exactly ITERATIONS * CHANNELS_PER_ITERATION * 2 messages and that every worker reports before printing PASS, so a worker-only regression cannot be hidden by the main-thread contribution. The test wrapper filters the known 'WARNING: ASAN interferes' startup line before asserting empty stderr, matching the convention in other subprocess tests.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/message-channel-concurrent-fixture.js`:
- Around line 9-12: The four constants WORKERS, ITERATIONS,
CHANNELS_PER_ITERATION, and EXPECTED_PER_HAMMER use SCREAMING_SNAKE_CASE but the
repo requires camelCase for JS/TS; rename them to camelCase (e.g., workers,
iterations, channelsPerIteration, expectedPerHammer) and update all references
in this file (and other occurrences noted at lines referenced in the review) so
the variable names remain consistent and tests still compute EXPECTED_PER_HAMMER
as iterations * channelsPerIteration * 2.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f0e59577-466e-4c06-9afb-7da8d54fd0bb
📒 Files selected for processing (2)
test/js/web/workers/message-channel-concurrent-fixture.jstest/js/web/workers/message-channel-concurrent.test.ts
CI's runner passes --timeout=90000 (scripts/runner.node.mjs testTimeout/2), so the fixture's own 30s watchdog already fires well within the per-test ceiling and the explicit 60s override is redundant.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/message-channel-concurrent.test.ts`:
- Around line 18-24: Remove the brittle empty-stderr gating and instead rely on
the subprocess output and exit code as the regression signal: drop the
filteredStderr computation and its expect assertion, and change the assertions
to verify that stdout (the variable `stdout` resolved from `proc.stdout.text()`)
contains the expected "PASS" message and that `exitCode` (from `proc.exited`) is
0; update the Promise.all usage taking `stdout` and `exitCode` into account and
remove references to `stderr`/`filteredStderr` (variables `stderr` and
`filteredStderr`) from the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2a1ec395-1d87-4614-b019-b9e2aec4574c
📒 Files selected for processing (1)
test/js/web/workers/message-channel-concurrent.test.ts
The fixture now exports hammer() and self-executes when loaded as a Worker; the test spawns the workers directly and asserts each returns the expected count. On a regression the test process itself crashes, which the runner reports as a failure — no need for the subprocess wrapper.
… bug
receiveMessageOnPort's JS wrapper uses if (!res) so a message value of 0
is treated as no-message, which short-circuited the receive loop on the
first channel (j === 0) of every iteration and made the count come up 400
short. Post {j} instead so the payload is always truthy; the wrapper bug
is pre-existing and tracked separately.
There was a problem hiding this comment.
LGTM — both previously flagged issues have been addressed: the protector leak is fixed (commit 6d5e1ee) and the spurious test timeout was removed (commit 36248eb). The locking approach is sound.
Extended reasoning...
Overview
This PR replaces the commented-out ASSERT(isMainThread()) guards in MessagePortChannelRegistry with a WTF::Lock, and upgrades MessagePortChannel to ThreadSafeRefCountedAndCanMakeThreadSafeWeakPtr. It touches 7 C++ source files and 2 new test files.
Security risks
No security-sensitive code paths (auth, crypto, permissions) are involved. The change narrows a race condition; it does not introduce new attack surface.
Level of scrutiny
This is a concurrency fix in production-critical runtime code, but the approach is straightforward: a single coarse-grained lock around an otherwise-unguarded HashMap. The key correctness details — hoisting RefPtr outside lock scope before the destructor can re-enter via messagePortChannelDestroyed, calling the takeAllMessagesForPort callback after releasing the lock to avoid deadlock on re-entrant registry access — are clearly handled. Both previously raised concerns (protector leak in tryTakeMessageForPort and test timeout convention violation) have been resolved.
Other factors
No outstanding reviewer comments remain unaddressed. The stress test reliably reproduces the pre-fix heap corruption and validates the fix. CI build failures visible in the timeline are on the earlier commit and appear to be unrelated zig build issues, not failures in the C++ changes under review.
…primitive (#29937) ## What Replaces the WebKit-derived `MessagePortChannelProvider` / `MessagePortChannelProviderImpl` / `MessagePortChannelRegistry` / `MessagePortChannel` stack — plus the `BroadcastChannel::MainThreadBridge` / per-context `BunBroadcastChannelRegistry` — with a small, thread-safe concurrency primitive (`MessagePortPipe`) that the Web-facing classes wrap thinly. Net −1082 lines (+737 / −1819), 10 files deleted. ## Why The existing stack is Safari's multi-*process* design carried over verbatim: an abstract provider so UIProcess can swap in an IPC impl, a process-global `MessagePortIdentifier → MessagePortChannel` HashMap, `ProcessIdentifier` tracking, self-`RefPtr` "protector" members to keep channels alive across IPC hops, and a main-thread serialization point for `BroadcastChannel`. None of that indirection is needed in Bun's single-process multi-thread model — and worse, the `MessagePortChannelRegistry::m_openChannels` HashMap and `MessagePortChannel::m_pendingMessages` Vector are mutated from worker threads with **no lock** (the WebKit `ASSERT(isMainThread())` guards were just commented out). Concurrent `new MessageChannel()` across workers corrupts the HashMap: ``` HashTable.h:1574:17: runtime error: member access within null pointer of type 'const HashTable<MessagePortIdentifier, KeyValuePair<MessagePortIdentifier, WeakRef<MessagePortChannel>>, ...>' ``` This is the cause of #16186, #22471, #25805 and likely #29458. ## Design **`MessagePortPipe`** — `ThreadSafeRefCounted`, two sides. Each side has: - a `Lock` + `Deque<MessageWithMessagePorts>` inbox - one `std::atomic<uint64_t>` state word: `Closed | DrainScheduled | Attached` in the low byte, queued-count in the high bits `send()` locks the destination side, enqueues, and if `Attached && !DrainScheduled` flips `DrainScheduled` and posts one task to the receiver's `ScriptExecutionContext`. A burst of N sends schedules at most one cross-thread wakeup. `takeAll()` clears `DrainScheduled` before swapping the deque, so a racing send reschedules — at most one extra no-op drain, never a stranded message. `attach()`/`detach()` handle transfer: a detached side buffers; attach flushes the backlog with one wakeup. `hasPendingActivity` is two atomic loads (my queued count + peer's `Closed`), safe from the GC thread with no lock. **`MessagePort`** — holds `RefPtr<MessagePortPipe>` + `uint8_t side`. `postMessage` → `pipe->send`. `start()` → `pipe->attach`. `close()` → `pipe->close`. Transfer → `pipe->detach`, hand `{pipe, side}` through `TransferredMessagePort`, re-create on the other side. No global maps, no identifiers, no per-context port iteration. The drain task takes the whole inbox in one go (sender-side batching), then posts each message as its own local task so microtasks checkpoint between deliveries — matching the HTML "port message queue is a task source" semantics and Node's observable ordering. **`BroadcastChannel`** — one process-global `Lock` + `HashMap<String, Vector<{ctxId, weak channel}>>`. `postMessage` snapshots subscribers under the lock, releases, then posts one task per `(message, subscriber)` directly to each subscriber's context — no bounce through the main thread, no `MainThreadBridge`, no second `allBroadcastChannels` map. Per-channel inbox batching was prototyped and reverted: the spec's same-event-loop `(message-major, creation-minor)` delivery-order test (WPT `broadcast-channel.test.ts::"messages are delivered in port creation order"`) fails if subscribers drain channel-major. A per-`(context, name)` inbox could restore batching without breaking order — left as future work. ## Deleted `MessagePortChannel.{h,cpp}`, `MessagePortChannelProvider.{h,cpp}`, `MessagePortChannelProviderImpl.{h,cpp}`, `MessagePortChannelRegistry.{h,cpp}`, `MessagePortIdentifier.h`, `BroadcastChannelRegistry.h`, the `allMessagePorts()` / `portToContextIdentifier()` / `allBroadcastChannels()` / `channelToContextIdentifier()` global maps, `ScriptExecutionContext::{m_messagePorts, processMessageWithMessagePortsSoon, dispatchMessagePortEvents, createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}`, and `WorkerGlobalScope::messagePortChannelProvider()`. ## Verification - `test/js/web/workers/message-channel.test.ts` — 12/12 - `test/js/web/broadcastchannel/broadcast-channel.test.ts` — 11/11 - `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` — 3/3 - `test/js/web/workers/` suite — 228 pass / 6 fail; all 6 fail identically on `main` (container-slowness timeouts), zero new failures - `test/js/node/worker_threads/worker_threads.test.ts` — 25 pass / 2 fail; both fail identically on `main` - 10 node `parallel/test-worker-message-port-*.js` + `test-broadcastchannel-*.js` — all pass - New `test/js/web/workers/message-port-pipe.test.ts`: - `concurrent MessageChannel creation across workers is race-free` — **fails 3/3 on main** with the HashTable UBSan null-deref shown above; **passes 5/5** on this branch - microtask ordering, buffered-before-start, FIFO `receiveMessageOnPort`, 1000-message cross-thread burst ## Relationship to other PRs #29832 / #29442 patch the race by adding a lock to the existing registry. This PR removes the registry instead — there's nothing left to race on. Either approach fixes the crash; this one also sheds the layering. Fixes #16186 Fixes #22471 Fixes #25805 --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…primitive (oven-sh#29937) ## What Replaces the WebKit-derived `MessagePortChannelProvider` / `MessagePortChannelProviderImpl` / `MessagePortChannelRegistry` / `MessagePortChannel` stack — plus the `BroadcastChannel::MainThreadBridge` / per-context `BunBroadcastChannelRegistry` — with a small, thread-safe concurrency primitive (`MessagePortPipe`) that the Web-facing classes wrap thinly. Net −1082 lines (+737 / −1819), 10 files deleted. ## Why The existing stack is Safari's multi-*process* design carried over verbatim: an abstract provider so UIProcess can swap in an IPC impl, a process-global `MessagePortIdentifier → MessagePortChannel` HashMap, `ProcessIdentifier` tracking, self-`RefPtr` "protector" members to keep channels alive across IPC hops, and a main-thread serialization point for `BroadcastChannel`. None of that indirection is needed in Bun's single-process multi-thread model — and worse, the `MessagePortChannelRegistry::m_openChannels` HashMap and `MessagePortChannel::m_pendingMessages` Vector are mutated from worker threads with **no lock** (the WebKit `ASSERT(isMainThread())` guards were just commented out). Concurrent `new MessageChannel()` across workers corrupts the HashMap: ``` HashTable.h:1574:17: runtime error: member access within null pointer of type 'const HashTable<MessagePortIdentifier, KeyValuePair<MessagePortIdentifier, WeakRef<MessagePortChannel>>, ...>' ``` This is the cause of oven-sh#16186, oven-sh#22471, oven-sh#25805 and likely oven-sh#29458. ## Design **`MessagePortPipe`** — `ThreadSafeRefCounted`, two sides. Each side has: - a `Lock` + `Deque<MessageWithMessagePorts>` inbox - one `std::atomic<uint64_t>` state word: `Closed | DrainScheduled | Attached` in the low byte, queued-count in the high bits `send()` locks the destination side, enqueues, and if `Attached && !DrainScheduled` flips `DrainScheduled` and posts one task to the receiver's `ScriptExecutionContext`. A burst of N sends schedules at most one cross-thread wakeup. `takeAll()` clears `DrainScheduled` before swapping the deque, so a racing send reschedules — at most one extra no-op drain, never a stranded message. `attach()`/`detach()` handle transfer: a detached side buffers; attach flushes the backlog with one wakeup. `hasPendingActivity` is two atomic loads (my queued count + peer's `Closed`), safe from the GC thread with no lock. **`MessagePort`** — holds `RefPtr<MessagePortPipe>` + `uint8_t side`. `postMessage` → `pipe->send`. `start()` → `pipe->attach`. `close()` → `pipe->close`. Transfer → `pipe->detach`, hand `{pipe, side}` through `TransferredMessagePort`, re-create on the other side. No global maps, no identifiers, no per-context port iteration. The drain task takes the whole inbox in one go (sender-side batching), then posts each message as its own local task so microtasks checkpoint between deliveries — matching the HTML "port message queue is a task source" semantics and Node's observable ordering. **`BroadcastChannel`** — one process-global `Lock` + `HashMap<String, Vector<{ctxId, weak channel}>>`. `postMessage` snapshots subscribers under the lock, releases, then posts one task per `(message, subscriber)` directly to each subscriber's context — no bounce through the main thread, no `MainThreadBridge`, no second `allBroadcastChannels` map. Per-channel inbox batching was prototyped and reverted: the spec's same-event-loop `(message-major, creation-minor)` delivery-order test (WPT `broadcast-channel.test.ts::"messages are delivered in port creation order"`) fails if subscribers drain channel-major. A per-`(context, name)` inbox could restore batching without breaking order — left as future work. ## Deleted `MessagePortChannel.{h,cpp}`, `MessagePortChannelProvider.{h,cpp}`, `MessagePortChannelProviderImpl.{h,cpp}`, `MessagePortChannelRegistry.{h,cpp}`, `MessagePortIdentifier.h`, `BroadcastChannelRegistry.h`, the `allMessagePorts()` / `portToContextIdentifier()` / `allBroadcastChannels()` / `channelToContextIdentifier()` global maps, `ScriptExecutionContext::{m_messagePorts, processMessageWithMessagePortsSoon, dispatchMessagePortEvents, createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}`, and `WorkerGlobalScope::messagePortChannelProvider()`. ## Verification - `test/js/web/workers/message-channel.test.ts` — 12/12 - `test/js/web/broadcastchannel/broadcast-channel.test.ts` — 11/11 - `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` — 3/3 - `test/js/web/workers/` suite — 228 pass / 6 fail; all 6 fail identically on `main` (container-slowness timeouts), zero new failures - `test/js/node/worker_threads/worker_threads.test.ts` — 25 pass / 2 fail; both fail identically on `main` - 10 node `parallel/test-worker-message-port-*.js` + `test-broadcastchannel-*.js` — all pass - New `test/js/web/workers/message-port-pipe.test.ts`: - `concurrent MessageChannel creation across workers is race-free` — **fails 3/3 on main** with the HashTable UBSan null-deref shown above; **passes 5/5** on this branch - microtask ordering, buffered-before-start, FIFO `receiveMessageOnPort`, 1000-message cross-thread burst ## Relationship to other PRs oven-sh#29832 / oven-sh#29442 patch the race by adding a lock to the existing registry. This PR removes the registry instead — there's nothing left to race on. Either approach fixes the crash; this one also sheds the layering. Fixes oven-sh#16186 Fixes oven-sh#22471 Fixes oven-sh#25805 --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
|
Superseded by #29937, which removed The regression coverage here is preserved: Closing as superseded. |
What
Replaces the commented-out
ASSERT(isMainThread())guards inMessagePortChannelRegistrywith a realWTF::Lock, and makesMessagePortChannelthread-safe-refcounted so the registry can be safely used from worker threads.Fixes #29458
Fixes #25805
Fixes #22471
Fixes #16186
Why
WebKit's
MessagePortChannelRegistryis main-thread-only by design — every method assertsisMainThread()and worker access is routed throughWorkerMessagePortChannelProvider→callOnMainThread. Bun stripped the proxy and commented the assertions out (with a// we totally are calling these off the main thread in many cases in Bun, so ........note), so worker threads were directly mutatingm_openChannels(an unguardedHashMap) and per-channelm_pendingMessagesVectors concurrently with the main thread.The new stress test reproduces this as bmalloc heap corruption (
pas panic: deallocation did fail/bitfit allocation error) on every run against current main.The linked issues are all the same underlying race surfacing at different points in the call path:
SerializedScriptValue::deserialize←tryTakeMessage←receiveMessageOnPort0x40m_pendingMessages[i].first()returned a torn slot with a nullRefPtr<SerializedScriptValue>while another thread reallocated the vectorpostMessage0x8m_openChannelsrehash + per-channel state mutationMessagePort::postMessage0x18m_openChannels.get()/ channel state underworkers_spawnedMessagePort::postMessage0x30Why a lock instead of WebKit's thread-hop
WebKit bounces every registry call through
callOnMainThread. Bun cannot do that directly because Node's synchronousreceiveMessageOnPort()(tryTakeMessageForPort) needs to read the queue from the calling thread without round-tripping through the main loop. A lock gives the same single-writer guarantee WebKit's main-thread invariant provides, while keepingreceiveMessageOnPortsynchronous.Details
MessagePortChannelRegistrygainsLock m_lock;m_openChannelsbecomesHashMap<…, ThreadSafeWeakPtr<MessagePortChannel>>and isWTF_GUARDED_BY_LOCK.RefPtr<MessagePortChannel>is hoisted outside the locked scope so the channel destructor (which re-enters viamessagePortChannelDestroyed) cannot deadlock.MessagePortChannelis nowThreadSafeRefCountedAndCanMakeThreadSafeWeakPtrso ref/weak ops are atomic across threads.MessagePortChannel::takeAllMessagesForPortnow returns the message vector instead of invoking the callback directly; the registry invokes the callback after releasing the lock (the callback re-enters the registry viaMessagePort::entanglePorts).MessagePortChannel::tryTakeMessageForPortnow clearsm_pendingMessageProtectors[i]when it drains the queue (pre-existing self-ref leak on thereceiveMessageOnPortpath).m_messageBatchesInFlight/hasAnyMessagesPendingOrInFlight/existingChannelContainingPort(declared but never used in Bun).Performance::timing()'s commentedisMainThread()guard was a Window-only check in WebKit; in Bun each context has its ownPerformance/m_timing, so it's deleted rather than restored.Testing
New
test/js/web/workers/message-channel-concurrent.test.tsspawns 6 workers + main, each creating/transferring/posting/closing 8 channels × 400 iterations in a tight loop with no yields. Against current main this corrupts the heap on every run.