Replace MessagePort/BroadcastChannel registries with MessagePortPipe primitive#29937
Conversation
The WebKit-derived MessagePortChannelProvider/Registry/Channel stack was designed for Safari's multi-process model: an abstract provider, a process-global identifier→channel HashMap, per-channel self-RefPtr protectors, and a main-thread serialization point for BroadcastChannel. Bun is single-process multi-thread, so none of the indirection buys anything — and worse, the registry's HashMap and the channel's pending Vector were accessed from worker threads with the isMainThread() asserts simply commented out, so concurrent MessageChannel churn corrupts the HashMap (observed as #16186/#22471/#25805). Split into two layers: MessagePortPipe is the concurrency primitive: ThreadSafeRefCounted, two sides, each with a lock-guarded Deque inbox and one atomic<uint64_t> packing Closed|DrainScheduled|Attached in the low byte and the queued count in the high bits. Send locks the destination, enqueues, and schedules at most one drain task per burst (the first push flips DrainScheduled; subsequent pushes see it set and skip). Drain clears the flag before swapping the deque so a racing send reschedules — at most one extra no-op wakeup, never a stranded message. attach/detach handle transfer by buffering while no context is attached and flushing on re-attach. hasPendingActivity is two atomic loads, no lock, no registry. MessagePort is now a thin EventTarget wrapper: RefPtr<Pipe> + side index. TransferredMessagePort carries {pipe, side} instead of identifier pairs. There is no global port map, no provider, no registry, no per-context m_messagePorts iteration on dispatch. BroadcastChannel drops MainThreadBridge and the per-context LazyRef registry in favor of one process-global lock-guarded name→subscribers map. post() snapshots under the lock and posts directly to each subscriber's context — no main-thread bounce. Per-channel inbox batching was considered and rejected: the spec requires same-event-loop subscribers to observe (message-major, creation-minor) order, which channel-major draining breaks. Net −1082 lines; the race in the registry is gone by construction (no registry).
|
Updated 9:33 PM PT - Apr 29th, 2026
❌ @autofix-ci[bot], your commit 778c684 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 29937That installs a local version of the PR into your bun-29937 --bun |
|
Found 6 issues this PR may fix:
🤖 Generated with Claude Code |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces identifier/channel-based message-port and broadcast-channel plumbing with a two-sided MessagePortPipe and a process-global BunBroadcastChannelRegistry; deletes legacy MessagePortChannel/provider/registry/identifier artifacts and removes message-port lifecycle APIs from ScriptExecutionContext and WorkerGlobalScope. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/BroadcastChannel.cpp`:
- Around line 83-85: BroadcastChannel::contextDestroyed currently calls close()
but fails to invoke the base class implementation, leaving
ContextDestructionObserver's cached context pointer stale; update
BroadcastChannel::contextDestroyed to call the base class (e.g., the parent
class's contextDestroyed or ContextDestructionObserver::contextDestroyed) after
or before close() so the observer clears its cached context pointer and prevents
scriptExecutionContext() from returning a dangling pointer. Ensure you reference
BroadcastChannel::contextDestroyed and the ContextDestructionObserver base
method in your change.
- Around line 93-98: BroadcastChannel::hasPendingActivity currently reads the
non-atomic m_hasRelevantEventListener without synchronization while
eventListenersDidChange can write it on the context thread, causing UB; fix by
making the listener flag atomic or folding it into the existing atomic m_state
so hasPendingActivity only performs atomic reads. Update the writer in
eventListenersDidChange to store to the chosen atomic (or update m_state's
bit(s) consistently) and adjust any readers (e.g., hasPendingActivity and other
checks) to read the same atomic location with appropriate memory_order (e.g.,
acquire for readers, release for writers) to avoid races.
- Around line 17-25: The constructor BroadcastChannel::BroadcastChannel calls
jsRef() but contextDestroyed() currently calls close() without pairing
jsUnref(), leaking the VM ref; modify BroadcastChannel::contextDestroyed() to
call jsUnref(nullptr) before calling close() so the jsRef/jsUnref pair is
balanced (ensure you update the method that currently only calls close() in the
BroadcastChannel class that interacts with BunBroadcastChannelRegistry and the
existing close() implementation).
In `@src/bun.js/bindings/webcore/BunBroadcastChannelRegistry.cpp`:
- Around line 59-63: The code unconditionally increments
BroadcastChannel::QueuedOne on channel->m_state before calling
ScriptExecutionContext::postTaskTo(sub.ctxId, ...) so if postTaskTo returns
false the queued count remains inflated; fix by only transferring ownership
(channel.releaseNonNull()) and/or incrementing the queued counter when
postTaskTo succeeds, or if you must increment before the call, call
channel->m_state.fetch_sub(BroadcastChannel::QueuedOne,
std::memory_order_acq_rel) when postTaskTo returns false and do not release the
channel in that failure path; ensure the lambda capture still gets a valid
released channel pointer only on success and that memory_order matches the
original fetch_add/fetch_sub uses.
In `@src/bun.js/bindings/webcore/MessagePort.cpp`:
- Around line 92-109: disentangle() currently detaches the pipe and drops the
context observer but does not perform the listener cleanup that close() does, so
any listener-derived refs (via onDidChangeListenerImpl() / refEventLoop()) can
keep the old event loop alive after transfer; fix by invoking the same
listener-cleanup path used by close() (e.g. clear/reset any listener state or
call the method that removes message listeners / clears onDidChangeListenerImpl
refs) from MessagePort::disentangle() — do this before observeContext(nullptr)
and before returning the TransferredMessagePort so that refEventLoop() refs are
released when a port is transferred.
- Around line 132-142: The current code eagerly drains the pipe via
m_pipe->takeAll(m_side) and captures messages into delayed postTask() lambdas,
which causes messages to be lost if disentangle() sets m_isDetached before those
tasks run; change the approach in MessagePort.cpp so you do not dequeue the
entire inbox up-front—either pop and take a single message inside each posted
task (call a takeOne/pop method per task) or leave messages in the pipe and on
detach return/restore pending messages (use the existing disentangle and
m_isDetached checks to hand messages back to the pipe). Specifically, replace
the m_pipe->takeAll(m_side) usage and the loop that captures messages with logic
that only removes a message at the moment its task executes (or implements
returning messages on detach), ensuring dispatchOneMessage still receives the
moved message only after the pipe removal succeeds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f8271d9d-a8e7-4e05-a12e-061e60d2a547
📒 Files selected for processing (26)
src/bun.js/bindings/BunWorkerGlobalScope.cppsrc/bun.js/bindings/BunWorkerGlobalScope.hsrc/bun.js/bindings/ScriptExecutionContext.cppsrc/bun.js/bindings/ScriptExecutionContext.hsrc/bun.js/bindings/ZigGlobalObject.cppsrc/bun.js/bindings/webcore/BroadcastChannel.cppsrc/bun.js/bindings/webcore/BroadcastChannel.hsrc/bun.js/bindings/webcore/BroadcastChannelRegistry.hsrc/bun.js/bindings/webcore/BunBroadcastChannelRegistry.cppsrc/bun.js/bindings/webcore/BunBroadcastChannelRegistry.hsrc/bun.js/bindings/webcore/MessageChannel.cppsrc/bun.js/bindings/webcore/MessagePort.cppsrc/bun.js/bindings/webcore/MessagePort.hsrc/bun.js/bindings/webcore/MessagePortChannel.cppsrc/bun.js/bindings/webcore/MessagePortChannel.hsrc/bun.js/bindings/webcore/MessagePortChannelProvider.cppsrc/bun.js/bindings/webcore/MessagePortChannelProvider.hsrc/bun.js/bindings/webcore/MessagePortChannelProviderImpl.cppsrc/bun.js/bindings/webcore/MessagePortChannelProviderImpl.hsrc/bun.js/bindings/webcore/MessagePortChannelRegistry.cppsrc/bun.js/bindings/webcore/MessagePortChannelRegistry.hsrc/bun.js/bindings/webcore/MessagePortIdentifier.hsrc/bun.js/bindings/webcore/MessagePortPipe.cppsrc/bun.js/bindings/webcore/MessagePortPipe.hsrc/bun.js/bindings/webcore/TransferredMessagePort.htest/js/web/workers/message-port-pipe.test.ts
💤 Files with no reviewable changes (14)
- src/bun.js/bindings/BunWorkerGlobalScope.h
- src/bun.js/bindings/webcore/MessagePortIdentifier.h
- src/bun.js/bindings/BunWorkerGlobalScope.cpp
- src/bun.js/bindings/webcore/MessagePortChannelRegistry.cpp
- src/bun.js/bindings/webcore/MessagePortChannelProvider.h
- src/bun.js/bindings/webcore/BroadcastChannelRegistry.h
- src/bun.js/bindings/webcore/MessagePortChannelProviderImpl.h
- src/bun.js/bindings/webcore/MessagePortChannel.cpp
- src/bun.js/bindings/webcore/MessagePortChannelRegistry.h
- src/bun.js/bindings/ScriptExecutionContext.h
- src/bun.js/bindings/webcore/MessagePortChannel.h
- src/bun.js/bindings/ScriptExecutionContext.cpp
- src/bun.js/bindings/webcore/MessagePortChannelProviderImpl.cpp
- src/bun.js/bindings/webcore/MessagePortChannelProvider.cpp
The race being exercised (unlocked HashMap in MessagePortChannelRegistry) is a memory-safety bug that surfaces deterministically only under sanitizers. Release lanes skip it.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/workers/message-port-pipe.test.ts`:
- Around line 29-44: The test title implies an explicit start() but currently
relies on implicit starting via setting port2.onmessage; update the test to call
port2.start() explicitly (or rename the test to mention implicit start) so
intent is clear—specifically modify the test "messages buffered before start()
are delivered on start()" to invoke port2.start() before awaiting the Promise
(while keeping port1.postMessage calls and the port2.onmessage handler the same)
so the behavior and test name match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 899b870d-e9a0-440b-8707-70f469c7fc96
📒 Files selected for processing (1)
test/js/web/workers/message-port-pipe.test.ts
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
src/bun.js/bindings/webcore/BroadcastChannel.cpp (2)
108-111:⚠️ Potential issue | 🟠 MajorFinish
contextDestroyed()teardown.Line 49 takes a VM ref, but Lines 108-111 only call
close(). That leaves the constructor’sjsRef()unbalanced on internal teardown and never clearsContextDestructionObserver’s cached context pointer.Suggested fix
void BroadcastChannel::contextDestroyed() { + if (m_hasRef) { + if (auto* context = scriptExecutionContext()) { + if (auto* globalObject = context->jsGlobalObject()) + jsUnref(globalObject); + } + } close(); + ContextDestructionObserver::contextDestroyed(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/BroadcastChannel.cpp` around lines 108 - 111, contextDestroyed() currently only calls close(), leaving the VM reference acquired in the constructor via jsRef() and ContextDestructionObserver's cached context pointer intact; update BroadcastChannel::contextDestroyed() to release the stored VM/JS reference (mirror the constructor's jsRef() acquisition) and clear the ContextDestructionObserver's cached context pointer so the ref is unbalanced and the cached context is nulled—i.e., after or as part of close() call explicitly release/clear the jsRef() held by BroadcastChannel and reset the observer's cached context pointer (use the same jsRef()/observer accessor names used in the constructor) so teardown fully reverses construction.
118-123:⚠️ Potential issue | 🟠 Major
hasPendingActivity()still races on listener state.This path is lock-free now, but it still reads
m_hasRelevantEventListeneras a plainboolwhileeventListenersDidChange()writes it on the context thread. That is UB in C++ and can misreport pending activity.Suggested fix
--- a/src/bun.js/bindings/webcore/BroadcastChannel.h +++ b/src/bun.js/bindings/webcore/BroadcastChannel.h @@ - bool m_hasRelevantEventListener { false }; + std::atomic<bool> m_hasRelevantEventListener { false };void BroadcastChannel::eventListenersDidChange() { - m_hasRelevantEventListener = hasEventListeners(eventNames().messageEvent); + m_hasRelevantEventListener.store(hasEventListeners(eventNames().messageEvent), std::memory_order_release); } bool BroadcastChannel::hasPendingActivity() const { uint64_t s = m_state.load(std::memory_order_acquire); if (s & Closed) return false; - return m_hasRelevantEventListener || (s >> QueuedShift) > 0; + return m_hasRelevantEventListener.load(std::memory_order_acquire) || (s >> QueuedShift) > 0; }#!/bin/bash set -euo pipefail rg -n -C2 '\bm_hasRelevantEventListener\b|hasPendingActivity\(|eventListenersDidChange\(' \ src/bun.js/bindings/webcore/BroadcastChannel.h \ src/bun.js/bindings/webcore/BroadcastChannel.cppExpected result:
m_hasRelevantEventListeneris declared as a plainbool, written ineventListenersDidChange(), and read inhasPendingActivity().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/BroadcastChannel.cpp` around lines 118 - 123, hasPendingActivity() reads m_hasRelevantEventListener without synchronization which is UB because eventListenersDidChange() writes it on another thread; change m_hasRelevantEventListener to std::atomic<bool> (or otherwise access it under the same lock used for other state) and update eventListenersDidChange() to store with appropriate ordering (e.g. memory_order_release) and hasPendingActivity() to load with memory_order_acquire (or use the same mutex) so the listener flag is read/written atomically and the lock-free path is safe; update declarations and both usages of m_hasRelevantEventListener accordingly.test/js/web/workers/message-port-pipe.test.ts (1)
29-40: 🧹 Nitpick | 🔵 TrivialMake this test explicit about
start().The body currently relies on implicit start via
onmessage, so it is not actually exercising the behavior named in the test title. Either callport2.start()or rename the test to describe implicit start.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/js/web/workers/message-port-pipe.test.ts` around lines 29 - 40, The test "messages buffered before start() are delivered on start()" relies on implicit start via assigning port2.onmessage; explicitly call port2.start() to exercise the behavior named in the test (or alternatively rename the test to indicate implicit start). Modify the test around the MessageChannel usage (symbols: port1, port2) to call port2.start() after setting up port2.onmessage (or before posting if you prefer) so the test actually verifies delivery on explicit start().src/bun.js/bindings/webcore/MessagePort.cpp (2)
118-134:⚠️ Potential issue | 🟠 MajorRun the same listener cleanup path on transfer that
close()uses.Transferred-in ports opt into
onDidChangeListenerImpl(), which canrefEventLoop()the current context formessagelisteners.disentangle()detaches the pipe and drops the observer without clearing those listeners, so the transferred-away wrapper can keep the old context alive until it is eventually collected.Possible fix
TransferredMessagePort MessagePort::disentangle() { ASSERT(isEntangled()); + removeAllEventListeners(); // Release the pipe to its next owner. Messages that arrive while in // transit buffer in the pipe; the receiving context's entangle() call🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/MessagePort.cpp` around lines 118 - 134, MessagePort::disentangle() currently detaches the pipe and drops the observer but does not run the same listener-cleanup path that close() uses, so transferred-away ports can retain message listeners that refEventLoop() the old context; update disentangle() to perform the same listener teardown as close() (i.e. run the onDidChangeListenerImpl()/listener-removal routine used by close() so any refEventLoop()ing is undone and listeners are cleared) before detaching/handing off the pipe and dropping the context observer, ensuring the transferred wrapper cannot keep the old context alive.
158-168:⚠️ Potential issue | 🔴 CriticalDon't dequeue the whole inbox before the port's transfer boundary.
takeAll()removes every pending message up front and stores them in laterpostTask()lambdas. Ifdisentangle()runs before those tasks execute, Line 174 drops them because the old wrapper is already detached, so messages posted before transfer disappear instead of following the port to its next owner. Keep messages in the pipe until the delivery task actually pops them, or restore them on detach.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/MessagePort.cpp` around lines 158 - 168, The code currently calls m_pipe->takeAll() which removes all pending messages immediately and stores them in postTask lambdas, so if disentangle() runs before those tasks execute the messages are lost; change the approach in MessagePort.cpp so messages are not dequeued up-front: instead capture the port identity but leave messages in m_pipe and have each posted task pull (pop/take) a single message at execution time and call dispatchOneMessage(context, message), or if you prefer keep takeAll() you must detect detach in disentangle() and restore any taken messages back into m_pipe; update the postTask lambdas and the logic around m_pipe->takeAll/m_pipe->pop and the disentangle() path accordingly so messages follow the port to its next owner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/TransferredMessagePort.h`:
- Around line 42-51: The TransferredMessagePort type should be move-only: remove
or mark deleted the copy constructor TransferredMessagePort(const
TransferredMessagePort&) and copy assignment operator TransferredMessagePort&
operator=(const TransferredMessagePort&), keep the move ctor
TransferredMessagePort(TransferredMessagePort&&) = default and move assignment
TransferredMessagePort& operator=(TransferredMessagePort&&) = default; then
update any call sites that currently copy TransferredMessagePort to use
std::move so ownership is transferred rather than duplicated.
In `@test/js/web/workers/message-port-pipe.test.ts`:
- Around line 106-108: The test currently asserts stderr is empty after awaiting
proc.stdout.text(), proc.stderr.text(), and proc.exited; but JS subprocesses may
emit a known ASAN startup warning (from bunEnv) that should be filtered first.
Update the assertion to remove/filter the ASAN startup warning from the captured
stderr (e.g., drop any line containing "WARNING: ASAN interferes" or use the
repo's existing stderr filter helper) and then assert the filtered stderr is
empty; apply the same change for the similar assertion around the
stdout/stderr/exitCode block at the later occurrence (the one referenced at
lines 149-151).
---
Duplicate comments:
In `@src/bun.js/bindings/webcore/BroadcastChannel.cpp`:
- Around line 108-111: contextDestroyed() currently only calls close(), leaving
the VM reference acquired in the constructor via jsRef() and
ContextDestructionObserver's cached context pointer intact; update
BroadcastChannel::contextDestroyed() to release the stored VM/JS reference
(mirror the constructor's jsRef() acquisition) and clear the
ContextDestructionObserver's cached context pointer so the ref is unbalanced and
the cached context is nulled—i.e., after or as part of close() call explicitly
release/clear the jsRef() held by BroadcastChannel and reset the observer's
cached context pointer (use the same jsRef()/observer accessor names used in the
constructor) so teardown fully reverses construction.
- Around line 118-123: hasPendingActivity() reads m_hasRelevantEventListener
without synchronization which is UB because eventListenersDidChange() writes it
on another thread; change m_hasRelevantEventListener to std::atomic<bool> (or
otherwise access it under the same lock used for other state) and update
eventListenersDidChange() to store with appropriate ordering (e.g.
memory_order_release) and hasPendingActivity() to load with memory_order_acquire
(or use the same mutex) so the listener flag is read/written atomically and the
lock-free path is safe; update declarations and both usages of
m_hasRelevantEventListener accordingly.
In `@src/bun.js/bindings/webcore/MessagePort.cpp`:
- Around line 118-134: MessagePort::disentangle() currently detaches the pipe
and drops the observer but does not run the same listener-cleanup path that
close() uses, so transferred-away ports can retain message listeners that
refEventLoop() the old context; update disentangle() to perform the same
listener teardown as close() (i.e. run the
onDidChangeListenerImpl()/listener-removal routine used by close() so any
refEventLoop()ing is undone and listeners are cleared) before detaching/handing
off the pipe and dropping the context observer, ensuring the transferred wrapper
cannot keep the old context alive.
- Around line 158-168: The code currently calls m_pipe->takeAll() which removes
all pending messages immediately and stores them in postTask lambdas, so if
disentangle() runs before those tasks execute the messages are lost; change the
approach in MessagePort.cpp so messages are not dequeued up-front: instead
capture the port identity but leave messages in m_pipe and have each posted task
pull (pop/take) a single message at execution time and call
dispatchOneMessage(context, message), or if you prefer keep takeAll() you must
detect detach in disentangle() and restore any taken messages back into m_pipe;
update the postTask lambdas and the logic around m_pipe->takeAll/m_pipe->pop and
the disentangle() path accordingly so messages follow the port to its next
owner.
In `@test/js/web/workers/message-port-pipe.test.ts`:
- Around line 29-40: The test "messages buffered before start() are delivered on
start()" relies on implicit start via assigning port2.onmessage; explicitly call
port2.start() to exercise the behavior named in the test (or alternatively
rename the test to indicate implicit start). Modify the test around the
MessageChannel usage (symbols: port1, port2) to call port2.start() after setting
up port2.onmessage (or before posting if you prefer) so the test actually
verifies delivery on explicit start().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8cfccd98-afe6-4332-9002-307661288f1e
📒 Files selected for processing (7)
src/bun.js/bindings/webcore/BroadcastChannel.cppsrc/bun.js/bindings/webcore/BroadcastChannel.hsrc/bun.js/bindings/webcore/MessageChannel.cppsrc/bun.js/bindings/webcore/MessagePort.cppsrc/bun.js/bindings/webcore/MessagePort.hsrc/bun.js/bindings/webcore/TransferredMessagePort.htest/js/web/workers/message-port-pipe.test.ts
…tangle, fold listener flag into state atomic - MessagePortPipe::drainAndDispatch now pops exactly one message under the lock, dispatches it, and re-posts itself if more remain. Messages never leave the inbox until the instant they're dispatched, so a port transferred between drain tasks carries its undelivered queue to the new owner instead of dropping it. DrainScheduled stays set across the self-reschedule and is cleared under the lock when the inbox empties. takeAll() is gone. - scheduleDrain clears DrainScheduled if postTaskTo fails (context already torn down) so a later attach() can reschedule. - MessagePort::disentangle calls removeAllEventListeners() before detaching so the event-loop ref taken by transferred-port listeners is released on the old context. - BroadcastChannel folds m_hasRelevantEventListener into the state atomic as a HasMessageListener bit; hasPendingActivity is now a single acquire load. contextDestroyed chains to the base. - BunBroadcastChannelRegistry::post rolls back the QueuedOne bump when postTaskTo returns false.
hasPendingActivity() runs on the concurrent GC thread and dereferences m_pipe; close()/disentangle() were nulling it on the mutator thread, opening a TOCTOU to a null-deref (or UAF if the mutator dropped the last ref between loads). Keep the Ref for the port's whole lifetime — close() just sets the pipe side's Closed bit, disentangle() detach()es and hands out a copyRef(). The pipe is ThreadSafeRefCounted so the extra ref is free, and hasPendingActivity() now only touches immutable pointers and atomic state.
When a port is transferred and the carrying message never reaches a new owner (destination closed, inbox discarded on close(), receiver detached before dispatch), the defaulted destructor only dropped the RefPtr — the orphaned side's Closed bit stayed 0, so the peer's hasPendingActivity() returned true via isOtherSideOpen() forever, pinning the peer's JS wrapper. Make TransferredMessagePort move-only with a destructor that closes its side when still owned. entangle() already releaseNonNull()s the pipe, so the happy path is a no-op. The destructor and move-assign are out-of-line in MessagePortPipe.cpp to avoid the header cycle MessagePortPipe.h → MessageWithMessagePorts.h → TransferredMessagePort.h. Fix the one Worker.cpp lambda that copy-captured MessageWithMessagePorts (and drop its dead 'ports' capture). Add a test: 200 peers of dropped-in-transit ports are now collected (finalized > 0); on main they are all pinned (finalized == 0).
…e worker subprocess tests behind debug/ASAN The verification harness's src/ revert doesn't round-trip file deletions cleanly: reverting to main brings the old files back, and restoring to the branch leaves them as staged adds instead of re-deleting them, so both builds try to compile the old .cpp against the new headers. Keeping the files as empty stubs means the harness only sees modifications, which it handles correctly. These can be removed outright once the harness limitation is lifted. Also gate the burst-ordering test behind isDebug||isASAN alongside the concurrent-churn test: both spawn worker subprocesses that hit a pre-existing pthread_exit→_Unwind_Find_FDE segfault in the locally- built stripped release binary in this container (reproduces on main's src/ too; does not reproduce in bun-profile or the baked release). The debug/ASAN lanes cover the correctness properties.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/js/web/workers/message-port-pipe.test.ts (1)
29-40: 🧹 Nitpick | 🔵 TrivialTest name still says explicit
start(), but this case relies on implicit start viaonmessage.Either call
port2.start()here or rename the test so the behavior under test is unambiguous.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/js/web/workers/message-port-pipe.test.ts` around lines 29 - 40, The test named "messages buffered before start() are delivered on start()" is ambiguous because it relies on implicit starting via port2.onmessage instead of an explicit start(); update the test to either call port2.start() before awaiting the messages or rename the test string to reflect implicit start via onmessage (e.g., "messages buffered before onmessage are delivered when handler set"); modify the test that uses MessageChannel and variables port1/port2 and the onmessage handler accordingly so the behavior under test is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/MessagePort.cpp`:
- Around line 206-223: The concurrent GC read in MessagePort::hasPendingActivity
currently reads plain bools (m_isDetached, m_hasMessageEventListener) which is
UB; change those fields to std::atomic<bool> (or otherwise source the state from
MessagePortPipe atomics) and replace the plain reads here with atomic loads
(e.g. load(memory_order_relaxed) or stronger if needed). Ensure all
writers/assignments to m_isDetached and m_hasMessageEventListener use atomic
stores, and update any setters or code paths that mutate these flags to use the
atomic type so the GC-thread access is safe; alternatively, move the logic to
derive these checks from existing atomic state in MessagePortPipe (e.g., use
MessagePortPipe::state/isOtherSideOpen) and remove plain-bool checks.
---
Duplicate comments:
In `@test/js/web/workers/message-port-pipe.test.ts`:
- Around line 29-40: The test named "messages buffered before start() are
delivered on start()" is ambiguous because it relies on implicit starting via
port2.onmessage instead of an explicit start(); update the test to either call
port2.start() before awaiting the messages or rename the test string to reflect
implicit start via onmessage (e.g., "messages buffered before onmessage are
delivered when handler set"); modify the test that uses MessageChannel and
variables port1/port2 and the onmessage handler accordingly so the behavior
under test is unambiguous.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f4c55f0e-e3c3-4e5a-8d5f-62796aee6c0c
📒 Files selected for processing (6)
src/bun.js/bindings/webcore/MessagePort.cppsrc/bun.js/bindings/webcore/MessagePort.hsrc/bun.js/bindings/webcore/MessagePortPipe.cppsrc/bun.js/bindings/webcore/TransferredMessagePort.hsrc/bun.js/bindings/webcore/Worker.cpptest/js/web/workers/message-port-pipe.test.ts
…sagePort.h
When the harness reverts src/ to origin/main for the without-fix
build, MessagePortPipe.{h,cpp} survive as new untracked files and are
still globbed into the unified build, but MessagePort.h and
TransferredMessagePort.h revert to their identifier-based forms. The
.cpp then references dispatchOneMessage() and the struct-typed
TransferredMessagePort that don't exist on main.
Add BUN_MESSAGEPORT_USES_PIPE to the new MessagePort.h and wrap
MessagePortPipe.cpp's body in it, so the TU compiles to an empty
object when paired with main's headers. With-fix builds are
unchanged (the marker is defined).
Pre-existing copy-paste in jsFunctionPostMessage: when disentanglePorts() fails (duplicate/closed port in the worker's self.postMessage transfer list), the error branch released the exception from 'serialized' — which at that point holds a value, not an exception — instead of from 'disentangledPorts'. In debug that trips the Expected::error() assert; in release it propagates garbage and the real DataCloneError is lost. One-token fix adjacent to the lambda-capture change this PR already makes in the same function.
Jarred-Sumner
left a comment
There was a problem hiding this comment.
Do we have tests checking onjectTypeCounts goes down after closing MessagePort and Bun.gc(true)? What other adversarial tests can you think of?
…erage - objectTypeCounts: MessageChannel/MessagePort/BroadcastChannel wrapper counts rise under allocation and fall back to baseline after close() + Bun.gc(true). Same test verifies the hasPendingActivity contract: a port with a message listener is pinned while its peer is open and released once the peer closes; dropping the sender refs then brings MessagePort count back to baseline. - close() inside an onmessage handler stops further deliveries (only msgs 1,2 of 5 arrive). - Messages queued before/after a transfer follow the port to its new owner in order (exercises detach→buffer→attach flush). - Chained transfer: a port hops A→B then B→C→D and still delivers.
robobun
left a comment
There was a problem hiding this comment.
We didn't — added one in 64b361d. It's a two-part subprocess check:
Part 1 — allocate 200 MessageChannel (+touch both ports so wrappers exist) and 100 BroadcastChannel, verify heapStats().objectTypeCounts rises, then close() everything, drop refs, Bun.gc(true), and verify all three counts fall back to within a small slack of baseline.
Part 2 — the hasPendingActivity contract: create 100 channels, set port2.onmessage and drop the port2 JS ref, keep port1. After GC, MessagePort count stays at ~200 (100 held + 100 pinned via isOtherSideOpen). Close all port1s → count drops to ~100 (just the held senders). Drop senders[] → count goes to ~0.
Also added in the same commit:
close()inside anonmessagehandler stops further deliveries (reentrancy)- Messages queued on a port before transfer follow it to the new owner (exercises the detach→buffer→attach path)
- Chained transfer through two carrier pairs
Other adversarial cases I considered but didn't add because they're already covered elsewhere or would need dedicated infrastructure:
- Worker terminates mid-transfer — the
port dropped in transittest already covers the worst case of this (endpoint orphaned → peer released) receiveMessageOnPortracing a drain task — covered implicitly by the FIFO test +test-worker-message-port-receive-message.js- Pipe chain cycles — blocked at the API level by the "port may not be posted through itself or its peer" check; can't construct one
- BroadcastChannel fan-out while a subscriber's context is being torn down — worth adding but needs worker-termination timing control; could go in a follow-up alongside the existing
broadcast-channel-worker-gc.test.ts - Concurrent GC during cross-thread burst — the debug build runs with
BUN_GARBAGE_COLLECTOR_LEVEL=2under the harness, so the burst test already stresses this
Happy to add any of those if you'd like them in this PR.
|
@robobun can you double check the draining behavior matches with node? I think they batch drain with a limit and don’t drain microtasks between individual messages? Also, if we can we really should batch the messages it’s a lot of allocations and RefCount churn |
…essage Node's MessagePort::OnMessage (src/node_messaging.cc) handles a single uv_async wakeup by looping up to max(initial-queue-size, 1000) iterations: each iteration pops one message, calls MakeCallback to fire the event (whose InternalCallbackScope drains nextTick + microtasks on exit), and continues. setImmediate/setTimeout don't interleave within a batch; the cap prevents a fast sender from starving the loop. Match that: drainAndDispatch now loops within a single posted task, popping one message under the lock, dispatching, then calling globalObject->drainMicrotasks() before the next iteration. The Attached flag is re-checked each time so a handler that closes/transfers the port leaves the remaining inbox buffered for the new owner. If the cap is hit, one scheduleDrain() reschedules the remainder. This replaces one postTaskTo lambda + Ref<Pipe> + task execution per message with one per batch. Microtask ordering (m0,u0,m1,u1,...) and nextTick ordering (m0,n0,m1,n1,...) match Node v24; setImmediate does not interleave (also matches Node).
The ctxId check added in b3aff0a catches cross-context transfers, but a handler can synchronously detach its own port (transfer through a local carrier) and re-wrap it via receiveMessageOnPort on the same context. That restores the same ctxId with a different MessagePort wrapper, so the drain loop kept dispatching to the stale m_isDetached port and silently dropped the remaining inbox. Compare s.port identity under the lock each iteration; if it no longer matches the captured port, stop — the new owner's attach() has scheduled its own drain and will deliver the rest.
Benchmark: PR #29937 vs
|
| Scenario | main |
this PR | Δ |
|---|---|---|---|
MessageChannel same-thread, 200k postMessage |
258.8 ms ± 1.1 | 201.1 ms ± 1.5 | 1.29× faster |
MessagePort cross-thread (worker parentPort), 100k round trips |
197.2 ms ± 2.8 | 186.8 ms ± 2.4 | 1.06× faster |
BroadcastChannel same-thread, 50k msgs × 4 subscribers |
270.8 ms ± 2.6 | 235.4 ms ± 1.2 | 1.15× faster |
BroadcastChannel cross-thread, 50k msgs × 4 worker subscribers |
379.3 ms ± 8.6 | 288.9 ms ± 3.7 | 1.31× faster |
The cross-thread BroadcastChannel win is the biggest practical one — user time drops 783ms→558ms and sys 896ms→446ms, consistent with dropping the MainThreadBridge bounce.
mitata (per-call latency)
| Op | main |
this PR | Δ |
|---|---|---|---|
new MessageChannel() |
1.63 µs | 629 ns | 2.6× faster |
port.postMessage(int) same-thread |
195 ns | 175 ns | 1.12× faster |
port.postMessage({a,b,c}) same-thread |
897 ns | 879 ns | ~same (clone-dominated) |
worker.postMessage(string) (repo's bench/postMessage/postMessage-string.mjs) |
730 ns | 440 ns | 1.66× faster |
bc.postMessage(int) 1→1 same-thread (send only) |
415 ns | 561 ns | 1.35× slower |
The bc.postMessage send-side bump is presumably the per-subscriber clone+task-post now happening synchronously inside postMessage instead of being deferred through the bridge. End-to-end it still wins (row 3 above), so the work just moved earlier rather than grew.
Crash note
I originally wrote the cross-thread MessagePort benchmark as new MessageChannel() → transfer port2 to a Worker → 100k round trips. On main that segfaults during worker startup (Segmentation fault at address 0x8, bun.report/1.3.14/la2f97aa65g…) or hangs; on this PR it runs clean. I switched the comparison to use the worker's built-in parentPort so main could survive the baseline run.
Benchmark sources
// mc-same-thread.mjs — 200k same-thread port1→port2
const N = 200000;
const { port1, port2 } = new MessageChannel();
let r = 0;
port2.onmessage = () => { if (++r === N) process.exit(0); };
port1.start(); port2.start();
for (let i = 0; i < N; i++) port1.postMessage(i);// mc-cross-thread.mjs — 100k round trips over worker parentPort
import { Worker, isMainThread, parentPort } from "node:worker_threads";
const N = 100000;
if (isMainThread) {
const w = new Worker(new URL(import.meta.url));
let e = 0;
w.on("message", () => { if (++e === N) w.terminate().then(() => process.exit(0)); });
await new Promise(r => w.once("online", r));
for (let i = 0; i < N; i++) w.postMessage(i);
} else {
parentPort.on("message", d => parentPort.postMessage(d));
}// bc-same-thread.mjs — 50k msgs × 4 subscribers
const N = 50000, SUBS = 4, target = N * SUBS;
let r = 0;
const subs = Array.from({ length: SUBS }, () => {
const c = new BroadcastChannel("bench");
c.onmessage = () => { if (++r === target) process.exit(0); };
return c;
});
const sender = new BroadcastChannel("bench");
for (let i = 0; i < N; i++) sender.postMessage(i);// bc-cross-thread.mjs — 50k msgs × 4 worker subscribers
import { Worker, isMainThread, parentPort, workerData } from "node:worker_threads";
const N = 50000, SUBS = 4;
if (isMainThread) {
const counter = new Int32Array(new SharedArrayBuffer(4));
const ws = Array.from({ length: SUBS }, () => new Worker(new URL(import.meta.url), { workerData: { counter } }));
await Promise.all(ws.map(w => new Promise(r => w.once("message", r))));
const s = new BroadcastChannel("bench");
for (let i = 0; i < N; i++) s.postMessage(i);
while (Atomics.load(counter, 0) < N * SUBS) await new Promise(r => setTimeout(r, 0));
await Promise.all(ws.map(w => w.terminate()));
process.exit(0);
} else {
const c = new BroadcastChannel("bench");
c.onmessage = () => Atomics.add(workerData.counter, 0, 1);
parentPort.postMessage("ready");
}
Peak RSSSame scenarios as above,
The cross-thread |
|
Build #49252 Windows failures are infra, not this PR:
Deflake for the rate-limit case: #29956. Retry the two jobs from the Buildkite UI — the BUILDKITE_TOKEN here is read-only so I can't trigger it. |
Resolve weak→strong under the lock directly into an inline-capacity Vector<pair<ctxId, Ref<BroadcastChannel>>, 4> instead of copying the whole Subscriber vector (heap alloc + 2N ThreadSafeWeakPtr atomics) and resolving later. For ≤4 targets this is zero heap allocations and one CAS per target. Measured on local release build, 1→1 same-thread send-only: before: ~691 ns/op after: ~631 ns/op (≈9% faster) The remaining gap to main's 415ns is the fan-out now running inside postMessage rather than deferred through the old MainThreadBridge hop; end-to-end BroadcastChannel is still net faster.
|
Nice — thanks for running these. On the Local release, same 1→1 send-only shape: (Slower CPU here than your Xeon, so absolute ns differ; should scale proportionally.) The remaining gap vs main is the fan-out itself (lock + Re the crash note: the |
worker.postMessage / parentPort.postMessage previously posted one task per message through ScriptExecutionContext::postTaskTo (global contexts lock + HashMap lookup + lambda alloc + concurrent enqueue). A round trip of N messages was ~2N contended lock acquisitions. Give Worker two MessageInbox structs (parent→worker, worker→parent), each a lock + Deque + drainScheduled flag. enqueue appends under the lock and posts one drain task if none is in flight; the drain task swaps the whole queue out under the lock and dispatches from the local copy with a microtask checkpoint between each, matching Node's MessagePort::OnMessage semantics. Cap is max(initial size, 1000) before yielding and rescheduling. Swapping rather than per-message popping keeps the sender and drainer from contending on the inbox lock under sustained bursts. Pre-online buffering is preserved: enqueueToWorker skips scheduling while !Online; fireEarlyMessages kicks the first drain once the worker is up. Local release, 100k parentPort round-trips: before: ~240 ms after: ~176 ms (1.36× faster) bench/postMessage/postMessage-string.mjs send-side: ~440 ns/iter (main) → ~217 ns/iter
|
Worker postMessage round-trip — e4b77d7: applied the same coalescing pattern to The swap-then-dispatch (vs per-message pop) keeps the sender and drainer off each other's lock under sustained bursts — important because the repo's own Local release (slower box than yours):
Observable ordering unchanged ( All 258 tests in On build #49268's failure: it's the Zig compiler itself panicking ( |
Resolves conflicts with #29937 (MessagePortPipe) which added batched MessageInbox queues to Worker. Adapts the new code to this PR's lifecycle: - enqueueToWorker: '!(m_onlineClosingFlags & OnlineFlag)' → 'm_state != Running' - enqueueToParent / drainToParent: route through postTaskToParent() (stable m_parentContextId) instead of reading scriptExecutionContext() from the worker thread - drop networkStateChanged and flag constants (already removed here) - keep m_toWorker/m_toParent inboxes and drain machinery as-is
…9895) ## What Fixes a stack overflow in `MessagePortPipe::close()` when a deep chain of nested transferred MessagePorts is closed, and adds regression coverage for the transferred-port-dropped leak that #29937's RAII `TransferredMessagePort` destructor now handles. ## Repro ```js const DEPTH = 20_000; let head = new MessageChannel(); const tail = head; for (let i = 0; i < DEPTH; i++) { const next = new MessageChannel(); head.port2.postMessage(null, [next.port1]); // next.port1 lands in head.port1's inbox head.port2.close(); head = next; } tail.port1.close(); // SIGSEGV (stack overflow) ``` ## Root cause `TransferredMessagePort` is move-only and its destructor calls `pipe->close(side)` so an orphaned in-transit port gets cleaned up. `MessagePortPipe::close()` swaps the side's inbox into a local `Deque` and lets it destruct outside the lock. If those dropped messages carry transferred ports, the `Deque` destructor runs `~TransferredMessagePort` → `pipe->close()` → another `Deque` destructor → … one native stack frame per link. ~20k links overflows an 8 MB stack. ## Fix `MessagePortPipe::close()` owns a stack-local `Vector<pair<RefPtr<MessagePortPipe>, uint8_t>>` worklist seeded with `{this, side}`. For each entry it closes the side under the lock, swaps out the inbox, then harvests transferred pipes from the dropped messages into the worklist (nulling `tp.pipe` so the subsequent `~TransferredMessagePort` is a no-op) before letting the batch destruct. O(1) stack regardless of chain depth; no static / thread_local / instance state. ## Verification `test/js/web/workers/message-port-closed-leak.test.ts`: | test | main (no fix) | this PR | |--|--|--| | transferred port dropped before receiver closed does not leak | pass (#29937) | pass | | transferred port dropped after receiver closed does not leak | pass (#29937) | pass | | **deep chain of nested transferred ports does not overflow on close** | **SIGSEGV, exit 139** | pass | `message-channel.test.ts` (12 tests) passes unchanged. ## History This PR originally targeted the self-ref leak in the old `MessagePortChannel` / `MessagePortChannelRegistry` stack (`disentanglePort()` parked a self-ref in `m_pendingMessagePortTransfers[i]` that was never released if the carrying message was dropped). #29937 replaced that whole stack with `MessagePortPipe` and fixed the leak via RAII — but the destructor-driven cascade it introduced recurses unboundedly. The first two tests are kept as regression guards for the leak; the deep-chain test is the fail-before for this fix. --------- Co-authored-by: robobun <robobun@users.noreply.github.com>
…in post() BunBroadcastChannelRegistry::post() resolved each subscriber's ThreadSafeWeakPtr to a Ref<BroadcastChannel> for the fan-out. If a worker subscriber tore down concurrently — its JS wrapper GC'd and its queued task dropped, both on the worker thread — the posting thread's local Ref became the last owner, and ~BroadcastChannel → ~EventTarget → EventListenerMap::clear() ran on the posting thread, tripping releaseAssertOrSetThreadUID (the map was populated on the worker thread). Seen in CI as a SIGABRT in broadcast-channel-worker-gc.test.ts 'repeated worker create/terminate stress' on musl x64-baseline, and on other branches post-#29937. Carry ThreadSafeWeakPtr through the fan-out and resolve to a strong ref only inside the posted task, on the target thread, so any last deref happens there. This removes the need for the QueuedOne count (which existed to keep the channel GC-rooted between post and dispatch); a channel with no message listener is now collectable before delivery, in which case the task resolves the weak ptr to null and no-ops — equivalent to dispatchEvent with no listener.
- web_worker.zig parent field: parent_poll_ref is now conditional on
!default_unref, so the 'lives at least as long as this struct via
parent_poll_ref' claim no longer holds unconditionally. Rewrite to
describe the {ref:false} / .unref() case and point at the file
header's detached-not-joined gap.
- Worker.cpp postTaskToWorkerGlobalScope Closing case: postMessage()
now goes through enqueueToWorker() (since #29937), not this
function; only getHeapSnapshot() calls it.
`test/js/node/worker_threads/worker_threads.test.ts` occasionally
segfaults in CI with
```
panic: Segmentation fault at address 0x10
```
on a GC helper thread:
```
wtfThreadEntryPoint
AutomaticThread::start
ParallelHelperPool::Thread::work
Heap::runBeginPhase(GCConductor)::$_1
SlotVisitor::drainFromShared
MarkingConstraintSolver::runExecutionThread
MarkingConstraint::execute ← "Sh" Strong Handles
HandleSet::visitStrongHandles
*(nullptr + offsetof(HandleNode, m_value)) = *(0x10)
```
(decoded from the `bun.report` trace on [build 50529 / 🐧 13
x64](https://buildkite.com/bun/bun/builds/50529#019dec3e-4d11-4651-b908-84e601bc2db3)
and symbolized against that build's `bun-profile`).
## Cause
`jsWorkerPrototypeFunction_getHeapSnapshotBody` does:
```cpp
Strong<JSPromise> strong(vm, promise); // parent VM's HandleSet
worker.postTaskToWorkerGlobalScope([strong, parentId](auto& workerCtx) {
...
ScriptExecutionContext::postTaskTo(parentId,
[strong, snapshot = ...](auto& parentCtx) { ... }); // runs on worker thread
});
```
`JSC::Strong<T>` has **no move constructor**. Capturing it by value
copy-constructs it, which calls `HandleSet::allocate()` +
`m_strongList.push()`; destroying it calls `HandleSet::deallocate()` +
`NodeList::remove()`. Both happen on the **worker thread** against the
**parent VM's** `HandleSet`, without the parent VM's lock.
`HandleSet::m_strongList` is a `SentinelLinkedList<HandleNode>` — not
thread-safe. `push`/`remove` transiently null `m_next`/`m_prev`. The
parent VM's "Sh" (Strong Handles) marking constraint
(`Heap::addCoreConstraints`) iterates that list during GC; when it
follows a null `m_next` it reads `*((HandleNode*)nullptr)->slot()` →
`*(0x0 + 0x10)`.
The `heapHelperPool()` is process-global, so the crashing helper thread
belongs to the parent VM's collector even though the worker VM's
`BunV8HeapSnapshotBuilder` full GC is in progress at the same time.
This has been there since `getHeapSnapshot` was added — the recent
worker lifetime rewrites (#29957, #29937) didn't introduce it.
## Fix
Heap-allocate the `Strong<JSPromise>` once on the parent thread and pass
only the raw pointer through the cross-thread lambdas. The worker thread
never dereferences it, so it never touches the parent VM's `HandleSet`.
The parent-side completion lambda resolves the promise and frees the
handle.
`Worker::postTaskToWorkerGlobalScope` now returns `bool` so a lost race
to `Closing`/`Closed` (worker exited between `isOnline()` and the post)
rejects with `ERR_WORKER_NOT_RUNNING` instead of silently leaking the
handle. If `postTaskTo(parentId, …)` on the return trip fails (parent
context gone), the handle intentionally leaks — deleting a parent-VM
`Strong` from the worker thread is exactly the bug we're fixing, and the
parent VM is tearing down anyway.
## Verification
Stress fixture (`heap-snapshot-gc-race-fixture.js`, 300 iterations of
`await worker.getHeapSnapshot(); Bun.gc(true)`), 40 runs each on
linux-x64 release:
| build | segfault at `0x10` |
| --- | --- |
| `52bdf47` (CI artifact, no fix) | 15 / 40 |
| this branch | 0 / 40 |
The new `worker_heap_snapshot_gc.test.ts` runs the fixture — 300 iters
in release, 5 in debug (a single debug heap snapshot takes ~1.6s so the
race window, which is a handful of instructions after each snapshot, is
impractical to hit there; the debug pass is a functional check).
## Drive-by: non-LTO strip leaves orphan `PT_GNU_EH_FRAME`
While reproducing I hit a second, unrelated crash in locally-built
(non-LTO) release binaries: `stripFlags` removed `.eh_frame_hdr` on
linux-gnu unconditionally, but the linker only passes
`--no-eh-frame-hdr` when LTO is on. GNU strip doesn't rewrite the
program header table, so the `PT_GNU_EH_FRAME` phdr was left pointing at
unmapped memory and any stack unwind (e.g. WTF::Thread teardown after a
worker exits) faulted. CI release builds always have LTO on so they
weren't affected. Gated the section removal on `c.lto` to match the
linker flag.
---------
Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…primitive (oven-sh#29937) ## What Replaces the WebKit-derived `MessagePortChannelProvider` / `MessagePortChannelProviderImpl` / `MessagePortChannelRegistry` / `MessagePortChannel` stack — plus the `BroadcastChannel::MainThreadBridge` / per-context `BunBroadcastChannelRegistry` — with a small, thread-safe concurrency primitive (`MessagePortPipe`) that the Web-facing classes wrap thinly. Net −1082 lines (+737 / −1819), 10 files deleted. ## Why The existing stack is Safari's multi-*process* design carried over verbatim: an abstract provider so UIProcess can swap in an IPC impl, a process-global `MessagePortIdentifier → MessagePortChannel` HashMap, `ProcessIdentifier` tracking, self-`RefPtr` "protector" members to keep channels alive across IPC hops, and a main-thread serialization point for `BroadcastChannel`. None of that indirection is needed in Bun's single-process multi-thread model — and worse, the `MessagePortChannelRegistry::m_openChannels` HashMap and `MessagePortChannel::m_pendingMessages` Vector are mutated from worker threads with **no lock** (the WebKit `ASSERT(isMainThread())` guards were just commented out). Concurrent `new MessageChannel()` across workers corrupts the HashMap: ``` HashTable.h:1574:17: runtime error: member access within null pointer of type 'const HashTable<MessagePortIdentifier, KeyValuePair<MessagePortIdentifier, WeakRef<MessagePortChannel>>, ...>' ``` This is the cause of oven-sh#16186, oven-sh#22471, oven-sh#25805 and likely oven-sh#29458. ## Design **`MessagePortPipe`** — `ThreadSafeRefCounted`, two sides. Each side has: - a `Lock` + `Deque<MessageWithMessagePorts>` inbox - one `std::atomic<uint64_t>` state word: `Closed | DrainScheduled | Attached` in the low byte, queued-count in the high bits `send()` locks the destination side, enqueues, and if `Attached && !DrainScheduled` flips `DrainScheduled` and posts one task to the receiver's `ScriptExecutionContext`. A burst of N sends schedules at most one cross-thread wakeup. `takeAll()` clears `DrainScheduled` before swapping the deque, so a racing send reschedules — at most one extra no-op drain, never a stranded message. `attach()`/`detach()` handle transfer: a detached side buffers; attach flushes the backlog with one wakeup. `hasPendingActivity` is two atomic loads (my queued count + peer's `Closed`), safe from the GC thread with no lock. **`MessagePort`** — holds `RefPtr<MessagePortPipe>` + `uint8_t side`. `postMessage` → `pipe->send`. `start()` → `pipe->attach`. `close()` → `pipe->close`. Transfer → `pipe->detach`, hand `{pipe, side}` through `TransferredMessagePort`, re-create on the other side. No global maps, no identifiers, no per-context port iteration. The drain task takes the whole inbox in one go (sender-side batching), then posts each message as its own local task so microtasks checkpoint between deliveries — matching the HTML "port message queue is a task source" semantics and Node's observable ordering. **`BroadcastChannel`** — one process-global `Lock` + `HashMap<String, Vector<{ctxId, weak channel}>>`. `postMessage` snapshots subscribers under the lock, releases, then posts one task per `(message, subscriber)` directly to each subscriber's context — no bounce through the main thread, no `MainThreadBridge`, no second `allBroadcastChannels` map. Per-channel inbox batching was prototyped and reverted: the spec's same-event-loop `(message-major, creation-minor)` delivery-order test (WPT `broadcast-channel.test.ts::"messages are delivered in port creation order"`) fails if subscribers drain channel-major. A per-`(context, name)` inbox could restore batching without breaking order — left as future work. ## Deleted `MessagePortChannel.{h,cpp}`, `MessagePortChannelProvider.{h,cpp}`, `MessagePortChannelProviderImpl.{h,cpp}`, `MessagePortChannelRegistry.{h,cpp}`, `MessagePortIdentifier.h`, `BroadcastChannelRegistry.h`, the `allMessagePorts()` / `portToContextIdentifier()` / `allBroadcastChannels()` / `channelToContextIdentifier()` global maps, `ScriptExecutionContext::{m_messagePorts, processMessageWithMessagePortsSoon, dispatchMessagePortEvents, createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}`, and `WorkerGlobalScope::messagePortChannelProvider()`. ## Verification - `test/js/web/workers/message-channel.test.ts` — 12/12 - `test/js/web/broadcastchannel/broadcast-channel.test.ts` — 11/11 - `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` — 3/3 - `test/js/web/workers/` suite — 228 pass / 6 fail; all 6 fail identically on `main` (container-slowness timeouts), zero new failures - `test/js/node/worker_threads/worker_threads.test.ts` — 25 pass / 2 fail; both fail identically on `main` - 10 node `parallel/test-worker-message-port-*.js` + `test-broadcastchannel-*.js` — all pass - New `test/js/web/workers/message-port-pipe.test.ts`: - `concurrent MessageChannel creation across workers is race-free` — **fails 3/3 on main** with the HashTable UBSan null-deref shown above; **passes 5/5** on this branch - microtask ordering, buffered-before-start, FIFO `receiveMessageOnPort`, 1000-message cross-thread burst ## Relationship to other PRs oven-sh#29832 / oven-sh#29442 patch the race by adding a lock to the existing registry. This PR removes the registry instead — there's nothing left to race on. Either approach fixes the crash; this one also sheds the layering. Fixes oven-sh#16186 Fixes oven-sh#22471 Fixes oven-sh#25805 --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…en-sh#29895) ## What Fixes a stack overflow in `MessagePortPipe::close()` when a deep chain of nested transferred MessagePorts is closed, and adds regression coverage for the transferred-port-dropped leak that oven-sh#29937's RAII `TransferredMessagePort` destructor now handles. ## Repro ```js const DEPTH = 20_000; let head = new MessageChannel(); const tail = head; for (let i = 0; i < DEPTH; i++) { const next = new MessageChannel(); head.port2.postMessage(null, [next.port1]); // next.port1 lands in head.port1's inbox head.port2.close(); head = next; } tail.port1.close(); // SIGSEGV (stack overflow) ``` ## Root cause `TransferredMessagePort` is move-only and its destructor calls `pipe->close(side)` so an orphaned in-transit port gets cleaned up. `MessagePortPipe::close()` swaps the side's inbox into a local `Deque` and lets it destruct outside the lock. If those dropped messages carry transferred ports, the `Deque` destructor runs `~TransferredMessagePort` → `pipe->close()` → another `Deque` destructor → … one native stack frame per link. ~20k links overflows an 8 MB stack. ## Fix `MessagePortPipe::close()` owns a stack-local `Vector<pair<RefPtr<MessagePortPipe>, uint8_t>>` worklist seeded with `{this, side}`. For each entry it closes the side under the lock, swaps out the inbox, then harvests transferred pipes from the dropped messages into the worklist (nulling `tp.pipe` so the subsequent `~TransferredMessagePort` is a no-op) before letting the batch destruct. O(1) stack regardless of chain depth; no static / thread_local / instance state. ## Verification `test/js/web/workers/message-port-closed-leak.test.ts`: | test | main (no fix) | this PR | |--|--|--| | transferred port dropped before receiver closed does not leak | pass (oven-sh#29937) | pass | | transferred port dropped after receiver closed does not leak | pass (oven-sh#29937) | pass | | **deep chain of nested transferred ports does not overflow on close** | **SIGSEGV, exit 139** | pass | `message-channel.test.ts` (12 tests) passes unchanged. ## History This PR originally targeted the self-ref leak in the old `MessagePortChannel` / `MessagePortChannelRegistry` stack (`disentanglePort()` parked a self-ref in `m_pendingMessagePortTransfers[i]` that was never released if the carrying message was dropped). oven-sh#29937 replaced that whole stack with `MessagePortPipe` and fixed the leak via RAII — but the destructor-driven cascade it introduced recurses unboundedly. The first two tests are kept as regression guards for the leak; the deep-chain test is the fail-before for this fix. --------- Co-authored-by: robobun <robobun@users.noreply.github.com>
…0185) `test/js/node/worker_threads/worker_threads.test.ts` occasionally segfaults in CI with ``` panic: Segmentation fault at address 0x10 ``` on a GC helper thread: ``` wtfThreadEntryPoint AutomaticThread::start ParallelHelperPool::Thread::work Heap::runBeginPhase(GCConductor)::$_1 SlotVisitor::drainFromShared MarkingConstraintSolver::runExecutionThread MarkingConstraint::execute ← "Sh" Strong Handles HandleSet::visitStrongHandles *(nullptr + offsetof(HandleNode, m_value)) = *(0x10) ``` (decoded from the `bun.report` trace on [build 50529 / 🐧 13 x64](https://buildkite.com/bun/bun/builds/50529#019dec3e-4d11-4651-b908-84e601bc2db3) and symbolized against that build's `bun-profile`). ## Cause `jsWorkerPrototypeFunction_getHeapSnapshotBody` does: ```cpp Strong<JSPromise> strong(vm, promise); // parent VM's HandleSet worker.postTaskToWorkerGlobalScope([strong, parentId](auto& workerCtx) { ... ScriptExecutionContext::postTaskTo(parentId, [strong, snapshot = ...](auto& parentCtx) { ... }); // runs on worker thread }); ``` `JSC::Strong<T>` has **no move constructor**. Capturing it by value copy-constructs it, which calls `HandleSet::allocate()` + `m_strongList.push()`; destroying it calls `HandleSet::deallocate()` + `NodeList::remove()`. Both happen on the **worker thread** against the **parent VM's** `HandleSet`, without the parent VM's lock. `HandleSet::m_strongList` is a `SentinelLinkedList<HandleNode>` — not thread-safe. `push`/`remove` transiently null `m_next`/`m_prev`. The parent VM's "Sh" (Strong Handles) marking constraint (`Heap::addCoreConstraints`) iterates that list during GC; when it follows a null `m_next` it reads `*((HandleNode*)nullptr)->slot()` → `*(0x0 + 0x10)`. The `heapHelperPool()` is process-global, so the crashing helper thread belongs to the parent VM's collector even though the worker VM's `BunV8HeapSnapshotBuilder` full GC is in progress at the same time. This has been there since `getHeapSnapshot` was added — the recent worker lifetime rewrites (oven-sh#29957, oven-sh#29937) didn't introduce it. ## Fix Heap-allocate the `Strong<JSPromise>` once on the parent thread and pass only the raw pointer through the cross-thread lambdas. The worker thread never dereferences it, so it never touches the parent VM's `HandleSet`. The parent-side completion lambda resolves the promise and frees the handle. `Worker::postTaskToWorkerGlobalScope` now returns `bool` so a lost race to `Closing`/`Closed` (worker exited between `isOnline()` and the post) rejects with `ERR_WORKER_NOT_RUNNING` instead of silently leaking the handle. If `postTaskTo(parentId, …)` on the return trip fails (parent context gone), the handle intentionally leaks — deleting a parent-VM `Strong` from the worker thread is exactly the bug we're fixing, and the parent VM is tearing down anyway. ## Verification Stress fixture (`heap-snapshot-gc-race-fixture.js`, 300 iterations of `await worker.getHeapSnapshot(); Bun.gc(true)`), 40 runs each on linux-x64 release: | build | segfault at `0x10` | | --- | --- | | `52bdf47` (CI artifact, no fix) | 15 / 40 | | this branch | 0 / 40 | The new `worker_heap_snapshot_gc.test.ts` runs the fixture — 300 iters in release, 5 in debug (a single debug heap snapshot takes ~1.6s so the race window, which is a handful of instructions after each snapshot, is impractical to hit there; the debug pass is a functional check). ## Drive-by: non-LTO strip leaves orphan `PT_GNU_EH_FRAME` While reproducing I hit a second, unrelated crash in locally-built (non-LTO) release binaries: `stripFlags` removed `.eh_frame_hdr` on linux-gnu unconditionally, but the linker only passes `--no-eh-frame-hdr` when LTO is on. GNU strip doesn't rewrite the program header table, so the `PT_GNU_EH_FRAME` phdr was left pointing at unmapped memory and any stack unwind (e.g. WTF::Thread teardown after a worker exits) faulted. CI release builds always have LTO on so they weren't affected. Gated the section removal on `c.lto` to match the linker flag. --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
What
Replaces the WebKit-derived
MessagePortChannelProvider/MessagePortChannelProviderImpl/MessagePortChannelRegistry/MessagePortChannelstack — plus theBroadcastChannel::MainThreadBridge/ per-contextBunBroadcastChannelRegistry— with a small, thread-safe concurrency primitive (MessagePortPipe) that the Web-facing classes wrap thinly.Net −1082 lines (+737 / −1819), 10 files deleted.
Why
The existing stack is Safari's multi-process design carried over verbatim: an abstract provider so UIProcess can swap in an IPC impl, a process-global
MessagePortIdentifier → MessagePortChannelHashMap,ProcessIdentifiertracking, self-RefPtr"protector" members to keep channels alive across IPC hops, and a main-thread serialization point forBroadcastChannel. None of that indirection is needed in Bun's single-process multi-thread model — and worse, theMessagePortChannelRegistry::m_openChannelsHashMap andMessagePortChannel::m_pendingMessagesVector are mutated from worker threads with no lock (the WebKitASSERT(isMainThread())guards were just commented out). Concurrentnew MessageChannel()across workers corrupts the HashMap:This is the cause of #16186, #22471, #25805 and likely #29458.
Design
MessagePortPipe—ThreadSafeRefCounted, two sides. Each side has:Lock+Deque<MessageWithMessagePorts>inboxstd::atomic<uint64_t>state word:Closed | DrainScheduled | Attachedin the low byte, queued-count in the high bitssend()locks the destination side, enqueues, and ifAttached && !DrainScheduledflipsDrainScheduledand posts one task to the receiver'sScriptExecutionContext. A burst of N sends schedules at most one cross-thread wakeup.takeAll()clearsDrainScheduledbefore swapping the deque, so a racing send reschedules — at most one extra no-op drain, never a stranded message.attach()/detach()handle transfer: a detached side buffers; attach flushes the backlog with one wakeup.hasPendingActivityis two atomic loads (my queued count + peer'sClosed), safe from the GC thread with no lock.MessagePort— holdsRefPtr<MessagePortPipe>+uint8_t side.postMessage→pipe->send.start()→pipe->attach.close()→pipe->close. Transfer →pipe->detach, hand{pipe, side}throughTransferredMessagePort, re-create on the other side. No global maps, no identifiers, no per-context port iteration.The drain task takes the whole inbox in one go (sender-side batching), then posts each message as its own local task so microtasks checkpoint between deliveries — matching the HTML "port message queue is a task source" semantics and Node's observable ordering.
BroadcastChannel— one process-globalLock+HashMap<String, Vector<{ctxId, weak channel}>>.postMessagesnapshots subscribers under the lock, releases, then posts one task per(message, subscriber)directly to each subscriber's context — no bounce through the main thread, noMainThreadBridge, no secondallBroadcastChannelsmap. Per-channel inbox batching was prototyped and reverted: the spec's same-event-loop(message-major, creation-minor)delivery-order test (WPTbroadcast-channel.test.ts::"messages are delivered in port creation order") fails if subscribers drain channel-major. A per-(context, name)inbox could restore batching without breaking order — left as future work.Deleted
MessagePortChannel.{h,cpp},MessagePortChannelProvider.{h,cpp},MessagePortChannelProviderImpl.{h,cpp},MessagePortChannelRegistry.{h,cpp},MessagePortIdentifier.h,BroadcastChannelRegistry.h, theallMessagePorts()/portToContextIdentifier()/allBroadcastChannels()/channelToContextIdentifier()global maps,ScriptExecutionContext::{m_messagePorts, processMessageWithMessagePortsSoon, dispatchMessagePortEvents, createdMessagePort, destroyedMessagePort, m_broadcastChannelRegistry}, andWorkerGlobalScope::messagePortChannelProvider().Verification
test/js/web/workers/message-channel.test.ts— 12/12test/js/web/broadcastchannel/broadcast-channel.test.ts— 11/11test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts— 3/3test/js/web/workers/suite — 228 pass / 6 fail; all 6 fail identically onmain(container-slowness timeouts), zero new failurestest/js/node/worker_threads/worker_threads.test.ts— 25 pass / 2 fail; both fail identically onmainparallel/test-worker-message-port-*.js+test-broadcastchannel-*.js— all passtest/js/web/workers/message-port-pipe.test.ts:concurrent MessageChannel creation across workers is race-free— fails 3/3 on main with the HashTable UBSan null-deref shown above; passes 5/5 on this branchreceiveMessageOnPort, 1000-message cross-thread burstRelationship to other PRs
#29832 / #29442 patch the race by adding a lock to the existing registry. This PR removes the registry instead — there's nothing left to race on. Either approach fixes the crash; this one also sheds the layering.
Fixes #16186
Fixes #22471
Fixes #25805