Fix BroadcastChannel channelToContextIdentifier locking and dispatchMessage lifetime#29441
Conversation
…essage lifetime
Two data races in BroadcastChannel.cpp that surface as ASAN heap-UAF
in broadcast-channel-worker-gc.test.ts:
(A) channelToContextIdentifier() HashMap was only locked on the worker-
thread reader (contextIdForBroadcastChannelId), not on the three
main-thread writers/readers (registerChannel .add, unregisterChannel
.remove, dispatchMessageTo .get). When a main-thread .add() rehashes
the table while a worker thread is mid-.get(), the worker dereferences
a freed bucket array. Add Locker at all call sites and annotate the
accessor with WTF_REQUIRES_LOCK so -Wthread-safety catches future
regressions (matching allBroadcastChannels()).
(B) dispatchMessage() posts a task capturing raw `this` with no
protecting Ref. The caller holds a strong ref from the
ThreadSafeWeakPtr lookup but drops it when the outer lambda returns,
so the queued task can outlive the BroadcastChannel during worker
terminate + GC. Capture protectedThis = Ref { *this } in the lambda,
matching MessagePort.cpp / WebSocket.cpp.
Also:
- Add a stress test that churns registrations while workers cross-post,
exercising the worker-side map read concurrently with main-side rehash.
- Filter the unconditional ASAN startup warning from child-process stderr
so the existing assertions hold under ASAN builds.
- Scale test timeouts for debug/ASAN (worker spawn is much slower there).
|
Updated 5:15 AM PT - Apr 18th, 2026
❌ @robobun, your commit 510f0f4 has 3 failures in
🧪 To try this PR locally: bunx bun-pr 29441That installs a local version of the PR into your bun-29441 --bun |
WalkthroughModified broadcast channel worker GC tests to introduce timeout logic based on build type and add stderr filtering to handle ASAN warnings. Updated existing subprocess-based tests and added a new regression test for concurrent register/dispatch/terminate BroadcastChannel operations. Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`:
- Line 168: The hardcoded sleep line "await new Promise(resolve =>
setTimeout(resolve, 50));" makes the test flaky; instead have the worker/child
emit an explicit completion signal (e.g., postMessage or console.log a unique
token like "GC_DONE") when the queued cross-thread work finishes, and replace
the setTimeout with awaiting that signal in the test (e.g., listen for
worker.onmessage / BroadcastChannel.onmessage or
childProcess.stdout/once('message') for the unique token), then remove the
timeout-based wait.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3cc0d175-33eb-456e-b751-c6872ff6f1cc
📒 Files selected for processing (2)
src/bun.js/bindings/webcore/BroadcastChannel.cpptest/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
This PR fixes real data races in thread-safety-critical concurrent code — locking gaps in a shared HashMap and a raw- capture in an async task. The fixes are mechanically sound and follow established patterns, but thread-safety fixes in C++ warrant a human look.
Extended reasoning...
Overview
Two targeted fixes in : (A) adds at the three unlocked call sites on the main thread and annotates the accessor with to match ; (B) changes the lambda capture from raw to to prevent a use-after-free if the channel is destroyed before the task runs. The test file gains a new stress test, ASAN stderr filtering, and scaled timeouts.
Security Risks
No security-sensitive paths (auth, crypto, permissions) are touched. The races being fixed are memory-safety bugs (heap-UAF) that could cause crashes; they are not exploitable in the traditional sense but could theoretically be leveraged for memory corruption. The fixes reduce risk.
Level of Scrutiny
Thread-safety fixes in C++ concurrent code deserve careful human review even when the logic is straightforward. Locking must be consistent across all call sites (including any added in the future), and the -capture pattern must interact correctly with the existing refcounting and destruction order. A missed site or incorrect scope could introduce deadlocks or leave races partially fixed.
Other Factors
The fixes follow well-established patterns already present in , , and . The annotation will enforce lock discipline at compile time going forward. No bugs were flagged by the automated system. The PR description is detailed and includes verification steps. Given the correctness of the approach, a quick human sign-off should suffice.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cli/bunx_command.zig`:
- Around line 361-368: The code currently rewrites update_request.name "claude"
even when the user provided an explicit package spec (e.g., claude@1.2.3);
restrict the alias so it only applies for bare invocations by checking that no
explicit package spec was provided (use opts.specified_package == null) before
substituting update_request.name, and apply the same guard to the duplicate
alias logic later (the other update_request.name == "claude" branch) so explicit
specs remain unmodified.
In `@src/zlib.zig`:
- Around line 847-849: The code calls deflateBound(&zlib_reader.zlib,
`@intCast`(input.len)) which truncates input.len on platforms where uLong is
32-bit; before calling deflateBound, check if input.len > `@as`(usize,
`@maxValue`(`@TypeOf`(`@intCast`(input.len)))), and if so call zlib_reader.deinit()
and return a clear error (e.g., error.InputTooLarge or error.InvalidArgument)
instead of casting; update the use-site around
zlib_reader.list.ensureTotalCapacityPrecise and `@intCast`(input.len) to only call
deflateBound with a safe cast after this guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f3a83da-5b37-4bcd-9c66-84db31f5f014
📥 Commits
Reviewing files that changed from the base of the PR and between 785b230 and d6881e9a728a4793ca4b48ccfa3b69c09099a00e.
📒 Files selected for processing (12)
src/bun.js/api/HashObject.zigsrc/bun.js/api/JSBundler.zigsrc/bun.js/api/bun/socket/Listener.zigsrc/bun.js/webcore/Body.zigsrc/bun.js/webcore/blob/read_file.zigsrc/cli/bunx_command.zigsrc/crash_handler.zigsrc/deps/zlib.win32.zigsrc/http/Decompressor.zigsrc/install/bin.zigsrc/zlib.zigtest/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts
The setTimeout(50) was not load-bearing — the assertion is that the subprocess exits cleanly with no ASAN output, which already covers the full process lifetime including any tasks that run during teardown.
d6881e9 to
510f0f4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`:
- Around line 10-14: The current filterStderr function removes every stderr line
that startsWith("WARNING: ASAN interferes"), which can hide legitimate later
diagnostics; change filterStderr (the function name) to only strip the first
startup line matching that exact prefix and leave all subsequent lines intact —
e.g., split stderr into lines, if the first non-empty line startsWith("WARNING:
ASAN interferes") drop just that first matching line, then rejoin the remaining
lines so later stderr that also begins with the same text is preserved for test
diffs.
- Around line 120-156: The test currently has each worker post its 20-message
burst before signaling "ready", so the main thread's extraChannels churn does
not overlap the worker-side HashMap .get() window; change the handshake so
workerCode waits for a parent "go" message before running the for loop and only
then posts "ready" after the burst, and update the main test logic to spawn
workers, create/close extraChannels, then send "go" to all workers (via
worker.postMessage("go")) and await their ready promises—this guarantees the
extraChannels churn occurs while worker-side channelToContextIdentifier().get()
is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e0626d1a-7f71-4675-bd8c-378fe650b419
📥 Commits
Reviewing files that changed from the base of the PR and between d6881e9a728a4793ca4b48ccfa3b69c09099a00e and 510f0f4.
📒 Files selected for processing (1)
test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts
There was a problem hiding this comment.
LGTM — targeted data-race fixes following the existing lock pattern.
Extended reasoning...
Overview
This PR modifies src/bun.js/bindings/webcore/BroadcastChannel.cpp and the accompanying test file. The C++ changes are minimal and surgical: (A) adding Locker locker { channelToContextIdentifierLock } at three call sites that were previously unguarded plus a WTF_REQUIRES_LOCK annotation on the accessor, and (B) changing a raw this capture in a postTaskTo lambda to protectedThis = Ref { *this } to extend object lifetime across the async task.
Security risks
No security-sensitive paths are touched. The changes harden thread safety rather than weakening it. The lock additions mirror the identical pattern already used for allBroadcastChannelsLock in the same file, and the Ref { *this } pattern matches MessagePort.cpp, Performance.cpp, and WebSocket.cpp.
Level of scrutiny
Concurrency fixes merit careful review, but the scope here is narrow: three identical Locker additions around HashMap calls, one WTF_REQUIRES_LOCK annotation, and one lambda capture change. Each fix has a clear mechanical parallel in the same file or nearby sibling files. The PR description thoroughly explains both races with a table of lock sites and the relevant code paths.
Other factors
No bugs were found by the automated bug hunting system. All CodeRabbit inline comments were addressed and resolved by the author with clear technical justification. The test additions follow established repo conventions for ASAN-compatible subprocess tests. The current repo HEAD already includes commits matching this PR title, indicating the changes have been validated in the main branch.
|
CI status (build #46260): 263 pass / 3 fail — none related to this change.
My token is read-only, so I can't retry the darwin job — it just needs a re-run for the 502. |
…essage lifetime (oven-sh#29441) ## What does this PR do? Fixes two data races in `BroadcastChannel.cpp` that surface as ASAN heap-use-after-free in `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`. ### Bug A — `channelToContextIdentifier` HashMap one-sided locking The prior `ThreadSafeWeakPtr` fix only covered `allBroadcastChannels()`. The second global, `channelToContextIdentifier()`, has its own lock — but it was taken at only 1 of 4 call sites: | Site | Thread | Lock? | | --- | --- | --- | | `registerChannel` `.add()` | main | ❌ | | `unregisterChannel` `.remove()` | main | ❌ | | `dispatchMessageTo` `.get()` | main | ❌ | | `contextIdForBroadcastChannelId` `.get()` | worker (via `ensureOnContextThread` → `dispatchMessage`) | ✅ | When main rehashes the HashMap (add/remove during worker spawn/terminate) while a worker reads it, the worker walks a freed bucket array → ASAN heap-UAF inside `WTF::HashTable`. The accessor was also missing `WTF_REQUIRES_LOCK`, so `-Wthread-safety` never flagged this. **Fix**: add `Locker locker { channelToContextIdentifierLock };` at the three unlocked sites and annotate the accessor with `WTF_REQUIRES_LOCK(channelToContextIdentifierLock)` to match `allBroadcastChannels()`. ### Bug B — `dispatchMessage` captures raw `this` in async task `dispatchMessage` posts a task with `[this, message = ...]` — raw `this`, no `Ref { *this }`. The caller (`dispatchMessageTo`'s inner lambda) holds a strong `RefPtr` from the `ThreadSafeWeakPtr` lookup, but that ref is dropped when the outer lambda returns. During worker terminate the JS wrapper is destroyed → refcount 0 → `~BroadcastChannel` → the queued task reads freed `this->m_isClosed` and calls `this->dispatchEvent()`. **Fix**: capture `protectedThis = Ref { *this }` in the `postTaskTo` lambda, matching the pattern in `MessagePort.cpp`, `Performance.cpp`, and `WebSocket.cpp`. ## How did you verify your code works? - `bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` — 3/3 pass, verified stable across 3 consecutive runs under debug+ASAN - `bun bd test test/js/web/broadcastchannel/broadcast-channel.test.ts` — 10/11 pass; the one failure (`broadcast channel worker wait`) is pre-existing on `main` under debug+ASAN (it uses `Bun.sleepSync(500)` which isn't enough for an ASAN worker to start) and is unrelated to this change ## Test changes - Added a stress test that churns channel registrations (forcing HashMap rehashes) while workers cross-post (reaching the worker-side map read), then terminates workers mid-dispatch (leaving queued tasks whose `this` would otherwise dangle). - Filtered the unconditional ASAN startup warning from child-process `stderr` so `expect(stderr).toBe("")` holds on ASAN builds — same pattern as `fetch-abort-queued.test.ts` / `string-decoder.test.js`. - Scaled timeouts for `isDebug || isASAN` — worker spawn under debug+ASAN is ~5–10× slower; the existing tests were borderline at the 5s default. Note: both races are highly timing-dependent (a HashMap rehash must land mid-`get()`); 20 local ASAN runs on macOS did not repro before the fix. The new stress test maximises contention but is not guaranteed to fail without the fix on every platform. --------- Co-authored-by: robobun <robobun@users.noreply.github.com>
…essage lifetime (oven-sh#29441) ## What does this PR do? Fixes two data races in `BroadcastChannel.cpp` that surface as ASAN heap-use-after-free in `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`. ### Bug A — `channelToContextIdentifier` HashMap one-sided locking The prior `ThreadSafeWeakPtr` fix only covered `allBroadcastChannels()`. The second global, `channelToContextIdentifier()`, has its own lock — but it was taken at only 1 of 4 call sites: | Site | Thread | Lock? | | --- | --- | --- | | `registerChannel` `.add()` | main | ❌ | | `unregisterChannel` `.remove()` | main | ❌ | | `dispatchMessageTo` `.get()` | main | ❌ | | `contextIdForBroadcastChannelId` `.get()` | worker (via `ensureOnContextThread` → `dispatchMessage`) | ✅ | When main rehashes the HashMap (add/remove during worker spawn/terminate) while a worker reads it, the worker walks a freed bucket array → ASAN heap-UAF inside `WTF::HashTable`. The accessor was also missing `WTF_REQUIRES_LOCK`, so `-Wthread-safety` never flagged this. **Fix**: add `Locker locker { channelToContextIdentifierLock };` at the three unlocked sites and annotate the accessor with `WTF_REQUIRES_LOCK(channelToContextIdentifierLock)` to match `allBroadcastChannels()`. ### Bug B — `dispatchMessage` captures raw `this` in async task `dispatchMessage` posts a task with `[this, message = ...]` — raw `this`, no `Ref { *this }`. The caller (`dispatchMessageTo`'s inner lambda) holds a strong `RefPtr` from the `ThreadSafeWeakPtr` lookup, but that ref is dropped when the outer lambda returns. During worker terminate the JS wrapper is destroyed → refcount 0 → `~BroadcastChannel` → the queued task reads freed `this->m_isClosed` and calls `this->dispatchEvent()`. **Fix**: capture `protectedThis = Ref { *this }` in the `postTaskTo` lambda, matching the pattern in `MessagePort.cpp`, `Performance.cpp`, and `WebSocket.cpp`. ## How did you verify your code works? - `bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` — 3/3 pass, verified stable across 3 consecutive runs under debug+ASAN - `bun bd test test/js/web/broadcastchannel/broadcast-channel.test.ts` — 10/11 pass; the one failure (`broadcast channel worker wait`) is pre-existing on `main` under debug+ASAN (it uses `Bun.sleepSync(500)` which isn't enough for an ASAN worker to start) and is unrelated to this change ## Test changes - Added a stress test that churns channel registrations (forcing HashMap rehashes) while workers cross-post (reaching the worker-side map read), then terminates workers mid-dispatch (leaving queued tasks whose `this` would otherwise dangle). - Filtered the unconditional ASAN startup warning from child-process `stderr` so `expect(stderr).toBe("")` holds on ASAN builds — same pattern as `fetch-abort-queued.test.ts` / `string-decoder.test.js`. - Scaled timeouts for `isDebug || isASAN` — worker spawn under debug+ASAN is ~5–10× slower; the existing tests were borderline at the 5s default. Note: both races are highly timing-dependent (a HashMap rehash must land mid-`get()`); 20 local ASAN runs on macOS did not repro before the fix. The new stress test maximises contention but is not guaranteed to fail without the fix on every platform. --------- Co-authored-by: robobun <robobun@users.noreply.github.com>
What does this PR do?
Fixes two data races in
BroadcastChannel.cppthat surface as ASAN heap-use-after-free intest/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts.Bug A —
channelToContextIdentifierHashMap one-sided lockingThe prior
ThreadSafeWeakPtrfix only coveredallBroadcastChannels(). The second global,channelToContextIdentifier(), has its own lock — but it was taken at only 1 of 4 call sites:registerChannel.add()unregisterChannel.remove()dispatchMessageTo.get()contextIdForBroadcastChannelId.get()ensureOnContextThread→dispatchMessage)When main rehashes the HashMap (add/remove during worker spawn/terminate) while a worker reads it, the worker walks a freed bucket array → ASAN heap-UAF inside
WTF::HashTable. The accessor was also missingWTF_REQUIRES_LOCK, so-Wthread-safetynever flagged this.Fix: add
Locker locker { channelToContextIdentifierLock };at the three unlocked sites and annotate the accessor withWTF_REQUIRES_LOCK(channelToContextIdentifierLock)to matchallBroadcastChannels().Bug B —
dispatchMessagecaptures rawthisin async taskdispatchMessageposts a task with[this, message = ...]— rawthis, noRef { *this }. The caller (dispatchMessageTo's inner lambda) holds a strongRefPtrfrom theThreadSafeWeakPtrlookup, but that ref is dropped when the outer lambda returns. During worker terminate the JS wrapper is destroyed → refcount 0 →~BroadcastChannel→ the queued task reads freedthis->m_isClosedand callsthis->dispatchEvent().Fix: capture
protectedThis = Ref { *this }in thepostTaskTolambda, matching the pattern inMessagePort.cpp,Performance.cpp, andWebSocket.cpp.How did you verify your code works?
bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts— 3/3 pass, verified stable across 3 consecutive runs under debug+ASANbun bd test test/js/web/broadcastchannel/broadcast-channel.test.ts— 10/11 pass; the one failure (broadcast channel worker wait) is pre-existing onmainunder debug+ASAN (it usesBun.sleepSync(500)which isn't enough for an ASAN worker to start) and is unrelated to this changeTest changes
thiswould otherwise dangle).stderrsoexpect(stderr).toBe("")holds on ASAN builds — same pattern asfetch-abort-queued.test.ts/string-decoder.test.js.isDebug || isASAN— worker spawn under debug+ASAN is ~5–10× slower; the existing tests were borderline at the 5s default.Note: both races are highly timing-dependent (a HashMap rehash must land mid-
get()); 20 local ASAN runs on macOS did not repro before the fix. The new stress test maximises contention but is not guaranteed to fail without the fix on every platform.