Skip to content

Use visitor.vm() in GlobalObject::visitChildren; make HTTPHeaderIdentifiers eager#32407

Merged
Jarred-Sumner merged 3 commits into
mainfrom
farm/e4e823db/fix-visitchildren-clientdata-race
Jun 16, 2026
Merged

Use visitor.vm() in GlobalObject::visitChildren; make HTTPHeaderIdentifiers eager#32407
Jarred-Sumner merged 3 commits into
mainfrom
farm/e4e823db/fix-visitchildren-clientdata-race

Conversation

@robobun

@robobun robobun commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Crash

Sentry BUN-2V1E: segfault inside WTF::TypeCastTraits<JSVMClientData>::isType reached from Zig::GlobalObject::visitChildrenImpl on a concurrent GC helper thread. 695 lifetime events (26 in the last 24h), 100% Windows x64, 1.2.17 through 1.3.14. 31% of events carry both workers_spawned=True and workers_terminated=True vs a ~3% baseline, pointing at worker-termination churn. Also seen intermittently in CI as the broadcast-channel-worker-gc flake (b03f1e6 is a rekick for it).

WTF::ParallelHelperPool::Thread::work
JSC::Heap::runBeginPhase lambda
JSC::SlotVisitor::drainFromShared
JSC::SlotVisitor::drain
JSC::SlotVisitor::visitChildren
JSC::MethodTable::visitChildren
Zig::GlobalObject::visitChildren
Zig::GlobalObject::visitChildrenImpl
WebCore::clientData(JSC::VM&)
WTF::downcast<JSVMClientData>
WTF::is<JSVMClientData>
TypeCastTraits<JSVMClientData>::isType   <-- SEGV

Cause

visitChildrenImpl ran:

WebCore::clientData(thisObject->vm())->httpHeaderIdentifiers().visit<Visitor>(visitor);

Two problems on this line:

1. thisObject->vm() dereferences cell state on the marker thread. JSGlobalObject::vm() returns *m_vm (a raw VM* const stored on the cell); clientData() then does downcast<JSVMClientData>(vm.clientData) whose RELEASE_ASSERT(!source || is<Target>(*source)) calls the virtual isWebCoreJSClientData(). The neighbouring visitGlobalObjectMember(unique_ptr) overload already guards a window where the concurrent marker visits a Zig::GlobalObject picked up via conservative stack scan while its IsoSubspace slot is being recycled; in that same window m_vm can read stale bytes, resolving to a garbage clientData whose vtable load faults. visitor.vm() (= m_heap.vm()) is guaranteed alive for the duration of marking and does not depend on the visited cell at all; this is how JSC's own visitChildren implementations (FunctionExecutable, JSWeakObjectRef, Structure) fetch the VM on the marker thread.

2. httpHeaderIdentifiers() was an unlocked lazy std::optional::emplace() called from both the mutator (NodeHTTP.cpp header assignment) and concurrent GC helper threads. With more than one Zig::GlobalObject in a VM (ShadowRealm, test-isolation swap, bake) distinct parallel marker helpers each visit a different global and all call httpHeaderIdentifiers() on the same JSVMClientData, so two threads can enter emplace() on the same storage. The HTTPHeaderIdentifiers constructor only runs ~90 LazyProperty::initLater() calls (each a single tagged-pointer store), so there is nothing worth deferring.

Fix

  • ZigGlobalObject.cpp: fetch the VM via visitor.vm() instead of thisObject->vm().
  • BunClientData.{h,cpp}: m_httpHeaderIdentifiers is now a plain eagerly-constructed member; httpHeaderIdentifiers() is an inline accessor.

Verification

The race window is too narrow to trip deterministically on Linux. An honest probe against the unfixed debug (ASAN) build, with Malloc=1 + BUN_JSC_collectContinuously=1 + BUN_JSC_numberOfGCMarkers=8:

  • 5 iterations of an 8-round × 6-worker BroadcastChannel create/terminate/GC stress: clean.
  • 8 iterations of a 100-round × 8-ShadowRealm (parallel-marker emplace) stress: clean.

So there is no fail-before proof to hand the gate; the crash signature is Windows-specific and timing-dependent. The fix is nonetheless clearly correct on inspection:

  • visitor.vm() is the JSC convention for the marker thread and cannot read through the visited cell.
  • An unlocked std::optional::emplace() reachable from two threads is a data race in any memory model.

A new stress test in test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts hammers the exact path (multiple globals per VM via ShadowRealm, worker churn, forced parallel markers, Malloc=1 on non-Windows) so a future regression on Windows CI will show up where the signature has already been observed.

bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts   # 4 pass
bun bd test test/js/node/http/node-http.test.ts -t headers                     # 5 pass (HTTPHeaderIdentifiers path)
bun bd test test/js/node/http/numeric-header.test.ts                           # 1 pass

Related

Checked #31990 / #32071 / #32082 (worker event-loop enqueue after terminate, Strong<> releases before VM teardown): none touch visitChildrenImpl or vm.clientData access from the marker thread. No open PR addresses this crash.

The issue-matcher suggested four candidates; assessment against the actual stack:

None are auto-closed by this PR. #20641 and #20786 suggest there may be other callers of clientData() that can see a bad vm.clientData on Windows; those are separate paths and out of scope here.

…TPHeaderIdentifiers

Zig::GlobalObject::visitChildrenImpl runs on concurrent GC helper threads
and reached JSVMClientData via clientData(thisObject->vm()). That path
dereferences JSGlobalObject::m_vm (a raw VM* const stored on the cell),
then does a RELEASE_ASSERT downcast that calls the virtual
isWebCoreJSClientData() on vm.clientData. If the cell was picked up by
conservative scan while its IsoSubspace slot is mid-recycle (the same
window the neighbouring visitGlobalObjectMember(unique_ptr) null-guard
covers), the m_vm read can resolve to garbage and the vcall faults inside
TypeCastTraits<JSVMClientData>::isType. Production crash reports show this
signature on Windows x64 under worker spawn/terminate churn.

Use visitor.vm() instead: AbstractSlotVisitor::vm() returns m_heap.vm(),
which is guaranteed alive for the duration of marking and does not depend
on the visited cell's state. This matches how JSC's own visitChildren
implementations (FunctionExecutable, JSWeakObjectRef, Structure) obtain the
VM on the marker thread.

Separately, JSVMClientData::httpHeaderIdentifiers() lazily
std::optional::emplace()'d on first call with no synchronization, and was
reachable from both the mutator (NodeHTTP header assignment) and parallel
GC marker threads. With multiple Zig::GlobalObject cells in one VM
(ShadowRealm, test isolation, bake), distinct marker helpers can each visit
a different global and enter the emplace concurrently. HTTPHeaderIdentifiers
is now a plain eagerly-constructed member; its constructor only runs ~90
LazyProperty::initLater() calls (each a single tagged-pointer store), so
there is nothing worth deferring.
@robobun

robobun commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 12:38 PM PT - Jun 16th, 2026

@robobun, your commit 32d5ce4 has 2 failures in Build #62837 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 32407

That installs a local version of the PR into your bun-32407 executable, so you can run:

bun-32407 --bun

@github-actions

Copy link
Copy Markdown
Contributor

Found 4 issues this PR may fix:

  1. bun crash #20641 - Stack trace shows crash at exactly BunClientData.h:141 in WTF::TypeCastTraits<JSVMClientData>::isType via bunVMConcurrently — the exact concurrent GC visitor code path fixed by the thisObject->vm()visitor.vm() change
  2. Bun crashes when...? #20786 - Stack trace shows the same TypeCastTraits<JSVMClientData>::isType crash at BunClientData.h:141 via JSCell::subspaceFor, triggered by stale m_vm pointer access during GC marking
  3. Flaky SIGILL crash in JSC garbage collector during bun test on Linux x64 (1.3.7 and 1.3.9) #27312 - Flaky SIGILL crash in SlotVisitor::drain on a GC marker thread (ParallelHelperPool::Thread::work), consistent with concurrent GC marker reading stale m_vm bytes via thisObject->vm() in visitChildrenImpl
  4. panic: Segmentation fault at address 0xD — "multiple threads are crashing" under Worker spawn/terminate churn (1.3.14, long-running server) #31880 - Segfault under worker spawn/terminate churn with "multiple threads are crashing" — the exact scenario where recycled IsoSubspace slots cause thisObject->vm() to read stale bytes on the GC marker thread

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #20641
Fixes #20786
Fixes #27312
Fixes #31880

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: eb70a9fe-820a-41f6-8b1f-bd4fd075a8aa

📥 Commits

Reviewing files that changed from the base of the PR and between bd0fbbd and 32d5ce4.

📒 Files selected for processing (1)
  • test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts

Walkthrough

Fixes a concurrent GC crash by changing HTTPHeaderIdentifiers storage in JSVMClientData from lazy std::optional to an eagerly-constructed value, inlining the accessor, removing the out-of-line implementation, and updating GlobalObject::visitChildrenImpl to obtain the VM from visitor.vm() rather than thisObject->vm(). A regression stress test is added.

Changes

Concurrent GC Race Fix

Layer / File(s) Summary
Eager HTTPHeaderIdentifiers initialization
src/jsc/bindings/BunClientData.h, src/jsc/bindings/BunClientData.cpp
Changes the backing member from std::optional<WebCore::HTTPHeaderIdentifiers> to a direct value, inlines the httpHeaderIdentifiers() accessor with comments explaining why eager construction is required, and removes the now-dead lazy-init implementation from the .cpp file.
GC visitor uses visitor.vm() instead of thisObject->vm()
src/jsc/bindings/ZigGlobalObject.cpp
In GlobalObject::visitChildrenImpl, replaces WebCore::clientData(thisObject->vm()) with WebCore::clientData(visitor.vm()) to prevent the concurrent GC marker from dereferencing a potentially stale JSGlobalObject::m_vm pointer during conservative-scan or mid-recycle windows.
Regression stress test
test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts
Adds imports for test configuration and a stress test that spawns worker VMs creating a BroadcastChannel, terminates them under parallel marking load (BUN_JSC_numberOfGCMarkers=8, extra ShadowRealm globals, allocation pressure), and asserts clean exit with no stderr output.

Suggested reviewers

  • Jarred-Sumner
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: using visitor.vm() in GlobalObject::visitChildren and making HTTPHeaderIdentifiers eager.
Description check ✅ Passed The PR description comprehensively covers the crash cause, the fix implementation, verification steps, and related issues. It exceeds the minimal template requirements with detailed technical context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`:
- Line 275: Remove the filterStderr function call from the assertion on stderr
in the broadcast-channel-worker-gc.test.ts file. Change
expect(filterStderr(stderr)).toBe("") to expect(stderr).toBe("") to directly
assert that stderr is empty, since the ASAN startup warning that filterStderr
was designed to filter out no longer occurs in subprocess tests after the fix
from PR `#30412`.
- Line 268: The env object in the broadcast-channel-worker-gc.test.ts test is
missing documentation explaining the purpose of the Malloc: "1" setting. Add a
comment above or inline with the env configuration that explains that Malloc:
"1" routes bmalloc through the system allocator to enable AddressSanitizer
(ASAN) detection of memory safety issues in sanitizer-enabled builds. Follow the
same documentation pattern used in similar tests like websocket-server.test.ts
to maintain consistency across the test suite.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fd099991-b765-46bf-8330-c9829e25d86d

📥 Commits

Reviewing files that changed from the base of the PR and between 78f0fff and bd0fbbd.

📒 Files selected for processing (4)
  • src/jsc/bindings/BunClientData.cpp
  • src/jsc/bindings/BunClientData.h
  • src/jsc/bindings/ZigGlobalObject.cpp
  • test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts
💤 Files with no reviewable changes (1)
  • src/jsc/bindings/BunClientData.cpp

Comment thread test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts Outdated
Comment thread test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any issues, but this touches concurrent GC marking internals (visitChildrenImpl thread-safety, JSVMClientData member initialization order) with no deterministic fail-before repro — worth a human eye on the JSC reasoning.

Extended reasoning...

Overview

This PR addresses a Sentry crash (BUN-2V1E) in Zig::GlobalObject::visitChildrenImpl on concurrent GC helper threads. Two changes: (1) switch from thisObject->vm() to visitor.vm() when fetching client data during marking, and (2) make JSVMClientData::m_httpHeaderIdentifiers an eagerly-constructed member instead of a lazy std::optional to eliminate an unsynchronized emplace() reachable from both mutator and marker threads. A new stress test is added.

Security risks

None. This is internal GC/runtime correctness with no user-controlled input surface.

Level of scrutiny

High. Concurrent GC marking is one of the most subtle areas in the runtime. The reasoning is well-argued and matches JSC convention (visitor.vm() is indeed how upstream visitChildren implementations fetch the VM), and the unlocked std::optional::emplace() is a clear data race. However:

  • The author explicitly states there is no fail-before reproduction on Linux/ASAN; correctness rests on inspection of the IsoSubspace recycling / conservative-scan window described in the neighbouring visitGlobalObjectMember(unique_ptr) guard.
  • Eager construction of HTTPHeaderIdentifiers moves ~90 LazyProperty::initLater calls into the JSVMClientData constructor and changes member declaration position. I checked that the ctor takes no VM/heap dependency (just stores tagged function pointers) and that declaration order keeps it after m_heapData / subspaces, so destruction ordering looks unaffected — but member-ordering in this class is already load-bearing per the existing comments, so a second pair of eyes is warranted.
  • The added .template visit<Visitor> qualifier is a correct dependent-name fix but is a syntactic change in a hot template path.

Other factors

No bugs flagged by the automated review. CI build is still in progress. No prior human reviews on the thread. Given the criticality of concurrent GC code paths and the absence of a deterministic repro, deferring to a human reviewer familiar with JSC GC internals.

bmalloc's SystemHeap is unimplemented on Windows and setting Malloc=1
there hits a RELEASE_BASSERT. Windows has no ASAN lane anyway, so the
env var only helps on the POSIX ASAN builds.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any issues — the visitor.vm() switch and eager HTTPHeaderIdentifiers construction both look correct on inspection — but this touches concurrent GC marking and IsoSubspace-recycle semantics without a deterministic repro, so it's worth a human pass from someone who knows JSC's marker-thread invariants.

Extended reasoning...

Overview

This PR addresses a Sentry-reported segfault (TypeCastTraits<JSVMClientData>::isType) in Zig::GlobalObject::visitChildrenImpl on concurrent GC helper threads. It makes three changes:

  • src/jsc/bindings/ZigGlobalObject.cpp: replaces thisObject->vm() with visitor.vm() when fetching clientData during marking, and adds the .template keyword on the dependent visit<Visitor> call.
  • src/jsc/bindings/BunClientData.{h,cpp}: converts m_httpHeaderIdentifiers from a lazily-emplaced std::optional to an eagerly-constructed plain member, inlining the accessor and removing the out-of-line lazy init.
  • test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts: adds a stress test (ShadowRealm globals + worker churn + numberOfGCMarkers=8 + Malloc=1 on non-Windows) that the author explicitly notes is a guard rather than a deterministic fail-before proof.

Security risks

None. This is a memory-safety/concurrency fix in internal GC bindings; no auth, crypto, input parsing, or external attack surface is touched.

Level of scrutiny

High. visitChildrenImpl runs on parallel GC marker threads and is on the hot path for every collection. The reasoning about conservative-scan-during-IsoSubspace-recycle reading a stale JSGlobalObject::m_vm is plausible and consistent with the existing visitGlobalObjectMember(unique_ptr) guard just above it, and visitor.vm() is indeed the JSC convention for marker-thread VM access. Likewise, the unlocked std::optional::emplace() reachable from both mutator and parallel markers is a textbook data race, and HTTPHeaderIdentifiers() only runs ~90 LazyProperty::initLater pointer stores, so eager construction is cheap. But the PR explicitly states there is no fail-before reproduction — the fix is argued by inspection. Concurrent-GC correctness arguments of this kind benefit from review by someone who owns the JSC/heap integration.

Other factors

  • The bug-hunting system found no issues.
  • CodeRabbit's two nits (document Malloc=1, drop filterStderr) were addressed in 1d66826 (the first fully, the second declined — filterStderr is still used, consistent with the other tests in the same file).
  • The musl build failure in CI is an unrelated LTO/linker infra issue, not caused by this diff.
  • HTTPHeaderIdentifiers has a no-arg default constructor that only calls initLater, so moving it to eager member construction in JSVMClientData doesn't introduce ordering hazards.

Given the subtlety of concurrent-marker invariants and the absence of a deterministic repro, I'm deferring rather than approving.

…e-max.test.ts timeouts, also seen on unrelated builds 62768/62763)
@robobun

robobun commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

CI status: the diff itself is green. Both build 62826 and the re-roll 62837 fail only on two darwin-14-x64 timeouts (terminal.test.ts and node-http-backpressure-max.test.ts), neither of which touches GC marking, visitChildren, JSVMClientData, or HTTPHeaderIdentifiers. The identical pair also failed on unrelated PR build 62768, and node-http-backpressure-max on 62763, so this is a persistent lane flake. broadcast-channel-worker-gc.test.ts (the file this PR touches) passed on every lane in both runs.

Ready for maintainer review; the remaining red is unrelated to this change.

@Jarred-Sumner Jarred-Sumner merged commit b0fef2f into main Jun 16, 2026
75 of 77 checks passed
@Jarred-Sumner Jarred-Sumner deleted the farm/e4e823db/fix-visitchildren-clientdata-race branch June 16, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants