Skip to content

Release RuntimeState's JSC handles before tearing down the VM#31990

Open
robobun wants to merge 7 commits into
mainfrom
farm/70a182aa/fix-worker-shutdown-handleset-uaf
Open

Release RuntimeState's JSC handles before tearing down the VM#31990
robobun wants to merge 7 commits into
mainfrom
farm/70a182aa/fix-worker-shutdown-handleset-uaf

Conversation

@robobun

@robobun robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Fixes the Sentry crashes BUN-3DSK and BUN-3END: segfault during WebWorker shutdown in

WTF::SentinelLinkedList<JSC::HandleNode>::remove
JSC::HandleSet::deallocate
Bun__StrongRef__delete
<bun_jsc::strong::Strong>::drop
drop_in_place<bun_sql_jsc::postgres::postgres_sql_context::PostgresSQLContext>
drop_in_place<bun_runtime::jsc_hooks::RuntimeState>
bun_runtime::jsc_hooks::deinit_runtime_state
<bun_jsc::virtual_machine::VirtualMachine>::destroy
<bun_jsc::web_worker::WebWorker>::shutdown

Repro: any worker that loads bun:sql (touching Bun.SQL is enough: the internal module's top-level init() stores Strong refs in the per-VM MySQL/Postgres contexts) and then exits:

const w = new Worker("data:text/javascript," + encodeURIComponent("Bun.SQL; postMessage('loaded');"));

Cause: WebWorker::shutdown step 3 (WebWorker__teardownJSCVM) derefs the JSC VM to zero, destroying the Heap and its HandleSet. Step 5 (VirtualMachine::destroy) then calls deinit_runtime_state, which drops the Box<RuntimeState>. Two of its fields still own JSC GC handles at that point:

  • sql_rare: the MySQL/Postgres contexts' on_query_resolve_fn / on_query_reject_fn Strongs (the observed crash),
  • global_dns_data: dropping it runs ares_destroy, whose EDESTRUCTION callbacks settle promise Strongs.

RareData.s3_default_client (cached by the Bun.s3 getter) is the same class: RareData is also dropped in destroy(), so a worker that touched Bun.s3 hits the identical UAF. It is released in the same pre-teardown step on both paths.

Strong's release path (Bun__StrongRef__delete -> HandleSet::heapFor(slot)->deallocate(slot)) reads the freed HandleBlock and writes the freed HandleSet's free list. The same ordering exists on the main thread under BUN_DESTRUCT_VM_ON_EXIT=1 (global_exit runs Zig__GlobalObject__destructOnExit before destroy()), which ASAN CI lanes enable.

Fix: new RuntimeHooks::release_runtime_state_js_handles slot, called from both teardown paths after the last JS runs (socket-group close callbacks, so SQL on_close can still dispatch) and before the VM deref. It releases the SQL context Strongs and drops the DNS GlobalData while the JSC heap is alive. This mirrors the existing cancel_all_timers pre-teardown hook, which already handles the same ordering for timer handles.

Dropping GlobalData pre-teardown surfaced two follow-on issues in the same ordering class (caught by ASAN while validating the fix with an in-flight DNS query):

  • reject_later now drops the deferred synchronously when the VM is shutting down. The event loop never ticks again, so the enqueued task could never run; its promise Strong has to drop while the heap is alive, and the stranded box leaked.
  • GlobalData::drop unlinks the resolver's pending-query timeout timer from the per-thread timer heap, which now outlives the box and is still walked by WTFTimer::update during teardown.

Audited the remaining RuntimeState fields (timer heap, ssl_ctx_cache, editor_context, entry_point, transpiler_arena, body_value_pool, isolation_handles): none reach JSC from their drop glue.

Overlap note: #31833 independently adds the same hook for the SQL Strongs and the S3 release, but only on the global_exit path. This PR additionally wires the release into WebWorker::shutdown (the path the Sentry crashes fire on), covers the DNS data and its two follow-on teardown bugs, and adds regression tests. The overlapping hunks are semantically identical, so whichever lands second rebases trivially.

How did you verify your code works?

Three new tests in test/js/web/workers/worker-terminate-lifetime.test.ts (worker + Bun.SQL, worker terminated with in-flight DNS, main thread + BUN_DESTRUCT_VM_ON_EXIT=1), all run with Malloc=1 so WebKit's fastMalloc uses the system allocator and ASAN builds poison freed JSC heap memory; with libpas the freed pages stay mapped, which is why production only saw rare macOS segfaults at small offsets.

On the unfixed debug+ASAN build the two SQL tests fail 10/10 runs with:

ERROR: AddressSanitizer: heap-use-after-free ... READ of size 8 ... thread T9 (Worker)
  #0 JSC::HandleBlock::handleSet() HandleBlock.h:71
  #2 JSC::HandleSet::heapFor(JSC::JSValue*) HandleSet.h:99
  #3 Bun__StrongRef__delete StrongRef.cpp:12
  ...
  #12 bun_runtime::jsc_hooks::deinit_runtime_state jsc_hooks.rs:592
freed by: JSC::HandleSet::~HandleSet <- JSC::Heap::~Heap <- JSC::VM::~VM <- WebWorker__teardownJSCVM

With the fix the file passes. Also verified end-to-end against a live postgres (query resolve/reject dispatch through the context Strongs, worker with an open connection exiting cleanly).

WebWorker::shutdown (and global_exit under BUN_DESTRUCT_VM_ON_EXIT) tore
down the JSC VM, then VirtualMachine::destroy dropped the per-VM
RuntimeState. The SQL contexts' Strong handles and the per-VM DNS data
dropped in that window, so Bun__StrongRef__delete unlinked HandleNodes
from the already-freed HandleSet (segfault in
WTF::SentinelLinkedList<JSC::HandleNode>::remove).

Add a release_runtime_state_js_handles runtime hook, called from both
teardown paths after the last JS runs (socket-group close callbacks) and
before the VM deref. It releases the MySQL/Postgres context Strongs and
drops the DNS GlobalData while the JSC heap is alive.

Dropping GlobalData pre-teardown surfaced two follow-on ordering issues
in the same class, fixed here too:
- reject_later now drops the deferred synchronously when the VM is
  shutting down instead of enqueueing a task that can never run (the
  promise Strong must drop while the heap is alive, and the stranded
  box leaked).
- GlobalData::drop unlinks the resolver's pending-query timeout timer
  from the per-thread timer heap, which now outlives the box and is
  still walked by WTFTimer::update during teardown.
@github-actions github-actions Bot added the claude label Jun 8, 2026
@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 1:05 AM PT - Jun 10th, 2026

@robobun, your commit d5f126f has some failures in Build #61687 (All Failures)


🧪   To try this PR locally:

bunx bun-pr 31990

That installs a local version of the PR into your bun-31990 executable, so you can run:

bun-31990 --bun

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. panic: Segmentation fault at address 0xD — "multiple threads are crashing" under Worker spawn/terminate churn (1.3.14, long-running server) #31880 - Reports segfaults at small addresses (0xD) during worker spawn/terminate churn, matching the PR's fix for use-after-free in HandleSet during worker teardown
  2. Worker create+terminate cycle aborts process after ~100k–900k iterations on macOS arm64 #30421 - Deterministic SIGABRT during tight worker create/terminate cycles on macOS arm64, crashing in the worker termination path this PR fixes

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #31880
Fixes #30421

🤖 Generated with Claude Code

@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Checked both against the mechanism this PR fixes. Neither matches, so I'm not adding the auto-close lines:

Both also predate the teardown ordering this PR changes, which affects 1.4.0 canary builds (the Sentry crashes BUN-3DSK/BUN-3END are on 1.4.0-canary).

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e77c6bc9-20c6-437d-81fe-3d074f8ac26c

📥 Commits

Reviewing files that changed from the base of the PR and between 35f3dea and d5f126f.

📒 Files selected for processing (1)
  • test/js/web/workers/worker-terminate-lifetime.test.ts

Walkthrough

Adds a RuntimeHooks entrypoint to release RuntimeState-owned JSC handles while the VM is still alive, implements deinit helpers for SQL contexts and RareData, integrates the hook into VirtualMachine and WebWorker shutdown paths, adds DNS/promise shutdown guards, and includes regression tests for clean shutdown.

Changes

JSC VM Teardown Handle Cleanup

Layer / File(s) Summary
RuntimeHooks contract and SQL context cleanup
src/jsc/VirtualMachine.rs, src/sql_jsc/mysql/MySQLContext.rs, src/sql_jsc/postgres/PostgresSQLContext.rs, src/sql_jsc/jsc.rs
Defines the release_runtime_state_js_handles hook contract in RuntimeHooks with safety docs. Adds deinit() to MySQLContext and PostgresSQLContext, and deinit_js_handles() to RareData to explicitly release JSC Strong callback handles while the VM is alive.
Runtime hook implementation and wiring
src/runtime/jsc_hooks.rs
Wires the new hook into __BUN_RUNTIME_HOOKS and implements release_runtime_state_js_handles to call RareData.deinit_js_handles() and drop per-VM DNS data (global_dns_data) when runtime_state() is present.
VM exit teardown hook integration
src/jsc/VirtualMachine.rs
Calls release_runtime_state_js_handles during VirtualMachine::global_exit (pre-GlobalObject destruction) and deinitializes RareData S3 default client before JSC teardown.
WebWorker shutdown hook integration
src/jsc/web_worker.rs
During WebWorker::shutdown, after socket-group closure and S3 client deinit, conditionally calls the runtime hook to release RuntimeState JSC handles before tearing down the worker VM.
DNS and promise shutdown guards
src/runtime/dns_jsc/cares_jsc.rs, src/runtime/dns_jsc/dns.rs
ErrorDeferred::reject_later skips enqueuing deferred rejections when the VM is shutting down. GlobalData::drop skips resolver timer unlink when runtime_state() is null to avoid accessing cleared runtime state.
Regression tests
test/js/web/workers/worker-terminate-lifetime.test.ts
Adds test helper debugHeapEnv and three test cases: worker loading Bun.SQL+Bun.s3, worker terminated during in-flight dns.promises.resolve4, and main-thread run with BUN_DESTRUCT_VM_ON_EXIT=1; all run with Malloc=1/ASAN options to assert clean shutdown (no stderr, expected stdout, exit code 0).

Suggested reviewers

  • Jarred-Sumner
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: releasing RuntimeState's JSC handles before VM teardown, which directly addresses the root cause of the Sentry crashes detailed in the PR description.
Description check ✅ Passed The description is comprehensive and well-structured. It includes both required sections (What does this PR do and How did you verify your code works), explains the crash root cause with a detailed stack trace, provides a clear reproduction case, describes the fix mechanism, and documents verification with regression tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 122-125: Reorder the assertions in the affected tests so stdout is
asserted first, then stderr only if exitCode !== 0, and finally assert exitCode
=== 0; specifically update the assertion blocks that currently read
expect(stderr).toBe(""), expect(stdout).toBe("..."), expect(exitCode).toBe(0)
(in the tests inside worker-terminate-lifetime.test.ts) to the pattern: assert
stdout value first, if exitCode !== 0 assert stderr is empty, then assert
exitCode is 0; apply this change consistently to all three new tests referenced
in the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ae617077-8fd8-4828-af55-1da43f31716a

📥 Commits

Reviewing files that changed from the base of the PR and between a988615 and 214dbc9.

📒 Files selected for processing (9)
  • src/jsc/VirtualMachine.rs
  • src/jsc/web_worker.rs
  • src/runtime/dns_jsc/cares_jsc.rs
  • src/runtime/dns_jsc/dns.rs
  • src/runtime/jsc_hooks.rs
  • src/sql_jsc/jsc.rs
  • src/sql_jsc/mysql/MySQLContext.rs
  • src/sql_jsc/postgres/PostgresSQLContext.rs
  • test/js/web/workers/worker-terminate-lifetime.test.ts

Comment thread test/js/web/workers/worker-terminate-lifetime.test.ts
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 - Also introduces a release_runtime_state_js_handles runtime hook to release SQL Strong handles (on_query_resolve/reject StrongOptionals in MySQLContext/PostgresSQLContext) before JSC VM teardown, fixing the same heap-use-after-free in Bun__StrongRef__delete. Both PRs change VirtualMachine.rs and jsc_hooks.rs for the same purpose. PR test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 additionally fixes S3 handle teardown, ConsoleObject ownership, and several other leak-related bugs.

🤖 Generated with Claude Code

RareData.s3_default_client (cached by the Bun.s3 getter) is dropped in
destroy() after the HandleSet is freed, on both the worker shutdown and
the BUN_DESTRUCT_VM_ON_EXIT main-thread paths: same class as the SQL
context Strongs. Release it in the same pre-teardown step.
@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Checked the overlap with #31833. It is real but partial, so this PR stays open:

The overlapping hunks are semantically identical, so whichever PR lands second rebases trivially. Noted the relationship in both PRs so a maintainer can sequence them.

robobun added 3 commits June 8, 2026 17:33
Malloc=1 exposes every deliberately-unreclaimed exit-time WebKit
allocation to LeakSanitizer, so the leak sweep enabled by ASAN CI lanes
(detect_leaks=1) took minutes in the spawned children and timed the
tests out. The use-after-free detection the tests exist for is
AddressSanitizer proper and unaffected by detect_leaks=0.
bmalloc has no system-heap fallback on Windows, so the spawned child
aborts at startup before running any JS. No Windows lane runs ASAN, so
the env var only served the Linux/macOS ASAN lanes; on Windows the
tests still cover the plain clean-shutdown contract.
@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

CI status: the diff is green; the remaining red is unrelated flake.

  • The only failing test in build 61398 is test/cli/install/bunx.test.ts ("should handle package that requires node 24", exit 3 against the live registry), on both the x64-asan and Windows x64-baseline lanes. The identical failure appears on unrelated branches: builds 61388 (506d9d9), 61391 (81c2768), and 61392 (25c35db).
  • The teardown regression tests added here pass on every lane, including x64-asan where they deterministically reproduce the HandleSet use-after-free on an unfixed build.
  • Earlier red on this PR was real and is fixed: the ASAN lane needed detect_leaks=0 in the tests' child processes (f5caab2, the Malloc=1 debug heap made LSAN sweep all WebKit exit allocations), and Windows needed Malloc=1 gated off (e7c7060, bmalloc has no system-heap fallback there). The mass Windows bun install setup failures in build 61394 were agent-side: builds 61393 and 61394 ran src-identical binaries yet went from 1 to 13 failed setup shards.

Ready for review.

@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Final CI state for build 61398: every failed shard (16, one per platform lane) is the same single test, test/cli/install/bunx.test.ts "should handle package that requires node 24" (exit 3 against the live registry). It fails on all lanes uniformly, and the same failure appears on unrelated branches (builds 61388, 61391, 61392), so it is an ecosystem-driven breakage independent of this PR, likely affecting main too.

The tests added by this PR pass on every lane in the completed matrix, including x64-asan where they reproduce the HandleSet use-after-free deterministically on an unfixed build.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/js/web/workers/worker-terminate-lifetime.test.ts (2)

123-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject these worker-ready promises on close too.

Lines 123-129 and Line 163 only fail on error. If the worker exits before posting "loaded"/"inflight", the child can sit until the outer test timeout instead of failing immediately with a useful reason. Please wire close to reject these waits as well.

As per coding guidelines, "Wire EVERY failure event (error, close, abort, process exit) to reject the awaited promise so failures surface immediately with a message instead of as an opaque 30s hang."

Also applies to: 163-163

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 123 -
129, The promises waiting for worker readiness (e.g., the "loaded" promise that
sets w.onmessage/w.onerror and the similar "inflight" wait around line 163) only
reject on "error" and must also reject when the worker closes; update the
promise constructors so the worker "close" event also calls reject (e.g., add
w.addEventListener("close", err => reject(new Error('worker closed before
ready')), { once: true }) and likewise for any "inflight" wait) ensuring every
failure event (close/error/abort) rejects the awaited promise with a descriptive
error message; keep the existing "closed" listener separate for normal close
handling.

Source: Coding guidelines


155-158: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the DNS repro local; 192.0.2.1 makes this test non-hermetic.

Line 157 sends the query to a real external resolver address. That violates the no-external-network rule and also weakens the precondition here: some environments will fail resolve4() immediately instead of keeping the c-ares request alive long enough to exercise teardown. Use a local UDP blackhole server bound to 127.0.0.1 with port: 0, then pass that port into the child.

As per coding guidelines, "Never contact external network hosts or live registries in tests - use local in-process servers or container harness instead" and "Always use port: 0."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 155 -
158, The test uses dns.setServers(["192.0.2.1"]) which contacts an external
resolver; instead, start a local UDP blackhole server bound to 127.0.0.1 with
port: 0 before spawning the child, read the actual assigned port, and pass that
port into the child so the child can call dns.setServers([`127.0.0.1:${port}`])
(the code path that contains dns.setServers and dns.promises.resolve4 should
read the injected port); ensure the server simply accepts/ignores packets to
keep the c-ares request in-flight and close the server during teardown so the
test remains hermetic and uses an ephemeral port.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 26-29: The Windows branch currently skips overriding a parent
Malloc value, leaving any inherited Malloc in bunEnv; update the debugHeapEnv
construction so when isWindows is true you explicitly clear Malloc (e.g., set
Malloc to an empty string) instead of omitting it. Modify the ternary around
...(isWindows ? {} : { Malloc: "1" }) so debugHeapEnv explicitly includes
Malloc: "" for the isWindows case (referencing debugHeapEnv, bunEnv, isWindows,
and the Malloc env var).

---

Outside diff comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 123-129: The promises waiting for worker readiness (e.g., the
"loaded" promise that sets w.onmessage/w.onerror and the similar "inflight" wait
around line 163) only reject on "error" and must also reject when the worker
closes; update the promise constructors so the worker "close" event also calls
reject (e.g., add w.addEventListener("close", err => reject(new Error('worker
closed before ready')), { once: true }) and likewise for any "inflight" wait)
ensuring every failure event (close/error/abort) rejects the awaited promise
with a descriptive error message; keep the existing "closed" listener separate
for normal close handling.
- Around line 155-158: The test uses dns.setServers(["192.0.2.1"]) which
contacts an external resolver; instead, start a local UDP blackhole server bound
to 127.0.0.1 with port: 0 before spawning the child, read the actual assigned
port, and pass that port into the child so the child can call
dns.setServers([`127.0.0.1:${port}`]) (the code path that contains
dns.setServers and dns.promises.resolve4 should read the injected port); ensure
the server simply accepts/ignores packets to keep the c-ares request in-flight
and close the server during teardown so the test remains hermetic and uses an
ephemeral port.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7fc1956-003a-4bc7-8de3-2a7e95497a84

📥 Commits

Reviewing files that changed from the base of the PR and between f5caab2 and 35f3dea.

📒 Files selected for processing (1)
  • test/js/web/workers/worker-terminate-lifetime.test.ts

Comment thread test/js/web/workers/worker-terminate-lifetime.test.ts
… Malloc clear

- The in-flight DNS test now points c-ares at a local UDP socket that
  never replies (port 0), instead of TEST-NET 192.0.2.1; this also makes
  the pending-query precondition deterministic (no ICMP fast-fail).
- The worker readiness waits reject on close so an early worker death
  fails with a message instead of hanging to the test timeout.
- debugHeapEnv explicitly clears an inherited Malloc on Windows.
@robobun

robobun commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

CI state for build 61687 (d5f126f, with main merged in): all 280 executed jobs passed, including every Linux, Windows, and x64-asan test lane. The three red GitHub contexts (darwin-14-aarch64, darwin-14-x64, darwin-26-aarch64) report "Expired": their jobs never left Buildkite's scheduled state because no macOS agents picked them up. That is agent capacity, not this diff; retrying those three jobs in Buildkite needs no new push.

@robobun

robobun commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Build 61687 finished: 284 of 284 executed jobs passed (darwin-26-aarch64 and darwin-14-x64 eventually got agents and went green). The build's failed state comes solely from the darwin-14-aarch64 test step, whose two job slots expired in Buildkite's scheduled queue without ever being picked up by an agent. No test failed anywhere in the matrix; a Buildkite-side retry of that one step (no push needed) completes the run.

Jarred-Sumner pushed a commit that referenced this pull request Jun 16, 2026
…ifiers eager (#32407)

## Crash

Sentry BUN-2V1E: segfault inside
`WTF::TypeCastTraits<JSVMClientData>::isType` reached from
`Zig::GlobalObject::visitChildrenImpl` on a concurrent GC helper thread.
695 lifetime events (26 in the last 24h), 100% Windows x64, 1.2.17
through 1.3.14. 31% of events carry both `workers_spawned=True` and
`workers_terminated=True` vs a ~3% baseline, pointing at
worker-termination churn. Also seen intermittently in CI as the
`broadcast-channel-worker-gc` flake (b03f1e6 is a rekick for it).

```
WTF::ParallelHelperPool::Thread::work
JSC::Heap::runBeginPhase lambda
JSC::SlotVisitor::drainFromShared
JSC::SlotVisitor::drain
JSC::SlotVisitor::visitChildren
JSC::MethodTable::visitChildren
Zig::GlobalObject::visitChildren
Zig::GlobalObject::visitChildrenImpl
WebCore::clientData(JSC::VM&)
WTF::downcast<JSVMClientData>
WTF::is<JSVMClientData>
TypeCastTraits<JSVMClientData>::isType   <-- SEGV
```

## Cause

`visitChildrenImpl` ran:

```cpp
WebCore::clientData(thisObject->vm())->httpHeaderIdentifiers().visit<Visitor>(visitor);
```

Two problems on this line:

**1. `thisObject->vm()` dereferences cell state on the marker thread.**
`JSGlobalObject::vm()` returns `*m_vm` (a raw `VM* const` stored on the
cell); `clientData()` then does
`downcast<JSVMClientData>(vm.clientData)` whose `RELEASE_ASSERT(!source
|| is<Target>(*source))` calls the virtual `isWebCoreJSClientData()`.
The neighbouring `visitGlobalObjectMember(unique_ptr)` overload already
guards a window where the concurrent marker visits a `Zig::GlobalObject`
picked up via conservative stack scan while its IsoSubspace slot is
being recycled; in that same window `m_vm` can read stale bytes,
resolving to a garbage `clientData` whose vtable load faults.
`visitor.vm()` (= `m_heap.vm()`) is guaranteed alive for the duration of
marking and does not depend on the visited cell at all; this is how
JSC's own `visitChildren` implementations (`FunctionExecutable`,
`JSWeakObjectRef`, `Structure`) fetch the VM on the marker thread.

**2. `httpHeaderIdentifiers()` was an unlocked lazy
`std::optional::emplace()`** called from both the mutator
(`NodeHTTP.cpp` header assignment) and concurrent GC helper threads.
With more than one `Zig::GlobalObject` in a VM (ShadowRealm,
test-isolation swap, bake) distinct parallel marker helpers each visit a
different global and all call `httpHeaderIdentifiers()` on the same
`JSVMClientData`, so two threads can enter `emplace()` on the same
storage. The `HTTPHeaderIdentifiers` constructor only runs ~90
`LazyProperty::initLater()` calls (each a single tagged-pointer store),
so there is nothing worth deferring.

## Fix

- `ZigGlobalObject.cpp`: fetch the VM via `visitor.vm()` instead of
`thisObject->vm()`.
- `BunClientData.{h,cpp}`: `m_httpHeaderIdentifiers` is now a plain
eagerly-constructed member; `httpHeaderIdentifiers()` is an inline
accessor.

## Verification

The race window is too narrow to trip deterministically on Linux. An
honest probe against the unfixed debug (ASAN) build, with `Malloc=1` +
`BUN_JSC_collectContinuously=1` + `BUN_JSC_numberOfGCMarkers=8`:

- 5 iterations of an 8-round × 6-worker BroadcastChannel
create/terminate/GC stress: clean.
- 8 iterations of a 100-round × 8-ShadowRealm (parallel-marker emplace)
stress: clean.

So there is no fail-before proof to hand the gate; the crash signature
is Windows-specific and timing-dependent. The fix is nonetheless clearly
correct on inspection:

- `visitor.vm()` is the JSC convention for the marker thread and cannot
read through the visited cell.
- An unlocked `std::optional::emplace()` reachable from two threads is a
data race in any memory model.

A new stress test in
`test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts`
hammers the exact path (multiple globals per VM via ShadowRealm, worker
churn, forced parallel markers, `Malloc=1` on non-Windows) so a future
regression on Windows CI will show up where the signature has already
been observed.

```
bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts   # 4 pass
bun bd test test/js/node/http/node-http.test.ts -t headers                     # 5 pass (HTTPHeaderIdentifiers path)
bun bd test test/js/node/http/numeric-header.test.ts                           # 1 pass
```

## Related

Checked #31990 / #32071 / #32082 (worker event-loop enqueue after
terminate, Strong<> releases before VM teardown): none touch
`visitChildrenImpl` or `vm.clientData` access from the marker thread. No
open PR addresses this crash.

The issue-matcher suggested four candidates; assessment against the
actual stack:

- #20641 (BUN-N2D): same `TypeCastTraits<JSVMClientData>::isType` frame
but reached from `bunVMConcurrently` on the main event loop during libuv
signal processing, not from a GC marker thread. Different code path;
this PR does not touch it.
- #20786 (BUN-PD8): same `isType` frame reached from `JSC::subspaceFor`
inside `Request__create` on the HTTP server request path (main thread).
Different code path; this PR does not touch it.
- #27312: SIGILL (not SEGV) in `SlotVisitor::drain` on Linux during `bun
test` cleanup. Adjacent area but a different fault signature; not
claimed.
- #31880: generic "multiple threads are crashing" under worker churn, no
decoded stack. #32071 already declines to claim it for the same reason;
not claimed here either.

None are auto-closed by this PR. #20641 and #20786 suggest there may be
other callers of `clientData()` that can see a bad `vm.clientData` on
Windows; those are separate paths and out of scope here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant