Release RuntimeState's JSC handles before tearing down the VM by robobun · Pull Request #31990 · oven-sh/bun

robobun · 2026-06-08T16:41:01Z

What does this PR do?

Fixes the Sentry crashes BUN-3DSK and BUN-3END: segfault during WebWorker shutdown in

WTF::SentinelLinkedList<JSC::HandleNode>::remove
JSC::HandleSet::deallocate
Bun__StrongRef__delete
<bun_jsc::strong::Strong>::drop
drop_in_place<bun_sql_jsc::postgres::postgres_sql_context::PostgresSQLContext>
drop_in_place<bun_runtime::jsc_hooks::RuntimeState>
bun_runtime::jsc_hooks::deinit_runtime_state
<bun_jsc::virtual_machine::VirtualMachine>::destroy
<bun_jsc::web_worker::WebWorker>::shutdown

Repro: any worker that loads bun:sql (touching Bun.SQL is enough: the internal module's top-level init() stores Strong refs in the per-VM MySQL/Postgres contexts) and then exits:

const w = new Worker("data:text/javascript," + encodeURIComponent("Bun.SQL; postMessage('loaded');"));

Cause: WebWorker::shutdown step 3 (WebWorker__teardownJSCVM) derefs the JSC VM to zero, destroying the Heap and its HandleSet. Step 5 (VirtualMachine::destroy) then calls deinit_runtime_state, which drops the Box<RuntimeState>. Two of its fields still own JSC GC handles at that point:

sql_rare: the MySQL/Postgres contexts' on_query_resolve_fn / on_query_reject_fn Strongs (the observed crash),
global_dns_data: dropping it runs ares_destroy, whose EDESTRUCTION callbacks settle promise Strongs.

RareData.s3_default_client (cached by the Bun.s3 getter) is the same class: RareData is also dropped in destroy(), so a worker that touched Bun.s3 hits the identical UAF. It is released in the same pre-teardown step on both paths.

Strong's release path (Bun__StrongRef__delete -> HandleSet::heapFor(slot)->deallocate(slot)) reads the freed HandleBlock and writes the freed HandleSet's free list. The same ordering exists on the main thread under BUN_DESTRUCT_VM_ON_EXIT=1 (global_exit runs Zig__GlobalObject__destructOnExit before destroy()), which ASAN CI lanes enable.

Fix: new RuntimeHooks::release_runtime_state_js_handles slot, called from both teardown paths after the last JS runs (socket-group close callbacks, so SQL on_close can still dispatch) and before the VM deref. It releases the SQL context Strongs and drops the DNS GlobalData while the JSC heap is alive. This mirrors the existing cancel_all_timers pre-teardown hook, which already handles the same ordering for timer handles.

Dropping GlobalData pre-teardown surfaced two follow-on issues in the same ordering class (caught by ASAN while validating the fix with an in-flight DNS query):

reject_later now drops the deferred synchronously when the VM is shutting down. The event loop never ticks again, so the enqueued task could never run; its promise Strong has to drop while the heap is alive, and the stranded box leaked.
GlobalData::drop unlinks the resolver's pending-query timeout timer from the per-thread timer heap, which now outlives the box and is still walked by WTFTimer::update during teardown.

Audited the remaining RuntimeState fields (timer heap, ssl_ctx_cache, editor_context, entry_point, transpiler_arena, body_value_pool, isolation_handles): none reach JSC from their drop glue.

Overlap note: #31833 independently adds the same hook for the SQL Strongs and the S3 release, but only on the global_exit path. This PR additionally wires the release into WebWorker::shutdown (the path the Sentry crashes fire on), covers the DNS data and its two follow-on teardown bugs, and adds regression tests. The overlapping hunks are semantically identical, so whichever lands second rebases trivially.

How did you verify your code works?

Three new tests in test/js/web/workers/worker-terminate-lifetime.test.ts (worker + Bun.SQL, worker terminated with in-flight DNS, main thread + BUN_DESTRUCT_VM_ON_EXIT=1), all run with Malloc=1 so WebKit's fastMalloc uses the system allocator and ASAN builds poison freed JSC heap memory; with libpas the freed pages stay mapped, which is why production only saw rare macOS segfaults at small offsets.

On the unfixed debug+ASAN build the two SQL tests fail 10/10 runs with:

ERROR: AddressSanitizer: heap-use-after-free ... READ of size 8 ... thread T9 (Worker)
  #0 JSC::HandleBlock::handleSet() HandleBlock.h:71
  #2 JSC::HandleSet::heapFor(JSC::JSValue*) HandleSet.h:99
  #3 Bun__StrongRef__delete StrongRef.cpp:12
  ...
  #12 bun_runtime::jsc_hooks::deinit_runtime_state jsc_hooks.rs:592
freed by: JSC::HandleSet::~HandleSet <- JSC::Heap::~Heap <- JSC::VM::~VM <- WebWorker__teardownJSCVM

With the fix the file passes. Also verified end-to-end against a live postgres (query resolve/reject dispatch through the context Strongs, worker with an open connection exiting cleanly).

WebWorker::shutdown (and global_exit under BUN_DESTRUCT_VM_ON_EXIT) tore down the JSC VM, then VirtualMachine::destroy dropped the per-VM RuntimeState. The SQL contexts' Strong handles and the per-VM DNS data dropped in that window, so Bun__StrongRef__delete unlinked HandleNodes from the already-freed HandleSet (segfault in WTF::SentinelLinkedList<JSC::HandleNode>::remove). Add a release_runtime_state_js_handles runtime hook, called from both teardown paths after the last JS runs (socket-group close callbacks) and before the VM deref. It releases the MySQL/Postgres context Strongs and drops the DNS GlobalData while the JSC heap is alive. Dropping GlobalData pre-teardown surfaced two follow-on ordering issues in the same class, fixed here too: - reject_later now drops the deferred synchronously when the VM is shutting down instead of enqueueing a task that can never run (the promise Strong must drop while the heap is alive, and the stranded box leaked). - GlobalData::drop unlinks the resolver's pending-query timeout timer from the per-thread timer heap, which now outlives the box and is still walked by WTFTimer::update during teardown.

robobun · 2026-06-08T16:41:12Z

^{Updated 1:05 AM PT - Jun 10th, 2026}

❌ @robobun, your commit d5f126f has some failures in Build #61687 (All Failures)

🧪 To try this PR locally:

bunx bun-pr 31990

That installs a local version of the PR into your bun-31990 executable, so you can run:

bun-31990 --bun

github-actions · 2026-06-08T16:45:47Z

Found 2 issues this PR may fix:

panic: Segmentation fault at address 0xD — "multiple threads are crashing" under Worker spawn/terminate churn (1.3.14, long-running server) #31880 - Reports segfaults at small addresses (0xD) during worker spawn/terminate churn, matching the PR's fix for use-after-free in HandleSet during worker teardown
Worker create+terminate cycle aborts process after ~100k–900k iterations on macOS arm64 #30421 - Deterministic SIGABRT during tight worker create/terminate cycles on macOS arm64, crashing in the worker termination path this PR fixes

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #31880
Fixes #30421

🤖 Generated with Claude Code

robobun · 2026-06-08T16:46:51Z

Checked both against the mechanism this PR fixes. Neither matches, so I'm not adding the auto-close lines:

The bug fixed here requires the per-VM SQL contexts' Strong handles to be populated (worker loads bun:sql, e.g. by touching Bun.SQL) or an in-flight c-ares DNS query at teardown. If neither happened, the handles are empty and the late drop is a no-op.
panic: Segmentation fault at address 0xD — "multiple threads are crashing" under Worker spawn/terminate churn (1.3.14, long-running server) #31880 is on 1.3.14 stable with generic message-passing workers; no bun:sql or dns.resolve* in the workload, and the signature (segfault at 0xD, "multiple threads are crashing") does not match this crash (read of the freed HandleBlock header during HandleSet::deallocate).
Worker create+terminate cycle aborts process after ~100k–900k iterations on macOS arm64 #30421 is on 1.3.13, a SIGABRT (not a segfault) in a worker create/terminate cycle whose workers only use parentPort messaging. The repro never loads bun:sql or starts a DNS query, so the code path changed here is never reached.

Both also predate the teardown ordering this PR changes, which affects 1.4.0 canary builds (the Sentry crashes BUN-3DSK/BUN-3END are on 1.4.0-canary).

coderabbitai · 2026-06-08T16:49:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e77c6bc9-20c6-437d-81fe-3d074f8ac26c

📥 Commits

Reviewing files that changed from the base of the PR and between 35f3dea and d5f126f.

📒 Files selected for processing (1)

test/js/web/workers/worker-terminate-lifetime.test.ts

Walkthrough

Adds a RuntimeHooks entrypoint to release RuntimeState-owned JSC handles while the VM is still alive, implements deinit helpers for SQL contexts and RareData, integrates the hook into VirtualMachine and WebWorker shutdown paths, adds DNS/promise shutdown guards, and includes regression tests for clean shutdown.

Changes

JSC VM Teardown Handle Cleanup

Layer / File(s)	Summary
RuntimeHooks contract and SQL context cleanup `src/jsc/VirtualMachine.rs`, `src/sql_jsc/mysql/MySQLContext.rs`, `src/sql_jsc/postgres/PostgresSQLContext.rs`, `src/sql_jsc/jsc.rs`	Defines the `release_runtime_state_js_handles` hook contract in `RuntimeHooks` with safety docs. Adds `deinit()` to `MySQLContext` and `PostgresSQLContext`, and `deinit_js_handles()` to `RareData` to explicitly release JSC Strong callback handles while the VM is alive.
Runtime hook implementation and wiring `src/runtime/jsc_hooks.rs`	Wires the new hook into `__BUN_RUNTIME_HOOKS` and implements `release_runtime_state_js_handles` to call `RareData.deinit_js_handles()` and drop per-VM DNS data (`global_dns_data`) when `runtime_state()` is present.
VM exit teardown hook integration `src/jsc/VirtualMachine.rs`	Calls `release_runtime_state_js_handles` during `VirtualMachine::global_exit` (pre-GlobalObject destruction) and deinitializes `RareData` S3 default client before JSC teardown.
WebWorker shutdown hook integration `src/jsc/web_worker.rs`	During `WebWorker::shutdown`, after socket-group closure and S3 client deinit, conditionally calls the runtime hook to release `RuntimeState` JSC handles before tearing down the worker VM.
DNS and promise shutdown guards `src/runtime/dns_jsc/cares_jsc.rs`, `src/runtime/dns_jsc/dns.rs`	`ErrorDeferred::reject_later` skips enqueuing deferred rejections when the VM is shutting down. `GlobalData::drop` skips resolver timer unlink when `runtime_state()` is null to avoid accessing cleared runtime state.
Regression tests `test/js/web/workers/worker-terminate-lifetime.test.ts`	Adds test helper `debugHeapEnv` and three test cases: worker loading `Bun.SQL`+`Bun.s3`, worker terminated during in-flight `dns.promises.resolve4`, and main-thread run with `BUN_DESTRUCT_VM_ON_EXIT=1`; all run with `Malloc=1`/ASAN options to assert clean shutdown (no stderr, expected stdout, exit code 0).

Suggested reviewers

Jarred-Sumner

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: releasing RuntimeState's JSC handles before VM teardown, which directly addresses the root cause of the Sentry crashes detailed in the PR description.
Description check	✅ Passed	The description is comprehensive and well-structured. It includes both required sections (What does this PR do and How did you verify your code works), explains the crash root cause with a detailed stack trace, provides a clear reproduction case, describes the fix mechanism, and documents verification with regression tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 122-125: Reorder the assertions in the affected tests so stdout is
asserted first, then stderr only if exitCode !== 0, and finally assert exitCode
=== 0; specifically update the assertion blocks that currently read
expect(stderr).toBe(""), expect(stdout).toBe("..."), expect(exitCode).toBe(0)
(in the tests inside worker-terminate-lifetime.test.ts) to the pattern: assert
stdout value first, if exitCode !== 0 assert stderr is empty, then assert
exitCode is 0; apply this change consistently to all three new tests referenced
in the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ae617077-8fd8-4828-af55-1da43f31716a

📥 Commits

Reviewing files that changed from the base of the PR and between a988615 and 214dbc9.

📒 Files selected for processing (9)

src/jsc/VirtualMachine.rs
src/jsc/web_worker.rs
src/runtime/dns_jsc/cares_jsc.rs
src/runtime/dns_jsc/dns.rs
src/runtime/jsc_hooks.rs
src/sql_jsc/jsc.rs
src/sql_jsc/mysql/MySQLContext.rs
src/sql_jsc/postgres/PostgresSQLContext.rs
test/js/web/workers/worker-terminate-lifetime.test.ts

github-actions · 2026-06-08T16:53:04Z

This PR may be a duplicate of:

test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 - Also introduces a release_runtime_state_js_handles runtime hook to release SQL Strong handles (on_query_resolve/reject StrongOptionals in MySQLContext/PostgresSQLContext) before JSC VM teardown, fixing the same heap-use-after-free in Bun__StrongRef__delete. Both PRs change VirtualMachine.rs and jsc_hooks.rs for the same purpose. PR test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 additionally fixes S3 handle teardown, ConsoleObject ownership, and several other leak-related bugs.

🤖 Generated with Claude Code

RareData.s3_default_client (cached by the Bun.s3 getter) is dropped in destroy() after the HandleSet is freed, on both the worker shutdown and the BUN_DESTRUCT_VM_ON_EXIT main-thread paths: same class as the SQL context Strongs. Release it in the same pre-teardown step.

robobun · 2026-06-08T17:02:02Z

Checked the overlap with #31833. It is real but partial, so this PR stays open:

test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 calls release_runtime_state_js_handles (same hook, independently arrived at) and releases RareData.s3_default_client only in global_exit, i.e. the main-thread BUN_DESTRUCT_VM_ON_EXIT path that the ASAN/LSAN runner exercises.
The Sentry crashes this PR fixes (BUN-3DSK, BUN-3END) fire in WebWorker::shutdown, which test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833 does not touch. This PR wires the release into both teardown paths, covers the per-VM DNS data (plus the two follow-on teardown bugs that surfaced: stranded reject_later tasks and the resolver's dangling timer-heap node), and adds regression tests that reproduce the crash 10/10 on an unfixed ASAN build.
After seeing test: run the full ported Node suite leak-clean under the ASAN runner and fix the bugs it surfaces — subprocess fd double-close, console/Strong-handle teardown UAFs, SAN/scope-function/valkey leaks, post-exit child_process streams, runner tmpdir isolation (2322 tests green) #31833's S3 finding, I verified a worker touching Bun.s3 hits the identical UAF on the worker path and added the same s3_default_client.deinit() to both paths here (ceb930e).

The overlapping hunks are semantically identical, so whichever PR lands second rebases trivially. Noted the relationship in both PRs so a maintainer can sequence them.

Malloc=1 exposes every deliberately-unreclaimed exit-time WebKit allocation to LeakSanitizer, so the leak sweep enabled by ASAN CI lanes (detect_leaks=1) took minutes in the spawned children and timed the tests out. The use-after-free detection the tests exist for is AddressSanitizer proper and unaffected by detect_leaks=0.

bmalloc has no system-heap fallback on Windows, so the spawned child aborts at startup before running any JS. No Windows lane runs ASAN, so the env var only served the Linux/macOS ASAN lanes; on Windows the tests still cover the plain clean-shutdown contract.

robobun · 2026-06-08T18:42:52Z

CI status: the diff is green; the remaining red is unrelated flake.

The only failing test in build 61398 is test/cli/install/bunx.test.ts ("should handle package that requires node 24", exit 3 against the live registry), on both the x64-asan and Windows x64-baseline lanes. The identical failure appears on unrelated branches: builds 61388 (506d9d9), 61391 (81c2768), and 61392 (25c35db).
The teardown regression tests added here pass on every lane, including x64-asan where they deterministically reproduce the HandleSet use-after-free on an unfixed build.
Earlier red on this PR was real and is fixed: the ASAN lane needed detect_leaks=0 in the tests' child processes (f5caab2, the Malloc=1 debug heap made LSAN sweep all WebKit exit allocations), and Windows needed Malloc=1 gated off (e7c7060, bmalloc has no system-heap fallback there). The mass Windows bun install setup failures in build 61394 were agent-side: builds 61393 and 61394 ran src-identical binaries yet went from 1 to 13 failed setup shards.

Ready for review.

robobun · 2026-06-08T19:10:23Z

Final CI state for build 61398: every failed shard (16, one per platform lane) is the same single test, test/cli/install/bunx.test.ts "should handle package that requires node 24" (exit 3 against the live registry). It fails on all lanes uniformly, and the same failure appears on unrelated branches (builds 61388, 61391, 61392), so it is an ecosystem-driven breakage independent of this PR, likely affecting main too.

The tests added by this PR pass on every lane in the completed matrix, including x64-asan where they reproduce the HandleSet use-after-free deterministically on an unfixed build.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

test/js/web/workers/worker-terminate-lifetime.test.ts (2)
123-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject these worker-ready promises on close too.

Lines 123-129 and Line 163 only fail on error. If the worker exits before posting "loaded"/"inflight", the child can sit until the outer test timeout instead of failing immediately with a useful reason. Please wire close to reject these waits as well.

As per coding guidelines, "Wire EVERY failure event (error, close, abort, process exit) to reject the awaited promise so failures surface immediately with a message instead of as an opaque 30s hang."

Also applies to: 163-163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 123 -
129, The promises waiting for worker readiness (e.g., the "loaded" promise that
sets w.onmessage/w.onerror and the similar "inflight" wait around line 163) only
reject on "error" and must also reject when the worker closes; update the
promise constructors so the worker "close" event also calls reject (e.g., add
w.addEventListener("close", err => reject(new Error('worker closed before
ready')), { once: true }) and likewise for any "inflight" wait) ensuring every
failure event (close/error/abort) rejects the awaited promise with a descriptive
error message; keep the existing "closed" listener separate for normal close
handling.
Source: Coding guidelines

155-158: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the DNS repro local; 192.0.2.1 makes this test non-hermetic.

Line 157 sends the query to a real external resolver address. That violates the no-external-network rule and also weakens the precondition here: some environments will fail resolve4() immediately instead of keeping the c-ares request alive long enough to exercise teardown. Use a local UDP blackhole server bound to 127.0.0.1 with port: 0, then pass that port into the child.

As per coding guidelines, "Never contact external network hosts or live registries in tests - use local in-process servers or container harness instead" and "Always use port: 0."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 155 -
158, The test uses dns.setServers(["192.0.2.1"]) which contacts an external
resolver; instead, start a local UDP blackhole server bound to 127.0.0.1 with
port: 0 before spawning the child, read the actual assigned port, and pass that
port into the child so the child can call dns.setServers([`127.0.0.1:${port}`])
(the code path that contains dns.setServers and dns.promises.resolve4 should
read the injected port); ensure the server simply accepts/ignores packets to
keep the c-ares request in-flight and close the server during teardown so the
test remains hermetic and uses an ephemeral port.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 26-29: The Windows branch currently skips overriding a parent
Malloc value, leaving any inherited Malloc in bunEnv; update the debugHeapEnv
construction so when isWindows is true you explicitly clear Malloc (e.g., set
Malloc to an empty string) instead of omitting it. Modify the ternary around
...(isWindows ? {} : { Malloc: "1" }) so debugHeapEnv explicitly includes
Malloc: "" for the isWindows case (referencing debugHeapEnv, bunEnv, isWindows,
and the Malloc env var).

---

Outside diff comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 123-129: The promises waiting for worker readiness (e.g., the
"loaded" promise that sets w.onmessage/w.onerror and the similar "inflight" wait
around line 163) only reject on "error" and must also reject when the worker
closes; update the promise constructors so the worker "close" event also calls
reject (e.g., add w.addEventListener("close", err => reject(new Error('worker
closed before ready')), { once: true }) and likewise for any "inflight" wait)
ensuring every failure event (close/error/abort) rejects the awaited promise
with a descriptive error message; keep the existing "closed" listener separate
for normal close handling.
- Around line 155-158: The test uses dns.setServers(["192.0.2.1"]) which
contacts an external resolver; instead, start a local UDP blackhole server bound
to 127.0.0.1 with port: 0 before spawning the child, read the actual assigned
port, and pass that port into the child so the child can call
dns.setServers([`127.0.0.1:${port}`]) (the code path that contains
dns.setServers and dns.promises.resolve4 should read the injected port); ensure
the server simply accepts/ignores packets to keep the c-ares request in-flight
and close the server during teardown so the test remains hermetic and uses an
ephemeral port.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7fc1956-003a-4bc7-8de3-2a7e95497a84

📥 Commits

Reviewing files that changed from the base of the PR and between f5caab2 and 35f3dea.

📒 Files selected for processing (1)

test/js/web/workers/worker-terminate-lifetime.test.ts

… Malloc clear - The in-flight DNS test now points c-ares at a local UDP socket that never replies (port 0), instead of TEST-NET 192.0.2.1; this also makes the pending-query precondition deterministic (no ICMP fast-fail). - The worker readiness waits reject on close so an early worker death fails with a message instead of hanging to the test timeout. - debugHeapEnv explicitly clears an inherited Malloc on Windows.

robobun · 2026-06-10T05:11:34Z

CI state for build 61687 (d5f126f, with main merged in): all 280 executed jobs passed, including every Linux, Windows, and x64-asan test lane. The three red GitHub contexts (darwin-14-aarch64, darwin-14-x64, darwin-26-aarch64) report "Expired": their jobs never left Buildkite's scheduled state because no macOS agents picked them up. That is agent capacity, not this diff; retrying those three jobs in Buildkite needs no new push.

robobun · 2026-06-10T08:11:29Z

Build 61687 finished: 284 of 284 executed jobs passed (darwin-26-aarch64 and darwin-14-x64 eventually got agents and went green). The build's failed state comes solely from the darwin-14-aarch64 test step, whose two job slots expired in Buildkite's scheduled queue without ever being picked up by an agent. No test failed anywhere in the matrix; a Buildkite-side retry of that one step (no push needed) completes the run.

…ifiers eager (#32407) ## Crash Sentry BUN-2V1E: segfault inside `WTF::TypeCastTraits<JSVMClientData>::isType` reached from `Zig::GlobalObject::visitChildrenImpl` on a concurrent GC helper thread. 695 lifetime events (26 in the last 24h), 100% Windows x64, 1.2.17 through 1.3.14. 31% of events carry both `workers_spawned=True` and `workers_terminated=True` vs a ~3% baseline, pointing at worker-termination churn. Also seen intermittently in CI as the `broadcast-channel-worker-gc` flake (b03f1e6 is a rekick for it). ``` WTF::ParallelHelperPool::Thread::work JSC::Heap::runBeginPhase lambda JSC::SlotVisitor::drainFromShared JSC::SlotVisitor::drain JSC::SlotVisitor::visitChildren JSC::MethodTable::visitChildren Zig::GlobalObject::visitChildren Zig::GlobalObject::visitChildrenImpl WebCore::clientData(JSC::VM&) WTF::downcast<JSVMClientData> WTF::is<JSVMClientData> TypeCastTraits<JSVMClientData>::isType <-- SEGV ``` ## Cause `visitChildrenImpl` ran: ```cpp WebCore::clientData(thisObject->vm())->httpHeaderIdentifiers().visit<Visitor>(visitor); ``` Two problems on this line: **1. `thisObject->vm()` dereferences cell state on the marker thread.** `JSGlobalObject::vm()` returns `*m_vm` (a raw `VM* const` stored on the cell); `clientData()` then does `downcast<JSVMClientData>(vm.clientData)` whose `RELEASE_ASSERT(!source || is<Target>(*source))` calls the virtual `isWebCoreJSClientData()`. The neighbouring `visitGlobalObjectMember(unique_ptr)` overload already guards a window where the concurrent marker visits a `Zig::GlobalObject` picked up via conservative stack scan while its IsoSubspace slot is being recycled; in that same window `m_vm` can read stale bytes, resolving to a garbage `clientData` whose vtable load faults. `visitor.vm()` (= `m_heap.vm()`) is guaranteed alive for the duration of marking and does not depend on the visited cell at all; this is how JSC's own `visitChildren` implementations (`FunctionExecutable`, `JSWeakObjectRef`, `Structure`) fetch the VM on the marker thread. **2. `httpHeaderIdentifiers()` was an unlocked lazy `std::optional::emplace()`** called from both the mutator (`NodeHTTP.cpp` header assignment) and concurrent GC helper threads. With more than one `Zig::GlobalObject` in a VM (ShadowRealm, test-isolation swap, bake) distinct parallel marker helpers each visit a different global and all call `httpHeaderIdentifiers()` on the same `JSVMClientData`, so two threads can enter `emplace()` on the same storage. The `HTTPHeaderIdentifiers` constructor only runs ~90 `LazyProperty::initLater()` calls (each a single tagged-pointer store), so there is nothing worth deferring. ## Fix - `ZigGlobalObject.cpp`: fetch the VM via `visitor.vm()` instead of `thisObject->vm()`. - `BunClientData.{h,cpp}`: `m_httpHeaderIdentifiers` is now a plain eagerly-constructed member; `httpHeaderIdentifiers()` is an inline accessor. ## Verification The race window is too narrow to trip deterministically on Linux. An honest probe against the unfixed debug (ASAN) build, with `Malloc=1` + `BUN_JSC_collectContinuously=1` + `BUN_JSC_numberOfGCMarkers=8`: - 5 iterations of an 8-round × 6-worker BroadcastChannel create/terminate/GC stress: clean. - 8 iterations of a 100-round × 8-ShadowRealm (parallel-marker emplace) stress: clean. So there is no fail-before proof to hand the gate; the crash signature is Windows-specific and timing-dependent. The fix is nonetheless clearly correct on inspection: - `visitor.vm()` is the JSC convention for the marker thread and cannot read through the visited cell. - An unlocked `std::optional::emplace()` reachable from two threads is a data race in any memory model. A new stress test in `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` hammers the exact path (multiple globals per VM via ShadowRealm, worker churn, forced parallel markers, `Malloc=1` on non-Windows) so a future regression on Windows CI will show up where the signature has already been observed. ``` bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts # 4 pass bun bd test test/js/node/http/node-http.test.ts -t headers # 5 pass (HTTPHeaderIdentifiers path) bun bd test test/js/node/http/numeric-header.test.ts # 1 pass ``` ## Related Checked #31990 / #32071 / #32082 (worker event-loop enqueue after terminate, Strong<> releases before VM teardown): none touch `visitChildrenImpl` or `vm.clientData` access from the marker thread. No open PR addresses this crash. The issue-matcher suggested four candidates; assessment against the actual stack: - #20641 (BUN-N2D): same `TypeCastTraits<JSVMClientData>::isType` frame but reached from `bunVMConcurrently` on the main event loop during libuv signal processing, not from a GC marker thread. Different code path; this PR does not touch it. - #20786 (BUN-PD8): same `isType` frame reached from `JSC::subspaceFor` inside `Request__create` on the HTTP server request path (main thread). Different code path; this PR does not touch it. - #27312: SIGILL (not SEGV) in `SlotVisitor::drain` on Linux during `bun test` cleanup. Adjacent area but a different fault signature; not claimed. - #31880: generic "multiple threads are crashing" under worker churn, no decoded stack. #32071 already declines to claim it for the same reason; not claimed here either. None are auto-closed by this PR. #20641 and #20786 suggest there may be other callers of `clientData()` that can see a bad `vm.clientData` on Windows; those are separate paths and out of scope here.

github-actions Bot added the claude label Jun 8, 2026

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread test/js/web/workers/worker-terminate-lifetime.test.ts

robobun added 3 commits June 8, 2026 17:33

ci: retrigger

497f408

Merge branch 'main' into farm/70a182aa/fix-worker-shutdown-handleset-uaf

35f3dea

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread test/js/web/workers/worker-terminate-lifetime.test.ts

claude Bot mentioned this pull request Jun 12, 2026

node:fs: snapshot path buffers backed by resizable ArrayBuffers for async ops #32189

Open

robobun mentioned this pull request Jun 16, 2026

Use visitor.vm() in GlobalObject::visitChildren; make HTTPHeaderIdentifiers eager #32407

Merged

Conversation

robobun commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How did you verify your code works?

Uh oh!

robobun commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

robobun commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

robobun commented Jun 8, 2026

Uh oh!

robobun commented Jun 8, 2026

Uh oh!

robobun commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robobun commented Jun 10, 2026

Uh oh!

robobun commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robobun commented Jun 8, 2026 •

edited

Loading

robobun commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading