Release RuntimeState's JSC handles before tearing down the VM#31990
Release RuntimeState's JSC handles before tearing down the VM#31990robobun wants to merge 7 commits into
Conversation
WebWorker::shutdown (and global_exit under BUN_DESTRUCT_VM_ON_EXIT) tore down the JSC VM, then VirtualMachine::destroy dropped the per-VM RuntimeState. The SQL contexts' Strong handles and the per-VM DNS data dropped in that window, so Bun__StrongRef__delete unlinked HandleNodes from the already-freed HandleSet (segfault in WTF::SentinelLinkedList<JSC::HandleNode>::remove). Add a release_runtime_state_js_handles runtime hook, called from both teardown paths after the last JS runs (socket-group close callbacks) and before the VM deref. It releases the MySQL/Postgres context Strongs and drops the DNS GlobalData while the JSC heap is alive. Dropping GlobalData pre-teardown surfaced two follow-on ordering issues in the same class, fixed here too: - reject_later now drops the deferred synchronously when the VM is shutting down instead of enqueueing a task that can never run (the promise Strong must drop while the heap is alive, and the stranded box leaked). - GlobalData::drop unlinks the resolver's pending-query timeout timer from the per-thread timer heap, which now outlives the box and is still walked by WTFTimer::update during teardown.
|
Updated 1:05 AM PT - Jun 10th, 2026
❌ @robobun, your commit d5f126f has some failures in 🧪 To try this PR locally: bunx bun-pr 31990That installs a local version of the PR into your bun-31990 --bun |
|
Found 2 issues this PR may fix:
🤖 Generated with Claude Code |
|
Checked both against the mechanism this PR fixes. Neither matches, so I'm not adding the auto-close lines:
Both also predate the teardown ordering this PR changes, which affects 1.4.0 canary builds (the Sentry crashes BUN-3DSK/BUN-3END are on 1.4.0-canary). |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a RuntimeHooks entrypoint to release RuntimeState-owned JSC handles while the VM is still alive, implements deinit helpers for SQL contexts and RareData, integrates the hook into VirtualMachine and WebWorker shutdown paths, adds DNS/promise shutdown guards, and includes regression tests for clean shutdown. ChangesJSC VM Teardown Handle Cleanup
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 122-125: Reorder the assertions in the affected tests so stdout is
asserted first, then stderr only if exitCode !== 0, and finally assert exitCode
=== 0; specifically update the assertion blocks that currently read
expect(stderr).toBe(""), expect(stdout).toBe("..."), expect(exitCode).toBe(0)
(in the tests inside worker-terminate-lifetime.test.ts) to the pattern: assert
stdout value first, if exitCode !== 0 assert stderr is empty, then assert
exitCode is 0; apply this change consistently to all three new tests referenced
in the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ae617077-8fd8-4828-af55-1da43f31716a
📒 Files selected for processing (9)
src/jsc/VirtualMachine.rssrc/jsc/web_worker.rssrc/runtime/dns_jsc/cares_jsc.rssrc/runtime/dns_jsc/dns.rssrc/runtime/jsc_hooks.rssrc/sql_jsc/jsc.rssrc/sql_jsc/mysql/MySQLContext.rssrc/sql_jsc/postgres/PostgresSQLContext.rstest/js/web/workers/worker-terminate-lifetime.test.ts
RareData.s3_default_client (cached by the Bun.s3 getter) is dropped in destroy() after the HandleSet is freed, on both the worker shutdown and the BUN_DESTRUCT_VM_ON_EXIT main-thread paths: same class as the SQL context Strongs. Release it in the same pre-teardown step.
|
Checked the overlap with #31833. It is real but partial, so this PR stays open:
The overlapping hunks are semantically identical, so whichever PR lands second rebases trivially. Noted the relationship in both PRs so a maintainer can sequence them. |
Malloc=1 exposes every deliberately-unreclaimed exit-time WebKit allocation to LeakSanitizer, so the leak sweep enabled by ASAN CI lanes (detect_leaks=1) took minutes in the spawned children and timed the tests out. The use-after-free detection the tests exist for is AddressSanitizer proper and unaffected by detect_leaks=0.
bmalloc has no system-heap fallback on Windows, so the spawned child aborts at startup before running any JS. No Windows lane runs ASAN, so the env var only served the Linux/macOS ASAN lanes; on Windows the tests still cover the plain clean-shutdown contract.
|
CI status: the diff is green; the remaining red is unrelated flake.
Ready for review. |
|
Final CI state for build 61398: every failed shard (16, one per platform lane) is the same single test, The tests added by this PR pass on every lane in the completed matrix, including x64-asan where they reproduce the HandleSet use-after-free deterministically on an unfixed build. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/js/web/workers/worker-terminate-lifetime.test.ts (2)
123-129:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject these worker-ready promises on
closetoo.Lines 123-129 and Line 163 only fail on
error. If the worker exits before posting"loaded"/"inflight", the child can sit until the outer test timeout instead of failing immediately with a useful reason. Please wirecloseto reject these waits as well.As per coding guidelines, "Wire EVERY failure event (
error,close,abort, process exit) to reject the awaited promise so failures surface immediately with a message instead of as an opaque 30s hang."Also applies to: 163-163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 123 - 129, The promises waiting for worker readiness (e.g., the "loaded" promise that sets w.onmessage/w.onerror and the similar "inflight" wait around line 163) only reject on "error" and must also reject when the worker closes; update the promise constructors so the worker "close" event also calls reject (e.g., add w.addEventListener("close", err => reject(new Error('worker closed before ready')), { once: true }) and likewise for any "inflight" wait) ensuring every failure event (close/error/abort) rejects the awaited promise with a descriptive error message; keep the existing "closed" listener separate for normal close handling.Source: Coding guidelines
155-158:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the DNS repro local;
192.0.2.1makes this test non-hermetic.Line 157 sends the query to a real external resolver address. That violates the no-external-network rule and also weakens the precondition here: some environments will fail
resolve4()immediately instead of keeping the c-ares request alive long enough to exercise teardown. Use a local UDP blackhole server bound to127.0.0.1withport: 0, then pass that port into the child.As per coding guidelines, "Never contact external network hosts or live registries in tests - use local in-process servers or container harness instead" and "Always use
port: 0."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/js/web/workers/worker-terminate-lifetime.test.ts` around lines 155 - 158, The test uses dns.setServers(["192.0.2.1"]) which contacts an external resolver; instead, start a local UDP blackhole server bound to 127.0.0.1 with port: 0 before spawning the child, read the actual assigned port, and pass that port into the child so the child can call dns.setServers([`127.0.0.1:${port}`]) (the code path that contains dns.setServers and dns.promises.resolve4 should read the injected port); ensure the server simply accepts/ignores packets to keep the c-ares request in-flight and close the server during teardown so the test remains hermetic and uses an ephemeral port.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 26-29: The Windows branch currently skips overriding a parent
Malloc value, leaving any inherited Malloc in bunEnv; update the debugHeapEnv
construction so when isWindows is true you explicitly clear Malloc (e.g., set
Malloc to an empty string) instead of omitting it. Modify the ternary around
...(isWindows ? {} : { Malloc: "1" }) so debugHeapEnv explicitly includes
Malloc: "" for the isWindows case (referencing debugHeapEnv, bunEnv, isWindows,
and the Malloc env var).
---
Outside diff comments:
In `@test/js/web/workers/worker-terminate-lifetime.test.ts`:
- Around line 123-129: The promises waiting for worker readiness (e.g., the
"loaded" promise that sets w.onmessage/w.onerror and the similar "inflight" wait
around line 163) only reject on "error" and must also reject when the worker
closes; update the promise constructors so the worker "close" event also calls
reject (e.g., add w.addEventListener("close", err => reject(new Error('worker
closed before ready')), { once: true }) and likewise for any "inflight" wait)
ensuring every failure event (close/error/abort) rejects the awaited promise
with a descriptive error message; keep the existing "closed" listener separate
for normal close handling.
- Around line 155-158: The test uses dns.setServers(["192.0.2.1"]) which
contacts an external resolver; instead, start a local UDP blackhole server bound
to 127.0.0.1 with port: 0 before spawning the child, read the actual assigned
port, and pass that port into the child so the child can call
dns.setServers([`127.0.0.1:${port}`]) (the code path that contains
dns.setServers and dns.promises.resolve4 should read the injected port); ensure
the server simply accepts/ignores packets to keep the c-ares request in-flight
and close the server during teardown so the test remains hermetic and uses an
ephemeral port.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c7fc1956-003a-4bc7-8de3-2a7e95497a84
📒 Files selected for processing (1)
test/js/web/workers/worker-terminate-lifetime.test.ts
… Malloc clear - The in-flight DNS test now points c-ares at a local UDP socket that never replies (port 0), instead of TEST-NET 192.0.2.1; this also makes the pending-query precondition deterministic (no ICMP fast-fail). - The worker readiness waits reject on close so an early worker death fails with a message instead of hanging to the test timeout. - debugHeapEnv explicitly clears an inherited Malloc on Windows.
|
CI state for build 61687 (d5f126f, with main merged in): all 280 executed jobs passed, including every Linux, Windows, and x64-asan test lane. The three red GitHub contexts (darwin-14-aarch64, darwin-14-x64, darwin-26-aarch64) report "Expired": their jobs never left Buildkite's scheduled state because no macOS agents picked them up. That is agent capacity, not this diff; retrying those three jobs in Buildkite needs no new push. |
|
Build 61687 finished: 284 of 284 executed jobs passed (darwin-26-aarch64 and darwin-14-x64 eventually got agents and went green). The build's failed state comes solely from the darwin-14-aarch64 test step, whose two job slots expired in Buildkite's scheduled queue without ever being picked up by an agent. No test failed anywhere in the matrix; a Buildkite-side retry of that one step (no push needed) completes the run. |
…ifiers eager (#32407) ## Crash Sentry BUN-2V1E: segfault inside `WTF::TypeCastTraits<JSVMClientData>::isType` reached from `Zig::GlobalObject::visitChildrenImpl` on a concurrent GC helper thread. 695 lifetime events (26 in the last 24h), 100% Windows x64, 1.2.17 through 1.3.14. 31% of events carry both `workers_spawned=True` and `workers_terminated=True` vs a ~3% baseline, pointing at worker-termination churn. Also seen intermittently in CI as the `broadcast-channel-worker-gc` flake (b03f1e6 is a rekick for it). ``` WTF::ParallelHelperPool::Thread::work JSC::Heap::runBeginPhase lambda JSC::SlotVisitor::drainFromShared JSC::SlotVisitor::drain JSC::SlotVisitor::visitChildren JSC::MethodTable::visitChildren Zig::GlobalObject::visitChildren Zig::GlobalObject::visitChildrenImpl WebCore::clientData(JSC::VM&) WTF::downcast<JSVMClientData> WTF::is<JSVMClientData> TypeCastTraits<JSVMClientData>::isType <-- SEGV ``` ## Cause `visitChildrenImpl` ran: ```cpp WebCore::clientData(thisObject->vm())->httpHeaderIdentifiers().visit<Visitor>(visitor); ``` Two problems on this line: **1. `thisObject->vm()` dereferences cell state on the marker thread.** `JSGlobalObject::vm()` returns `*m_vm` (a raw `VM* const` stored on the cell); `clientData()` then does `downcast<JSVMClientData>(vm.clientData)` whose `RELEASE_ASSERT(!source || is<Target>(*source))` calls the virtual `isWebCoreJSClientData()`. The neighbouring `visitGlobalObjectMember(unique_ptr)` overload already guards a window where the concurrent marker visits a `Zig::GlobalObject` picked up via conservative stack scan while its IsoSubspace slot is being recycled; in that same window `m_vm` can read stale bytes, resolving to a garbage `clientData` whose vtable load faults. `visitor.vm()` (= `m_heap.vm()`) is guaranteed alive for the duration of marking and does not depend on the visited cell at all; this is how JSC's own `visitChildren` implementations (`FunctionExecutable`, `JSWeakObjectRef`, `Structure`) fetch the VM on the marker thread. **2. `httpHeaderIdentifiers()` was an unlocked lazy `std::optional::emplace()`** called from both the mutator (`NodeHTTP.cpp` header assignment) and concurrent GC helper threads. With more than one `Zig::GlobalObject` in a VM (ShadowRealm, test-isolation swap, bake) distinct parallel marker helpers each visit a different global and all call `httpHeaderIdentifiers()` on the same `JSVMClientData`, so two threads can enter `emplace()` on the same storage. The `HTTPHeaderIdentifiers` constructor only runs ~90 `LazyProperty::initLater()` calls (each a single tagged-pointer store), so there is nothing worth deferring. ## Fix - `ZigGlobalObject.cpp`: fetch the VM via `visitor.vm()` instead of `thisObject->vm()`. - `BunClientData.{h,cpp}`: `m_httpHeaderIdentifiers` is now a plain eagerly-constructed member; `httpHeaderIdentifiers()` is an inline accessor. ## Verification The race window is too narrow to trip deterministically on Linux. An honest probe against the unfixed debug (ASAN) build, with `Malloc=1` + `BUN_JSC_collectContinuously=1` + `BUN_JSC_numberOfGCMarkers=8`: - 5 iterations of an 8-round × 6-worker BroadcastChannel create/terminate/GC stress: clean. - 8 iterations of a 100-round × 8-ShadowRealm (parallel-marker emplace) stress: clean. So there is no fail-before proof to hand the gate; the crash signature is Windows-specific and timing-dependent. The fix is nonetheless clearly correct on inspection: - `visitor.vm()` is the JSC convention for the marker thread and cannot read through the visited cell. - An unlocked `std::optional::emplace()` reachable from two threads is a data race in any memory model. A new stress test in `test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts` hammers the exact path (multiple globals per VM via ShadowRealm, worker churn, forced parallel markers, `Malloc=1` on non-Windows) so a future regression on Windows CI will show up where the signature has already been observed. ``` bun bd test test/js/web/broadcastchannel/broadcast-channel-worker-gc.test.ts # 4 pass bun bd test test/js/node/http/node-http.test.ts -t headers # 5 pass (HTTPHeaderIdentifiers path) bun bd test test/js/node/http/numeric-header.test.ts # 1 pass ``` ## Related Checked #31990 / #32071 / #32082 (worker event-loop enqueue after terminate, Strong<> releases before VM teardown): none touch `visitChildrenImpl` or `vm.clientData` access from the marker thread. No open PR addresses this crash. The issue-matcher suggested four candidates; assessment against the actual stack: - #20641 (BUN-N2D): same `TypeCastTraits<JSVMClientData>::isType` frame but reached from `bunVMConcurrently` on the main event loop during libuv signal processing, not from a GC marker thread. Different code path; this PR does not touch it. - #20786 (BUN-PD8): same `isType` frame reached from `JSC::subspaceFor` inside `Request__create` on the HTTP server request path (main thread). Different code path; this PR does not touch it. - #27312: SIGILL (not SEGV) in `SlotVisitor::drain` on Linux during `bun test` cleanup. Adjacent area but a different fault signature; not claimed. - #31880: generic "multiple threads are crashing" under worker churn, no decoded stack. #32071 already declines to claim it for the same reason; not claimed here either. None are auto-closed by this PR. #20641 and #20786 suggest there may be other callers of `clientData()` that can see a bad `vm.clientData` on Windows; those are separate paths and out of scope here.
What does this PR do?
Fixes the Sentry crashes BUN-3DSK and BUN-3END: segfault during WebWorker shutdown in
Repro: any worker that loads
bun:sql(touchingBun.SQLis enough: the internal module's top-levelinit()stores Strong refs in the per-VM MySQL/Postgres contexts) and then exits:Cause:
WebWorker::shutdownstep 3 (WebWorker__teardownJSCVM) derefs the JSC VM to zero, destroying the Heap and its HandleSet. Step 5 (VirtualMachine::destroy) then callsdeinit_runtime_state, which drops theBox<RuntimeState>. Two of its fields still own JSC GC handles at that point:sql_rare: the MySQL/Postgres contexts'on_query_resolve_fn/on_query_reject_fnStrongs (the observed crash),global_dns_data: dropping it runsares_destroy, whose EDESTRUCTION callbacks settle promise Strongs.RareData.s3_default_client(cached by theBun.s3getter) is the same class:RareDatais also dropped indestroy(), so a worker that touchedBun.s3hits the identical UAF. It is released in the same pre-teardown step on both paths.Strong's release path (Bun__StrongRef__delete->HandleSet::heapFor(slot)->deallocate(slot)) reads the freed HandleBlock and writes the freed HandleSet's free list. The same ordering exists on the main thread underBUN_DESTRUCT_VM_ON_EXIT=1(global_exitrunsZig__GlobalObject__destructOnExitbeforedestroy()), which ASAN CI lanes enable.Fix: new
RuntimeHooks::release_runtime_state_js_handlesslot, called from both teardown paths after the last JS runs (socket-group close callbacks, so SQL on_close can still dispatch) and before the VM deref. It releases the SQL context Strongs and drops the DNSGlobalDatawhile the JSC heap is alive. This mirrors the existingcancel_all_timerspre-teardown hook, which already handles the same ordering for timer handles.Dropping
GlobalDatapre-teardown surfaced two follow-on issues in the same ordering class (caught by ASAN while validating the fix with an in-flight DNS query):reject_laternow drops the deferred synchronously when the VM is shutting down. The event loop never ticks again, so the enqueued task could never run; its promise Strong has to drop while the heap is alive, and the stranded box leaked.GlobalData::dropunlinks the resolver's pending-query timeout timer from the per-thread timer heap, which now outlives the box and is still walked byWTFTimer::updateduring teardown.Audited the remaining
RuntimeStatefields (timer heap, ssl_ctx_cache, editor_context, entry_point, transpiler_arena, body_value_pool, isolation_handles): none reach JSC from their drop glue.Overlap note: #31833 independently adds the same hook for the SQL Strongs and the S3 release, but only on the
global_exitpath. This PR additionally wires the release intoWebWorker::shutdown(the path the Sentry crashes fire on), covers the DNS data and its two follow-on teardown bugs, and adds regression tests. The overlapping hunks are semantically identical, so whichever lands second rebases trivially.How did you verify your code works?
Three new tests in
test/js/web/workers/worker-terminate-lifetime.test.ts(worker +Bun.SQL, worker terminated with in-flight DNS, main thread +BUN_DESTRUCT_VM_ON_EXIT=1), all run withMalloc=1so WebKit's fastMalloc uses the system allocator and ASAN builds poison freed JSC heap memory; with libpas the freed pages stay mapped, which is why production only saw rare macOS segfaults at small offsets.On the unfixed debug+ASAN build the two SQL tests fail 10/10 runs with:
With the fix the file passes. Also verified end-to-end against a live postgres (query resolve/reject dispatch through the context Strongs, worker with an open connection exiting cleanly).