Skip to content

worker: fix cross-thread HandleSet race in getHeapSnapshot#30185

Merged
Jarred-Sumner merged 9 commits into
mainfrom
farm/46285952/worker-getheapsnapshot-handleset-race
May 3, 2026
Merged

worker: fix cross-thread HandleSet race in getHeapSnapshot#30185
Jarred-Sumner merged 9 commits into
mainfrom
farm/46285952/worker-getheapsnapshot-handleset-race

Conversation

@robobun

@robobun robobun commented May 3, 2026

Copy link
Copy Markdown
Collaborator

test/js/node/worker_threads/worker_threads.test.ts occasionally segfaults in CI with

panic: Segmentation fault at address 0x10

on a GC helper thread:

wtfThreadEntryPoint
  AutomaticThread::start
    ParallelHelperPool::Thread::work
      Heap::runBeginPhase(GCConductor)::$_1
        SlotVisitor::drainFromShared
          MarkingConstraintSolver::runExecutionThread
            MarkingConstraint::execute   ← "Sh" Strong Handles
              HandleSet::visitStrongHandles
                *(nullptr + offsetof(HandleNode, m_value))  = *(0x10)

(decoded from the bun.report trace on build 50529 / 🐧 13 x64 and symbolized against that build's bun-profile).

Cause

jsWorkerPrototypeFunction_getHeapSnapshotBody does:

Strong<JSPromise> strong(vm, promise);                      // parent VM's HandleSet
worker.postTaskToWorkerGlobalScope([strong, parentId](auto& workerCtx) {
    ...
    ScriptExecutionContext::postTaskTo(parentId,
        [strong, snapshot = ...](auto& parentCtx) { ... }); // runs on worker thread
});

JSC::Strong<T> has no move constructor. Capturing it by value copy-constructs it, which calls HandleSet::allocate() + m_strongList.push(); destroying it calls HandleSet::deallocate() + NodeList::remove(). Both happen on the worker thread against the parent VM's HandleSet, without the parent VM's lock.

HandleSet::m_strongList is a SentinelLinkedList<HandleNode> — not thread-safe. push/remove transiently null m_next/m_prev. The parent VM's "Sh" (Strong Handles) marking constraint (Heap::addCoreConstraints) iterates that list during GC; when it follows a null m_next it reads *((HandleNode*)nullptr)->slot()*(0x0 + 0x10).

The heapHelperPool() is process-global, so the crashing helper thread belongs to the parent VM's collector even though the worker VM's BunV8HeapSnapshotBuilder full GC is in progress at the same time.

This has been there since getHeapSnapshot was added — the recent worker lifetime rewrites (#29957, #29937) didn't introduce it.

Fix

Heap-allocate the Strong<JSPromise> once on the parent thread and pass only the raw pointer through the cross-thread lambdas. The worker thread never dereferences it, so it never touches the parent VM's HandleSet. The parent-side completion lambda resolves the promise and frees the handle.

Worker::postTaskToWorkerGlobalScope now returns bool so a lost race to Closing/Closed (worker exited between isOnline() and the post) rejects with ERR_WORKER_NOT_RUNNING instead of silently leaking the handle. If postTaskTo(parentId, …) on the return trip fails (parent context gone), the handle intentionally leaks — deleting a parent-VM Strong from the worker thread is exactly the bug we're fixing, and the parent VM is tearing down anyway.

Verification

Stress fixture (heap-snapshot-gc-race-fixture.js, 300 iterations of await worker.getHeapSnapshot(); Bun.gc(true)), 40 runs each on linux-x64 release:

build segfault at 0x10
52bdf47 (CI artifact, no fix) 15 / 40
this branch 0 / 40

The new worker_heap_snapshot_gc.test.ts runs the fixture — 300 iters in release, 5 in debug (a single debug heap snapshot takes ~1.6s so the race window, which is a handful of instructions after each snapshot, is impractical to hit there; the debug pass is a functional check).

Drive-by: non-LTO strip leaves orphan PT_GNU_EH_FRAME

While reproducing I hit a second, unrelated crash in locally-built (non-LTO) release binaries: stripFlags removed .eh_frame_hdr on linux-gnu unconditionally, but the linker only passes --no-eh-frame-hdr when LTO is on. GNU strip doesn't rewrite the program header table, so the PT_GNU_EH_FRAME phdr was left pointing at unmapped memory and any stack unwind (e.g. WTF::Thread teardown after a worker exits) faulted. CI release builds always have LTO on so they weren't affected. Gated the section removal on c.lto to match the linker flag.

…etHeapSnapshot

worker.getHeapSnapshot() captured a Strong<JSPromise> (allocated in the
parent VM's HandleSet) by value in a lambda that runs on the worker thread.
Strong<T> has no move constructor, so on the worker thread the inner
lambda's capture copy-constructs a new HandleSet slot and the outer
lambda's destruction deallocates one — both mutating the parent VM's
HandleSet::m_strongList (a non-thread-safe SentinelLinkedList) without
the parent VM's lock.

When the parent VM's GC runs the "Sh" (Strong Handles) marking
constraint concurrently, visitStrongHandles iterates m_strongList while
the worker thread is mid-push/remove; it follows a transiently-null next
pointer and faults reading HandleNode::m_value at (nullptr + 0x10).

Heap-allocate the Strong once on the parent thread and pass only the raw
pointer across. The worker thread never dereferences it; the parent-side
completion lambda resolves the promise and frees the handle. Make
postTaskToWorkerGlobalScope return whether the task was accepted so a
lost race to Closing rejects cleanly instead of leaking the handle.
@github-actions github-actions Bot added the claude label May 3, 2026
@robobun

robobun commented May 3, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 10:56 AM PT - May 3rd, 2026

@robobun, your commit 5659245 has 2 failures in Build #50633 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30185

That installs a local version of the PR into your bun-30185 executable, so you can run:

bun-30185 --bun

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Found 1 issue this PR may fix:

  1. getHeapSnapshot causes lodestar beacon node to crash #23393 - Reports crash when calling v8.getHeapSnapshot() under load, which is exactly the cross-thread HandleSet race condition this PR fixes

If this is helpful, copy the block below into the PR description to auto-close this issue on merge.

Fixes #23393

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Worker heap-snapshot promise lifetime now uses a heap-allocated Strong passed across threads; postTaskToWorkerGlobalScope returns bool consistently and is documented. Added a GC-race fixture and integration test. Linux unwind-section stripping is gated to LTO builds.

Changes

Worker Heap Snapshot GC Safety

Layer / File(s) Summary
API doc & semantics
src/bun.js/bindings/webcore/Worker.h
Adds comment documenting postTaskToWorkerGlobalScope(...) return semantics (true when queued/posted; false when Closing/Closed or context gone).
Control-flow / return consistency
src/bun.js/bindings/webcore/Worker.cpp
postTaskToWorkerGlobalScope(Function<void(ScriptExecutionContext&)>&&) updated to return bool consistently: returns true when queued in State::Pending, false when dropping in State::Closing/Closed, and the result of ScriptExecutionContext::postTaskTo(...) when State::Running.
Core implementation / lifetime fix
src/bun.js/bindings/webcore/JSWorker.cpp
JSWorker::getHeapSnapshot() now heap-allocates a Strong<JSPromise>* and passes only the raw pointer into cross-thread lambdas. If posting to the worker fails, it deletes the handle and rejects the promise with ERR_WORKER_NOT_RUNNING; on parent-thread postback it resolves/rejects via an owned Strong pointer.
Test fixture
test/js/node/worker_threads/heap-snapshot-gc-race-fixture.js
Adds a worker fixture that repeatedly calls worker.getHeapSnapshot() and forces synchronous full GC (Bun.gc(true)) to exercise heap-snapshot vs parent-thread GC concurrency.
Integration test
test/js/node/worker_threads/worker_heap_snapshot_gc.test.ts
Adds a Bun test that spawns the fixture, configures ITERS (debug vs normal), captures stdout/stderr/exit, and asserts stdout === "ok\n", stderr === "", exitCode === 0, and no signal.

Linux Build Unwind Section Stripping

Layer / File(s) Summary
Build flags change
scripts/build/flags.ts
The Linux (glibc/gnu) unwind/exception-section strip rule is now gated by c.lto (only applied when LTO is enabled). Added comments explaining the orphan PT_GNU_EH_FRAME hazard on non-LTO builds and why the conditional was tightened.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title directly addresses the main issue: fixing a cross-thread HandleSet race in getHeapSnapshot, which is the primary bug fix in this PR.
Description check ✅ Passed Description covers both required sections: 'What does this PR do?' (detailed problem analysis and fix) and 'How did you verify your code works?' (stress testing results).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/worker_threads/worker_threads.test.ts`:
- Around line 504-507: Replace the three separate assertions with a single
crash-oriented assertion that preserves all subprocess metadata: after awaiting
Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited,
proc.signalCode]) (or reading proc.signalCode after exit), build an object
containing stdout, stderr, exitCode and signalCode and assert them together
(e.g. expect({ stdout, stderr, exitCode, signalCode }).toEqual({ stdout: "ok\n",
stderr: "", exitCode: 0, signalCode: null })). Update references to proc,
proc.stdout.text(), proc.stderr.text(), proc.exited and proc.signalCode in the
test to ensure stdout/stderr and exit/signal are asserted in one combined
expectation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7724ae22-e5c7-465d-b66e-fdbd7755aa7a

📥 Commits

Reviewing files that changed from the base of the PR and between d484fd6 and c6820ac.

📒 Files selected for processing (5)
  • src/bun.js/bindings/webcore/JSWorker.cpp
  • src/bun.js/bindings/webcore/Worker.cpp
  • src/bun.js/bindings/webcore/Worker.h
  • test/js/node/worker_threads/heap-snapshot-gc-race-fixture.js
  • test/js/node/worker_threads/worker_threads.test.ts

Comment thread test/js/node/worker_threads/worker_threads.test.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any issues, but this is subtle cross-thread GC/HandleSet code with manual raw-pointer ownership transfer across VMs, so it's worth a human pass.

Extended reasoning...

Overview

This PR fixes an intermittent segfault in worker.getHeapSnapshot() by changing how the parent-VM Strong<JSPromise> is kept alive across the parent→worker→parent round-trip. Instead of capturing the Strong by value in cross-thread lambdas (which copy-constructs/destroys it on the worker thread and races the parent VM's HandleSet), it heap-allocates the Strong once on the parent thread and threads only a raw pointer through the lambdas. Worker::postTaskToWorkerGlobalScope is changed from void to bool so the caller can clean up on a lost race to Closing/Closed. A stress fixture and test are added.

Security risks

None identified. No untrusted input parsing, auth, or crypto. The intentional leak when postTaskTo(parentId, …) fails on the return trip is bounded to parent-VM teardown and is well-justified (deleting a parent-VM Strong from the worker thread is the bug being fixed).

Level of scrutiny

High. This is concurrency code at the JSC GC boundary: it relies on non-obvious invariants (Strong<T> has no move ctor; HandleSet::m_strongList is a non-thread-safe SentinelLinkedList; lambda capture lifetimes across ScriptExecutionContext::postTaskTo). The fix uses raw new/delete with ownership handed across threads via raw pointer, which is correct here but warrants a second set of eyes from someone familiar with WebKit/JSC conventions — e.g., whether a different pattern (holding the VM lock, or a thread-safe promise wrapper) would be preferred.

Other factors

  • I verified postTaskToWorkerGlobalScope has no other callers, so the voidbool change is safe.
  • The Pending-state path (task queued in m_pendingTasks, worker terminated before Running) would leak promiseHandle when the deque is destroyed, but that's a pre-existing edge case where the promise hung forever anyway, and destruction happens on the parent thread.
  • The root-cause analysis is thorough and the 40-run stress verification (15/40 → 0/40) is convincing.
  • No prior reviews on the timeline; bug-hunting system found nothing.

robobun and others added 2 commits May 3, 2026 12:45
…rip on LTO

The non-LTO release strip step was removing .eh_frame_hdr while the
linker (which only passes --no-eh-frame-hdr under LTO) had already
emitted a PT_GNU_EH_FRAME phdr pointing at it. GNU strip does not
rewrite the program header table, so the phdr was left pointing at
unmapped memory and any stack unwind during worker teardown faulted.
Gate the section removal on c.lto to match the linker flag.

Move the stress test out of worker_threads.test.ts so it runs in
isolation, and group the subprocess result into one assertion so a
crash shows stdout/stderr/exitCode/signalCode together.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/node/worker_threads/worker_heap_snapshot_gc.test.ts`:
- Around line 25-32: The test unconditionally asserts stdout/stderr alongside
exitCode which makes crash failures noisy; change the assertion after awaiting
Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]) to branch on
exitCode: if exitCode === 0 then assert the full success contract (stdout ===
"ok\n" and stderr === "" and exitCode === 0 and signalCode === null), otherwise
assert only exitCode and proc.signalCode (still including stdout/stderr in the
thrown expectation so they appear in the failure diff) — update the existing
expect that compares { stdout, stderr, exitCode, signalCode: proc.signalCode }
to perform this conditional assertion logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b62e2bd9-9de1-447f-b50a-68753a232ef6

📥 Commits

Reviewing files that changed from the base of the PR and between 125dcec and 54625fd.

📒 Files selected for processing (1)
  • test/js/node/worker_threads/worker_heap_snapshot_gc.test.ts

Comment thread test/js/node/worker_threads/worker_heap_snapshot_gc.test.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this touches cross-thread JSC Strong handle lifetime with manual new/delete and an intentional-leak path, plus a drive-by build-flags change — worth a human look (and CI build is currently red on 125dcec).

Extended reasoning...

Overview

This PR fixes a cross-thread data race in worker.getHeapSnapshot() where a JSC::Strong<JSPromise> belonging to the parent VM was copy-constructed/destroyed on the worker thread, mutating the parent VM's HandleSet::m_strongList without the heap lock while the parent's GC was iterating it. The fix heap-allocates the Strong once on the parent thread and passes only a raw pointer through the cross-thread lambdas; postTaskToWorkerGlobalScope now returns bool so a lost race to Closing/Closed cleans up on the parent thread. It also adds a stress-test fixture + test, and includes a drive-by gating of .eh_frame strip flags on c.lto in scripts/build/flags.ts.

Security risks

No new attack surface — this is internal lifetime management. The main risk class is memory safety: raw new/delete of a Strong<JSPromise>, an explicitly documented intentional leak when the parent context is gone on the return trip, and correctness of the accepted == false cleanup path. None of these are security-sensitive in the auth/crypto sense, but they are crash-safety-sensitive.

Level of scrutiny

High. This is concurrency + GC-adjacent C++ with manual memory management and subtle ownership transfer across threads. The reasoning in the PR description is thorough and the fix looks correct to me (the worker thread now only carries an opaque pointer; allocation and destruction both happen on the parent thread), but this is exactly the category of change that benefits from a second pair of human eyes — particularly the State::Pending queue path (task is moved into m_pendingTasks and later run on the worker thread, which is fine since the lambda only captures the raw pointer) and the intentional-leak rationale.

Other factors

  • CI build #50612 against 125dcec shows widespread build-zig/build-cpp failures; unclear if related to this change or infra, but it should be green before merge.
  • The scripts/build/flags.ts change is a separate drive-by fix (orphan PT_GNU_EH_FRAME on non-LTO local builds) bundled into this PR — small and well-explained, but adds scope.
  • One outstanding coderabbit nitpick on the test assertion shape is purely stylistic and not blocking.

A single 300-iteration run only trips the race ~30-50% of the time on
this hardware; loop the subprocess 15× in release so the without-fix
case fails reliably (0/20 false negatives observed, max 12 runs before
crash). Debug keeps a single short functional pass.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this is a cross-thread GC/lifetime fix in JSC bindings with manual new/delete of a Strong<> handle plus an unrelated build-system strip change — worth a human pass on the ownership/leak edge cases (e.g. task queued in Pending then dropped) and the flags.ts drive-by.

Extended reasoning...

Overview

This PR fixes a CI segfault in worker.getHeapSnapshot() by changing how the parent-VM Strong<JSPromise> is carried across the parent→worker→parent round-trip. Instead of capturing the Strong by value (which copy-constructs/destructs on the worker thread and races the parent VM's HandleSet::m_strongList against the GC's Strong Handles constraint), it heap-allocates the Strong once on the parent thread and passes only a raw pointer through the lambdas. Worker::postTaskToWorkerGlobalScope is changed from void to bool so the caller can clean up when the worker has already closed. A new stress fixture + test are added, and there's a separate drive-by change to scripts/build/flags.ts gating .eh_frame_hdr stripping on c.lto.

Security risks

None identified. No user-controlled input flows into the changed code paths beyond what was already there; the change is purely about which thread allocates/frees a GC handle. The flags.ts change narrows when sections are stripped (strictly less aggressive), so no new attack surface.

Level of scrutiny

High. This touches:

  • JSC GC internals (Strong<T>, HandleSet) and cross-VM thread safety
  • Manual memory management (new / delete / std::unique_ptr adoption of a raw pointer) across thread boundaries
  • The Worker lifecycle state machine, with a public-ish API signature change (voidbool)
  • An intentional-leak path on the return trip when the parent context is gone

The diagnosis and fix are very well argued in the PR description and the 15/40 → 0/40 repro is convincing, but the correctness depends on subtle invariants (e.g. ~Worker running on the parent thread, postTaskTo move semantics, what happens if the outer task is queued in State::Pending and then dropped without executing). These are exactly the kinds of edge cases a human familiar with the recent worker-lifetime rewrites (#29957, #29937) should sanity-check.

Other factors

  • The PR bundles an unrelated build-system change (stripFlags gating on LTO). It's well-explained and looks correct, but it changes how release Linux binaries are post-processed and deserves its own glance.
  • The new test runs the fixture 15×300 iterations in release with a 120s timeout — reasonable, but adds non-trivial CI time.
  • CI failures on the latest commit appear unrelated (bake/dev, http keepalive timeouts, a pre-existing ASAN assertion in 26225.test.ts).
  • All coderabbit nitpicks are resolved; no outstanding human review comments.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs, but this touches cross-thread JSC GC/HandleSet lifetime with manual new/delete and an intentional-leak path, plus a drive-by build-flag change — worth a human pass on the ownership/leak tradeoffs.

Extended reasoning...

Overview

Fixes a cross-thread race in worker.getHeapSnapshot() where capturing a JSC::Strong<JSPromise> by value caused the worker thread to copy-construct/destroy it against the parent VM's HandleSet::m_strongList without the parent VM lock, racing the "Sh" (Strong Handles) GC marking constraint and segfaulting at 0x10. The fix heap-allocates the Strong once on the parent thread and passes only the raw pointer through the cross-thread lambdas; the parent-side completion lambda adopts it into a std::unique_ptr and resolves the promise. Worker::postTaskToWorkerGlobalScope now returns bool so a lost race to Closing/Closed rejects with ERR_WORKER_NOT_RUNNING instead of leaking. There is also an unrelated drive-by in scripts/build/flags.ts gating .eh_frame_hdr stripping on c.lto to avoid an orphan PT_GNU_EH_FRAME phdr in non-LTO local builds. New stress fixture + test added.

Security risks

None identified. No untrusted input handling, auth, or crypto changes — this is internal VM/GC lifetime management. The intentional leak of promiseHandle when postTaskTo(parentId, …) fails on the return trip is bounded (parent VM is tearing down) and is the safe choice vs. cross-thread HandleSet mutation.

Level of scrutiny

High. This is concurrency-sensitive C++ at the JSC GC boundary with manual new/delete ownership across two threads and three lambda hops. The root-cause analysis is thorough and the fix matches it, but the design choices (raw pointer through lambdas, intentional leak when the parent context vanishes mid-round-trip, voidbool API change) are exactly the kind of thing a human reviewer familiar with Bun's worker/VM lifecycle should sign off on. The flags.ts change is logically separate and affects release binary stripping behavior.

Other factors

  • Bug-hunting system found no issues; coderabbit nits were addressed/resolved.
  • Stress verification in the PR description (15/40 → 0/40 segfaults) is convincing, and a regression test is included.
  • The postTaskToWorkerGlobalScope return-type change is safe (only caller is getHeapSnapshot, per the in-code comment), but that invariant is worth a human confirming.
  • No CODEOWNERS for the touched paths.

The x64-asan CI lane is release+ASAN (~3x slower per heap snapshot) but
isDebug is false there, so it was running the full 15x300 workload and
timing out at 120s after 7 clean attempts. Treat ASAN like the debug
build and run a single shorter functional pass; plain release lanes
keep the full 15-attempt stress.
Comment thread src/bun.js/bindings/webcore/JSWorker.cpp
…-trip leak

On x64-baseline/musl/aarch64 CI the worker has been observed to exit on
its own after a few hundred heap snapshots, which surfaces as a clean
ERR_WORKER_NOT_RUNNING rejection — not the process-level 0x10 segfault
this fixture is designed to catch. Recreate the worker and continue so
the round-trip count (and thus the number of HandleSet race
opportunities) is preserved.

Also expand the JSWorker.cpp comment to cover the 'task accepted but
never run' leak window alongside the existing return-trip note.
Comment thread scripts/build/flags.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of my earlier notes (outbound-trip leak comment, linkerFlags typo) are addressed and I found nothing further — deferring to a human for the cross-thread Strong<JSPromise> ownership change and the stripFlags LTO gating, which are subtle enough to warrant a second pair of eyes.

Extended reasoning...

Overview

This PR fixes a cross-thread race in worker.getHeapSnapshot() where a parent-VM Strong<JSPromise> was being copy-constructed/destroyed on the worker thread, racing the parent VM's GC "Strong Handles" marking constraint and segfaulting at 0x10. The fix heap-allocates the Strong once on the parent thread and threads only a raw pointer through the cross-thread lambdas, with postTaskToWorkerGlobalScope now returning bool so a lost race to Closing/Closed can be cleaned up synchronously. It also includes an unrelated drive-by: gating .eh_frame_hdr removal in stripFlags on c.lto so non-LTO local builds don't end up with an orphan PT_GNU_EH_FRAME phdr. Two new test files exercise the race under GC pressure.

Security risks

None identified. No auth, crypto, permissions, or user-input parsing is touched. The change is internal lifetime management for a profiling/debugging API and a build-script strip-flag predicate.

Level of scrutiny

High. The core change is cross-thread ownership of a JSC GC handle via raw pointer, with two explicitly documented (and now commented) leak windows accepted as a crash→leak tradeoff. This is exactly the kind of subtle concurrency/GC-lifetime reasoning where a human reviewer familiar with the recent Worker lifetime rewrites (#29957/#29937) should confirm the invariants — particularly that the parent-thread completion lambda is the only place the Strong is dereferenced/freed, and that the !accepted cleanup path is sound. The stripFlags change is small but affects release-binary layout on linux-gnu and is bundled as a drive-by; it deserves a glance from someone who owns the build scripts.

Other factors

  • My two prior inline comments (documenting the outbound-trip leak window; linkFlagslinkerFlags typo) were both addressed in 164fe32 and 557a6b2.
  • The bug-hunting system found no issues in this pass.
  • CodeRabbit's nits on test assertion shape were resolved/declined with reasonable justification.
  • robobun reports musl build-bun failures on 557a6b2; the author has since pushed a CI retrigger (5659245). The flags.ts change is gated on c.abi === "gnu" so it shouldn't affect musl, but CI should be green before merge.
  • Stress verification (15/40 segfaults pre-fix → 0/40 post-fix) is compelling.

@Jarred-Sumner Jarred-Sumner merged commit 0150c57 into main May 3, 2026
74 of 77 checks passed
@Jarred-Sumner Jarred-Sumner deleted the farm/46285952/worker-getheapsnapshot-handleset-race branch May 3, 2026 23:43
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…0185)

`test/js/node/worker_threads/worker_threads.test.ts` occasionally
segfaults in CI with

```
panic: Segmentation fault at address 0x10
```

on a GC helper thread:

```
wtfThreadEntryPoint
  AutomaticThread::start
    ParallelHelperPool::Thread::work
      Heap::runBeginPhase(GCConductor)::$_1
        SlotVisitor::drainFromShared
          MarkingConstraintSolver::runExecutionThread
            MarkingConstraint::execute   ← "Sh" Strong Handles
              HandleSet::visitStrongHandles
                *(nullptr + offsetof(HandleNode, m_value))  = *(0x10)
```

(decoded from the `bun.report` trace on [build 50529 / 🐧 13
x64](https://buildkite.com/bun/bun/builds/50529#019dec3e-4d11-4651-b908-84e601bc2db3)
and symbolized against that build's `bun-profile`).

## Cause

`jsWorkerPrototypeFunction_getHeapSnapshotBody` does:

```cpp
Strong<JSPromise> strong(vm, promise);                      // parent VM's HandleSet
worker.postTaskToWorkerGlobalScope([strong, parentId](auto& workerCtx) {
    ...
    ScriptExecutionContext::postTaskTo(parentId,
        [strong, snapshot = ...](auto& parentCtx) { ... }); // runs on worker thread
});
```

`JSC::Strong<T>` has **no move constructor**. Capturing it by value
copy-constructs it, which calls `HandleSet::allocate()` +
`m_strongList.push()`; destroying it calls `HandleSet::deallocate()` +
`NodeList::remove()`. Both happen on the **worker thread** against the
**parent VM's** `HandleSet`, without the parent VM's lock.

`HandleSet::m_strongList` is a `SentinelLinkedList<HandleNode>` — not
thread-safe. `push`/`remove` transiently null `m_next`/`m_prev`. The
parent VM's "Sh" (Strong Handles) marking constraint
(`Heap::addCoreConstraints`) iterates that list during GC; when it
follows a null `m_next` it reads `*((HandleNode*)nullptr)->slot()` →
`*(0x0 + 0x10)`.

The `heapHelperPool()` is process-global, so the crashing helper thread
belongs to the parent VM's collector even though the worker VM's
`BunV8HeapSnapshotBuilder` full GC is in progress at the same time.

This has been there since `getHeapSnapshot` was added — the recent
worker lifetime rewrites (oven-sh#29957, oven-sh#29937) didn't introduce it.

## Fix

Heap-allocate the `Strong<JSPromise>` once on the parent thread and pass
only the raw pointer through the cross-thread lambdas. The worker thread
never dereferences it, so it never touches the parent VM's `HandleSet`.
The parent-side completion lambda resolves the promise and frees the
handle.

`Worker::postTaskToWorkerGlobalScope` now returns `bool` so a lost race
to `Closing`/`Closed` (worker exited between `isOnline()` and the post)
rejects with `ERR_WORKER_NOT_RUNNING` instead of silently leaking the
handle. If `postTaskTo(parentId, …)` on the return trip fails (parent
context gone), the handle intentionally leaks — deleting a parent-VM
`Strong` from the worker thread is exactly the bug we're fixing, and the
parent VM is tearing down anyway.

## Verification

Stress fixture (`heap-snapshot-gc-race-fixture.js`, 300 iterations of
`await worker.getHeapSnapshot(); Bun.gc(true)`), 40 runs each on
linux-x64 release:

| build | segfault at `0x10` |
| --- | --- |
| `52bdf47` (CI artifact, no fix) | 15 / 40 |
| this branch | 0 / 40 |

The new `worker_heap_snapshot_gc.test.ts` runs the fixture — 300 iters
in release, 5 in debug (a single debug heap snapshot takes ~1.6s so the
race window, which is a handful of instructions after each snapshot, is
impractical to hit there; the debug pass is a functional check).

## Drive-by: non-LTO strip leaves orphan `PT_GNU_EH_FRAME`

While reproducing I hit a second, unrelated crash in locally-built
(non-LTO) release binaries: `stripFlags` removed `.eh_frame_hdr` on
linux-gnu unconditionally, but the linker only passes
`--no-eh-frame-hdr` when LTO is on. GNU strip doesn't rewrite the
program header table, so the `PT_GNU_EH_FRAME` phdr was left pointing at
unmapped memory and any stack unwind (e.g. WTF::Thread teardown after a
worker exits) faulted. CI release builds always have LTO on so they
weren't affected. Gated the section removal on `c.lto` to match the
linker flag.

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants