Fix event-loop hang when dlopen'd Go/cgo lib coexists with WASM on Linux#31161
Fix event-loop hang when dlopen'd Go/cgo lib coexists with WASM on Linux#31161robobun wants to merge 15 commits into
Conversation
Fixes #31158. JavaScriptCore suspends JS threads for GC/code-cache flushes by sending a per-thread signal (SIGPWR on Linux in our WebKit fork) and having the handler snapshot the interrupted thread's context. The handler's sanity check — "am I running on the tracked thread's stack?" — reads its own SP, so it only succeeds when the signal arrives on the normal stack. Go's cgo runtime (and others following the same pattern — Mono, some JNI layouts, Rust cdylibs) walks every signal in initsig() and slaps SA_ONSTACK onto any handler it didn't install itself, including ours. In a process that already has an alternate signal stack (ASAN, Bun's crash handler) the next JSC suspend lands on the alt stack, the stack check fails every time, Thread::suspend() spins forever on SIGPWR retries and the event loop stops — the first WASM compile after a dlopen freezes the VM. Clear SA_ONSTACK on SIGPWR after every dlopen from both entry points (bun:ffi via sys::dlopen, process.dlopen via a new C export).
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Unix-only post-dlopen repair that clears SA_ONSTACK from SIGPWR when present, exposes it via a C ABI, invokes it from JSC's dlopen binding, and adds a glibc-only regression test reproducing the original hang scenario. ChangesSignal handler repair on dlopen
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/jsc/bindings/BunProcess.cpp`:
- Around line 538-539: Add a brief explanatory comment immediately above the
call to Bun__repairJscGcSignalAfterDlopen() (the code block that checks "if
(handle) ...") stating that this call repairs JSC's SIGPWR handler when foreign
libraries (e.g., Go cgo) have reinstalled SIGPWR with SA_ONSTACK, which can
cause a subtle event-loop hang; mention the signal name (SIGPWR), the SA_ONSTACK
flag, and that Bun__repairJscGcSignalAfterDlopen() restores the expected
handler/flags so the JSC GC signal handling and event loop remain stable after
dlopen of foreign libs.
- Around line 538-539: Wrap the call to Bun__repairJscGcSignalAfterDlopen behind
a Linux-only preprocessor guard so the Linux-specific SIGPWR repair only runs on
Linux; keep the existing if (handle) check and change it to only call
Bun__repairJscGcSignalAfterDlopen when the platform macro indicates Linux (e.g.,
surround the Bun__repairJscGcSignalAfterDlopen() invocation with an OS(LINUX) /
__linux__ preprocessor check) so the platform-specific behavior is explicit.
- Line 389: Add a short explanatory comment at the call site of
Bun__repairJscGcSignalAfterDlopen() clarifying that this extern "C" function
(matches Rust fn Bun__repairJscGcSignalAfterDlopen) is a post-dlopen fix for
JSC’s Linux SIGPWR handler: it clears SA_ONSTACK to avoid Thread::suspend()
retry/deadlock when an alternate signal stack is present; note that the Rust
implementation is a no-op on non-Linux Unixes so no OS(LINUX) guard is required
in the C++ caller.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bbc2f969-d8d7-4554-a299-2d410570b2c4
📒 Files selected for processing (3)
src/jsc/bindings/BunProcess.cppsrc/sys/lib.rstest/regression/issue/31158.test.ts
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
|
Cross-referencing #29844 (flagged by the duplicate-finder): it's aimed at a related but distinct Go-dlopen signal problem — full handler replacement (Go calls But the wire Go hits for SIGPWR is a different code path in cgo init: If the maintainer would rather land the broader #29844 (covers SIGPIPE restoration etc. too) that one supersedes this PR and I'll close it. If they prefer a narrow targeted fix while #29844 is still in review (it's been open since last fall with stale CI), this one is smaller and ships immediately. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/regression/issue/31158.test.ts`:
- Around line 77-82: Remove the harness-side watchdog: delete the
setTimeout(...) that assigns watchdog and calls proc.kill("SIGKILL") as well as
the clearTimeout(watchdog) usage so the test relies on the existing "await using
proc = Bun.spawn(...)" scoped cleanup and the outer test timeout instead; ensure
no leftover references to the watchdog variable remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a652d8c-4f42-47eb-9ef6-a54ce21cd981
📒 Files selected for processing (1)
test/regression/issue/31158.test.ts
Raised per-test timeout to 15s so a regression still times out gracefully; await using proc = Bun.spawn(...) kills the child on scope exit without a manual setTimeout+clearTimeout. Matches the no-setTimeout-in-tests rule.
|
Thanks for the fast turnaround on this. The fix works — but stepping back, the layering is getting hard to follow:
This is a workaround for a workaround. The next runtime that does something unexpected to signal flags will need another The real fix is what you mentioned as the follow-up: use the interrupted SP from ucontext in Would it be feasible to ship the WebKit handler fix directly instead of (or alongside) this band-aid? Then the Longer term, cooperative safepoint polling (V8's approach) would eliminate signal-based suspension entirely — but that's a bigger architectural change and the ucontext fix is sufficient. |
|
You're right — shipping both is the cleanest landing. Opened oven-sh/WebKit#235 with the ucontext-SP fix; once the preview autobuild finishes (~1h) I'll bump Agree on cooperative safepoint polling being the cleaner endgame — separate track, much bigger lift. |
There was a problem hiding this comment.
I didn't find any bugs, but this rewrites process-wide signal disposition (SIGPWR) on a hot path that JSC's GC depends on, and there's an open call in the thread about landing this narrow fix vs the broader #29844 — both feel like they want a maintainer's eyes.
Extended reasoning...
Overview
The PR adds a post-dlopen repair step that reads the current sigaction for SIGPWR on Linux and clears SA_ONSTACK if a freshly-loaded shared library (Go cgo's runtime.setsigstack being the canonical case) added it. The repair is implemented in Rust (src/sys/lib.rs) with a C-ABI wrapper, called from both sys::dlopen (the bun:ffi path) and Process_functionDlopen in BunProcess.cpp (the process.dlopen path). A glibc-gated regression test reproduces the hang shape via bun:ffi + WASM without needing Go in CI.
Security risks
No injection, auth, or data-exposure surface. The change does mutate process-wide signal disposition for SIGPWR, which is the signal JSC uses to suspend threads for GC and code-cache flushes. The mutation is narrowly scoped (only clears one flag bit, only when it's set, only on Linux) and preserves the existing handler/mask/restorer, so the attack surface is minimal. The hardcoded SIGPWR = 30 is correct for x86_64/aarch64 (Bun's supported Linux targets).
Level of scrutiny
This warrants maintainer review rather than auto-approval. Signal handling that interacts with JSC's stop-the-world GC mechanism is exactly the kind of code where a subtle mistake produces hard-to-reproduce hangs or crashes, and the fix is explicitly framed as a stopgap pending a proper WTF-side fix. The regression test also embeds architecture-specific assumptions about glibc's userspace struct sigaction layout (sa_flags at offset 136) that hold on x86_64/aarch64 glibc but are worth a human glance.
Other factors
The author (robobun) explicitly surfaced an open decision in the thread: PR #29844 takes a broader snapshot-and-restore approach across all signals and would subsume this fix. A maintainer should decide whether to land this targeted patch now or wait for/prefer #29844. All CodeRabbit nitpicks have been addressed or reasonably declined, and no bugs were found by the bug-hunting system. CI build #56544 is in progress.
Point WEBKIT_VERSION at the preview autobuild of oven-sh/WebKit#235, which teaches signalHandlerSuspendResume to read the interrupted thread's SP from the ucontext instead of the handler's own currentStackPointer(). That's the root-cause fix — regardless of what any dlopen'd library (Go cgo, Mono, JNI, Rust cdylib, …) does to SA_ONSTACK, the stack-range check always sees the interrupted stack rather than wherever the handler landed. Delete the per-dlopen sigaction scan on the Bun side. With the WTF change in place it's pure overhead, and the reporter rightly pointed out that chaining workarounds for workarounds is the wrong direction. Regression test (test/regression/issue/31158.test.ts) is unchanged and now exercises the upstream fix directly. Swap the preview-pr-235 tag for the merged autobuild hash once oven-sh/WebKit#235 lands.
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
|
Ready for maintainer review (rebased on current The fix is in WebKit: oven-sh/WebKit#235 makes JSC's This PR is Bun-side only:
WebKit#235 addresses all review feedback (the To land this: merge oven-sh/WebKit#235, then pin |
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
There was a problem hiding this comment.
No new issues at this revision, but the actual fix lives in oven-sh/WebKit#235 and this pins WEBKIT_VERSION to a preview-tag autobuild of that unmerged PR — a maintainer should sign off on landing main against a preview WebKit build (and confirm 0eedb192's parent is 963f8758c29e so nothing on WebKit main is dropped).
Extended reasoning...
Overview
The Bun-side diff is two files: a one-line WEBKIT_VERSION change in scripts/build/deps/webkit.ts (from the 40-char SHA 963f8758c29e… to the preview tag autobuild-preview-pr-235-0eedb192) and a new glibc-only regression test at test/regression/issue/31158.test.ts. The substantive fix — reading the interrupted thread's SP from ucontext in WTF's signalHandlerSuspendResume — is entirely in oven-sh/WebKit#235, which is not part of this repo and is not yet merged. The earlier Bun-side repair_jsc_gc_signal_after_dlopen band-aid was removed during review, so this PR now ships only the version pin and the test.
Security risks
None introduced by the Bun-side diff itself. The regression test does poke sigaction(SIGPWR, …) via bun:ffi inside a spawned child, but that's contained to the test subprocess and gated to glibc Linux. The real risk surface is the WebKit binary swap: changing which prebuilt JSC/WTF is linked affects every code path in the runtime, and I cannot inspect oven-sh/WebKit#235's diff from here.
Level of scrutiny
High. Although the textual diff is tiny, a WEBKIT_VERSION bump replaces the JS engine for every build. This particular bump points at a preview autobuild of an unmerged PR rather than a merged commit, which (a) the inline comment itself flags as temporary ("Swap back to the merged hash once #235 lands"), (b) breaks documented invariants in sync-webkit-source.ts and depVersionsHeader.ts (author acknowledged, accepted as temporary), and (c) requires trusting that 0eedb192 is genuinely rebased on 963f8758c29e so no intervening WebKit-main changes are silently dropped — the same class of concern I raised on an earlier tag. That's a maintainer judgment call, not a mechanical change.
Other factors
All prior inline comments (mine and CodeRabbit's) are resolved. The bug-hunting pass found nothing new at this revision. The regression test looks correct and well-scoped. The author has explicitly marked the PR "Ready for maintainer review" / "Needs a maintainer to merge," and there is an open question of whether to land this against the preview tag now or wait for oven-sh/WebKit#235 to merge and use the real SHA. Deferring to a human for that decision.
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
…e7173c) Rebase of the signalHandlerSuspendResume ucontext-SP fix onto the current WebKit main pin 09f04cd5, plus a cross-reference comment tying interruptedStackPointer() to JSC::MachineContext::stackPointerImpl so a future arch addition updates both copies.
…58-sigpwr-sa-onstack # Conflicts: # scripts/build/deps/webkit.ts
Fixes #31158.
Fixes #29843.
Repro
Hangs after the WASM call; "alive" never prints. Linux x64 only. Reporter's strace shows a continuous
tgkill(…, SIGPWR)storm (≈8500 signals/s) while the main thread pins a core.Cause
JavaScriptCore's GC/code-cache flush suspends other JS threads by sending a per-thread signal (SIGPWR on Linux in our WebKit fork) and having the handler snapshot the interrupted context.
signalHandlerSuspendResumeinWTF/wtf/posix/ThreadingPOSIX.cppsanity-checks that the snapshot is meaningful with:Go's cgo runtime walks every signal in
runtime.initsigwhen loaded as a c-shared library and, for any signal whose handler is non-default that it doesn't want to own, force-addsSA_ONSTACKso its own threads' synchronous faults stay on a managed alt stack. That includes WTF's SIGPWR handler. Because Bun's crash handler (or ASAN's runtime in debug builds) already has an alternate signal stack installed on the main thread, the very next JSC suspend delivers SIGPWR onto the alt stack,currentStackPointer()lands outsidethread->m_stack, the sanity check fails, the handler posts withm_platformRegisters = nullptr,Thread::suspend()retries — forever. The event loop never gets another timer tick.Any runtime that follows the same recipe (Mono, some JNI layouts, Rust async runtimes compiled as cdylibs) trips the same wire; this just happens to surface most loudly with Go because cshared Go is what most people reach for when they want a native helper.
Confirmed with a plain C test:
Handler SP is in the alt stack when
SA_ONSTACKis in effect; the ucontext-saved SP always tells the truth.Fix
Root-cause fix is in WTF: read the interrupted thread's SP from the ucontext (kernel-saved register state) instead of the handler's own
currentStackPointer(). That SP is stable regardless of whether the handler itself runs on the normal or the alternate signal stack, and the existing retry-on-alt-stack semantics still work — if the thread genuinely was on an alt stack when the signal arrived (e.g. a nested handler), the ucontext SP reflects that and the check backs off as before.The WTF change ships in oven-sh/WebKit#235; this PR bumps
WEBKIT_VERSIONto that PR's preview autobuild. No per-dlopen sigaction scan on the Bun side — with the WTF fix in place it would be pure overhead, and chaining workarounds for workarounds is the wrong direction (as the reporter rightly pointed out).Swap the preview-pr-235 tag for the merged autobuild hash once oven-sh/WebKit#235 lands.
Test
test/regression/issue/31158.test.ts— setsSA_ONSTACKon SIGPWR viabun:ffi(same shape as Go'sinitsig), triggers a fresh dlopen, compiles+invokes a WASM function, thenawaits asetTimeout. Passes with the fix in ~800 ms; hangs to a 15 s test-runner timeout on system Bun. Skipped on non-glibc (musl Bun is static and doesn't have libc.so.6 to dlopen).The exact repro from the issue (with Go) also passes end-to-end against this build.