Skip to content

test --isolate: retarget NapiEnv at new global; --parallel: abort on worker panic, never retry#30216

Merged
Jarred-Sumner merged 8 commits into
mainfrom
farm/29e6f039/isolate-napi-uaf-no-retry
May 4, 2026
Merged

test --isolate: retarget NapiEnv at new global; --parallel: abort on worker panic, never retry#30216
Jarred-Sumner merged 8 commits into
mainfrom
farm/29e6f039/isolate-napi-uaf-no-retry

Conversation

@robobun

@robobun robobun commented May 4, 2026

Copy link
Copy Markdown
Collaborator

What

bun test --isolate / --parallel crashes when a test file loads a native addon whose deferred napi finalizers outlive the file. The --parallel coordinator then silently retries the file once, which masks the panic and lets the run exit 0.

Fixes #30205, #30191. Supersedes #30214 (same NapiEnv fix, but without the coordinator change, the cleanup_hooks retarget, or a test that actually reproduces on unpatched main).

Reproduction

git clone https://github.com/workglow-dev/libs && cd libs
bun i && bun run build:packages
bun test --timeout=30000 --parallel=4 packages/test/src/test/{util,task}/*.test.ts

On main (d484fd6), 3–4 workers crash per run with either

ASSERTION FAILED: isMarked(cell)
  JavaScriptCore/heap/Heap.cpp:1232 : void JSC::Heap::addToRememberedSet(const JSCell *)

or (when the slot is already being reallocated)

ASSERTION FAILED: m_cellState == CellState::DefinitelyWhite
  JavaScriptCore/JSCellInlines.h:69 : JSC::JSCell::JSCell(VM &, Structure *)

and in release builds the segfaults at 0x68 / 0xD0 reported in #30205.

Root cause

Frame-pointer walk from the assertion:

#3  Bun::NapiHandleScope::open(Zig::GlobalObject*, bool)
#4  NapiHandleScope__open
#6  napi.Finalizer.run
#7  napi.NapiFinalizerTask.runOnJSThread
#10 event_loop.tick
#11 event_loop.waitForPromise
#13 VirtualMachine.loadEntryPointForTestRunner   ← next test file

NapiEnv::m_globalObject is a raw Zig::GlobalObject*. For non-experimental addons (nm_version != NAPI_VERSION_EXPERIMENTAL, which is ~every real-world addon — sharp, better-sqlite3, etc.), napi_wrap/napi_create_external finalizers are deferred to the event loop as NapiFinalizerTask rather than run inside GC sweep.

Objects rooted on the old global (module graph, globalThis.*) only become collectable when Zig__GlobalObject__createForTestIsolation runs gcUnprotect(oldGlobal). The DeferGC from #29573 ends at that function's }, so the next GC runs there, collects those objects, and enqueues their finalizers. Those tasks then run on the very next eventLoop().tick() — inside loadEntryPointForTestRunner's waitForPromise for file N+1. Finalizer.run opens a NapiHandleScope via env->globalObject(), which reads NapiHandleScopeImplStructure() off the dead cell and writes m_currentNapiHandleScopeImpl on it → write barrier on an unmarked cell.

The --parallel coordinator's reapWorker then re-queued the file once (retries[idx] < 1) into a fresh worker with no stale NapiEnv, which passed — so the run reported 0 fail despite multiple Bun panics in the log.

Fix

NapiEnv retarget (ZigGlobalObject.cpp, napi.h): Zig__GlobalObject__createForTestIsolation now calls newGlobal->adoptNapiEnvsForTestIsolation(oldGlobal) before gcUnprotect. Each NapiEnv::m_globalObject is repointed at the new global and the Ref<NapiEnv>s are moved over, so late finalizers open handle scopes on a live global and the envs stay owned after the old global is swept. VirtualMachine.swapGlobalForTestIsolation also repoints rare_data.cleanup_hooks[*].globalThis so CleanupHook.eql() stays accurate.

No retry, abort on panic (Coordinator.zig): removed the per-file retry. A worker that dies mid-file is counted as one failure. If it died by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS — Bun's own @trap(), a JSC/WTF assertion, or native-addon crash), the whole run aborts with error: a test worker process crashed with <SIG> while running <file>. process.exit() / SIGKILL are still just a per-file failure and the run continues.

Verification

  • test/regression/issue/30205.test.ts — 4 tests. Adds a tiny non-experimental addon (isolate_finalizer_addon.c) and a fixture pattern (Bun.gc(true) + module-scope await 0 + objects rooted on globalThis) that crashes 8/8 on unpatched main and passes 8/8 with this change.
  • workglow-dev/libs full 201-file unit suite: 3× clean --parallel=4 runs (was 3–4 crashes/run).
  • Gate: git stash -- src/ && bun bd test test/regression/issue/30205.test.ts → 3/4 fail; with fix → 4/4 pass.
  • test/cli/test/isolation.test.ts, test/regression/issue/29519.test.ts → pass (one pre-existing unrelated timeout in isolation.test.ts, same as isolate: defer GC across Zig::GlobalObject swap; guard unique_ptr visit #29573).
  • test/cli/test/parallel.test.ts → all tests I touched pass; the 3 timing-sensitive scale-up/work-steal tests that fail in this container fail identically on unmodified main.

…panic, never retry

Two intertwined problems surfaced by running a suite that loads native
addons under --parallel/--isolate (issue #30205, workglow-dev/libs):

1. NapiEnv holds a raw Zig::GlobalObject*. Under --isolate the old global
   is gcUnprotect()'d while NapiFinalizerTasks for its objects are still
   queued on the event loop (non-experimental addons defer finalizers).
   When those finalizers run during the next file's
   loadEntryPointForTestRunner → waitForPromise → tick(),
   Finalizer.run → NapiHandleScope::open(env->globalObject()) reads
   NapiHandleScopeImplStructure() off the dead cell and writes
   m_currentNapiHandleScopeImpl on it. Debug:
     ASSERTION FAILED: isMarked(cell)        (Heap::addToRememberedSet)
     ASSERTION FAILED: m_cellState == DefinitelyWhite (JSCell ctor)
   Release: segfault at 0x68/0xD0 when the concurrent marker later
   visits the half-dead global.

   Fix: during the swap, retarget every NapiEnv's m_globalObject to the
   new global and move the Ref<NapiEnv>s over so they stay owned after
   the old global is swept. Also repoint rare_data.cleanup_hooks'
   captured globalThis so eql() stays accurate.

2. The --parallel coordinator silently retried any file whose worker
   crashed, once. A real Bun panic in a worker was therefore invisible —
   the retry ran in a fresh process with no stale NapiEnv and passed.
   Removed the retry entirely. A worker killed by a fatal signal
   (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS — i.e. Bun's
   crash handler or a JSC assertion) now aborts the whole run with a
   clear error; a plain process.exit() marks the file failed and
   continues.

The regression test (test/regression/issue/30205.test.ts) adds a tiny
non-experimental addon (isolate_finalizer_addon.c) that napi_wrap()s
objects the test roots on globalThis; Bun.gc(true) + top-level await at
the start of each isolated file forces the previous global's wrapped
objects to be collected *before* the event loop has drained their
deferred finalizers, making the crash reproduce 8/8 on Linux x64 ASAN.

Fixes #30205
Fixes #30191
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Retargets pending N-API environments and cleanup hooks when swapping a GlobalObject for test isolation, transferring their ownership to the new global. Separately, removes per-file retry logic from the parallel test coordinator, introduces panic classification/abort-on-panic, and updates tests and an N-API test addon.

Changes

N-API Global Object Isolation

Layer / File(s) Summary
Method Declarations
src/bun.js/bindings/napi.h, src/bun.js/bindings/ZigGlobalObject.h
Adds NapiEnv::retargetGlobalObject(Zig::GlobalObject* newGlobal) and GlobalObject::adoptNapiEnvsForTestIsolation(GlobalObject* oldGlobal).
Core Implementation
src/bun.js/bindings/ZigGlobalObject.cpp, src/bun.js/VirtualMachine.zig
swapGlobalForTestIsolation saves old_global, creates new_global, updates VM/event-loop/global references, and rewrites rare_data.cleanup_hooks.items[].globalThis from old_global to new_global. Zig__GlobalObject__createForTestIsolation calls adoptNapiEnvsForTestIsolation(oldGlobal). adoptNapiEnvsForTestIsolation retargets each NapiEnv to the new global and transfers m_napiEnvs ownership; NapiEnv::retargetGlobalObject asserts VM match and updates its global pointer.
Test Module / Build
test/napi/napi-app/binding.gyp, test/napi/napi-app/isolate_finalizer_addon.c
Adds isolate_finalizer_addon build target and C source exporting wrap(obj) which allocates per-object data and attaches a deferred N-API finalizer that frees it.
Regression Tests / Validation
test/regression/issue/30205.test.ts, test/no-validate-exceptions.txt
Adds tests that build the addon, create fixtures forcing GC/finalizer runs, and assert correct behavior under --isolate and --parallel; records the test in the ASAN no-validate list.

Test Coordinator Crash Handling

Layer / File(s) Summary
Data / Struct Layout
src/cli/test/parallel/Coordinator.zig, src/cli/test/parallel/runner.zig
Removes Coordinator per-file retry state fields retries and pending_retry and drops their allocation/wiring in runner initialization.
Coordinator Core Logic
src/cli/test/parallel/Coordinator.zig
On worker exit with an inflight file, reapWorker flushes output, clears w.inflight, immediately accounts the file as crashed via accountCrash(status), and records crash metadata with describeStatus. Removes retry/orphan logic, adds isPanicStatus and abortOnWorkerPanic (prints panic banner, kills other workers, sets bailed, aborts queued files), and simplifies .ready handling to always call assignWork.
Tests / Expectations
test/cli/test/parallel.test.ts, test/regression/issue/30205.test.ts
Updates parallel-crash tests to assert no retrying and expect worker-crash exit-code/status messaging; tightens hostile-fd test to require deterministic non-zero exit. Adds regression tests verifying abort-on-fatal-signal behavior and non-panic file-failure behavior (process.exit(7)) under --parallel.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: NapiEnv retargeting for test isolation and abort-on-panic behavior removal of retries in parallel mode.
Description check ✅ Passed The description comprehensively covers what the PR does, the root cause, the fix, and verification steps, fulfilling both required template sections.
Linked Issues check ✅ Passed The PR fulfills all coding requirements from #30205: implements NapiEnv retargeting fix, removes retries, aborts on panic, adds regression test, and verifies against the workglow-dev/libs suite.
Out of Scope Changes check ✅ Passed All changes are directly in-scope: NapiEnv/GlobalObject retargeting, Coordinator crash handling, test fixture and addon, and test list update for exception validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@robobun

robobun commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 10:35 PM PT - May 3rd, 2026

@robobun, your commit c62ff55 has 1 failures in Build #50783 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30216

That installs a local version of the PR into your bun-30216 executable, so you can run:

bun-30216 --bun

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/cli/test/parallel.test.ts (1)

101-127: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Make the crashing file sort first so this covers the mid-run path deterministically.

As written, boom.test.js sorts after a.test.js and b.test.js, so the worker can crash on its last assigned file and this still passes without exercising the "remaining work continues with no retry" branch. Renaming the crashing file to sort first would make the regression actually hit the code path this PR changed.

One simple way to make it deterministic
 test("--parallel marks a file whose worker exits mid-run as failed (no retry)", async () => {
   using dir = tempDir("parallel-crash", {
+    "0-boom.test.js": `import {test} from "bun:test"; test("boom",()=>process.exit(7));`,
     "a.test.js": `import {test,expect} from "bun:test"; test("a",()=>expect(1).toBe(1));`,
     "b.test.js": `import {test,expect} from "bun:test"; test("b",()=>expect(1).toBe(1));`,
-    "boom.test.js": `import {test} from "bun:test"; test("boom",()=>process.exit(7));`,
   });
@@
-  expect(stderr).toContain("boom.test.js");
+  expect(stderr).toContain("0-boom.test.js");
   expect(stderr).toContain("(worker crashed: exit code 7)");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli/test/parallel.test.ts` around lines 101 - 127, The test is
nondeterministic because "boom.test.js" can be scheduled last; rename the
crashing file so it sorts first (e.g., prefix it like "0_boom.test.js" or
"000_boom.test.js") in the tempDir fixture in the test body and update any
assertions that reference the filename (expect(stderr).toContain(...)) to the
new filename; leave the rest of the test logic (spawn, flags, and expectations
about crash message, no "retrying", summary and exitCode) unchanged so the
mid-run crash path is exercised deterministically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bun.js/bindings/napi.h`:
- Around line 383-393: The helper retargetGlobalObject(Zig::GlobalObject*
newGlobal) can only be used when newGlobal belongs to the same VM as the NapiEnv
(which stores m_vm at construction); add an assertion at the start of
retargetGlobalObject that verifies newGlobal's VM equals m_vm (e.g.,
ASSERT(newGlobal->vm() == m_vm)) before assigning m_globalObject = newGlobal so
misuse is caught early; reference NapiEnv, m_vm, retargetGlobalObject,
m_globalObject and Zig::GlobalObject to locate and update the function.

In `@src/bun.js/VirtualMachine.zig`:
- Around line 2572-2574: The isolate-swap code updates this.global,
VMHolder.cached_global_object, and this.regular_event_loop.global but neglects
the macro event loop; ensure you also retarget the macro event loop by assigning
the new global to its global field (e.g., set this.macro_event_loop.global =
new_global or VMHolder.macro_event_loop.global = new_global) in the same swap
code path so macro_event_loop doesn’t remain pinned to the old global after
enableMacroMode()/isolate swaps.

In `@src/cli/test/parallel/Coordinator.zig`:
- Around line 383-396: In abortOnWorkerPanic, remove or bypass the early guard
that returns when this.bailed is already true and ensure you force-stop every
alive worker (not only those with inflight == null): in the loop over
this.workers[0..this.spawned_count] call other.shutdown() for any other.alive
regardless of other.inflight, so inflight workers are stopped immediately; still
set this.bailed = true, call breakDots(), Output.prettyError/flush, and then
call this.abortQueuedFiles("aborted: worker panicked") so the panic path always
force-stops workers and lets reapWorker() account for their inflight files.

---

Outside diff comments:
In `@test/cli/test/parallel.test.ts`:
- Around line 101-127: The test is nondeterministic because "boom.test.js" can
be scheduled last; rename the crashing file so it sorts first (e.g., prefix it
like "0_boom.test.js" or "000_boom.test.js") in the tempDir fixture in the test
body and update any assertions that reference the filename
(expect(stderr).toContain(...)) to the new filename; leave the rest of the test
logic (spawn, flags, and expectations about crash message, no "retrying",
summary and exitCode) unchanged so the mid-run crash path is exercised
deterministically.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a0d8c28c-4690-4c20-9373-44eb172b8db8

📥 Commits

Reviewing files that changed from the base of the PR and between 31c4946 and 8ffa838.

📒 Files selected for processing (10)
  • src/bun.js/VirtualMachine.zig
  • src/bun.js/bindings/ZigGlobalObject.cpp
  • src/bun.js/bindings/ZigGlobalObject.h
  • src/bun.js/bindings/napi.h
  • src/cli/test/parallel/Coordinator.zig
  • src/cli/test/parallel/runner.zig
  • test/cli/test/parallel.test.ts
  • test/napi/napi-app/binding.gyp
  • test/napi/napi-app/isolate_finalizer_addon.c
  • test/regression/issue/30205.test.ts
💤 Files with no reviewable changes (1)
  • src/cli/test/parallel/runner.zig

Comment thread src/bun.js/bindings/napi.h Outdated
Comment thread src/bun.js/VirtualMachine.zig
Comment thread src/cli/test/parallel/Coordinator.zig
Comment thread src/cli/test/parallel/Coordinator.zig

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/cli/test/parallel/Coordinator.zig (1)

394-411: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Panic hard-abort is still skipped when bailed was already set.

At Line 394, if (this.bailed) return; bypasses worker termination and queued-file abort on a fatal-signal crash if --bail tripped earlier. That can still allow additional inflight output after the panic banner instead of an immediate crash abort path.

Suggested fix
 fn abortOnWorkerPanic(this: *Coordinator, file_idx: u32, status: bun.spawn.Status) void {
     this.breakDots();
     var buf: [32]u8 = undefined;
@@
     );
     Output.flush();
-    if (this.bailed) return;
-    this.bailed = true;
+    const already_bailed = this.bailed;
+    this.bailed = true;
@@
     for (this.workers[0..this.spawned_count]) |*other| {
         if (!other.alive) continue;
         if (other.process) |p| {
             if (Environment.isPosix) {
                 _ = std.c.kill(-p.pid, std.posix.SIG.TERM);
             } else {
                 _ = p.kill(1);
             }
         }
     }
-    this.abortQueuedFiles("aborted: worker panicked");
+    if (!already_bailed) {
+        this.abortQueuedFiles("aborted: worker panicked");
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/test/parallel/Coordinator.zig` around lines 394 - 411, The early
return "if (this.bailed) return;" prevents the hard-abort path from running when
bailed was already true; change this to not return so the termination loop and
this.abortQueuedFiles("aborted: worker panicked") still run. Concretely, remove
the return and make setting this.bailed idempotent (e.g., only set this.bailed =
true if it was false) so the for-loop over this.workers[0..this.spawned_count],
the Environment.isPosix / p.kill path and the call to this.abortQueuedFiles
still execute even if this.bailed was previously set; this preserves idempotency
while ensuring the hard-abort logic always runs on panic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/cli/test/parallel/Coordinator.zig`:
- Around line 394-411: The early return "if (this.bailed) return;" prevents the
hard-abort path from running when bailed was already true; change this to not
return so the termination loop and this.abortQueuedFiles("aborted: worker
panicked") still run. Concretely, remove the return and make setting this.bailed
idempotent (e.g., only set this.bailed = true if it was false) so the for-loop
over this.workers[0..this.spawned_count], the Environment.isPosix / p.kill path
and the call to this.abortQueuedFiles still execute even if this.bailed was
previously set; this preserves idempotency while ensuring the hard-abort logic
always runs on panic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cec60892-0dc8-42d8-8223-773ba5fd71c8

📥 Commits

Reviewing files that changed from the base of the PR and between 19503d4 and 1160318.

📒 Files selected for processing (1)
  • src/cli/test/parallel/Coordinator.zig

Comment thread src/cli/test/parallel/Coordinator.zig
Comment thread src/cli/test/parallel/Coordinator.zig
Comment thread src/bun.js/VirtualMachine.zig
Comment thread src/cli/test/parallel/Coordinator.zig
robobun added 2 commits May 4, 2026 02:32
…ocesses

CI's ASAN lane sets BUN_JSC_validateExceptionChecks=1, which bunEnv passes
through to the fixture subprocesses. Under collectContinuously the
simulated-throw counter lands on the addon's NAPI_MODULE_INIT path
(napi_create_function → napi_set_named_property) and hits a pre-existing
unchecked ThrowScope in napi.cpp — the same reason every other napi test
is listed in test/no-validate-exceptions.txt. This test is about GC
behaviour across the isolate swap, not exception-scope validation, so
strip the validator from the child env and add the test to the skip list.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/30205.test.ts`:
- Around line 148-162: Replace the fragmented assertions that destructure proc
results and assert stderr and exitCode separately by collecting stdout, stderr,
exitCode, and signalCode into a single result object and asserting with
toMatchObject(...) so failures show all fields at once; specifically, in the
test that creates proc via Bun.spawn({ cmd: [bunExe(), "test", "--parallel=2",
"."], env: { ...bunEnv, BUN_TEST_PARALLEL_SCALE_MS: "0" }, cwd: String(dir),
stdout: "pipe", stderr: "pipe" }) (the "worker crashed: SIGABRT" case) build an
object { stdout, stderr, exitCode, signalCode } and replace the multiple
expect(...) checks with one expect(result).toMatchObject({ stderr:
expect.stringContaining("worker crashed: SIGABRT"), stderr:
expect.stringMatching(/a test worker process crashed with SIGABRT while running
.*boom\.test\.js/), stderr: expect.not.stringContaining("retrying"), stderr:
expect.stringContaining("Aborting"), exitCode: expect.not.toBe(0) }) (adjust
matchers to equivalent toMatchObject-friendly forms), and apply the same
consolidation to the other crash-path test (the process.exit test around the
176–190 region).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 87c68cc9-6d1c-440d-8391-b7c56575e548

📥 Commits

Reviewing files that changed from the base of the PR and between 5a6dcdf and a8d09c8.

📒 Files selected for processing (2)
  • test/no-validate-exceptions.txt
  • test/regression/issue/30205.test.ts

Comment thread test/regression/issue/30205.test.ts
Comment thread test/regression/issue/30205.test.ts Outdated
Comment thread src/cli/test/parallel/Coordinator.zig
…T core on musl; guard @TagName on non-exhaustive SignalCode

- Windows: the --isolate/--parallel napi-UAF fixtures time out under
  BUN_JSC_collectContinuously (same as 29519.test.ts). Skip on Windows;
  the swap/finalizer path is platform-agnostic and covered by POSIX.
- Alpine/musl: the panic-abort test's worker intentionally SIGABRTs
  itself, which writes a core file that CI's coredump-upload collector
  flags as a failure even though all 4 assertions pass. Wrap the
  coordinator in sh -c 'ulimit -c 0 && exec ...' so RLIMIT_CORE=0 is
  inherited by the worker; same reasoning as the setrlimit in
  BunProcess.cpp's execve-abort path.
- Coordinator.describeStatus: SignalCode is a non-exhaustive enum;
  @TagName on an unnamed value (RT signals 32-64) is safety-checked
  illegal behavior. Use sig.name() orelse bufPrint('signal N').

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further issues from my side — all earlier inline comments were addressed. Deferring to a human reviewer since this touches GC-lifetime / NapiEnv ownership across the isolate global swap and changes --parallel crash semantics (retry removed, panic-abort added); also worth waiting on CI for c62ff55 since the last reported build (0e0573f) had new failures that this commit targets.

Extended reasoning...

Overview

This PR fixes a use-after-free in bun test --isolate / --parallel where deferred N-API finalizers run against a dead Zig::GlobalObject after the per-file global swap, and changes the --parallel coordinator to stop silently retrying crashed files and instead abort the whole run on a fatal-signal worker death. It spans:

  • ZigGlobalObject.{cpp,h} / napi.h: new adoptNapiEnvsForTestIsolation + NapiEnv::retargetGlobalObject to move Ref<NapiEnv> ownership and repoint m_globalObject before gcUnprotect.
  • VirtualMachine.zig: repoints macro_event_loop.global and rare_data.cleanup_hooks[*].globalThis during the swap.
  • Coordinator.zig / runner.zig: removes retries[]/pending_retry[], adds isPanicStatus, describeStatus, abortOnWorkerPanic (process-group SIGTERM of all workers), and reworks reapWorker.
  • New native test addon (isolate_finalizer_addon.c), regression test, and parallel.test.ts updates.

Security risks

None identified. No auth/crypto/permissions surface. The new C addon is a minimal napi_wrap + free finalizer used only in tests. std.c.kill(-pid, SIGTERM) targets only PIDs the coordinator itself spawned. No untrusted-input parsing was added.

Level of scrutiny

High. The core fix manipulates JSC GC lifetime: it transfers Ref<NapiEnv> ownership between GlobalObject instances and repoints a raw Zig::GlobalObject* so that a write barrier on a dead cell doesn't fire. The reasoning in the PR description is detailed and plausible, and the regression test reportedly reproduces 8/8 on unpatched main, but this is exactly the class of change (GC/UAF, Ref ownership move, ordering relative to gcUnprotect/DeferGC) where a subtle mistake causes rare production crashes in every native-addon user. The --parallel behavioral change (no retry → abort on fatal signal) is a deliberate UX/policy decision that a maintainer should sign off on.

Other factors

  • All seven of my prior inline comments and the CodeRabbit findings were addressed in follow-up commits (19503d4, 1160318, 5a6dcdf, a8d09c8, 0e0573f, c62ff55) and the threads are resolved.
  • The most recent robobun CI report is for 0e0573f (two commits behind HEAD) and shows new failures of 30205.test.ts on Windows and musl; HEAD c62ff55's commit message says it fixes exactly those (skip on Windows, suppress intentional SIGABRT core on musl, guard @tagName), but there's no green build posted yet.
  • Test coverage for the change is good (4-test regression file + updated parallel.test.ts), and the PR description includes a fail-on-unpatched / pass-on-patched gate.

Given the GC-sensitive core change, the user-facing behavioral change to --parallel, and the not-yet-confirmed CI on HEAD, this should go to a human reviewer rather than be auto-approved.

@Jarred-Sumner Jarred-Sumner merged commit f8fee8d into main May 4, 2026
75 of 77 checks passed
@Jarred-Sumner Jarred-Sumner deleted the farm/29e6f039/isolate-napi-uaf-no-retry branch May 4, 2026 08:02
robobun added a commit that referenced this pull request May 4, 2026
reapWorker's mid-file handling is keyed on inflight, which is only
set after .ready -> dispatch. A worker that spawns OK but dies
during init (before the IPC handshake) has inflight == null, skips
that block, and is respawned unconditionally at
hasUndispatchedFiles() — the coordinator spins forever.

Track reached_ready + startup_failures per slot. A reap with
inflight == null and !reached_ready increments startup_failures;
can_respawn requires startup_failures < max_startup_failures (2),
so after two consecutive pre-ready exits the slot stops respawning
and falls through to the existing !respawned -> abortQueuedFiles
path. startup_failures resets on .ready so only consecutive init
failures count.

A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts
the whole run via abortOnWorkerStartupPanic, mirroring #30216's
mid-file abortOnWorkerPanic.

Add BUN_TEST_WORKER_EXIT_BEFORE_READY so
parallel-startup-failure.test.ts can deterministically exercise the
path (real triggers are init segfaults / failed fd-3 adopt, not
reproducible from outside the worker).
Jarred-Sumner pushed a commit that referenced this pull request May 4, 2026
reapWorker's mid-file handling is keyed on inflight, which is only
set after .ready -> dispatch. A worker that spawns OK but dies
during init (before the IPC handshake) has inflight == null, skips
that block, and is respawned unconditionally at
hasUndispatchedFiles() — the coordinator spins forever.

Track reached_ready + startup_failures per slot. A reap with
inflight == null and !reached_ready increments startup_failures;
can_respawn requires startup_failures < max_startup_failures (2),
so after two consecutive pre-ready exits the slot stops respawning
and falls through to the existing !respawned -> abortQueuedFiles
path. startup_failures resets on .ready so only consecutive init
failures count.

A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts
the whole run via abortOnWorkerStartupPanic, mirroring #30216's
mid-file abortOnWorkerPanic.

Add BUN_TEST_WORKER_EXIT_BEFORE_READY so
parallel-startup-failure.test.ts can deterministically exercise the
path (real triggers are init segfaults / failed fd-3 adopt, not
reproducible from outside the worker).
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…worker panic, never retry (oven-sh#30216)

## What

`bun test --isolate` / `--parallel` crashes when a test file loads a
native addon whose deferred napi finalizers outlive the file. The
`--parallel` coordinator then silently retries the file once, which
masks the panic and lets the run exit 0.

Fixes oven-sh#30205, oven-sh#30191. Supersedes oven-sh#30214 (same NapiEnv fix, but without
the coordinator change, the `cleanup_hooks` retarget, or a test that
actually reproduces on unpatched `main`).

## Reproduction

```sh
git clone https://github.com/workglow-dev/libs && cd libs
bun i && bun run build:packages
bun test --timeout=30000 --parallel=4 packages/test/src/test/{util,task}/*.test.ts
```

On `main` (d484fd6), 3–4 workers crash per run with either
```
ASSERTION FAILED: isMarked(cell)
  JavaScriptCore/heap/Heap.cpp:1232 : void JSC::Heap::addToRememberedSet(const JSCell *)
```
or (when the slot is already being reallocated)
```
ASSERTION FAILED: m_cellState == CellState::DefinitelyWhite
  JavaScriptCore/JSCellInlines.h:69 : JSC::JSCell::JSCell(VM &, Structure *)
```
and in release builds the segfaults at `0x68` / `0xD0` reported in
oven-sh#30205.

## Root cause

Frame-pointer walk from the assertion:

```
oven-sh#3  Bun::NapiHandleScope::open(Zig::GlobalObject*, bool)
oven-sh#4  NapiHandleScope__open
oven-sh#6  napi.Finalizer.run
oven-sh#7  napi.NapiFinalizerTask.runOnJSThread
oven-sh#10 event_loop.tick
oven-sh#11 event_loop.waitForPromise
oven-sh#13 VirtualMachine.loadEntryPointForTestRunner   ← next test file
```

`NapiEnv::m_globalObject` is a raw `Zig::GlobalObject*`. For
non-experimental addons (`nm_version != NAPI_VERSION_EXPERIMENTAL`,
which is ~every real-world addon — sharp, better-sqlite3, etc.),
`napi_wrap`/`napi_create_external` finalizers are **deferred** to the
event loop as `NapiFinalizerTask` rather than run inside GC sweep.

Objects rooted on the old global (module graph, `globalThis.*`) only
become collectable when `Zig__GlobalObject__createForTestIsolation` runs
`gcUnprotect(oldGlobal)`. The `DeferGC` from oven-sh#29573 ends at that
function's `}`, so the next GC runs there, collects those objects, and
enqueues their finalizers. Those tasks then run on the very next
`eventLoop().tick()` — inside `loadEntryPointForTestRunner`'s
`waitForPromise` for file N+1. `Finalizer.run` opens a `NapiHandleScope`
via `env->globalObject()`, which reads `NapiHandleScopeImplStructure()`
off the dead cell and writes `m_currentNapiHandleScopeImpl` on it →
write barrier on an unmarked cell.

The `--parallel` coordinator's `reapWorker` then re-queued the file once
(`retries[idx] < 1`) into a fresh worker with no stale `NapiEnv`, which
passed — so the run reported 0 fail despite multiple Bun panics in the
log.

## Fix

**NapiEnv retarget** (`ZigGlobalObject.cpp`, `napi.h`):
`Zig__GlobalObject__createForTestIsolation` now calls
`newGlobal->adoptNapiEnvsForTestIsolation(oldGlobal)` before
`gcUnprotect`. Each `NapiEnv::m_globalObject` is repointed at the new
global and the `Ref<NapiEnv>`s are moved over, so late finalizers open
handle scopes on a live global and the envs stay owned after the old
global is swept. `VirtualMachine.swapGlobalForTestIsolation` also
repoints `rare_data.cleanup_hooks[*].globalThis` so `CleanupHook.eql()`
stays accurate.

**No retry, abort on panic** (`Coordinator.zig`): removed the per-file
retry. A worker that dies mid-file is counted as one failure. If it died
by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS —
Bun's own `@trap()`, a JSC/WTF assertion, or native-addon crash), the
whole run aborts with `error: a test worker process crashed with <SIG>
while running <file>`. `process.exit()` / SIGKILL are still just a
per-file failure and the run continues.

## Verification

- `test/regression/issue/30205.test.ts` — 4 tests. Adds a tiny
non-experimental addon (`isolate_finalizer_addon.c`) and a fixture
pattern (`Bun.gc(true)` + module-scope `await 0` + objects rooted on
`globalThis`) that crashes **8/8** on unpatched `main` and passes 8/8
with this change.
- `workglow-dev/libs` full 201-file unit suite: 3× clean `--parallel=4`
runs (was 3–4 crashes/run).
- Gate: `git stash -- src/ && bun bd test
test/regression/issue/30205.test.ts` → 3/4 fail; with fix → 4/4 pass.
- `test/cli/test/isolation.test.ts`,
`test/regression/issue/29519.test.ts` → pass (one pre-existing unrelated
timeout in isolation.test.ts, same as oven-sh#29573).
- `test/cli/test/parallel.test.ts` → all tests I touched pass; the 3
timing-sensitive scale-up/work-steal tests that fail in this container
fail identically on unmodified `main`.

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
robobun added a commit that referenced this pull request May 16, 2026
reap_worker's mid-file handling is keyed on inflight, which is only
set after .ready -> dispatch. A worker that spawns OK but dies
during init (before the IPC handshake) has inflight == None, skips
that block, and is respawned unconditionally at
has_undispatched_files() — the coordinator spins forever.

Track reached_ready + startup_failures per slot. A reap with
inflight == None and !reached_ready increments startup_failures;
can_respawn requires startup_failures < MAX_STARTUP_FAILURES (2),
so after two consecutive pre-ready exits the slot stops respawning
and falls through to the existing !respawned -> abort_queued_files
path. startup_failures resets on .ready so only consecutive init
failures count.

A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts
the whole run via abort_on_worker_startup_panic, mirroring #30216's
mid-file abort_on_worker_panic.

Add BUN_TEST_WORKER_EXIT_BEFORE_READY so
parallel-startup-failure.test.ts can deterministically exercise the
path (real triggers are init segfaults / failed fd-3 adopt, not
reproducible from outside the worker).

Applied to both the live .rs sources under src/runtime/cli/test/
parallel/ and the co-located .zig reference files so the two stay
in sync.
robobun added a commit that referenced this pull request May 19, 2026
reap_worker's mid-file handling is keyed on inflight, which is only
set after .ready -> dispatch. A worker that spawns OK but dies
during init (before the IPC handshake) has inflight == None, skips
that block, and is respawned unconditionally at
has_undispatched_files() — the coordinator spins forever.

Track reached_ready + startup_failures per slot. A reap with
inflight == None and !reached_ready increments startup_failures;
can_respawn requires startup_failures < MAX_STARTUP_FAILURES (2),
so after two consecutive pre-ready exits the slot stops respawning
and falls through to the existing !respawned -> abort_queued_files
path. startup_failures resets on .ready so only consecutive init
failures count.

A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts
the whole run via abort_on_worker_startup_panic, mirroring #30216's
mid-file abort_on_worker_panic.

Add BUN_TEST_WORKER_EXIT_BEFORE_READY so
parallel-startup-failure.test.ts can deterministically exercise the
path (real triggers are init segfaults / failed fd-3 adopt, not
reproducible from outside the worker).

Applied to both the live .rs sources under src/runtime/cli/test/
parallel/ and the co-located .zig reference files so the two stay
in sync.
HaleTom added a commit to HaleTom/bun that referenced this pull request May 23, 2026
Remove napi-related artifacts from this branch that came from other PRs
(oven-sh#29785, oven-sh#30047, oven-sh#30216). The napi files now match main exactly, making
this PR clean and self-contained with only the nested-resolutions feature.

Restored files:
- src/jsc/bindings/napi.cpp
- test/napi/napi-app/standalone_tests.cpp
- test/napi/napi.test.ts
- test/napi/napi-app/binding.gyp

Removed files (not on main):
- test/napi/napi-app/isolate_finalizer_addon.c
- test/napi/napi-app/reentrant_register_addon.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic during tests with parallel

2 participants