test --isolate: retarget NapiEnv at new global; --parallel: abort on worker panic, never retry#30216
Conversation
…panic, never retry Two intertwined problems surfaced by running a suite that loads native addons under --parallel/--isolate (issue #30205, workglow-dev/libs): 1. NapiEnv holds a raw Zig::GlobalObject*. Under --isolate the old global is gcUnprotect()'d while NapiFinalizerTasks for its objects are still queued on the event loop (non-experimental addons defer finalizers). When those finalizers run during the next file's loadEntryPointForTestRunner → waitForPromise → tick(), Finalizer.run → NapiHandleScope::open(env->globalObject()) reads NapiHandleScopeImplStructure() off the dead cell and writes m_currentNapiHandleScopeImpl on it. Debug: ASSERTION FAILED: isMarked(cell) (Heap::addToRememberedSet) ASSERTION FAILED: m_cellState == DefinitelyWhite (JSCell ctor) Release: segfault at 0x68/0xD0 when the concurrent marker later visits the half-dead global. Fix: during the swap, retarget every NapiEnv's m_globalObject to the new global and move the Ref<NapiEnv>s over so they stay owned after the old global is swept. Also repoint rare_data.cleanup_hooks' captured globalThis so eql() stays accurate. 2. The --parallel coordinator silently retried any file whose worker crashed, once. A real Bun panic in a worker was therefore invisible — the retry ran in a fresh process with no stale NapiEnv and passed. Removed the retry entirely. A worker killed by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS — i.e. Bun's crash handler or a JSC assertion) now aborts the whole run with a clear error; a plain process.exit() marks the file failed and continues. The regression test (test/regression/issue/30205.test.ts) adds a tiny non-experimental addon (isolate_finalizer_addon.c) that napi_wrap()s objects the test roots on globalThis; Bun.gc(true) + top-level await at the start of each isolated file forces the previous global's wrapped objects to be collected *before* the event loop has drained their deferred finalizers, making the crash reproduce 8/8 on Linux x64 ASAN. Fixes #30205 Fixes #30191
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRetargets pending N-API environments and cleanup hooks when swapping a GlobalObject for test isolation, transferring their ownership to the new global. Separately, removes per-file retry logic from the parallel test coordinator, introduces panic classification/abort-on-panic, and updates tests and an N-API test addon. ChangesN-API Global Object Isolation
Test Coordinator Crash Handling
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Updated 10:35 PM PT - May 3rd, 2026
❌ @robobun, your commit c62ff55 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 30216That installs a local version of the PR into your bun-30216 --bun |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/cli/test/parallel.test.ts (1)
101-127: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winMake the crashing file sort first so this covers the mid-run path deterministically.
As written,
boom.test.jssorts aftera.test.jsandb.test.js, so the worker can crash on its last assigned file and this still passes without exercising the "remaining work continues with no retry" branch. Renaming the crashing file to sort first would make the regression actually hit the code path this PR changed.One simple way to make it deterministic
test("--parallel marks a file whose worker exits mid-run as failed (no retry)", async () => { using dir = tempDir("parallel-crash", { + "0-boom.test.js": `import {test} from "bun:test"; test("boom",()=>process.exit(7));`, "a.test.js": `import {test,expect} from "bun:test"; test("a",()=>expect(1).toBe(1));`, "b.test.js": `import {test,expect} from "bun:test"; test("b",()=>expect(1).toBe(1));`, - "boom.test.js": `import {test} from "bun:test"; test("boom",()=>process.exit(7));`, }); @@ - expect(stderr).toContain("boom.test.js"); + expect(stderr).toContain("0-boom.test.js"); expect(stderr).toContain("(worker crashed: exit code 7)");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/cli/test/parallel.test.ts` around lines 101 - 127, The test is nondeterministic because "boom.test.js" can be scheduled last; rename the crashing file so it sorts first (e.g., prefix it like "0_boom.test.js" or "000_boom.test.js") in the tempDir fixture in the test body and update any assertions that reference the filename (expect(stderr).toContain(...)) to the new filename; leave the rest of the test logic (spawn, flags, and expectations about crash message, no "retrying", summary and exitCode) unchanged so the mid-run crash path is exercised deterministically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/napi.h`:
- Around line 383-393: The helper retargetGlobalObject(Zig::GlobalObject*
newGlobal) can only be used when newGlobal belongs to the same VM as the NapiEnv
(which stores m_vm at construction); add an assertion at the start of
retargetGlobalObject that verifies newGlobal's VM equals m_vm (e.g.,
ASSERT(newGlobal->vm() == m_vm)) before assigning m_globalObject = newGlobal so
misuse is caught early; reference NapiEnv, m_vm, retargetGlobalObject,
m_globalObject and Zig::GlobalObject to locate and update the function.
In `@src/bun.js/VirtualMachine.zig`:
- Around line 2572-2574: The isolate-swap code updates this.global,
VMHolder.cached_global_object, and this.regular_event_loop.global but neglects
the macro event loop; ensure you also retarget the macro event loop by assigning
the new global to its global field (e.g., set this.macro_event_loop.global =
new_global or VMHolder.macro_event_loop.global = new_global) in the same swap
code path so macro_event_loop doesn’t remain pinned to the old global after
enableMacroMode()/isolate swaps.
In `@src/cli/test/parallel/Coordinator.zig`:
- Around line 383-396: In abortOnWorkerPanic, remove or bypass the early guard
that returns when this.bailed is already true and ensure you force-stop every
alive worker (not only those with inflight == null): in the loop over
this.workers[0..this.spawned_count] call other.shutdown() for any other.alive
regardless of other.inflight, so inflight workers are stopped immediately; still
set this.bailed = true, call breakDots(), Output.prettyError/flush, and then
call this.abortQueuedFiles("aborted: worker panicked") so the panic path always
force-stops workers and lets reapWorker() account for their inflight files.
---
Outside diff comments:
In `@test/cli/test/parallel.test.ts`:
- Around line 101-127: The test is nondeterministic because "boom.test.js" can
be scheduled last; rename the crashing file so it sorts first (e.g., prefix it
like "0_boom.test.js" or "000_boom.test.js") in the tempDir fixture in the test
body and update any assertions that reference the filename
(expect(stderr).toContain(...)) to the new filename; leave the rest of the test
logic (spawn, flags, and expectations about crash message, no "retrying",
summary and exitCode) unchanged so the mid-run crash path is exercised
deterministically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a0d8c28c-4690-4c20-9373-44eb172b8db8
📒 Files selected for processing (10)
src/bun.js/VirtualMachine.zigsrc/bun.js/bindings/ZigGlobalObject.cppsrc/bun.js/bindings/ZigGlobalObject.hsrc/bun.js/bindings/napi.hsrc/cli/test/parallel/Coordinator.zigsrc/cli/test/parallel/runner.zigtest/cli/test/parallel.test.tstest/napi/napi-app/binding.gyptest/napi/napi-app/isolate_finalizer_addon.ctest/regression/issue/30205.test.ts
💤 Files with no reviewable changes (1)
- src/cli/test/parallel/runner.zig
…lobal; panic-abort terminates inflight workers
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cli/test/parallel/Coordinator.zig (1)
394-411:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPanic hard-abort is still skipped when
bailedwas already set.At Line 394,
if (this.bailed) return;bypasses worker termination and queued-file abort on a fatal-signal crash if--bailtripped earlier. That can still allow additional inflight output after the panic banner instead of an immediate crash abort path.Suggested fix
fn abortOnWorkerPanic(this: *Coordinator, file_idx: u32, status: bun.spawn.Status) void { this.breakDots(); var buf: [32]u8 = undefined; @@ ); Output.flush(); - if (this.bailed) return; - this.bailed = true; + const already_bailed = this.bailed; + this.bailed = true; @@ for (this.workers[0..this.spawned_count]) |*other| { if (!other.alive) continue; if (other.process) |p| { if (Environment.isPosix) { _ = std.c.kill(-p.pid, std.posix.SIG.TERM); } else { _ = p.kill(1); } } } - this.abortQueuedFiles("aborted: worker panicked"); + if (!already_bailed) { + this.abortQueuedFiles("aborted: worker panicked"); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/test/parallel/Coordinator.zig` around lines 394 - 411, The early return "if (this.bailed) return;" prevents the hard-abort path from running when bailed was already true; change this to not return so the termination loop and this.abortQueuedFiles("aborted: worker panicked") still run. Concretely, remove the return and make setting this.bailed idempotent (e.g., only set this.bailed = true if it was false) so the for-loop over this.workers[0..this.spawned_count], the Environment.isPosix / p.kill path and the call to this.abortQueuedFiles still execute even if this.bailed was previously set; this preserves idempotency while ensuring the hard-abort logic always runs on panic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/cli/test/parallel/Coordinator.zig`:
- Around line 394-411: The early return "if (this.bailed) return;" prevents the
hard-abort path from running when bailed was already true; change this to not
return so the termination loop and this.abortQueuedFiles("aborted: worker
panicked") still run. Concretely, remove the return and make setting this.bailed
idempotent (e.g., only set this.bailed = true if it was false) so the for-loop
over this.workers[0..this.spawned_count], the Environment.isPosix / p.kill path
and the call to this.abortQueuedFiles still execute even if this.bailed was
previously set; this preserves idempotency while ensuring the hard-abort logic
always runs on panic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cec60892-0dc8-42d8-8223-773ba5fd71c8
📒 Files selected for processing (1)
src/cli/test/parallel/Coordinator.zig
…bail already fired
…ocesses CI's ASAN lane sets BUN_JSC_validateExceptionChecks=1, which bunEnv passes through to the fixture subprocesses. Under collectContinuously the simulated-throw counter lands on the addon's NAPI_MODULE_INIT path (napi_create_function → napi_set_named_property) and hits a pre-existing unchecked ThrowScope in napi.cpp — the same reason every other napi test is listed in test/no-validate-exceptions.txt. This test is about GC behaviour across the isolate swap, not exception-scope validation, so strip the validator from the child env and add the test to the skip list.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/30205.test.ts`:
- Around line 148-162: Replace the fragmented assertions that destructure proc
results and assert stderr and exitCode separately by collecting stdout, stderr,
exitCode, and signalCode into a single result object and asserting with
toMatchObject(...) so failures show all fields at once; specifically, in the
test that creates proc via Bun.spawn({ cmd: [bunExe(), "test", "--parallel=2",
"."], env: { ...bunEnv, BUN_TEST_PARALLEL_SCALE_MS: "0" }, cwd: String(dir),
stdout: "pipe", stderr: "pipe" }) (the "worker crashed: SIGABRT" case) build an
object { stdout, stderr, exitCode, signalCode } and replace the multiple
expect(...) checks with one expect(result).toMatchObject({ stderr:
expect.stringContaining("worker crashed: SIGABRT"), stderr:
expect.stringMatching(/a test worker process crashed with SIGABRT while running
.*boom\.test\.js/), stderr: expect.not.stringContaining("retrying"), stderr:
expect.stringContaining("Aborting"), exitCode: expect.not.toBe(0) }) (adjust
matchers to equivalent toMatchObject-friendly forms), and apply the same
consolidation to the other crash-path test (the process.exit test around the
176–190 region).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 87c68cc9-6d1c-440d-8391-b7c56575e548
📒 Files selected for processing (2)
test/no-validate-exceptions.txttest/regression/issue/30205.test.ts
…T core on musl; guard @TagName on non-exhaustive SignalCode - Windows: the --isolate/--parallel napi-UAF fixtures time out under BUN_JSC_collectContinuously (same as 29519.test.ts). Skip on Windows; the swap/finalizer path is platform-agnostic and covered by POSIX. - Alpine/musl: the panic-abort test's worker intentionally SIGABRTs itself, which writes a core file that CI's coredump-upload collector flags as a failure even though all 4 assertions pass. Wrap the coordinator in sh -c 'ulimit -c 0 && exec ...' so RLIMIT_CORE=0 is inherited by the worker; same reasoning as the setrlimit in BunProcess.cpp's execve-abort path. - Coordinator.describeStatus: SignalCode is a non-exhaustive enum; @TagName on an unnamed value (RT signals 32-64) is safety-checked illegal behavior. Use sig.name() orelse bufPrint('signal N').
There was a problem hiding this comment.
No further issues from my side — all earlier inline comments were addressed. Deferring to a human reviewer since this touches GC-lifetime / NapiEnv ownership across the isolate global swap and changes --parallel crash semantics (retry removed, panic-abort added); also worth waiting on CI for c62ff55 since the last reported build (0e0573f) had new failures that this commit targets.
Extended reasoning...
Overview
This PR fixes a use-after-free in bun test --isolate / --parallel where deferred N-API finalizers run against a dead Zig::GlobalObject after the per-file global swap, and changes the --parallel coordinator to stop silently retrying crashed files and instead abort the whole run on a fatal-signal worker death. It spans:
ZigGlobalObject.{cpp,h}/napi.h: newadoptNapiEnvsForTestIsolation+NapiEnv::retargetGlobalObjectto moveRef<NapiEnv>ownership and repointm_globalObjectbeforegcUnprotect.VirtualMachine.zig: repointsmacro_event_loop.globalandrare_data.cleanup_hooks[*].globalThisduring the swap.Coordinator.zig/runner.zig: removesretries[]/pending_retry[], addsisPanicStatus,describeStatus,abortOnWorkerPanic(process-group SIGTERM of all workers), and reworksreapWorker.- New native test addon (
isolate_finalizer_addon.c), regression test, andparallel.test.tsupdates.
Security risks
None identified. No auth/crypto/permissions surface. The new C addon is a minimal napi_wrap + free finalizer used only in tests. std.c.kill(-pid, SIGTERM) targets only PIDs the coordinator itself spawned. No untrusted-input parsing was added.
Level of scrutiny
High. The core fix manipulates JSC GC lifetime: it transfers Ref<NapiEnv> ownership between GlobalObject instances and repoints a raw Zig::GlobalObject* so that a write barrier on a dead cell doesn't fire. The reasoning in the PR description is detailed and plausible, and the regression test reportedly reproduces 8/8 on unpatched main, but this is exactly the class of change (GC/UAF, Ref ownership move, ordering relative to gcUnprotect/DeferGC) where a subtle mistake causes rare production crashes in every native-addon user. The --parallel behavioral change (no retry → abort on fatal signal) is a deliberate UX/policy decision that a maintainer should sign off on.
Other factors
- All seven of my prior inline comments and the CodeRabbit findings were addressed in follow-up commits (19503d4, 1160318, 5a6dcdf, a8d09c8, 0e0573f, c62ff55) and the threads are resolved.
- The most recent robobun CI report is for 0e0573f (two commits behind HEAD) and shows new failures of
30205.test.tson Windows and musl; HEAD c62ff55's commit message says it fixes exactly those (skip on Windows, suppress intentional SIGABRT core on musl, guard@tagName), but there's no green build posted yet. - Test coverage for the change is good (4-test regression file + updated
parallel.test.ts), and the PR description includes a fail-on-unpatched / pass-on-patched gate.
Given the GC-sensitive core change, the user-facing behavioral change to --parallel, and the not-yet-confirmed CI on HEAD, this should go to a human reviewer rather than be auto-approved.
reapWorker's mid-file handling is keyed on inflight, which is only set after .ready -> dispatch. A worker that spawns OK but dies during init (before the IPC handshake) has inflight == null, skips that block, and is respawned unconditionally at hasUndispatchedFiles() — the coordinator spins forever. Track reached_ready + startup_failures per slot. A reap with inflight == null and !reached_ready increments startup_failures; can_respawn requires startup_failures < max_startup_failures (2), so after two consecutive pre-ready exits the slot stops respawning and falls through to the existing !respawned -> abortQueuedFiles path. startup_failures resets on .ready so only consecutive init failures count. A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts the whole run via abortOnWorkerStartupPanic, mirroring #30216's mid-file abortOnWorkerPanic. Add BUN_TEST_WORKER_EXIT_BEFORE_READY so parallel-startup-failure.test.ts can deterministically exercise the path (real triggers are init segfaults / failed fd-3 adopt, not reproducible from outside the worker).
reapWorker's mid-file handling is keyed on inflight, which is only set after .ready -> dispatch. A worker that spawns OK but dies during init (before the IPC handshake) has inflight == null, skips that block, and is respawned unconditionally at hasUndispatchedFiles() — the coordinator spins forever. Track reached_ready + startup_failures per slot. A reap with inflight == null and !reached_ready increments startup_failures; can_respawn requires startup_failures < max_startup_failures (2), so after two consecutive pre-ready exits the slot stops respawning and falls through to the existing !respawned -> abortQueuedFiles path. startup_failures resets on .ready so only consecutive init failures count. A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts the whole run via abortOnWorkerStartupPanic, mirroring #30216's mid-file abortOnWorkerPanic. Add BUN_TEST_WORKER_EXIT_BEFORE_READY so parallel-startup-failure.test.ts can deterministically exercise the path (real triggers are init segfaults / failed fd-3 adopt, not reproducible from outside the worker).
…worker panic, never retry (oven-sh#30216) ## What `bun test --isolate` / `--parallel` crashes when a test file loads a native addon whose deferred napi finalizers outlive the file. The `--parallel` coordinator then silently retries the file once, which masks the panic and lets the run exit 0. Fixes oven-sh#30205, oven-sh#30191. Supersedes oven-sh#30214 (same NapiEnv fix, but without the coordinator change, the `cleanup_hooks` retarget, or a test that actually reproduces on unpatched `main`). ## Reproduction ```sh git clone https://github.com/workglow-dev/libs && cd libs bun i && bun run build:packages bun test --timeout=30000 --parallel=4 packages/test/src/test/{util,task}/*.test.ts ``` On `main` (d484fd6), 3–4 workers crash per run with either ``` ASSERTION FAILED: isMarked(cell) JavaScriptCore/heap/Heap.cpp:1232 : void JSC::Heap::addToRememberedSet(const JSCell *) ``` or (when the slot is already being reallocated) ``` ASSERTION FAILED: m_cellState == CellState::DefinitelyWhite JavaScriptCore/JSCellInlines.h:69 : JSC::JSCell::JSCell(VM &, Structure *) ``` and in release builds the segfaults at `0x68` / `0xD0` reported in oven-sh#30205. ## Root cause Frame-pointer walk from the assertion: ``` oven-sh#3 Bun::NapiHandleScope::open(Zig::GlobalObject*, bool) oven-sh#4 NapiHandleScope__open oven-sh#6 napi.Finalizer.run oven-sh#7 napi.NapiFinalizerTask.runOnJSThread oven-sh#10 event_loop.tick oven-sh#11 event_loop.waitForPromise oven-sh#13 VirtualMachine.loadEntryPointForTestRunner ← next test file ``` `NapiEnv::m_globalObject` is a raw `Zig::GlobalObject*`. For non-experimental addons (`nm_version != NAPI_VERSION_EXPERIMENTAL`, which is ~every real-world addon — sharp, better-sqlite3, etc.), `napi_wrap`/`napi_create_external` finalizers are **deferred** to the event loop as `NapiFinalizerTask` rather than run inside GC sweep. Objects rooted on the old global (module graph, `globalThis.*`) only become collectable when `Zig__GlobalObject__createForTestIsolation` runs `gcUnprotect(oldGlobal)`. The `DeferGC` from oven-sh#29573 ends at that function's `}`, so the next GC runs there, collects those objects, and enqueues their finalizers. Those tasks then run on the very next `eventLoop().tick()` — inside `loadEntryPointForTestRunner`'s `waitForPromise` for file N+1. `Finalizer.run` opens a `NapiHandleScope` via `env->globalObject()`, which reads `NapiHandleScopeImplStructure()` off the dead cell and writes `m_currentNapiHandleScopeImpl` on it → write barrier on an unmarked cell. The `--parallel` coordinator's `reapWorker` then re-queued the file once (`retries[idx] < 1`) into a fresh worker with no stale `NapiEnv`, which passed — so the run reported 0 fail despite multiple Bun panics in the log. ## Fix **NapiEnv retarget** (`ZigGlobalObject.cpp`, `napi.h`): `Zig__GlobalObject__createForTestIsolation` now calls `newGlobal->adoptNapiEnvsForTestIsolation(oldGlobal)` before `gcUnprotect`. Each `NapiEnv::m_globalObject` is repointed at the new global and the `Ref<NapiEnv>`s are moved over, so late finalizers open handle scopes on a live global and the envs stay owned after the old global is swept. `VirtualMachine.swapGlobalForTestIsolation` also repoints `rare_data.cleanup_hooks[*].globalThis` so `CleanupHook.eql()` stays accurate. **No retry, abort on panic** (`Coordinator.zig`): removed the per-file retry. A worker that dies mid-file is counted as one failure. If it died by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS — Bun's own `@trap()`, a JSC/WTF assertion, or native-addon crash), the whole run aborts with `error: a test worker process crashed with <SIG> while running <file>`. `process.exit()` / SIGKILL are still just a per-file failure and the run continues. ## Verification - `test/regression/issue/30205.test.ts` — 4 tests. Adds a tiny non-experimental addon (`isolate_finalizer_addon.c`) and a fixture pattern (`Bun.gc(true)` + module-scope `await 0` + objects rooted on `globalThis`) that crashes **8/8** on unpatched `main` and passes 8/8 with this change. - `workglow-dev/libs` full 201-file unit suite: 3× clean `--parallel=4` runs (was 3–4 crashes/run). - Gate: `git stash -- src/ && bun bd test test/regression/issue/30205.test.ts` → 3/4 fail; with fix → 4/4 pass. - `test/cli/test/isolation.test.ts`, `test/regression/issue/29519.test.ts` → pass (one pre-existing unrelated timeout in isolation.test.ts, same as oven-sh#29573). - `test/cli/test/parallel.test.ts` → all tests I touched pass; the 3 timing-sensitive scale-up/work-steal tests that fail in this container fail identically on unmodified `main`. --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
reap_worker's mid-file handling is keyed on inflight, which is only set after .ready -> dispatch. A worker that spawns OK but dies during init (before the IPC handshake) has inflight == None, skips that block, and is respawned unconditionally at has_undispatched_files() — the coordinator spins forever. Track reached_ready + startup_failures per slot. A reap with inflight == None and !reached_ready increments startup_failures; can_respawn requires startup_failures < MAX_STARTUP_FAILURES (2), so after two consecutive pre-ready exits the slot stops respawning and falls through to the existing !respawned -> abort_queued_files path. startup_failures resets on .ready so only consecutive init failures count. A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts the whole run via abort_on_worker_startup_panic, mirroring #30216's mid-file abort_on_worker_panic. Add BUN_TEST_WORKER_EXIT_BEFORE_READY so parallel-startup-failure.test.ts can deterministically exercise the path (real triggers are init segfaults / failed fd-3 adopt, not reproducible from outside the worker). Applied to both the live .rs sources under src/runtime/cli/test/ parallel/ and the co-located .zig reference files so the two stay in sync.
reap_worker's mid-file handling is keyed on inflight, which is only set after .ready -> dispatch. A worker that spawns OK but dies during init (before the IPC handshake) has inflight == None, skips that block, and is respawned unconditionally at has_undispatched_files() — the coordinator spins forever. Track reached_ready + startup_failures per slot. A reap with inflight == None and !reached_ready increments startup_failures; can_respawn requires startup_failures < MAX_STARTUP_FAILURES (2), so after two consecutive pre-ready exits the slot stops respawning and falls through to the existing !respawned -> abort_queued_files path. startup_failures resets on .ready so only consecutive init failures count. A pre-ready fatal signal (SIGSEGV/SIGABRT/etc.) additionally aborts the whole run via abort_on_worker_startup_panic, mirroring #30216's mid-file abort_on_worker_panic. Add BUN_TEST_WORKER_EXIT_BEFORE_READY so parallel-startup-failure.test.ts can deterministically exercise the path (real triggers are init segfaults / failed fd-3 adopt, not reproducible from outside the worker). Applied to both the live .rs sources under src/runtime/cli/test/ parallel/ and the co-located .zig reference files so the two stay in sync.
Remove napi-related artifacts from this branch that came from other PRs (oven-sh#29785, oven-sh#30047, oven-sh#30216). The napi files now match main exactly, making this PR clean and self-contained with only the nested-resolutions feature. Restored files: - src/jsc/bindings/napi.cpp - test/napi/napi-app/standalone_tests.cpp - test/napi/napi.test.ts - test/napi/napi-app/binding.gyp Removed files (not on main): - test/napi/napi-app/isolate_finalizer_addon.c - test/napi/napi-app/reentrant_register_addon.cpp
What
bun test --isolate/--parallelcrashes when a test file loads a native addon whose deferred napi finalizers outlive the file. The--parallelcoordinator then silently retries the file once, which masks the panic and lets the run exit 0.Fixes #30205, #30191. Supersedes #30214 (same NapiEnv fix, but without the coordinator change, the
cleanup_hooksretarget, or a test that actually reproduces on unpatchedmain).Reproduction
On
main(d484fd6), 3–4 workers crash per run with eitheror (when the slot is already being reallocated)
and in release builds the segfaults at
0x68/0xD0reported in #30205.Root cause
Frame-pointer walk from the assertion:
NapiEnv::m_globalObjectis a rawZig::GlobalObject*. For non-experimental addons (nm_version != NAPI_VERSION_EXPERIMENTAL, which is ~every real-world addon — sharp, better-sqlite3, etc.),napi_wrap/napi_create_externalfinalizers are deferred to the event loop asNapiFinalizerTaskrather than run inside GC sweep.Objects rooted on the old global (module graph,
globalThis.*) only become collectable whenZig__GlobalObject__createForTestIsolationrunsgcUnprotect(oldGlobal). TheDeferGCfrom #29573 ends at that function's}, so the next GC runs there, collects those objects, and enqueues their finalizers. Those tasks then run on the very nexteventLoop().tick()— insideloadEntryPointForTestRunner'swaitForPromisefor file N+1.Finalizer.runopens aNapiHandleScopeviaenv->globalObject(), which readsNapiHandleScopeImplStructure()off the dead cell and writesm_currentNapiHandleScopeImplon it → write barrier on an unmarked cell.The
--parallelcoordinator'sreapWorkerthen re-queued the file once (retries[idx] < 1) into a fresh worker with no staleNapiEnv, which passed — so the run reported 0 fail despite multiple Bun panics in the log.Fix
NapiEnv retarget (
ZigGlobalObject.cpp,napi.h):Zig__GlobalObject__createForTestIsolationnow callsnewGlobal->adoptNapiEnvsForTestIsolation(oldGlobal)beforegcUnprotect. EachNapiEnv::m_globalObjectis repointed at the new global and theRef<NapiEnv>s are moved over, so late finalizers open handle scopes on a live global and the envs stay owned after the old global is swept.VirtualMachine.swapGlobalForTestIsolationalso repointsrare_data.cleanup_hooks[*].globalThissoCleanupHook.eql()stays accurate.No retry, abort on panic (
Coordinator.zig): removed the per-file retry. A worker that dies mid-file is counted as one failure. If it died by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS — Bun's own@trap(), a JSC/WTF assertion, or native-addon crash), the whole run aborts witherror: a test worker process crashed with <SIG> while running <file>.process.exit()/ SIGKILL are still just a per-file failure and the run continues.Verification
test/regression/issue/30205.test.ts— 4 tests. Adds a tiny non-experimental addon (isolate_finalizer_addon.c) and a fixture pattern (Bun.gc(true)+ module-scopeawait 0+ objects rooted onglobalThis) that crashes 8/8 on unpatchedmainand passes 8/8 with this change.workglow-dev/libsfull 201-file unit suite: 3× clean--parallel=4runs (was 3–4 crashes/run).git stash -- src/ && bun bd test test/regression/issue/30205.test.ts→ 3/4 fail; with fix → 4/4 pass.test/cli/test/isolation.test.ts,test/regression/issue/29519.test.ts→ pass (one pre-existing unrelated timeout in isolation.test.ts, same as isolate: defer GC across Zig::GlobalObject swap; guard unique_ptr visit #29573).test/cli/test/parallel.test.ts→ all tests I touched pass; the 3 timing-sensitive scale-up/work-steal tests that fail in this container fail identically on unmodifiedmain.