Skip to content

fix(test --isolate): repoint NapiEnv at new global during swap#30214

Closed
robobun wants to merge 5 commits into
mainfrom
farm/f4c89feb/napi-env-isolate-swap
Closed

fix(test --isolate): repoint NapiEnv at new global during swap#30214
robobun wants to merge 5 commits into
mainfrom
farm/f4c89feb/napi-env-isolate-swap

Conversation

@robobun

@robobun robobun commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Fixes #30205 (also the remaining half of #30191).

Repro

Any test suite that loads a NAPI module under bun test --isolate (or --parallel, which implies --isolate per worker). Minimal:

# 20 files, each: require(<.node addon with deferred finalizers>), napi_wrap a batch
bun test --isolate .

Release builds segfault at 0x68/0xD0/0xE00000020; debug builds assert isMarked(cell) in JSC::Heap::addToRememberedSet. The reporter's workglow-dev/libs suite hits it on every bun run test:bun:unit (the panics are masked by retry = 1 so the exit code stays 0).

Root cause

NapiEnv::m_globalObject is a raw Zig::GlobalObject*. Zig__GlobalObject__createForTestIsolation creates a new global and gcUnprotect()s the old one, but never touches the NapiEnvs the old global created. Those envs are kept alive by NapiEnv::Ref held in queued NapiFinalizerTasks (and by weak-handle owners that fire napi_internal_enqueue_finalizer when the old graph is swept on a later GC). When the deferred finalizer runs on the event loop:

NapiFinalizerTask.runOnJSThread
  → Finalizer.run
    → NapiHandleScope.open(env, false)
      → NapiHandleScope__open(env, false)
        → NapiHandleScope::open(env->globalObject(), …)   // stale raw pointer
          → globalObject->m_currentNapiHandleScopeImpl.set(vm, globalObject, impl)
            → vm.writeBarrier(globalObject, …)
              → Heap::addToRememberedSet(globalObject)     // cell is swept → ASSERT / segfault

Every process_dlopen in the Features: line of the reporter's panics is the fingerprint.

This is distinct from #29519 / PR #29573: that DeferGC protects the new global during finishCreation; this is the old global being written to long after the swap, via a raw pointer the swap never updated.

Fix

During the swap, move oldGlobal->m_napiEnvs into the new global and set each env's m_globalObject to the new global. This is done inside the existing DeferGC scope, before gcUnprotect(oldGlobal), so the envs never observe an unprotected pointer. The new global keeps them alive and reports their pending finalizers via hasNapiFinalizers(); every env->globalObject() caller (NapiHandleScope, Finalizer.run, napi_async_work::runFromJS, ThreadSafeFunction, toJS(napi_env)) now sees a live, gcProtect()'d global.

Verification

New regression test in test/regression/issue/30205.test.ts spawns bun test --isolate over 20 files that each require the existing test/napi/napi-app/async_finalize_addon (NAPI v8, deferred finalizers) and wrap 1000 objects with Buffer ballast rooted on globalThis, under BUN_JSC_collectContinuously=1.

build before fix after fix
debug (ASAN) ASSERTION FAILED: isMarked(cell) — 10/10 20 pass, 0 fail — 3/3
release (system bun) panic: Segmentation fault at address 0x68 — 8/8

The reporter's sharp repro (20 files, no collectContinuously) also passes 3/3 after the fix and 3/3 under --parallel=4 --retry=1 --max-concurrency=20.

NapiEnv holds a raw Zig::GlobalObject*. When bun test --isolate swaps
the global between files, the old global is gcUnprotect()'d and later
swept, but NapiFinalizerTasks queued by the previous file (and weak
handle callbacks that fire on subsequent sweeps) still hold NapiEnvs
pointing at the dead global. Opening a handle scope through that env
writes a barrier on a swept cell — debug builds assert isMarked(cell)
in Heap::addToRememberedSet, release builds segfault at 0x68/0xD0.

Move the old global's NapiEnvs to the new global and repoint
m_globalObject during Zig__GlobalObject__createForTestIsolation so
every env->globalObject() caller (NapiHandleScope, Finalizer.run,
napi_async_work, ThreadSafeFunction) sees a live, gcProtect()'d global.

Fixes #30205
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4d436f80-2a7d-44da-affb-82197dcb65ba

📥 Commits

Reviewing files that changed from the base of the PR and between 810f6a8 and 4494b6b.

📒 Files selected for processing (1)
  • test/regression/issue/30205.test.ts

Walkthrough

When running tests with bun test --isolate, newly created GlobalObjects now adopt existing N-API environments from the old global: each NapiEnv's m_globalObject is repointed and the container of envs is moved to the new global before the old global becomes collectable, preventing use-after-free during deferred finalizers.

Changes

N-API Environment Repointing for Test Isolation

Layer / File(s) Summary
Public API Additions
src/bun.js/bindings/napi.h, src/bun.js/bindings/ZigGlobalObject.h
Adds NapiEnv::setGlobalObject(Zig::GlobalObject*) and declares GlobalObject::adoptNapiEnvsForTestIsolation(GlobalObject* fromGlobal).
Data / State Repointing
src/bun.js/bindings/ZigGlobalObject.cpp, src/bun.js/bindings/napi.h
New GlobalObject::adoptNapiEnvsForTestIsolation iterates existing NapiEnvs, asserts same VM, calls env->setGlobalObject(this), and prepares to transfer ownership of the env container.
Ownership Transfer
src/bun.js/bindings/ZigGlobalObject.cpp
Moves m_napiEnvs from the source global into the new global using std::exchange, ensuring the new global becomes the owner of the N-API env collection.
Integration Point
src/bun.js/bindings/ZigGlobalObject.cpp
In Zig__GlobalObject__createForTestIsolation, after gcProtect() on the newly created global, call adoptNapiEnvsForTestIsolation(oldGlobal) before the old global can be unprotected/collected.
Regression Test
test/regression/issue/30205.test.ts
Adds a (Windows-skipped) regression test that builds an N-API addon and runs bun test --isolate across 20 generated test files with continuous collection enabled to assert no crashes and full pass summary.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the primary change: repointing NapiEnv to the new global during test isolation swap.
Description check ✅ Passed The PR description provides comprehensive context including root cause analysis, fix details, and verification results, exceeding the template's minimal requirements.
Linked Issues check ✅ Passed The code changes fully address issue #30205 by preventing use-after-free crashes in NAPI environments during test isolation via repointing and moving env ownership.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the NAPI env repointing issue: bindings implementation, header declarations, and a targeted regression test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Review rate limit: 0/5 reviews remaining, refill in 50 minutes and 14 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the claude label May 4, 2026
@robobun

robobun commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bun.js/bindings/napi.h`:
- Around line 383-388: Add a guard and restrict visibility for the env
retargeting: make inline void setGlobalObject(Zig::GlobalObject* globalObject) a
non-public mutator (move it out of the public section) and inside it assert that
globalObject->vm() (or equivalent accessor) equals m_vm to enforce the invariant
that m_vm is fixed at construction; document that
GlobalObject::adoptNapiEnvsForTestIsolation() is the only intended caller so
callers must also update ownership bookkeeping when retargeting NapiEnvs.

In `@test/regression/issue/30205.test.ts`:
- Around line 80-90: Replace the current separate assertions that read
stdout/stderr then check exitCode with a single, combined assertion that awaits
stdout, stderr and proc.exited together and asserts the child result object
(including exitCode and signalCode) at once; specifically, change the
Promise.all call to collect proc.stdout.text(), proc.stderr.text(), and
proc.exited into variables (e.g. stdout, stderr, exited) and then assert an
object containing stdout, stderr, exitCode (from exited) and signalCode (from
exited) instead of asserting streams first and exitCode separately so
crashes/segfaults produce actionable diffs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9cf4346a-739c-43a2-9089-42c5523dfba5

📥 Commits

Reviewing files that changed from the base of the PR and between 797dee4 and 8fb346d.

📒 Files selected for processing (4)
  • src/bun.js/bindings/ZigGlobalObject.cpp
  • src/bun.js/bindings/ZigGlobalObject.h
  • src/bun.js/bindings/napi.h
  • test/regression/issue/30205.test.ts

Comment thread src/bun.js/bindings/napi.h Outdated
Comment thread test/regression/issue/30205.test.ts Outdated
@robobun

robobun commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #30216, which has the same adoptNapiEnvsForTestIsolation fix plus:

  • the --parallel coordinator change (no retry; fatal-signal worker crash aborts the run) that was also requested for panic during tests with parallel #30205 — without it, the retry keeps masking the panic even after the NapiEnv fix goes in, so CI stays green on a real crash
  • rare_data.cleanup_hooks[*].globalThis retarget so CleanupHook.eql() doesn't compare against a dead global
  • a regression test that reproduces 8/8 on unpatched main (the async_finalize_addon.create_ref() + in-test ballast pattern here returns undefined so the wrapped objects are collected during the runner's own post-test tick(), before the swap — I ran it 10× on main and couldn't get it to assert)

Happy to fold anything from here into #30216 instead if preferred.

robobun and others added 2 commits May 4, 2026 01:51
… setGlobalObject

The addon's printf() output is fully-buffered on a pipe and may or may
not be flushed before the isolate runner exits depending on build
flavour; release-asan CI saw only the banner on stdout. The crash is
fully detected by stderr lacking the pass summary + non-zero exit,
so assert those together with signalCode for an actionable diff and
drop the stdout capture entirely.

Also assert &getVM(global) == &m_vm in NapiEnv::setGlobalObject so
future callers can't retarget an env across VMs.
Comment thread test/regression/issue/30205.test.ts
Comment thread test/regression/issue/30205.test.ts
CI ASAN sets BUN_JSC_validateExceptionChecks=1 on the outer test;
bunEnv spreads process.env so the inner bun test --isolate inherited
it and aborted on a pre-existing unchecked-exception scope in NAPI
module init (napi_create_function -> napi_set_named_property),
unrelated to the global-swap UAF under test. Strip it from the child
env like the other NAPI tests in no-validate-exceptions.txt.
Comment thread src/bun.js/bindings/ZigGlobalObject.cpp
Jarred-Sumner pushed a commit that referenced this pull request May 4, 2026
…worker panic, never retry (#30216)

## What

`bun test --isolate` / `--parallel` crashes when a test file loads a
native addon whose deferred napi finalizers outlive the file. The
`--parallel` coordinator then silently retries the file once, which
masks the panic and lets the run exit 0.

Fixes #30205, #30191. Supersedes #30214 (same NapiEnv fix, but without
the coordinator change, the `cleanup_hooks` retarget, or a test that
actually reproduces on unpatched `main`).

## Reproduction

```sh
git clone https://github.com/workglow-dev/libs && cd libs
bun i && bun run build:packages
bun test --timeout=30000 --parallel=4 packages/test/src/test/{util,task}/*.test.ts
```

On `main` (d484fd6), 3–4 workers crash per run with either
```
ASSERTION FAILED: isMarked(cell)
  JavaScriptCore/heap/Heap.cpp:1232 : void JSC::Heap::addToRememberedSet(const JSCell *)
```
or (when the slot is already being reallocated)
```
ASSERTION FAILED: m_cellState == CellState::DefinitelyWhite
  JavaScriptCore/JSCellInlines.h:69 : JSC::JSCell::JSCell(VM &, Structure *)
```
and in release builds the segfaults at `0x68` / `0xD0` reported in
#30205.

## Root cause

Frame-pointer walk from the assertion:

```
#3  Bun::NapiHandleScope::open(Zig::GlobalObject*, bool)
#4  NapiHandleScope__open
#6  napi.Finalizer.run
#7  napi.NapiFinalizerTask.runOnJSThread
#10 event_loop.tick
#11 event_loop.waitForPromise
#13 VirtualMachine.loadEntryPointForTestRunner   ← next test file
```

`NapiEnv::m_globalObject` is a raw `Zig::GlobalObject*`. For
non-experimental addons (`nm_version != NAPI_VERSION_EXPERIMENTAL`,
which is ~every real-world addon — sharp, better-sqlite3, etc.),
`napi_wrap`/`napi_create_external` finalizers are **deferred** to the
event loop as `NapiFinalizerTask` rather than run inside GC sweep.

Objects rooted on the old global (module graph, `globalThis.*`) only
become collectable when `Zig__GlobalObject__createForTestIsolation` runs
`gcUnprotect(oldGlobal)`. The `DeferGC` from #29573 ends at that
function's `}`, so the next GC runs there, collects those objects, and
enqueues their finalizers. Those tasks then run on the very next
`eventLoop().tick()` — inside `loadEntryPointForTestRunner`'s
`waitForPromise` for file N+1. `Finalizer.run` opens a `NapiHandleScope`
via `env->globalObject()`, which reads `NapiHandleScopeImplStructure()`
off the dead cell and writes `m_currentNapiHandleScopeImpl` on it →
write barrier on an unmarked cell.

The `--parallel` coordinator's `reapWorker` then re-queued the file once
(`retries[idx] < 1`) into a fresh worker with no stale `NapiEnv`, which
passed — so the run reported 0 fail despite multiple Bun panics in the
log.

## Fix

**NapiEnv retarget** (`ZigGlobalObject.cpp`, `napi.h`):
`Zig__GlobalObject__createForTestIsolation` now calls
`newGlobal->adoptNapiEnvsForTestIsolation(oldGlobal)` before
`gcUnprotect`. Each `NapiEnv::m_globalObject` is repointed at the new
global and the `Ref<NapiEnv>`s are moved over, so late finalizers open
handle scopes on a live global and the envs stay owned after the old
global is swept. `VirtualMachine.swapGlobalForTestIsolation` also
repoints `rare_data.cleanup_hooks[*].globalThis` so `CleanupHook.eql()`
stays accurate.

**No retry, abort on panic** (`Coordinator.zig`): removed the per-file
retry. A worker that dies mid-file is counted as one failure. If it died
by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS —
Bun's own `@trap()`, a JSC/WTF assertion, or native-addon crash), the
whole run aborts with `error: a test worker process crashed with <SIG>
while running <file>`. `process.exit()` / SIGKILL are still just a
per-file failure and the run continues.

## Verification

- `test/regression/issue/30205.test.ts` — 4 tests. Adds a tiny
non-experimental addon (`isolate_finalizer_addon.c`) and a fixture
pattern (`Bun.gc(true)` + module-scope `await 0` + objects rooted on
`globalThis`) that crashes **8/8** on unpatched `main` and passes 8/8
with this change.
- `workglow-dev/libs` full 201-file unit suite: 3× clean `--parallel=4`
runs (was 3–4 crashes/run).
- Gate: `git stash -- src/ && bun bd test
test/regression/issue/30205.test.ts` → 3/4 fail; with fix → 4/4 pass.
- `test/cli/test/isolation.test.ts`,
`test/regression/issue/29519.test.ts` → pass (one pre-existing unrelated
timeout in isolation.test.ts, same as #29573).
- `test/cli/test/parallel.test.ts` → all tests I touched pass; the 3
timing-sensitive scale-up/work-steal tests that fail in this container
fail identically on unmodified `main`.

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…worker panic, never retry (oven-sh#30216)

## What

`bun test --isolate` / `--parallel` crashes when a test file loads a
native addon whose deferred napi finalizers outlive the file. The
`--parallel` coordinator then silently retries the file once, which
masks the panic and lets the run exit 0.

Fixes oven-sh#30205, oven-sh#30191. Supersedes oven-sh#30214 (same NapiEnv fix, but without
the coordinator change, the `cleanup_hooks` retarget, or a test that
actually reproduces on unpatched `main`).

## Reproduction

```sh
git clone https://github.com/workglow-dev/libs && cd libs
bun i && bun run build:packages
bun test --timeout=30000 --parallel=4 packages/test/src/test/{util,task}/*.test.ts
```

On `main` (d484fd6), 3–4 workers crash per run with either
```
ASSERTION FAILED: isMarked(cell)
  JavaScriptCore/heap/Heap.cpp:1232 : void JSC::Heap::addToRememberedSet(const JSCell *)
```
or (when the slot is already being reallocated)
```
ASSERTION FAILED: m_cellState == CellState::DefinitelyWhite
  JavaScriptCore/JSCellInlines.h:69 : JSC::JSCell::JSCell(VM &, Structure *)
```
and in release builds the segfaults at `0x68` / `0xD0` reported in
oven-sh#30205.

## Root cause

Frame-pointer walk from the assertion:

```
oven-sh#3  Bun::NapiHandleScope::open(Zig::GlobalObject*, bool)
oven-sh#4  NapiHandleScope__open
oven-sh#6  napi.Finalizer.run
oven-sh#7  napi.NapiFinalizerTask.runOnJSThread
oven-sh#10 event_loop.tick
oven-sh#11 event_loop.waitForPromise
oven-sh#13 VirtualMachine.loadEntryPointForTestRunner   ← next test file
```

`NapiEnv::m_globalObject` is a raw `Zig::GlobalObject*`. For
non-experimental addons (`nm_version != NAPI_VERSION_EXPERIMENTAL`,
which is ~every real-world addon — sharp, better-sqlite3, etc.),
`napi_wrap`/`napi_create_external` finalizers are **deferred** to the
event loop as `NapiFinalizerTask` rather than run inside GC sweep.

Objects rooted on the old global (module graph, `globalThis.*`) only
become collectable when `Zig__GlobalObject__createForTestIsolation` runs
`gcUnprotect(oldGlobal)`. The `DeferGC` from oven-sh#29573 ends at that
function's `}`, so the next GC runs there, collects those objects, and
enqueues their finalizers. Those tasks then run on the very next
`eventLoop().tick()` — inside `loadEntryPointForTestRunner`'s
`waitForPromise` for file N+1. `Finalizer.run` opens a `NapiHandleScope`
via `env->globalObject()`, which reads `NapiHandleScopeImplStructure()`
off the dead cell and writes `m_currentNapiHandleScopeImpl` on it →
write barrier on an unmarked cell.

The `--parallel` coordinator's `reapWorker` then re-queued the file once
(`retries[idx] < 1`) into a fresh worker with no stale `NapiEnv`, which
passed — so the run reported 0 fail despite multiple Bun panics in the
log.

## Fix

**NapiEnv retarget** (`ZigGlobalObject.cpp`, `napi.h`):
`Zig__GlobalObject__createForTestIsolation` now calls
`newGlobal->adoptNapiEnvsForTestIsolation(oldGlobal)` before
`gcUnprotect`. Each `NapiEnv::m_globalObject` is repointed at the new
global and the `Ref<NapiEnv>`s are moved over, so late finalizers open
handle scopes on a live global and the envs stay owned after the old
global is swept. `VirtualMachine.swapGlobalForTestIsolation` also
repoints `rare_data.cleanup_hooks[*].globalThis` so `CleanupHook.eql()`
stays accurate.

**No retry, abort on panic** (`Coordinator.zig`): removed the per-file
retry. A worker that dies mid-file is counted as one failure. If it died
by a fatal signal (SIGILL/SIGTRAP/SIGABRT/SIGBUS/SIGFPE/SIGSEGV/SIGSYS —
Bun's own `@trap()`, a JSC/WTF assertion, or native-addon crash), the
whole run aborts with `error: a test worker process crashed with <SIG>
while running <file>`. `process.exit()` / SIGKILL are still just a
per-file failure and the run continues.

## Verification

- `test/regression/issue/30205.test.ts` — 4 tests. Adds a tiny
non-experimental addon (`isolate_finalizer_addon.c`) and a fixture
pattern (`Bun.gc(true)` + module-scope `await 0` + objects rooted on
`globalThis`) that crashes **8/8** on unpatched `main` and passes 8/8
with this change.
- `workglow-dev/libs` full 201-file unit suite: 3× clean `--parallel=4`
runs (was 3–4 crashes/run).
- Gate: `git stash -- src/ && bun bd test
test/regression/issue/30205.test.ts` → 3/4 fail; with fix → 4/4 pass.
- `test/cli/test/isolation.test.ts`,
`test/regression/issue/29519.test.ts` → pass (one pre-existing unrelated
timeout in isolation.test.ts, same as oven-sh#29573).
- `test/cli/test/parallel.test.ts` → all tests I touched pass; the 3
timing-sensitive scale-up/work-steal tests that fail in this container
fail identically on unmodified `main`.

---------

Co-authored-by: robobun <robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic during tests with parallel

2 participants