napi: clear pending exceptions between finalizers during env cleanup#30291
napi: clear pending exceptions between finalizers during env cleanup#30291robobun wants to merge 15 commits into
Conversation
WalkthroughAdds VM-level exception clearing during N-API environment teardown and removes throw-scope reliance when creating errors. Introduces a clearExceptionsBetweenFinalizers helper, invokes it between finalizers in NapiEnv::cleanup, includes TopExceptionScope, and adds a regression N-API test and build target validating napi_create_error during cleanup. ChangesNAPI Finalizer Exception Handling
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/napi/napi.test.ts`:
- Around line 720-721: Remove the negative assertions that check stderr for
"panic" and "NAPI FATAL" and rely on the existing positive assertions (the
create_error_status=0 and exitCode=0 checks) to validate the test; specifically
delete or comment out the two lines using expect(stderr).not.toContain("panic")
and expect(stderr).not.toContain("NAPI FATAL") (or convert them to non-failing
diagnostic logs) so the test follows the guideline of asserting success rather
than absence of panic strings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 689b52ca-94bc-4d1e-940d-78eb94b9b3cc
📒 Files selected for processing (5)
src/jsc/bindings/napi.cppsrc/jsc/bindings/napi.htest/napi/napi-app/binding.gyptest/napi/napi-app/test_finalizer_create_error.ctest/napi/napi.test.ts
|
Found 5 issues this PR may fix:
🤖 Generated with Claude Code |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
Per CLAUDE.md: 'Never write tests that check for no panic or uncaught exception in the test output. These tests will never fail in CI.' The positive assertions (stdout == 'create_error_status=0' and exitCode == 0) already fail decisively if the finalizer panics -- stdout is empty on a crash. Also added stderr == '' as a positive absence-of-diagnostic check. Addresses CodeRabbit review on #30291.
There was a problem hiding this comment.
Thanks for addressing the test-assertion nit. No bugs found, but this changes JSC exception-scope handling in the NAPI cleanup path (dropping the throw scope in createErrorWithNapiValues and clearing VM exceptions between finalizers), so it'd be good to have someone with JSC internals context confirm the scope semantics before merge.
Extended reasoning...
Overview
This PR touches two C++ files in the NAPI bindings (src/jsc/bindings/napi.cpp, src/jsc/bindings/napi.h) plus a new C test addon, a binding.gyp target, and a regression test in napi.test.ts. The functional changes are: (1) createErrorWithNapiValues drops its DECLARE_THROW_SCOPE and three RETURN_IF_EXCEPTION guards so napi_create_error succeeds even with a pending VM exception, matching Node.js; and (2) NapiEnv::cleanup() now calls a new clearExceptionsBetweenFinalizers() helper (which uses DECLARE_TOP_EXCEPTION_SCOPE(vm).clearException() and clears m_pendingException) after each finalizer and after the instance-data finalizer.
Security risks
None apparent — this is internal exception-state hygiene during env teardown, not an auth/input-validation surface.
Level of scrutiny
High. This is production-critical native-bindings code on the worker/process teardown path, and the change alters JSC exception-scope semantics. Two specific points deserve a JSC-aware reviewer's eye: (a) the claim that JSValue::getString(globalObject) on a value already verified isString() can never throw (rope resolution / OOM edge cases) and that createErrorWithCode is safe without a surrounding throw scope under EXCEPTION_SCOPE_VERIFICATION; and (b) that DECLARE_TOP_EXCEPTION_SCOPE(m_vm).clearException() on a temporary is the right idiom here vs. a longer-lived scope or a catch scope. The reasoning in the PR description is thorough and plausible, but these are exactly the kind of invariants that are easy to get subtly wrong.
Other factors
The PR includes a targeted regression test that exercises both layers of the fix, and the author already addressed my earlier style nit (commit eddff53 swapped the negative panic-string assertions for expect(stderr).toBe("")). The PR also claims to close several long-standing crash reports (#19917, #26081, #28557), which raises the impact of getting the fix right. No CODEOWNERS cover these files. Given the subtlety of JSC exception-scope bookkeeping and the breadth of the claimed fixes, I'm deferring rather than approving.
|
Also fixes #30431 — the macOS shutdown crash there ( |
|
@dylan-conway @Jarred-Sumner this PR needs a maintainer to unblock CI and merge. The only failing check is "Lint JavaScript" — 3 oxlint Could someone please re-run ? |
Tree-sitter inside node:worker_threads deterministically panicked at worker exit once ~15 finalizers had run: panic: NAPI FATAL ERROR: Error::New napi_create_error at NapiEnv::cleanup -> NapiRef::callFinalizer -> NapiFinalizer::call Two things were broken. First, NapiEnv::cleanup ran each finalizer without clearing any pending exception after it returned, so a throw from one finalizer leaked into the next. Node.js doesn't do that -- finalizers run with no JS frame above them to catch, so propagating the exception is pointless and only confuses later finalizers. Second, createErrorWithNapiValues refused to produce an Error value while a VM-level exception was pending (DECLARE_THROW_SCOPE + RETURN_IF_EXCEPTION at the top). Node.js's napi_create_error has no such check -- it is a pure value-producing function. Because napi_is_exception_pending deliberately skips the VM-scope check during cleanup for safety, node-addon-api's Error::New(env) saw 'no pending exception', fell through to napi_create_error, got back napi_pending_exception, and tripped NAPI_FATAL_IF_FAILED -> napi_fatal_error. Fix both layers: clear env/VM exceptions after each finalizer in NapiEnv::cleanup, and drop the pending-exception check from createErrorWithNapiValues. Matches Node.js on both counts. Regression test: a standalone addon wraps two JS objects; the first finalizer calls a throwing JS function via napi_call_function (leaving the JSC VM exception pending without clearing it), the second calls napi_create_error and prints the returned status. Before the fix the status is 10 (napi_pending_exception) and the debug ASAN build aborts on the leaked exception first; after the fix the status is 0 (napi_ok) and the process exits cleanly. Closes #30286. Related: #22259, #28248, #29785.
Per CLAUDE.md: 'Never write tests that check for no panic or uncaught exception in the test output. These tests will never fail in CI.' The positive assertions (stdout == 'create_error_status=0' and exitCode == 0) already fail decisively if the finalizer panics -- stdout is empty on a crash. Also added stderr == '' as a positive absence-of-diagnostic check. Addresses CodeRabbit review on #30291.
Declared inline in napi.h it was getting instantiated in every TU that pulled in napi_external.h. On Windows aarch64 (build 51887's only real failure), the TopExceptionScope ctor/dtor are JS_EXPORT_PRIVATE under ENABLE_EXCEPTION_SCOPE_VERIFICATION; that broke build-cpp. Confining the body to napi.cpp means TopExceptionScope is touched from exactly one TU, matching how every other caller (ErrorStackTrace.cpp, NodeVMModule.cpp, MessagePort.cpp, ...) uses it today.
9cacea6 to
720f084
Compare
|
CI note: build #59847 failed only on `debian-13-x64-test-bun` (1 shard of 17) — the failing test is `napi_get_value_string_utf8 with buffer > copies auto len`, which passes `424242` as the buffer size and trips a debug-mode Rust UB check (`slice::from_raw_parts_mut requires the pointer to be aligned and non-null`). This is pre-existing on main and unrelated to this PR: I reproduced it on a clean checkout of The PR diff itself is green. This needs a maintainer to merge (the unrelated pre-existing failure will clear on its own or needs a separate fix). |
Wire the OpenCode 'server.instance.disposed' SDK event to stop this instance's process-resident resources (RPC server, dream-schedule timer registration, auto-update abort controller). Desktop runs multiple plugin instances in one process, each disposed independently, so cleanup is gated on the disposed directory resolving to our own project identity — never tears down a sibling instance's still-live resources. Also fixes a pre-existing leak: the dream-timer registration cleanup was never captured. Deliberately does NOT dispose the native ONNX embedding session here — forcing onnxruntime-node's destructor on teardown makes the Bun N-API exit crash (#95) worse, not better (tracked upstream at oven-sh/bun#30291); the OS reclaims that memory on exit. Pi keeps its own session_shutdown cleanup (no server.instance. disposed event in Pi) — documented divergence. Co-authored-by: Alfonso [Magic Context] <288211368+alfonso-magic-context@users.noreply.github.com>
|
Additional confirmation that this still affects Bun 1.3.14 with native N-API modules in workers, using Minimal repro: // worker-canvas.mjs
import { createRequire } from "node:module";
const require = createRequire(import.meta.url);
const { createCanvas } = require("canvas");
const canvas = createCanvas(16, 16);
canvas.getContext("2d").fillRect(0, 0, 16, 16);
postMessage({ ok: true, bytes: canvas.toBuffer("image/png").length });// main-worker-canvas.mjs
const worker = new Worker(new URL("./worker-canvas.mjs", import.meta.url).href, { type: "module" });
worker.onmessage = event => {
console.log(JSON.stringify(event.data));
worker.terminate();
};
worker.onerror = event => {
console.error("worker error", event.message || event);
worker.terminate();
process.exit(1);
};npm i canvas@3.2.1
bun main-worker-canvas.mjsObserved:
If the worker instead receives a message and calls |
A native teardown callback can leave a pending VM exception before the first wrap finalizer runs: a terminated Worker's teardown promotes a scheduled exception onto the VM (napi_call_function's prologue calls throwPendingException before validating arguments), and nothing cleared it between the cleanup-hook phase and the finalizer phase. The first finalizer's first napi call then failed with napi_pending_exception and node-addon-api escalated to napi_fatal_error -- the node-canvas shape of #30286: panic: NAPI FATAL ERROR: Error::Error napi_create_object reported with canvas@3.2.1 + worker.terminate(). Clear pending exceptions at the start of cleanup and again after the cleanup-hook drain so every native teardown callback starts from a clean exception state, completing the invariant the between-finalizer clearing already established. Deterministic regression test: the addon registers a cleanup hook that schedules an exception via napi_throw_error and promotes it with napi_call_function, then a wrap finalizer probes napi_create_error. Fails without this commit (status -110 / abort), passes with it. Verified the canvas@3.2.1 worker repro from the issue thread exits 0.
drain() ran hooks back-to-back, so a hook that leaked a VM exception poisoned the next hook's napi calls. Apply the same per-callback clear the finalizer loop uses, making cleanup()'s documented invariant (every native teardown callback starts from a clean exception state) hold for hooks as well.
|
Thanks @basedcorp99 for drawing my attention back to this PR. @dylan-conway @Jarred-Sumner robobun says the PR itself is green -- would you please take a look? Cheers! |
|
Independent confirmation of this PR's analysis, from triaging the same Sentry crash (BUN-2ZQ9 / BUN-38NS, While verifying the fix against Node I found the same class extends past
Branch with those fixes plus tests comparing against Node for all of the above: https://github.com/oven-sh/bun/branches/all?query=farm/0a365ff8/napi-error-new-finalizers ( |
|
@robobun I'm a bit maxxed out ATM and my agent isn't as good as you are. I'll take the W on the headline fix and give you the pleasure of committing your own work :) |
|
@dylan-conway @Jarred-Sumner could you please unleash robobun to merge this? @robobun said:
|
|
Appreciate the vote of confidence, @HaleTom! I don't have merge rights on this repo (no maintainer/admin bit), and the PR currently shows Everything on my side is ready: diff is green, main is merged in, all review threads resolved, regression tests + the |
Closes #30286.
Fixes #19917.
Fixes #26081.
Fixes #28557.
Repro
gitnexus analyzeon a synthetic repo with 15 trivial Java files reliably panicked at worker exit:Cause
Two interacting bugs:
NapiEnv::cleanupdid not clear pending exceptions between finalizers. If finalizer N leaked a JSC VM exception (a common thing when a finalizer calls into JS vianapi_call_function, or internally hits an allocation failure), finalizer N+1 inherited it. Node.js does the opposite — finalizers run with no JS frame above them to catch, so each one starts from a clean exception state.createErrorWithNapiValuesrefused to run with a pending VM exception. Node.js'snapi_create_erroris a pure value-producing function with no such check. Because Bun'snapi_is_exception_pendingdeliberately skips the VM-scope check during cleanup, node-addon-api'sError::New(env)helper saw "no pending exception", took the "create a new error" branch, got backnapi_pending_exceptionfromcreateErrorWithNapiValues, andNAPI_FATAL_IF_FAILEDfirednapi_fatal_error.Layer 1 also surfaces as
panic: A C++ exception occurredduring process teardown: a finalizer's leaked exception unwinds out ofNapiEnv::cleanup→napi_internal_cleanup_env_cpp→VirtualMachine.onExit/globalExitand hitsstd::set_terminate. That's the stack on #19917, #26081, and #28557 — all NAPI modules (@huggingface/transformers/ onnxruntime-node, bun:sqlite).Fix
NapiEnv::cleanup: clear VM and env pending exceptions at the start of cleanup, after the cleanup-hook drain, after each finalizer, and after the instance-data finalizer via a newclearExceptionsBetweenFinalizershelper. UsesDECLARE_TOP_EXCEPTION_SCOPE(vm).clearException()so the EXCEPTION_SCOPE_VERIFICATION bookkeeping gets reset too — leaving it uncleared would trip debug asserts in the next finalizer's napi calls. The start/after-drain clears cover the case where the exception predates the finalizer phase (a cleanup hook or a terminated Worker's teardown promoted a scheduled exception onto the VM), which is the node-canvas shape reported in the issue thread (panic: NAPI FATAL ERROR: Error::Error napi_create_objectwithcanvas@3.2.1+worker.terminate()).createErrorWithNapiValues: drop theDECLARE_THROW_SCOPE+RETURN_IF_EXCEPTIONat entry. The string inputs are already validated as strings beforegetString()is called, so nothing in the body can throw.Verification
New
test/napi/napi-app/test_finalizer_create_error.caddon, driven bytest/napi/napi.test.ts:napi_call_function, leaves the VM exception pending.napi_create_errorand prints the returned status.Before the patch: status is 10 (
napi_pending_exception) and debug ASAN aborts on the leaked exception first. After the patch: status is 0 (napi_ok) and the process exits 0.A second test covers the pre-finalizer-phase leak: the addon registers a cleanup hook that schedules an exception via
napi_throw_errorand promotes it onto the VM throughnapi_call_function's prologue, then the wrap finalizer probesnapi_create_error. Fails without the start/after-drain clears, passes with them.Gate:
bun bd test test/napi/napi.test.ts -t "30286"passes with the fix, fails (either an ASAN assertion abort on debug, or a non-zero status on release) without it.Also verified the
canvas@3.2.1+worker.terminate()repro from the issue thread: panics on the unfixed build, exits 0 with this PR.Related
napi_create_errorcrash with@napi-rs/canvas.Error::New napi_get_last_error_info) — similar signature but the failing function has no exception check in Bun; more likely the env-UAF class that napi: run wrap finalizers in LIFO order during env teardown #29324 fixed.npx bun testcrash #21068 — sharespanic: A C++ exception occurredtext but the crash happens during module load (CPython embedding), not during teardown; not a NAPI-cleanup path.