QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke#1797
Conversation
Lets a client request a clean addon teardown before tearing the bare runtime down, so addon static state (e.g. js_ref_t handles into the worker V8 isolate) is released while that env is still alive. Without this, tearing down a runtime whose addons retain isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP) inside the next runtime that re-imports the same .bare files in the same OS process — the JsLogger.setLogger path in qvac-lib-inference-addon-cpp is the reproducer (every addon that links it has the same retention). - worker-core.ts: extract the existing shutdown body into a reusable cleanupForTerminate() that runs the same registry / model / resource cleanup but skips releaseWorkerLock() and process.exit(). The full shutdownBareDirectWorker still runs both for desktop signal and exit paths. - handler-utils.ts + handle-request.ts: new internal __shutdown__ message dispatched alongside __init_config. Bypasses the schema, awaits cleanupForTerminate(), and replies success. Lazy-imports the worker-core function to break the handler-utils -> worker-core -> create-server -> handle-request import cycle. - bare-client.ts: mirror the message in the in-process mock RPC for desktop direct-mode (Pear-style) consumers. - expo-rpc-client.ts: close() is now async; sends __shutdown__ over RPC and awaits the success reply (with a 10s timeout safety) before calling worklet.terminate(). Best-effort: timeouts log a warning and proceed with terminate. The auto-close path in unload-model.ts already awaits close(), so this is non-breaking for that caller.
… settle
Two related fixes that together let the mobile smoke run progress
past the "previous heavy model still resident" memory ceiling:
- resource-lifecycle: tests with dependency:none used to skip
evictExcept and leave whatever was loaded by the previous test
resident. Now treated as evictExcept([]), so a heavy model from
the prior test gets unloaded before the next one starts allocating.
Empirically this is what kept tripping sharded-model-load right
after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB
resident; sharded then asks for multi-GB on top and hits the iOS
memory limit).
- resource-manager: new ResourceManager({ unloadSettleMs }) option
that sleeps for the configured duration after a successful
unloadModel (only on success — failure path returns immediately).
Lets the kernel release pages before the next load starts allocating.
Defaults to 0 (off, desktop is fine without it). Mobile consumer
opts in to 100ms.
Mobile consumer also picks up SkipExecutor entries for the
lifecycle-suspend tests; suspend hangs the runner indefinitely on
mobile because the lifecycle coordinator pauses MQTT and never
resumes within the test timeout.
Picks up: - in-app memory poller in mobile-consumer template - desktop in-app memory poller (process-tree RSS) - Memory tab + per-test memory metrics in HTML/JSON reports - bucket results by metadata.category instead of testId-prefix split Required by the eviction / settle work in this PR; both depend on the new MemorySummary fields and the corrected category bucketing.
Tier-based Approval Status |
cleanupForTerminate previously set the same isShuttingDown flag that shutdownBareDirectWorker uses as its early-return guard. After a __shutdown__ message ran the pre-terminate cleanup, a subsequent SIGTERM / SIGINT / uncaught-exception in desktop direct mode would early-return at the guard and skip releaseWorkerLock() + process.exit(). Result: lock file leak and no graceful exit. Mobile is unaffected because each Worklet has its own module instance (fresh isShuttingDown per worklet). The bug only bites the bare-client mock-RPC path (Pear-style consumers where the worker shares the host process for its lifetime). Two flags now: - cleanupRan: idempotent guard around runCleanup body - isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate must NOT set it shutdownBareDirectWorker still calls runCleanup which is now a no-op when cleanupRan is already true.
…_ races If two callers race close() (or one calls close() while another getRPC() is mid-flight), the second sees rpcInstance still set, fires a redundant __shutdown__, then re-enters the terminate block on already-null state. Wrap the body in a singleton closingPromise; concurrent callers share the same in-flight close. Reset to null in finally so a fresh worker brought up later can be cleanly closed again. The auto-close path in unload-model.ts is naturally serialised today so this is robustness rather than fixing an active bug, but the cost is minimal and the failure mode (double __shutdown__ after terminate) is annoying to diagnose.
Worklet.terminate() crashes on Android: addon dlclose unmaps the lib but pthread_key_t destructors registered by some addons (likely rocksdb-native, libbare-tls, libbare-crypto) are never pthread_key_delete'd before unload, so libc's per-thread cleanup table points at unmapped memory and the next pthread_exit SIGSEGVs in pthread_key_clean_all(). iOS dyld no-ops dlclose for already-loaded third-party libs, so the dangling-destructor problem cannot manifest there. The terminate path stays enabled on iOS. On non-iOS, fall back to the legacy refs-only close: drop rpcInstance and rpcPromise, leave workletInstance + workletInitialized intact so the next getRPC() reuses the live worklet. Skip the __shutdown__ roundtrip too -- it would clear the worker plugin registry without a follow-up terminate, leaving the worker unusable for subsequent loadModel. Trade-off: Android tests no longer recover memory between heavy tests the way iOS now does, so memory accumulates across the smoke run. On Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android devices may regress vs the pre-PR baseline. Acceptable until the upstream holepunchto/bare exposes a per-addon unload hook. Platform is resolved via the existing getRuntimeContext() path (getDeviceInfo handles a missing expo-device safely via dynamic import + try/catch), so no new react-native imports are added.
The test reliably times out on mobile (Android Pixel 10 Pro hit the 600s timeout in the latest smoke run). Test framework drops the await on timeout but the underlying streaming inference keeps running on the Bare worker side, leaving the diffusion model "in use" from the runtime's perspective. Knock-on effect: any later test whose modelSetup needs to evict diffusion (e.g. wrong-model-transcribe-on-llm via ResourceManager.evictExcept) blocks indefinitely waiting for the stream to finish. Observed in local-android-smoke: 86/88 tests completed, then the runner stuck for 50+ minutes inside the eviction of diffusion at test 86's setup. Skipping unblocks the smoke run end-to-end. The proper fixes (framework-side cancel-on-timeout, resource-manager bounded waits) are tracked separately.
simon-iribarren
left a comment
There was a problem hiding this comment.
LGTM on the SDK changes. Deep-dove the requested key commit e9954bf and its three follow-ups — diagnosis is correct (addon js_ref_t static state surviving across worklets in the same iOS process trips the V8 GlobalHandles::Destroy assertion / EXC_BREAKPOINT in JsLogger::releaseJsRefs), and the fix is well-structured.
What I checked:
runCleanup()body matches the originalshutdownBareDirectWorkercleanup verbatim (sameclearRegistries()+Promise.allSettledof the same five closers) → no regression for desktop signal/exit.__shutdown__schema bypass + lazy-import inhandleShutdownto breakhandler-utils → worker-core → create-server → handle-requestcycle is correct; double-underscore prefix matches existing__init_configconvention.expo-rpc-client.close()widening from() => voidto() => Promise<void>is non-breaking — verified all callers inpackages/sdk(auto-close atunload-model.ts:50already awaits; desktopvoid close()callers route tobare-client.ts/node-rpc-client.tswhich were already async).- Follow-ups are real bug fixes for issues introduced in
e9954bf, not polish:4cfc8ec2splitscleanupRanfromisShuttingDownso a later SIGTERM still releases the worker lock + exits;4ba4137bserialisesclose()against double-__shutdown__races. Both worth merging. b795d74d's non-iOS skip is honestly disclosed — the Androidpthread_key_clean_allSIGSEGV root cause (addons registerpthread_key_tdestructors but neverpthread_key_deletethem, sodlcloseunmaps the lib while libc's per-thread cleanup table still points at it) matches what the tombstone shows.
One small thing worth a sanity check (non-blocking): in b795d74d's iOS branch, (await getRuntimeContext()).platform is read inside the closingPromise body, and the try/catch around it swallows transient failures into platform = undefined, which falls through to the refs-only path even on iOS. Probably fine because runtime context is cached after first call, but if you'd rather default iOS to "yes terminate" on a transient failure, the catch could special-case that. Up to you.
CI failure on check (sdk) is a transient bun install tarball error for react-devtools-core — not code-related, clears on rebase. Approving once the test-suite 0.6.2 release lands and you push a rebase.
|
/review |
…moke (#1797) * feat: add pre-terminate cleanup signal for SDK clients Lets a client request a clean addon teardown before tearing the bare runtime down, so addon static state (e.g. js_ref_t handles into the worker V8 isolate) is released while that env is still alive. Without this, tearing down a runtime whose addons retain isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP) inside the next runtime that re-imports the same .bare files in the same OS process — the JsLogger.setLogger path in qvac-lib-inference-addon-cpp is the reproducer (every addon that links it has the same retention). - worker-core.ts: extract the existing shutdown body into a reusable cleanupForTerminate() that runs the same registry / model / resource cleanup but skips releaseWorkerLock() and process.exit(). The full shutdownBareDirectWorker still runs both for desktop signal and exit paths. - handler-utils.ts + handle-request.ts: new internal __shutdown__ message dispatched alongside __init_config. Bypasses the schema, awaits cleanupForTerminate(), and replies success. Lazy-imports the worker-core function to break the handler-utils -> worker-core -> create-server -> handle-request import cycle. - bare-client.ts: mirror the message in the in-process mock RPC for desktop direct-mode (Pear-style) consumers. - expo-rpc-client.ts: close() is now async; sends __shutdown__ over RPC and awaits the success reply (with a 10s timeout safety) before calling worklet.terminate(). Best-effort: timeouts log a warning and proceed with terminate. The auto-close path in unload-model.ts already awaits close(), so this is non-breaking for that caller. * test: stabilise mobile smoke run via eviction-on-none and post-unload settle Two related fixes that together let the mobile smoke run progress past the "previous heavy model still resident" memory ceiling: - resource-lifecycle: tests with dependency:none used to skip evictExcept and leave whatever was loaded by the previous test resident. Now treated as evictExcept([]), so a heavy model from the prior test gets unloaded before the next one starts allocating. Empirically this is what kept tripping sharded-model-load right after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB resident; sharded then asks for multi-GB on top and hits the iOS memory limit). - resource-manager: new ResourceManager({ unloadSettleMs }) option that sleeps for the configured duration after a successful unloadModel (only on success — failure path returns immediately). Lets the kernel release pages before the next load starts allocating. Defaults to 0 (off, desktop is fine without it). Mobile consumer opts in to 100ms. Mobile consumer also picks up SkipExecutor entries for the lifecycle-suspend tests; suspend hangs the runner indefinitely on mobile because the lifecycle coordinator pauses MQTT and never resumes within the test timeout. * chore: bump qvac-test-suite to ^0.6.2 Picks up: - in-app memory poller in mobile-consumer template - desktop in-app memory poller (process-tree RSS) - Memory tab + per-test memory metrics in HTML/JSON reports - bucket results by metadata.category instead of testId-prefix split Required by the eviction / settle work in this PR; both depend on the new MemorySummary fields and the corrected category bucketing. * fix: split cleanupRan and isShuttingDown so shutdown still releases lock cleanupForTerminate previously set the same isShuttingDown flag that shutdownBareDirectWorker uses as its early-return guard. After a __shutdown__ message ran the pre-terminate cleanup, a subsequent SIGTERM / SIGINT / uncaught-exception in desktop direct mode would early-return at the guard and skip releaseWorkerLock() + process.exit(). Result: lock file leak and no graceful exit. Mobile is unaffected because each Worklet has its own module instance (fresh isShuttingDown per worklet). The bug only bites the bare-client mock-RPC path (Pear-style consumers where the worker shares the host process for its lifetime). Two flags now: - cleanupRan: idempotent guard around runCleanup body - isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate must NOT set it shutdownBareDirectWorker still calls runCleanup which is now a no-op when cleanupRan is already true. * fix: serialise expo-rpc-client.close() to avoid duplicate __shutdown__ races If two callers race close() (or one calls close() while another getRPC() is mid-flight), the second sees rpcInstance still set, fires a redundant __shutdown__, then re-enters the terminate block on already-null state. Wrap the body in a singleton closingPromise; concurrent callers share the same in-flight close. Reset to null in finally so a fresh worker brought up later can be cleanly closed again. The auto-close path in unload-model.ts is naturally serialised today so this is robustness rather than fixing an active bug, but the cost is minimal and the failure mode (double __shutdown__ after terminate) is annoying to diagnose. * fix: skip Worklet.terminate() on non-iOS platforms Worklet.terminate() crashes on Android: addon dlclose unmaps the lib but pthread_key_t destructors registered by some addons (likely rocksdb-native, libbare-tls, libbare-crypto) are never pthread_key_delete'd before unload, so libc's per-thread cleanup table points at unmapped memory and the next pthread_exit SIGSEGVs in pthread_key_clean_all(). iOS dyld no-ops dlclose for already-loaded third-party libs, so the dangling-destructor problem cannot manifest there. The terminate path stays enabled on iOS. On non-iOS, fall back to the legacy refs-only close: drop rpcInstance and rpcPromise, leave workletInstance + workletInitialized intact so the next getRPC() reuses the live worklet. Skip the __shutdown__ roundtrip too -- it would clear the worker plugin registry without a follow-up terminate, leaving the worker unusable for subsequent loadModel. Trade-off: Android tests no longer recover memory between heavy tests the way iOS now does, so memory accumulates across the smoke run. On Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android devices may regress vs the pre-PR baseline. Acceptable until the upstream holepunchto/bare exposes a per-addon unload hook. Platform is resolved via the existing getRuntimeContext() path (getDeviceInfo handles a missing expo-device safely via dynamic import + try/catch), so no new react-native imports are added. * test: skip diffusion-streaming-progress on mobile The test reliably times out on mobile (Android Pixel 10 Pro hit the 600s timeout in the latest smoke run). Test framework drops the await on timeout but the underlying streaming inference keeps running on the Bare worker side, leaving the diffusion model "in use" from the runtime's perspective. Knock-on effect: any later test whose modelSetup needs to evict diffusion (e.g. wrong-model-transcribe-on-llm via ResourceManager.evictExcept) blocks indefinitely waiting for the stream to finish. Observed in local-android-smoke: 86/88 tests completed, then the runner stuck for 50+ minutes inside the eviction of diffusion at test 86's setup. Skipping unblocks the smoke run end-to-end. The proper fixes (framework-side cancel-on-timeout, resource-manager bounded waits) are tracked separately.
Note: be concise and prefer bullet points.
🎯 What problem does this PR solve?
EXC_BREAKPOINT/SIGTRAP) the moment the SDK auto-closed the bare runtime between heavy tests. Native addons retainjs_ref_thandles into the worker V8 isolate; tearing the isolate down without releasing those refs trips a V8GlobalHandles::Destroyassertion the next time a fresh worker re-imports the same.barefiles in the same iOS process.sharded-model-loadbecause tests withdependency: "none"skipped eviction and let the previous test heavy model linger.📝 How does it solve it?
__shutdown__RPC message). Lets the client tell the worker to release addon-bound state (releaseLoggeron every plugin viaclearPlugins,unloadAllModels, registry close) while the JS env is still alive.expo-rpc-client.close()now sends__shutdown__and awaits ack (10s safety timeout) before callingWorklet.terminate(). Reusable cleanup body extracted fromshutdownBareDirectWorkerso desktop signal/exit paths stay unchanged. Mock RPC inbare-client.tsmirrors the message for direct-mode (Pear) consumers.dependency: "none"now triggersevictExcept([])instead of skipping eviction entirely; previous test deps are dropped before the new test runs.ResourceManager({ unloadSettleMs })option sleeps after a successfulunloadModelso the kernel can release pages before the next load starts allocating. Default 0 (off, desktop is fine), mobile opts in to 100 ms.lifecycle-suspend-*on mobile.suspend()hangs the runner indefinitely on mobile (lifecycle coordinator pauses MQTT and never resumes within the test timeout).Worklet.terminate()on non-iOS. Androiddlcloseactually unmaps addon shared libs; some addons (likelyrocksdb-native,libbare-tls,libbare-crypto) registerpthread_key_tdestructors but neverpthread_key_deletethem before unload, so the nextpthread_exitSIGSEGVs inpthread_key_clean_all. iOS dyld no-opsdlcloseso the problem can't manifest there. On non-iOS we fall back to the legacy refs-onlyclose()and reuse the live worklet on the nextgetRPC(). Trade-off: Android no longer recovers memory between heavy tests, so it accumulates across the smoke run — lower-RAM Android devices may hit issues. Acceptable until holepunchto/bare exposes a per-addon unload hook.🧪 How was it tested?
run:local:ioson iPhone 17 (iOS 26.3.1):model-load-concurrent) with EXC_BREAKPOINT inJsLogger::releaseJsRefsafterWorklet.terminate().run:local:androidon Pixel 10 Pro (Android 16):model-load-embedding) withSIGSEGVinpthread_key_clean_allright after the firstWorklet.terminate()(tombstone backtrace pulled and analysed; PC sat in adlclose'd region with no surviving named mapping).run:local:desktop: unchanged behaviour confirmed (nounloadSettleMs, no__shutdown__RPC roundtrip cost — desktop relies onkill SIGTERMof the spawned worker which tears state down via the kernel).GlobalHandlescrash signature in the iOS.ipscrash report before the fix; absent after.🔌 API Changes
expo-rpc-client.ts::close()is nowasync(was() => void). The only caller in the SDK (unload-model.ts:50auto-close path) alreadyawaits it, so this is non-breaking for the SDK itself. Third-party callers that previously fired-and-forgotclose()continue to work; the returnedPromise<void>can be ignored.bare-client.ts::close()is alreadyasync.