QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke by lauripiisang · Pull Request #1797 · tetherto/qvac

lauripiisang · 2026-04-29T04:35:02Z

Note: be concise and prefer bullet points.

⛔ Blocked on tetherto/qvac-test-suite#68 merging + releasing @tetherto/qvac-test-suite@0.6.2. The tarball pointer is already bumped in tests-qvac/package.json so CI will fail until the upstream release publishes.

🎯 What problem does this PR solve?

Mobile (iOS) test runs crashed within ~5 tests (EXC_BREAKPOINT / SIGTRAP) the moment the SDK auto-closed the bare runtime between heavy tests. Native addons retain js_ref_t handles into the worker V8 isolate; tearing the isolate down without releasing those refs trips a V8 GlobalHandles::Destroy assertion the next time a fresh worker re-imports the same .bare files in the same iOS process.
Even when the runtime survived, the smoke run hit a memory ceiling on sharded-model-load because tests with dependency: "none" skipped eviction and let the previous test heavy model linger.

📝 How does it solve it?

SDK pre-terminate cleanup signal (__shutdown__ RPC message). Lets the client tell the worker to release addon-bound state (releaseLogger on every plugin via clearPlugins, unloadAllModels, registry close) while the JS env is still alive. expo-rpc-client.close() now sends __shutdown__ and awaits ack (10s safety timeout) before calling Worklet.terminate(). Reusable cleanup body extracted from shutdownBareDirectWorker so desktop signal/exit paths stay unchanged. Mock RPC in bare-client.ts mirrors the message for direct-mode (Pear) consumers.
tests-qvac lifecycle fix. dependency: "none" now triggers evictExcept([]) instead of skipping eviction entirely; previous test deps are dropped before the new test runs.
tests-qvac post-unload settle. New ResourceManager({ unloadSettleMs }) option sleeps after a successful unloadModel so the kernel can release pages before the next load starts allocating. Default 0 (off, desktop is fine), mobile opts in to 100 ms.
Skip lifecycle-suspend-* on mobile. suspend() hangs the runner indefinitely on mobile (lifecycle coordinator pauses MQTT and never resumes within the test timeout).
Skip Worklet.terminate() on non-iOS. Android dlclose actually unmaps addon shared libs; some addons (likely rocksdb-native, libbare-tls, libbare-crypto) register pthread_key_t destructors but never pthread_key_delete them before unload, so the next pthread_exit SIGSEGVs in pthread_key_clean_all. iOS dyld no-ops dlclose so the problem can't manifest there. On non-iOS we fall back to the legacy refs-only close() and reuse the live worklet on the next getRPC(). Trade-off: Android no longer recovers memory between heavy tests, so it accumulates across the smoke run — lower-RAM Android devices may hit issues. Acceptable until holepunchto/bare exposes a per-addon unload hook.

🧪 How was it tested?

run:local:ios on iPhone 17 (iOS 26.3.1):
- Before: crashed at test 5 (model-load-concurrent) with EXC_BREAKPOINT in JsLogger::releaseJsRefs after Worklet.terminate().
- After: full 88-test smoke runs to completion. Hundreds of clean worklet termination cycles, no V8 GlobalHandle crashes.
- Memory tab in the report shows expected per-test peaks; baseline returns to ~250 MB band between heavy categories.
run:local:android on Pixel 10 Pro (Android 16):
- Before the skip: crashed at test 2 (model-load-embedding) with SIGSEGV in pthread_key_clean_all right after the first Worklet.terminate() (tombstone backtrace pulled and analysed; PC sat in a dlclose'd region with no surviving named mapping).
- After the skip: smoke run progresses past the previous failure point; runner reuses the live worklet between tests as before this PR.
run:local:desktop: unchanged behaviour confirmed (no unloadSettleMs, no __shutdown__ RPC roundtrip cost — desktop relies on kill SIGTERM of the spawned worker which tears state down via the kernel).
Verified the JS GlobalHandles crash signature in the iOS .ips crash report before the fix; absent after.

🔌 API Changes

expo-rpc-client.ts::close() is now async (was () => void). The only caller in the SDK (unload-model.ts:50 auto-close path) already awaits it, so this is non-breaking for the SDK itself. Third-party callers that previously fired-and-forgot close() continue to work; the returned Promise<void> can be ignored. bare-client.ts::close() is already async.

// Mobile auto-close path (unchanged from caller perspective):
await close(); // now awaits worker cleanup ack before terminating worklet

Lets a client request a clean addon teardown before tearing the bare runtime down, so addon static state (e.g. js_ref_t handles into the worker V8 isolate) is released while that env is still alive. Without this, tearing down a runtime whose addons retain isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP) inside the next runtime that re-imports the same .bare files in the same OS process — the JsLogger.setLogger path in qvac-lib-inference-addon-cpp is the reproducer (every addon that links it has the same retention). - worker-core.ts: extract the existing shutdown body into a reusable cleanupForTerminate() that runs the same registry / model / resource cleanup but skips releaseWorkerLock() and process.exit(). The full shutdownBareDirectWorker still runs both for desktop signal and exit paths. - handler-utils.ts + handle-request.ts: new internal __shutdown__ message dispatched alongside __init_config. Bypasses the schema, awaits cleanupForTerminate(), and replies success. Lazy-imports the worker-core function to break the handler-utils -> worker-core -> create-server -> handle-request import cycle. - bare-client.ts: mirror the message in the in-process mock RPC for desktop direct-mode (Pear-style) consumers. - expo-rpc-client.ts: close() is now async; sends __shutdown__ over RPC and awaits the success reply (with a 10s timeout safety) before calling worklet.terminate(). Best-effort: timeouts log a warning and proceed with terminate. The auto-close path in unload-model.ts already awaits close(), so this is non-breaking for that caller.

… settle Two related fixes that together let the mobile smoke run progress past the "previous heavy model still resident" memory ceiling: - resource-lifecycle: tests with dependency:none used to skip evictExcept and leave whatever was loaded by the previous test resident. Now treated as evictExcept([]), so a heavy model from the prior test gets unloaded before the next one starts allocating. Empirically this is what kept tripping sharded-model-load right after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB resident; sharded then asks for multi-GB on top and hits the iOS memory limit). - resource-manager: new ResourceManager({ unloadSettleMs }) option that sleeps for the configured duration after a successful unloadModel (only on success — failure path returns immediately). Lets the kernel release pages before the next load starts allocating. Defaults to 0 (off, desktop is fine without it). Mobile consumer opts in to 100ms. Mobile consumer also picks up SkipExecutor entries for the lifecycle-suspend tests; suspend hangs the runner indefinitely on mobile because the lifecycle coordinator pauses MQTT and never resumes within the test timeout.

Picks up: - in-app memory poller in mobile-consumer template - desktop in-app memory poller (process-tree RSS) - Memory tab + per-test memory metrics in HTML/JSON reports - bucket results by metadata.category instead of testId-prefix split Required by the eviction / settle work in this PR; both depend on the new MemorySummary fields and the corrected category bucketing.

github-actions · 2026-04-29T08:23:46Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

cleanupForTerminate previously set the same isShuttingDown flag that shutdownBareDirectWorker uses as its early-return guard. After a __shutdown__ message ran the pre-terminate cleanup, a subsequent SIGTERM / SIGINT / uncaught-exception in desktop direct mode would early-return at the guard and skip releaseWorkerLock() + process.exit(). Result: lock file leak and no graceful exit. Mobile is unaffected because each Worklet has its own module instance (fresh isShuttingDown per worklet). The bug only bites the bare-client mock-RPC path (Pear-style consumers where the worker shares the host process for its lifetime). Two flags now: - cleanupRan: idempotent guard around runCleanup body - isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate must NOT set it shutdownBareDirectWorker still calls runCleanup which is now a no-op when cleanupRan is already true.

…_ races If two callers race close() (or one calls close() while another getRPC() is mid-flight), the second sees rpcInstance still set, fires a redundant __shutdown__, then re-enters the terminate block on already-null state. Wrap the body in a singleton closingPromise; concurrent callers share the same in-flight close. Reset to null in finally so a fresh worker brought up later can be cleanly closed again. The auto-close path in unload-model.ts is naturally serialised today so this is robustness rather than fixing an active bug, but the cost is minimal and the failure mode (double __shutdown__ after terminate) is annoying to diagnose.

Worklet.terminate() crashes on Android: addon dlclose unmaps the lib but pthread_key_t destructors registered by some addons (likely rocksdb-native, libbare-tls, libbare-crypto) are never pthread_key_delete'd before unload, so libc's per-thread cleanup table points at unmapped memory and the next pthread_exit SIGSEGVs in pthread_key_clean_all(). iOS dyld no-ops dlclose for already-loaded third-party libs, so the dangling-destructor problem cannot manifest there. The terminate path stays enabled on iOS. On non-iOS, fall back to the legacy refs-only close: drop rpcInstance and rpcPromise, leave workletInstance + workletInitialized intact so the next getRPC() reuses the live worklet. Skip the __shutdown__ roundtrip too -- it would clear the worker plugin registry without a follow-up terminate, leaving the worker unusable for subsequent loadModel. Trade-off: Android tests no longer recover memory between heavy tests the way iOS now does, so memory accumulates across the smoke run. On Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android devices may regress vs the pre-PR baseline. Acceptable until the upstream holepunchto/bare exposes a per-addon unload hook. Platform is resolved via the existing getRuntimeContext() path (getDeviceInfo handles a missing expo-device safely via dynamic import + try/catch), so no new react-native imports are added.

The test reliably times out on mobile (Android Pixel 10 Pro hit the 600s timeout in the latest smoke run). Test framework drops the await on timeout but the underlying streaming inference keeps running on the Bare worker side, leaving the diffusion model "in use" from the runtime's perspective. Knock-on effect: any later test whose modelSetup needs to evict diffusion (e.g. wrong-model-transcribe-on-llm via ResourceManager.evictExcept) blocks indefinitely waiting for the stream to finish. Observed in local-android-smoke: 86/88 tests completed, then the runner stuck for 50+ minutes inside the eviction of diffusion at test 86's setup. Skipping unblocks the smoke run end-to-end. The proper fixes (framework-side cancel-on-timeout, resource-manager bounded waits) are tracked separately.

simon-iribarren

LGTM on the SDK changes. Deep-dove the requested key commit e9954bf and its three follow-ups — diagnosis is correct (addon js_ref_t static state surviving across worklets in the same iOS process trips the V8 GlobalHandles::Destroy assertion / EXC_BREAKPOINT in JsLogger::releaseJsRefs), and the fix is well-structured.

What I checked:

runCleanup() body matches the original shutdownBareDirectWorker cleanup verbatim (same clearRegistries() + Promise.allSettled of the same five closers) → no regression for desktop signal/exit.
__shutdown__ schema bypass + lazy-import in handleShutdown to break handler-utils → worker-core → create-server → handle-request cycle is correct; double-underscore prefix matches existing __init_config convention.
expo-rpc-client.close() widening from () => void to () => Promise<void> is non-breaking — verified all callers in packages/sdk (auto-close at unload-model.ts:50 already awaits; desktop void close() callers route to bare-client.ts/node-rpc-client.ts which were already async).
Follow-ups are real bug fixes for issues introduced in e9954bf, not polish: 4cfc8ec2 splits cleanupRan from isShuttingDown so a later SIGTERM still releases the worker lock + exits; 4ba4137b serialises close() against double-__shutdown__ races. Both worth merging.
b795d74d's non-iOS skip is honestly disclosed — the Android pthread_key_clean_all SIGSEGV root cause (addons register pthread_key_t destructors but never pthread_key_delete them, so dlclose unmaps the lib while libc's per-thread cleanup table still points at it) matches what the tombstone shows.

One small thing worth a sanity check (non-blocking): in b795d74d's iOS branch, (await getRuntimeContext()).platform is read inside the closingPromise body, and the try/catch around it swallows transient failures into platform = undefined, which falls through to the refs-only path even on iOS. Probably fine because runtime context is cached after first call, but if you'd rather default iOS to "yes terminate" on a transient failure, the catch could special-case that. Up to you.

CI failure on check (sdk) is a transient bun install tarball error for react-devtools-core — not code-related, clears on rebase. Approving once the test-suite 0.6.2 release lands and you push a rebase.

lauripiisang · 2026-04-30T07:46:13Z

/review

…moke (#1797) * feat: add pre-terminate cleanup signal for SDK clients Lets a client request a clean addon teardown before tearing the bare runtime down, so addon static state (e.g. js_ref_t handles into the worker V8 isolate) is released while that env is still alive. Without this, tearing down a runtime whose addons retain isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP) inside the next runtime that re-imports the same .bare files in the same OS process — the JsLogger.setLogger path in qvac-lib-inference-addon-cpp is the reproducer (every addon that links it has the same retention). - worker-core.ts: extract the existing shutdown body into a reusable cleanupForTerminate() that runs the same registry / model / resource cleanup but skips releaseWorkerLock() and process.exit(). The full shutdownBareDirectWorker still runs both for desktop signal and exit paths. - handler-utils.ts + handle-request.ts: new internal __shutdown__ message dispatched alongside __init_config. Bypasses the schema, awaits cleanupForTerminate(), and replies success. Lazy-imports the worker-core function to break the handler-utils -> worker-core -> create-server -> handle-request import cycle. - bare-client.ts: mirror the message in the in-process mock RPC for desktop direct-mode (Pear-style) consumers. - expo-rpc-client.ts: close() is now async; sends __shutdown__ over RPC and awaits the success reply (with a 10s timeout safety) before calling worklet.terminate(). Best-effort: timeouts log a warning and proceed with terminate. The auto-close path in unload-model.ts already awaits close(), so this is non-breaking for that caller. * test: stabilise mobile smoke run via eviction-on-none and post-unload settle Two related fixes that together let the mobile smoke run progress past the "previous heavy model still resident" memory ceiling: - resource-lifecycle: tests with dependency:none used to skip evictExcept and leave whatever was loaded by the previous test resident. Now treated as evictExcept([]), so a heavy model from the prior test gets unloaded before the next one starts allocating. Empirically this is what kept tripping sharded-model-load right after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB resident; sharded then asks for multi-GB on top and hits the iOS memory limit). - resource-manager: new ResourceManager({ unloadSettleMs }) option that sleeps for the configured duration after a successful unloadModel (only on success — failure path returns immediately). Lets the kernel release pages before the next load starts allocating. Defaults to 0 (off, desktop is fine without it). Mobile consumer opts in to 100ms. Mobile consumer also picks up SkipExecutor entries for the lifecycle-suspend tests; suspend hangs the runner indefinitely on mobile because the lifecycle coordinator pauses MQTT and never resumes within the test timeout. * chore: bump qvac-test-suite to ^0.6.2 Picks up: - in-app memory poller in mobile-consumer template - desktop in-app memory poller (process-tree RSS) - Memory tab + per-test memory metrics in HTML/JSON reports - bucket results by metadata.category instead of testId-prefix split Required by the eviction / settle work in this PR; both depend on the new MemorySummary fields and the corrected category bucketing. * fix: split cleanupRan and isShuttingDown so shutdown still releases lock cleanupForTerminate previously set the same isShuttingDown flag that shutdownBareDirectWorker uses as its early-return guard. After a __shutdown__ message ran the pre-terminate cleanup, a subsequent SIGTERM / SIGINT / uncaught-exception in desktop direct mode would early-return at the guard and skip releaseWorkerLock() + process.exit(). Result: lock file leak and no graceful exit. Mobile is unaffected because each Worklet has its own module instance (fresh isShuttingDown per worklet). The bug only bites the bare-client mock-RPC path (Pear-style consumers where the worker shares the host process for its lifetime). Two flags now: - cleanupRan: idempotent guard around runCleanup body - isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate must NOT set it shutdownBareDirectWorker still calls runCleanup which is now a no-op when cleanupRan is already true. * fix: serialise expo-rpc-client.close() to avoid duplicate __shutdown__ races If two callers race close() (or one calls close() while another getRPC() is mid-flight), the second sees rpcInstance still set, fires a redundant __shutdown__, then re-enters the terminate block on already-null state. Wrap the body in a singleton closingPromise; concurrent callers share the same in-flight close. Reset to null in finally so a fresh worker brought up later can be cleanly closed again. The auto-close path in unload-model.ts is naturally serialised today so this is robustness rather than fixing an active bug, but the cost is minimal and the failure mode (double __shutdown__ after terminate) is annoying to diagnose. * fix: skip Worklet.terminate() on non-iOS platforms Worklet.terminate() crashes on Android: addon dlclose unmaps the lib but pthread_key_t destructors registered by some addons (likely rocksdb-native, libbare-tls, libbare-crypto) are never pthread_key_delete'd before unload, so libc's per-thread cleanup table points at unmapped memory and the next pthread_exit SIGSEGVs in pthread_key_clean_all(). iOS dyld no-ops dlclose for already-loaded third-party libs, so the dangling-destructor problem cannot manifest there. The terminate path stays enabled on iOS. On non-iOS, fall back to the legacy refs-only close: drop rpcInstance and rpcPromise, leave workletInstance + workletInitialized intact so the next getRPC() reuses the live worklet. Skip the __shutdown__ roundtrip too -- it would clear the worker plugin registry without a follow-up terminate, leaving the worker unusable for subsequent loadModel. Trade-off: Android tests no longer recover memory between heavy tests the way iOS now does, so memory accumulates across the smoke run. On Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android devices may regress vs the pre-PR baseline. Acceptable until the upstream holepunchto/bare exposes a per-addon unload hook. Platform is resolved via the existing getRuntimeContext() path (getDeviceInfo handles a missing expo-device safely via dynamic import + try/catch), so no new react-native imports are added. * test: skip diffusion-streaming-progress on mobile The test reliably times out on mobile (Android Pixel 10 Pro hit the 600s timeout in the latest smoke run). Test framework drops the await on timeout but the underlying streaming inference keeps running on the Bare worker side, leaving the diffusion model "in use" from the runtime's perspective. Knock-on effect: any later test whose modelSetup needs to evict diffusion (e.g. wrong-model-transcribe-on-llm via ResourceManager.evictExcept) blocks indefinitely waiting for the stream to finish. Observed in local-android-smoke: 86/88 tests completed, then the runner stuck for 50+ minutes inside the eviction of diffusion at test 86's setup. Skipping unblocks the smoke run end-to-end. The proper fixes (framework-side cancel-on-timeout, resource-manager bounded waits) are tracked separately.

lauripiisang added 3 commits April 29, 2026 08:33

lauripiisang had a problem deploying to release April 29, 2026 04:35 — with GitHub Actions Failure

lauripiisang marked this pull request as ready for review April 29, 2026 05:00

lauripiisang requested review from a team as code owners April 29, 2026 05:00

opaninakuffo previously approved these changes Apr 29, 2026

View reviewed changes

lauripiisang added 2 commits April 29, 2026 12:48

lauripiisang dismissed opaninakuffo’s stale review via 4ba4137 April 29, 2026 08:48

lauripiisang had a problem deploying to release April 29, 2026 08:49 — with GitHub Actions Failure

NamelsKing previously approved these changes Apr 29, 2026

View reviewed changes

lauripiisang dismissed NamelsKing’s stale review via b795d74 April 29, 2026 14:27

lauripiisang had a problem deploying to release April 29, 2026 14:27 — with GitHub Actions Failure

lauripiisang had a problem deploying to release April 29, 2026 14:28 — with GitHub Actions Failure

lauripiisang had a problem deploying to release April 29, 2026 14:29 — with GitHub Actions Failure

lauripiisang had a problem deploying to release April 29, 2026 16:16 — with GitHub Actions Failure

simon-iribarren approved these changes Apr 29, 2026

View reviewed changes

opaninakuffo approved these changes Apr 29, 2026

View reviewed changes

NamelsKing approved these changes Apr 30, 2026

View reviewed changes

Merge branch 'main' into feat/QVAC-16526-mobile-worker-cleanup

b1a9764

lauripiisang had a problem deploying to release April 30, 2026 07:32 — with GitHub Actions Failure

lauripiisang merged commit b09caa1 into main Apr 30, 2026
19 of 20 checks passed

lauripiisang deleted the feat/QVAC-16526-mobile-worker-cleanup branch April 30, 2026 07:47

lauripiisang had a problem deploying to release April 30, 2026 07:47 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke#1797

QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke#1797
lauripiisang merged 8 commits into
mainfrom
feat/QVAC-16526-mobile-worker-cleanup

lauripiisang commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

simon-iribarren left a comment

Uh oh!

lauripiisang commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lauripiisang commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

🔌 API Changes

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

simon-iribarren left a comment

Choose a reason for hiding this comment

Uh oh!

lauripiisang commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lauripiisang commented Apr 29, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading