Skip to content

QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke#1797

Merged
lauripiisang merged 8 commits into
mainfrom
feat/QVAC-16526-mobile-worker-cleanup
Apr 30, 2026
Merged

QVAC-16526 feat[api]: pre-terminate cleanup hook + stabilise mobile smoke#1797
lauripiisang merged 8 commits into
mainfrom
feat/QVAC-16526-mobile-worker-cleanup

Conversation

@lauripiisang

@lauripiisang lauripiisang commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Note: be concise and prefer bullet points.

Blocked on tetherto/qvac-test-suite#68 merging + releasing @tetherto/qvac-test-suite@0.6.2. The tarball pointer is already bumped in tests-qvac/package.json so CI will fail until the upstream release publishes.

🎯 What problem does this PR solve?

  • Mobile (iOS) test runs crashed within ~5 tests (EXC_BREAKPOINT / SIGTRAP) the moment the SDK auto-closed the bare runtime between heavy tests. Native addons retain js_ref_t handles into the worker V8 isolate; tearing the isolate down without releasing those refs trips a V8 GlobalHandles::Destroy assertion the next time a fresh worker re-imports the same .bare files in the same iOS process.
  • Even when the runtime survived, the smoke run hit a memory ceiling on sharded-model-load because tests with dependency: "none" skipped eviction and let the previous test heavy model linger.

📝 How does it solve it?

  • SDK pre-terminate cleanup signal (__shutdown__ RPC message). Lets the client tell the worker to release addon-bound state (releaseLogger on every plugin via clearPlugins, unloadAllModels, registry close) while the JS env is still alive. expo-rpc-client.close() now sends __shutdown__ and awaits ack (10s safety timeout) before calling Worklet.terminate(). Reusable cleanup body extracted from shutdownBareDirectWorker so desktop signal/exit paths stay unchanged. Mock RPC in bare-client.ts mirrors the message for direct-mode (Pear) consumers.
  • tests-qvac lifecycle fix. dependency: "none" now triggers evictExcept([]) instead of skipping eviction entirely; previous test deps are dropped before the new test runs.
  • tests-qvac post-unload settle. New ResourceManager({ unloadSettleMs }) option sleeps after a successful unloadModel so the kernel can release pages before the next load starts allocating. Default 0 (off, desktop is fine), mobile opts in to 100 ms.
  • Skip lifecycle-suspend-* on mobile. suspend() hangs the runner indefinitely on mobile (lifecycle coordinator pauses MQTT and never resumes within the test timeout).
  • Skip Worklet.terminate() on non-iOS. Android dlclose actually unmaps addon shared libs; some addons (likely rocksdb-native, libbare-tls, libbare-crypto) register pthread_key_t destructors but never pthread_key_delete them before unload, so the next pthread_exit SIGSEGVs in pthread_key_clean_all. iOS dyld no-ops dlclose so the problem can't manifest there. On non-iOS we fall back to the legacy refs-only close() and reuse the live worklet on the next getRPC(). Trade-off: Android no longer recovers memory between heavy tests, so it accumulates across the smoke run — lower-RAM Android devices may hit issues. Acceptable until holepunchto/bare exposes a per-addon unload hook.

🧪 How was it tested?

  • run:local:ios on iPhone 17 (iOS 26.3.1):
    • Before: crashed at test 5 (model-load-concurrent) with EXC_BREAKPOINT in JsLogger::releaseJsRefs after Worklet.terminate().
    • After: full 88-test smoke runs to completion. Hundreds of clean worklet termination cycles, no V8 GlobalHandle crashes.
    • Memory tab in the report shows expected per-test peaks; baseline returns to ~250 MB band between heavy categories.
  • run:local:android on Pixel 10 Pro (Android 16):
    • Before the skip: crashed at test 2 (model-load-embedding) with SIGSEGV in pthread_key_clean_all right after the first Worklet.terminate() (tombstone backtrace pulled and analysed; PC sat in a dlclose'd region with no surviving named mapping).
    • After the skip: smoke run progresses past the previous failure point; runner reuses the live worklet between tests as before this PR.
  • run:local:desktop: unchanged behaviour confirmed (no unloadSettleMs, no __shutdown__ RPC roundtrip cost — desktop relies on kill SIGTERM of the spawned worker which tears state down via the kernel).
  • Verified the JS GlobalHandles crash signature in the iOS .ips crash report before the fix; absent after.

🔌 API Changes

expo-rpc-client.ts::close() is now async (was () => void). The only caller in the SDK (unload-model.ts:50 auto-close path) already awaits it, so this is non-breaking for the SDK itself. Third-party callers that previously fired-and-forgot close() continue to work; the returned Promise<void> can be ignored. bare-client.ts::close() is already async.

// Mobile auto-close path (unchanged from caller perspective):
await close(); // now awaits worker cleanup ack before terminating worklet

Lets a client request a clean addon teardown before tearing the bare
runtime down, so addon static state (e.g. js_ref_t handles into the
worker V8 isolate) is released while that env is still alive.

Without this, tearing down a runtime whose addons retain
isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP)
inside the next runtime that re-imports the same .bare files in the
same OS process — the JsLogger.setLogger path in
qvac-lib-inference-addon-cpp is the reproducer (every addon that
links it has the same retention).

- worker-core.ts: extract the existing shutdown body into a reusable
  cleanupForTerminate() that runs the same registry / model / resource
  cleanup but skips releaseWorkerLock() and process.exit(). The full
  shutdownBareDirectWorker still runs both for desktop signal and
  exit paths.
- handler-utils.ts + handle-request.ts: new internal __shutdown__
  message dispatched alongside __init_config. Bypasses the schema,
  awaits cleanupForTerminate(), and replies success. Lazy-imports the
  worker-core function to break the handler-utils -> worker-core ->
  create-server -> handle-request import cycle.
- bare-client.ts: mirror the message in the in-process mock RPC for
  desktop direct-mode (Pear-style) consumers.
- expo-rpc-client.ts: close() is now async; sends __shutdown__ over
  RPC and awaits the success reply (with a 10s timeout safety) before
  calling worklet.terminate(). Best-effort: timeouts log a warning
  and proceed with terminate. The auto-close path in unload-model.ts
  already awaits close(), so this is non-breaking for that caller.
… settle

Two related fixes that together let the mobile smoke run progress
past the "previous heavy model still resident" memory ceiling:

- resource-lifecycle: tests with dependency:none used to skip
  evictExcept and leave whatever was loaded by the previous test
  resident. Now treated as evictExcept([]), so a heavy model from
  the prior test gets unloaded before the next one starts allocating.
  Empirically this is what kept tripping sharded-model-load right
  after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB
  resident; sharded then asks for multi-GB on top and hits the iOS
  memory limit).

- resource-manager: new ResourceManager({ unloadSettleMs }) option
  that sleeps for the configured duration after a successful
  unloadModel (only on success — failure path returns immediately).
  Lets the kernel release pages before the next load starts allocating.
  Defaults to 0 (off, desktop is fine without it). Mobile consumer
  opts in to 100ms.

Mobile consumer also picks up SkipExecutor entries for the
lifecycle-suspend tests; suspend hangs the runner indefinitely on
mobile because the lifecycle coordinator pauses MQTT and never
resumes within the test timeout.
Picks up:
- in-app memory poller in mobile-consumer template
- desktop in-app memory poller (process-tree RSS)
- Memory tab + per-test memory metrics in HTML/JSON reports
- bucket results by metadata.category instead of testId-prefix split

Required by the eviction / settle work in this PR; both depend on
the new MemorySummary fields and the corrected category bucketing.
@lauripiisang lauripiisang marked this pull request as ready for review April 29, 2026 05:00
@lauripiisang lauripiisang requested review from a team as code owners April 29, 2026 05:00
opaninakuffo
opaninakuffo previously approved these changes Apr 29, 2026
@github-actions

github-actions Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

cleanupForTerminate previously set the same isShuttingDown flag that
shutdownBareDirectWorker uses as its early-return guard. After a
__shutdown__ message ran the pre-terminate cleanup, a subsequent
SIGTERM / SIGINT / uncaught-exception in desktop direct mode would
early-return at the guard and skip releaseWorkerLock() + process.exit().
Result: lock file leak and no graceful exit.

Mobile is unaffected because each Worklet has its own module instance
(fresh isShuttingDown per worklet). The bug only bites the bare-client
mock-RPC path (Pear-style consumers where the worker shares the host
process for its lifetime).

Two flags now:
- cleanupRan: idempotent guard around runCleanup body
- isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate
  must NOT set it
shutdownBareDirectWorker still calls runCleanup which is now a no-op
when cleanupRan is already true.
…_ races

If two callers race close() (or one calls close() while another getRPC()
is mid-flight), the second sees rpcInstance still set, fires a redundant
__shutdown__, then re-enters the terminate block on already-null state.

Wrap the body in a singleton closingPromise; concurrent callers share
the same in-flight close. Reset to null in finally so a fresh worker
brought up later can be cleanly closed again.

The auto-close path in unload-model.ts is naturally serialised today
so this is robustness rather than fixing an active bug, but the cost
is minimal and the failure mode (double __shutdown__ after terminate)
is annoying to diagnose.
NamelsKing
NamelsKing previously approved these changes Apr 29, 2026
Worklet.terminate() crashes on Android: addon dlclose unmaps the lib
but pthread_key_t destructors registered by some addons (likely
rocksdb-native, libbare-tls, libbare-crypto) are never
pthread_key_delete'd before unload, so libc's per-thread cleanup table
points at unmapped memory and the next pthread_exit SIGSEGVs in
pthread_key_clean_all().

iOS dyld no-ops dlclose for already-loaded third-party libs, so the
dangling-destructor problem cannot manifest there. The terminate path
stays enabled on iOS.

On non-iOS, fall back to the legacy refs-only close: drop rpcInstance
and rpcPromise, leave workletInstance + workletInitialized intact so
the next getRPC() reuses the live worklet. Skip the __shutdown__
roundtrip too -- it would clear the worker plugin registry without a
follow-up terminate, leaving the worker unusable for subsequent
loadModel.

Trade-off: Android tests no longer recover memory between heavy tests
the way iOS now does, so memory accumulates across the smoke run. On
Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android
devices may regress vs the pre-PR baseline. Acceptable until the
upstream holepunchto/bare exposes a per-addon unload hook.

Platform is resolved via the existing getRuntimeContext() path
(getDeviceInfo handles a missing expo-device safely via dynamic
import + try/catch), so no new react-native imports are added.
The test reliably times out on mobile (Android Pixel 10 Pro hit the
600s timeout in the latest smoke run). Test framework drops the await
on timeout but the underlying streaming inference keeps running on the
Bare worker side, leaving the diffusion model "in use" from the
runtime's perspective.

Knock-on effect: any later test whose modelSetup needs to evict
diffusion (e.g. wrong-model-transcribe-on-llm via
ResourceManager.evictExcept) blocks indefinitely waiting for the
stream to finish. Observed in local-android-smoke: 86/88 tests
completed, then the runner stuck for 50+ minutes inside the eviction
of diffusion at test 86's setup.

Skipping unblocks the smoke run end-to-end. The proper fixes
(framework-side cancel-on-timeout, resource-manager bounded waits)
are tracked separately.

@simon-iribarren simon-iribarren left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the SDK changes. Deep-dove the requested key commit e9954bf and its three follow-ups — diagnosis is correct (addon js_ref_t static state surviving across worklets in the same iOS process trips the V8 GlobalHandles::Destroy assertion / EXC_BREAKPOINT in JsLogger::releaseJsRefs), and the fix is well-structured.

What I checked:

  • runCleanup() body matches the original shutdownBareDirectWorker cleanup verbatim (same clearRegistries() + Promise.allSettled of the same five closers) → no regression for desktop signal/exit.
  • __shutdown__ schema bypass + lazy-import in handleShutdown to break handler-utils → worker-core → create-server → handle-request cycle is correct; double-underscore prefix matches existing __init_config convention.
  • expo-rpc-client.close() widening from () => void to () => Promise<void> is non-breaking — verified all callers in packages/sdk (auto-close at unload-model.ts:50 already awaits; desktop void close() callers route to bare-client.ts/node-rpc-client.ts which were already async).
  • Follow-ups are real bug fixes for issues introduced in e9954bf, not polish: 4cfc8ec2 splits cleanupRan from isShuttingDown so a later SIGTERM still releases the worker lock + exits; 4ba4137b serialises close() against double-__shutdown__ races. Both worth merging.
  • b795d74d's non-iOS skip is honestly disclosed — the Android pthread_key_clean_all SIGSEGV root cause (addons register pthread_key_t destructors but never pthread_key_delete them, so dlclose unmaps the lib while libc's per-thread cleanup table still points at it) matches what the tombstone shows.

One small thing worth a sanity check (non-blocking): in b795d74d's iOS branch, (await getRuntimeContext()).platform is read inside the closingPromise body, and the try/catch around it swallows transient failures into platform = undefined, which falls through to the refs-only path even on iOS. Probably fine because runtime context is cached after first call, but if you'd rather default iOS to "yes terminate" on a transient failure, the catch could special-case that. Up to you.

CI failure on check (sdk) is a transient bun install tarball error for react-devtools-core — not code-related, clears on rebase. Approving once the test-suite 0.6.2 release lands and you push a rebase.

@lauripiisang

Copy link
Copy Markdown
Contributor Author

/review

@lauripiisang lauripiisang merged commit b09caa1 into main Apr 30, 2026
19 of 20 checks passed
@lauripiisang lauripiisang deleted the feat/QVAC-16526-mobile-worker-cleanup branch April 30, 2026 07:47
Proletter pushed a commit that referenced this pull request May 24, 2026
…moke (#1797)

* feat: add pre-terminate cleanup signal for SDK clients

Lets a client request a clean addon teardown before tearing the bare
runtime down, so addon static state (e.g. js_ref_t handles into the
worker V8 isolate) is released while that env is still alive.

Without this, tearing down a runtime whose addons retain
isolate-bound refs trips a V8 GlobalHandles assertion (brk 0 / SIGTRAP)
inside the next runtime that re-imports the same .bare files in the
same OS process — the JsLogger.setLogger path in
qvac-lib-inference-addon-cpp is the reproducer (every addon that
links it has the same retention).

- worker-core.ts: extract the existing shutdown body into a reusable
  cleanupForTerminate() that runs the same registry / model / resource
  cleanup but skips releaseWorkerLock() and process.exit(). The full
  shutdownBareDirectWorker still runs both for desktop signal and
  exit paths.
- handler-utils.ts + handle-request.ts: new internal __shutdown__
  message dispatched alongside __init_config. Bypasses the schema,
  awaits cleanupForTerminate(), and replies success. Lazy-imports the
  worker-core function to break the handler-utils -> worker-core ->
  create-server -> handle-request import cycle.
- bare-client.ts: mirror the message in the in-process mock RPC for
  desktop direct-mode (Pear-style) consumers.
- expo-rpc-client.ts: close() is now async; sends __shutdown__ over
  RPC and awaits the success reply (with a 10s timeout safety) before
  calling worklet.terminate(). Best-effort: timeouts log a warning
  and proceed with terminate. The auto-close path in unload-model.ts
  already awaits close(), so this is non-breaking for that caller.

* test: stabilise mobile smoke run via eviction-on-none and post-unload settle

Two related fixes that together let the mobile smoke run progress
past the "previous heavy model still resident" memory ceiling:

- resource-lifecycle: tests with dependency:none used to skip
  evictExcept and leave whatever was loaded by the previous test
  resident. Now treated as evictExcept([]), so a heavy model from
  the prior test gets unloaded before the next one starts allocating.
  Empirically this is what kept tripping sharded-model-load right
  after translation-afriquegemma-sw-en (afriquegemma 4B leaves ~550 MB
  resident; sharded then asks for multi-GB on top and hits the iOS
  memory limit).

- resource-manager: new ResourceManager({ unloadSettleMs }) option
  that sleeps for the configured duration after a successful
  unloadModel (only on success — failure path returns immediately).
  Lets the kernel release pages before the next load starts allocating.
  Defaults to 0 (off, desktop is fine without it). Mobile consumer
  opts in to 100ms.

Mobile consumer also picks up SkipExecutor entries for the
lifecycle-suspend tests; suspend hangs the runner indefinitely on
mobile because the lifecycle coordinator pauses MQTT and never
resumes within the test timeout.

* chore: bump qvac-test-suite to ^0.6.2

Picks up:
- in-app memory poller in mobile-consumer template
- desktop in-app memory poller (process-tree RSS)
- Memory tab + per-test memory metrics in HTML/JSON reports
- bucket results by metadata.category instead of testId-prefix split

Required by the eviction / settle work in this PR; both depend on
the new MemorySummary fields and the corrected category bucketing.

* fix: split cleanupRan and isShuttingDown so shutdown still releases lock

cleanupForTerminate previously set the same isShuttingDown flag that
shutdownBareDirectWorker uses as its early-return guard. After a
__shutdown__ message ran the pre-terminate cleanup, a subsequent
SIGTERM / SIGINT / uncaught-exception in desktop direct mode would
early-return at the guard and skip releaseWorkerLock() + process.exit().
Result: lock file leak and no graceful exit.

Mobile is unaffected because each Worklet has its own module instance
(fresh isShuttingDown per worklet). The bug only bites the bare-client
mock-RPC path (Pear-style consumers where the worker shares the host
process for its lifetime).

Two flags now:
- cleanupRan: idempotent guard around runCleanup body
- isShuttingDown: only set by shutdownBareDirectWorker; cleanupForTerminate
  must NOT set it
shutdownBareDirectWorker still calls runCleanup which is now a no-op
when cleanupRan is already true.

* fix: serialise expo-rpc-client.close() to avoid duplicate __shutdown__ races

If two callers race close() (or one calls close() while another getRPC()
is mid-flight), the second sees rpcInstance still set, fires a redundant
__shutdown__, then re-enters the terminate block on already-null state.

Wrap the body in a singleton closingPromise; concurrent callers share
the same in-flight close. Reset to null in finally so a fresh worker
brought up later can be cleanly closed again.

The auto-close path in unload-model.ts is naturally serialised today
so this is robustness rather than fixing an active bug, but the cost
is minimal and the failure mode (double __shutdown__ after terminate)
is annoying to diagnose.

* fix: skip Worklet.terminate() on non-iOS platforms

Worklet.terminate() crashes on Android: addon dlclose unmaps the lib
but pthread_key_t destructors registered by some addons (likely
rocksdb-native, libbare-tls, libbare-crypto) are never
pthread_key_delete'd before unload, so libc's per-thread cleanup table
points at unmapped memory and the next pthread_exit SIGSEGVs in
pthread_key_clean_all().

iOS dyld no-ops dlclose for already-loaded third-party libs, so the
dangling-destructor problem cannot manifest there. The terminate path
stays enabled on iOS.

On non-iOS, fall back to the legacy refs-only close: drop rpcInstance
and rpcPromise, leave workletInstance + workletInitialized intact so
the next getRPC() reuses the live worklet. Skip the __shutdown__
roundtrip too -- it would clear the worker plugin registry without a
follow-up terminate, leaving the worker unusable for subsequent
loadModel.

Trade-off: Android tests no longer recover memory between heavy tests
the way iOS now does, so memory accumulates across the smoke run. On
Pixel-class devices (8+ GB RAM) this is fine; smaller-RAM Android
devices may regress vs the pre-PR baseline. Acceptable until the
upstream holepunchto/bare exposes a per-addon unload hook.

Platform is resolved via the existing getRuntimeContext() path
(getDeviceInfo handles a missing expo-device safely via dynamic
import + try/catch), so no new react-native imports are added.

* test: skip diffusion-streaming-progress on mobile

The test reliably times out on mobile (Android Pixel 10 Pro hit the
600s timeout in the latest smoke run). Test framework drops the await
on timeout but the underlying streaming inference keeps running on the
Bare worker side, leaving the diffusion model "in use" from the
runtime's perspective.

Knock-on effect: any later test whose modelSetup needs to evict
diffusion (e.g. wrong-model-transcribe-on-llm via
ResourceManager.evictExcept) blocks indefinitely waiting for the
stream to finish. Observed in local-android-smoke: 86/88 tests
completed, then the runner stuck for 50+ minutes inside the eviction
of diffusion at test 86's setup.

Skipping unblocks the smoke run end-to-end. The proper fixes
(framework-side cancel-on-timeout, resource-manager bounded waits)
are tracked separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants