Skip to content

test[notask]: fix android sharded-model-resume scudo oom#1831

Merged
opaninakuffo merged 5 commits into
mainfrom
test/sdk-fix-android-sharded-resume
Apr 30, 2026
Merged

test[notask]: fix android sharded-model-resume scudo oom#1831
opaninakuffo merged 5 commits into
mainfrom
test/sdk-fix-android-sharded-resume

Conversation

@lauripiisang

@lauripiisang lauripiisang commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Note: be concise and prefer bullet points.

🎯 What problem does this PR solve?

  • sharded-model-resume reproducibly crashes the Android consumer worklet on Pixel 10 Pro local runs (reports/local-android-full/, 2026-04-30) with Scudo ERROR: internal map failure (Out of memory) → SIGABRT in mqt_v_js.
  • Cascade failure: after the worklet dies, the rest of the sharded-model category (cancellation, inference, batch-inference, long-text-inference) fails with "Consumer died before test could be executed" — 5/10 sharded tests red.

📝 How does it solve it?

  • Five sharded-model tests (detection, hash-validation, progress, resume, cancellation) declared dependency: "none" while their default handler (loadSharded) calls ensureLoaded("sharded-embeddings"). After commit 7594d703 (eviction-on-none), dep:none means evictExcept([]), so each ran an unload-then-immediately-mmap-the-same-5-shards cycle. On Android, Scudo's mmap() fails before the kernel reclaims the prior maps (RSS ~2.3 GB / 16 GB at crash — this is mmap-region/page-reclaim contention, not real OOM).
  • Tag those five tests with dependency: "sharded-embeddings" so modelSetup keeps the model hot across the category. (load keeps dep:none — cold-load test; backward-compatibility keeps dep:none — loads GTE_LARGE_FP16, a different model.)
  • Bump mobile unloadSettleMs 100 → 200 ms as added slack for any remaining same-model unload/reload paths (matches the value the existing comment claimed was empirically sufficient on iOS).

🧪 How was it tested?

  • Reproduced the crash on Pixel 10 Pro (reports/local-android-full/, 2026-04-30 10:28): sharded-model-resume failed with 125 s heartbeat timeout; device.log shows malloc(65536) failedScudo ERROR: internal map failure → SIGABRT during the re-mmap of the same 5 shards.
  • Re-ran both just the sharded tests and a full run of android locally - passes.

Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.

Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.
Victor-Rodzko
Victor-Rodzko previously approved these changes Apr 30, 2026
@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.

Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.
@opaninakuffo

Copy link
Copy Markdown
Contributor

review

@opaninakuffo opaninakuffo merged commit b162a69 into main Apr 30, 2026
19 of 20 checks passed
@opaninakuffo opaninakuffo deleted the test/sdk-fix-android-sharded-resume branch April 30, 2026 19:09
gabrielgrigoras-serv pushed a commit that referenced this pull request May 1, 2026
* test[notask]: fix android sharded-model-resume scudo oom

Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.

Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.

* infra[notask]: pass --report-dir to CI sdk producer runs

Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.

Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.

---------

Co-authored-by: Victor-Rodzko <victor.rodzko@itrexgroup.com>
Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
* test[notask]: fix android sharded-model-resume scudo oom

Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.

Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.

* infra[notask]: pass --report-dir to CI sdk producer runs

Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.

Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.

---------

Co-authored-by: Victor-Rodzko <victor.rodzko@itrexgroup.com>
Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants