test[notask]: fix android sharded-model-resume scudo oom#1831
Merged
Conversation
Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.
Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.
Victor-Rodzko
previously approved these changes
Apr 30, 2026
Contributor
Tier-based Approval Status |
Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.
Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.
opaninakuffo
approved these changes
Apr 30, 2026
Victor-Rodzko
approved these changes
Apr 30, 2026
NamelsKing
approved these changes
Apr 30, 2026
Contributor
|
review |
gabrielgrigoras-serv
pushed a commit
that referenced
this pull request
May 1, 2026
* test[notask]: fix android sharded-model-resume scudo oom
Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.
Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.
* infra[notask]: pass --report-dir to CI sdk producer runs
Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.
Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.
---------
Co-authored-by: Victor-Rodzko <victor.rodzko@itrexgroup.com>
Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
* test[notask]: fix android sharded-model-resume scudo oom
Tag sharded-model {detection, hash-validation, progress, resume,
cancellation} with dependency "sharded-embeddings" instead of "none".
With dep:none and the default loadSharded handler, modelSetup evicts
sharded-embeddings then immediately re-mmaps the same 5 shards; on
Android (Pixel 10 Pro) Scudo's mmap fails with "internal map failure"
before the kernel reclaims the prior maps, killing the worklet and
cascading the rest of the sharded category.
Also bump mobile unloadSettleMs 100 -> 200 ms to keep some slack for
remaining same-model unload / reload paths.
* infra[notask]: pass --report-dir to CI sdk producer runs
Without --report-dir, BatchOrchestrator skips writing app-mem.ndjson
and test-timeline.ndjson, so mobile in-app memory samples published
on qvac/app-memory get dropped silently and the per-test memory rows
/ chart / suite peak never make it into the report. run:local already
passes --report-dir; the three CI workflows did not.
Pin --report-dir=./reports in test-android-sdk.yml, test-ios-sdk.yml
and test-desktop-sdk.yml. The existing "Upload results" step already
uploads ${working-directory}/reports/ so the new files ride along.
---------
Co-authored-by: Victor-Rodzko <victor.rodzko@itrexgroup.com>
Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: be concise and prefer bullet points.
🎯 What problem does this PR solve?
sharded-model-resumereproducibly crashes the Android consumer worklet on Pixel 10 Pro local runs (reports/local-android-full/, 2026-04-30) withScudo ERROR: internal map failure (Out of memory)→ SIGABRT inmqt_v_js.cancellation,inference,batch-inference,long-text-inference) fails with "Consumer died before test could be executed" — 5/10 sharded tests red.📝 How does it solve it?
detection,hash-validation,progress,resume,cancellation) declareddependency: "none"while their default handler (loadSharded) callsensureLoaded("sharded-embeddings"). After commit7594d703(eviction-on-none),dep:nonemeansevictExcept([]), so each ran an unload-then-immediately-mmap-the-same-5-shards cycle. On Android, Scudo'smmap()fails before the kernel reclaims the prior maps (RSS ~2.3 GB / 16 GB at crash — this is mmap-region/page-reclaim contention, not real OOM).dependency: "sharded-embeddings"somodelSetupkeeps the model hot across the category. (loadkeepsdep:none— cold-load test;backward-compatibilitykeepsdep:none— loadsGTE_LARGE_FP16, a different model.)unloadSettleMs100 → 200 ms as added slack for any remaining same-model unload/reload paths (matches the value the existing comment claimed was empirically sufficient on iOS).🧪 How was it tested?
reports/local-android-full/, 2026-04-30 10:28):sharded-model-resumefailed with 125 s heartbeat timeout;device.logshowsmalloc(65536) failed→Scudo ERROR: internal map failure→ SIGABRT during the re-mmap of the same 5 shards.