infra: fix mobile int tests on linux due to now missing global nodejs#2123
Merged
Conversation
gianni-cor
approved these changes
May 19, 2026
Contributor
Tier-based Approval Status |
NamelsKing
approved these changes
May 19, 2026
tobi-legan
added a commit
that referenced
this pull request
May 20, 2026
…te (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com>
4 tasks
tobi-legan
added a commit
that referenced
this pull request
May 21, 2026
…omposite (#2153) * refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min) Main monolith uses timeout: 2700000 (45min) and waitforTimeout: 600000 (10min). Our composite defaults to 1800000 (30min) and 120000 (2min). The slower parakeet tests (sortformer inference on Pixel 9a) exceed 30min and time out. Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000 to upload-to-devicefarm to match the monolith. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter Deep audit of all 9 monoliths revealed custom timeout values that our shims were missing (using composite defaults instead): bci-whispercpp: mocha 900000 (15min), was 1800000 transcription-whispercpp: mocha 900000 (15min), was 1800000 decoder-audio: mocha 600000 (10min), was 1800000 tts-ggml: mocha 2700000 (45min) + waitfor 600000 (10min) Also: tts-onnx monolith used --filter supertonic on perf extraction to exclude Chatterbox rows from reports. Added filter: 'supertonic' to the extract-addon-perf call. embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all matched the composite defaults (1800000 / 120000) — no change needed. transcription-parakeet was already fixed in the previous commit. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-setup): always try artifact download, fall back to npm only when empty The artifact-download steps were gated behind github.event_name != 'workflow_dispatch', which skipped them on workflow_dispatch even when on-pr-* had just produced fresh prebuild artifacts in sibling jobs. This caused workflow_dispatch runs to always fall back to npm, getting outdated/smaller prebuilds (e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts). Fix: remove the event_name gate from artifact download (with continue-on-error: true it's safe to run when no artifacts exist). The npm-fallback step now checks if prebuilds/ already has content from artifacts before attempting npm pack. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90 Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26 Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device Farm's 60-min job timeout STOPS the run during teardown even though all 3 tests passed. Bumping to 90 min gives enough headroom. NMT already overrides to 120 via the consumer shim. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(mobile-monitor): distinguish STOPPED-but-passed from real failures Device Farm result=STOPPED means the jobTimeoutMinutes cap expired, not that tests failed. When a device is STOPPED but its counters show 0 failed / 0 errored / N passed, the tests all completed successfully — DF just killed the teardown phase. Before: STOPPED counted as USER_FAILED, triggering exit 1 even though every test passed. This burned investigation time. Now: STOPPED with clean counters →⚠️ warning + USER_PASSED. STOPPED with actual failures → ❌ with counter breakdown. WARNED → treated as success (same as PASSED). FAILED / ERRORED → ❌ with counter breakdown. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-upload): implement perf-only group filtering via perf-test-regex input The monolith had an inline filter_perf() + maybe_make_and_upload() that intersected each test group's grep with a perf-test regex when qvac_perf_only=true, skipping groups with no matching perf tests. This was lost when the composite was created — qvac-perf-only was threaded through to on-device config but the scheduling-side filter was missing. Result: benchmark runs scheduled ALL test groups on ALL devices instead of the perf-emitting subset. New perf-test-regex input on upload-to-devicefarm: when qvac-perf-only=true and perf-test-regex is set, each group's grep is filtered to only keep matching tests. Empty groups are skipped with a clear log message. LLM consumer now passes the same PERF_REGEX the monolith used: ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$ Other addons don't use qvac_perf_only so they're unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: move perf-test filter regex to consumer input instead of test-groups.json The perf_tests key in test-groups.json broke the LLM addon's generate-mobile-integration-tests.js validator, which treats every top-level key as a platform and expects all tests to be covered. Match the original monolith approach: the perf-emitting test regex is supplied by the consumer workflow via a new `perf-test-regex` composite input, keeping test-groups.json identical to main. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: convention-based perf-test filtering via perf-tests.json Replace the hardcoded perf-test-regex consumer input with a convention-based auto-discovery file (perf-tests.json) that sits alongside test-groups.json in the addon's test/mobile/ directory. The composite reads the file when qvac_perf_only=true, builds the filter regex from the array, and skips groups with no perf-emitting tests. No consumer workflow changes needed — addons opt in by dropping a perf-tests.json file. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…omposite (#2153) * refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min) Main monolith uses timeout: 2700000 (45min) and waitforTimeout: 600000 (10min). Our composite defaults to 1800000 (30min) and 120000 (2min). The slower parakeet tests (sortformer inference on Pixel 9a) exceed 30min and time out. Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000 to upload-to-devicefarm to match the monolith. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter Deep audit of all 9 monoliths revealed custom timeout values that our shims were missing (using composite defaults instead): bci-whispercpp: mocha 900000 (15min), was 1800000 transcription-whispercpp: mocha 900000 (15min), was 1800000 decoder-audio: mocha 600000 (10min), was 1800000 tts-ggml: mocha 2700000 (45min) + waitfor 600000 (10min) Also: tts-onnx monolith used --filter supertonic on perf extraction to exclude Chatterbox rows from reports. Added filter: 'supertonic' to the extract-addon-perf call. embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all matched the composite defaults (1800000 / 120000) — no change needed. transcription-parakeet was already fixed in the previous commit. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-setup): always try artifact download, fall back to npm only when empty The artifact-download steps were gated behind github.event_name != 'workflow_dispatch', which skipped them on workflow_dispatch even when on-pr-* had just produced fresh prebuild artifacts in sibling jobs. This caused workflow_dispatch runs to always fall back to npm, getting outdated/smaller prebuilds (e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts). Fix: remove the event_name gate from artifact download (with continue-on-error: true it's safe to run when no artifacts exist). The npm-fallback step now checks if prebuilds/ already has content from artifacts before attempting npm pack. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90 Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26 Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device Farm's 60-min job timeout STOPS the run during teardown even though all 3 tests passed. Bumping to 90 min gives enough headroom. NMT already overrides to 120 via the consumer shim. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(mobile-monitor): distinguish STOPPED-but-passed from real failures Device Farm result=STOPPED means the jobTimeoutMinutes cap expired, not that tests failed. When a device is STOPPED but its counters show 0 failed / 0 errored / N passed, the tests all completed successfully — DF just killed the teardown phase. Before: STOPPED counted as USER_FAILED, triggering exit 1 even though every test passed. This burned investigation time. Now: STOPPED with clean counters →⚠️ warning + USER_PASSED. STOPPED with actual failures → ❌ with counter breakdown. WARNED → treated as success (same as PASSED). FAILED / ERRORED → ❌ with counter breakdown. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-upload): implement perf-only group filtering via perf-test-regex input The monolith had an inline filter_perf() + maybe_make_and_upload() that intersected each test group's grep with a perf-test regex when qvac_perf_only=true, skipping groups with no matching perf tests. This was lost when the composite was created — qvac-perf-only was threaded through to on-device config but the scheduling-side filter was missing. Result: benchmark runs scheduled ALL test groups on ALL devices instead of the perf-emitting subset. New perf-test-regex input on upload-to-devicefarm: when qvac-perf-only=true and perf-test-regex is set, each group's grep is filtered to only keep matching tests. Empty groups are skipped with a clear log message. LLM consumer now passes the same PERF_REGEX the monolith used: ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$ Other addons don't use qvac_perf_only so they're unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: move perf-test filter regex to consumer input instead of test-groups.json The perf_tests key in test-groups.json broke the LLM addon's generate-mobile-integration-tests.js validator, which treats every top-level key as a platform and expects all tests to be covered. Match the original monolith approach: the perf-emitting test regex is supplied by the consumer workflow via a new `perf-test-regex` composite input, keeping test-groups.json identical to main. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: convention-based perf-test filtering via perf-tests.json Replace the hardcoded perf-test-regex consumer input with a convention-based auto-discovery file (perf-tests.json) that sits alongside test-groups.json in the addon's test/mobile/ directory. The composite reads the file when qvac_perf_only=true, builds the filter regex from the array, and skips groups with no perf-emitting tests. No consumer workflow changes needed — addons opt in by dropping a perf-tests.json file. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Fixes mobile tests on linux (android)
How does it solve it?
allows nodejs setup on linux (and expo global install in runner agent's user's home)
Breaking changes
None