infra: fix mobile int tests on linux due to now missing global nodejs by tamer-hassan-tether · Pull Request #2123 · tetherto/qvac

tamer-hassan-tether · 2026-05-19T13:36:27Z

What problem does this PR solve?

Fixes mobile tests on linux (android)

How does it solve it?

allows nodejs setup on linux (and expo global install in runner agent's user's home)

Breaking changes

None

github-actions · 2026-05-19T13:37:48Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (2/1)

**Bypass rule:** Triggered (2+ Team Lead approvals (Tier 1 exception)). This PR is approved regardless of tier.

---
*This comment is automatically updated when reviews change.*

…te (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com>

…omposite (#2153) * refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min) Main monolith uses timeout: 2700000 (45min) and waitforTimeout: 600000 (10min). Our composite defaults to 1800000 (30min) and 120000 (2min). The slower parakeet tests (sortformer inference on Pixel 9a) exceed 30min and time out. Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000 to upload-to-devicefarm to match the monolith. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter Deep audit of all 9 monoliths revealed custom timeout values that our shims were missing (using composite defaults instead): bci-whispercpp: mocha 900000 (15min), was 1800000 transcription-whispercpp: mocha 900000 (15min), was 1800000 decoder-audio: mocha 600000 (10min), was 1800000 tts-ggml: mocha 2700000 (45min) + waitfor 600000 (10min) Also: tts-onnx monolith used --filter supertonic on perf extraction to exclude Chatterbox rows from reports. Added filter: 'supertonic' to the extract-addon-perf call. embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all matched the composite defaults (1800000 / 120000) — no change needed. transcription-parakeet was already fixed in the previous commit. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-setup): always try artifact download, fall back to npm only when empty The artifact-download steps were gated behind github.event_name != 'workflow_dispatch', which skipped them on workflow_dispatch even when on-pr-* had just produced fresh prebuild artifacts in sibling jobs. This caused workflow_dispatch runs to always fall back to npm, getting outdated/smaller prebuilds (e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts). Fix: remove the event_name gate from artifact download (with continue-on-error: true it's safe to run when no artifacts exist). The npm-fallback step now checks if prebuilds/ already has content from artifacts before attempting npm pack. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90 Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26 Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device Farm's 60-min job timeout STOPS the run during teardown even though all 3 tests passed. Bumping to 90 min gives enough headroom. NMT already overrides to 120 via the consumer shim. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(mobile-monitor): distinguish STOPPED-but-passed from real failures Device Farm result=STOPPED means the jobTimeoutMinutes cap expired, not that tests failed. When a device is STOPPED but its counters show 0 failed / 0 errored / N passed, the tests all completed successfully — DF just killed the teardown phase. Before: STOPPED counted as USER_FAILED, triggering exit 1 even though every test passed. This burned investigation time. Now: STOPPED with clean counters → ⚠️ warning + USER_PASSED. STOPPED with actual failures → ❌ with counter breakdown. WARNED → treated as success (same as PASSED). FAILED / ERRORED → ❌ with counter breakdown. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-upload): implement perf-only group filtering via perf-test-regex input The monolith had an inline filter_perf() + maybe_make_and_upload() that intersected each test group's grep with a perf-test regex when qvac_perf_only=true, skipping groups with no matching perf tests. This was lost when the composite was created — qvac-perf-only was threaded through to on-device config but the scheduling-side filter was missing. Result: benchmark runs scheduled ALL test groups on ALL devices instead of the perf-emitting subset. New perf-test-regex input on upload-to-devicefarm: when qvac-perf-only=true and perf-test-regex is set, each group's grep is filtered to only keep matching tests. Empty groups are skipped with a clear log message. LLM consumer now passes the same PERF_REGEX the monolith used: ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$ Other addons don't use qvac_perf_only so they're unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: move perf-test filter regex to consumer input instead of test-groups.json The perf_tests key in test-groups.json broke the LLM addon's generate-mobile-integration-tests.js validator, which treats every top-level key as a platform and expects all tests to be covered. Match the original monolith approach: the perf-emitting test regex is supplied by the consumer workflow via a new `perf-test-regex` composite input, keeping test-groups.json identical to main. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: convention-based perf-test filtering via perf-tests.json Replace the hardcoded perf-test-regex consumer input with a convention-based auto-discovery file (perf-tests.json) that sits alongside test-groups.json in the addon's test/mobile/ directory. The composite reads the file when qvac_perf_only=true, builds the filter regex from the array, and skips groups with no perf-emitting tests. No consumer workflow changes needed — addons opt in by dropping a perf-tests.json file. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

…#2123)

…omposite (#2153) * refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up) Rebased clean on main after PR #1913 merged. Each monolithic mobile workflow (~1400-1800 lines) replaced with a thin composite-based shim (~170-230 lines). Addons migrated: embed-llamacpp, bci-whispercpp, transcription-whispercpp, transcription-parakeet, decoder-audio, diffusion-cpp, classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml Composite extensions (backwards-compatible, no change for LLM/OCR/NMT): - setup: skip-prebuilds input (decoder-audio has no own prebuilds) - monitor: max-wait-time-seconds input (tts-onnx needs 3h) Addon-side provision scripts (matching NMT's pattern): - packages/tts-ggml/scripts/provision-mobile-models.sh - packages/transcription-parakeet/scripts/provision-mobile-models.sh Runner alignment: all shims use qvac-ubuntu2404-x64 for Android (matching main's latest self-hosted strategy from PR #2021/#2123). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min) Main monolith uses timeout: 2700000 (45min) and waitforTimeout: 600000 (10min). Our composite defaults to 1800000 (30min) and 120000 (2min). The slower parakeet tests (sortformer inference on Pixel 9a) exceed 30min and time out. Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000 to upload-to-devicefarm to match the monolith. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter Deep audit of all 9 monoliths revealed custom timeout values that our shims were missing (using composite defaults instead): bci-whispercpp: mocha 900000 (15min), was 1800000 transcription-whispercpp: mocha 900000 (15min), was 1800000 decoder-audio: mocha 600000 (10min), was 1800000 tts-ggml: mocha 2700000 (45min) + waitfor 600000 (10min) Also: tts-onnx monolith used --filter supertonic on perf extraction to exclude Chatterbox rows from reports. Added filter: 'supertonic' to the extract-addon-perf call. embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all matched the composite defaults (1800000 / 120000) — no change needed. transcription-parakeet was already fixed in the previous commit. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-setup): always try artifact download, fall back to npm only when empty The artifact-download steps were gated behind github.event_name != 'workflow_dispatch', which skipped them on workflow_dispatch even when on-pr-* had just produced fresh prebuild artifacts in sibling jobs. This caused workflow_dispatch runs to always fall back to npm, getting outdated/smaller prebuilds (e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts). Fix: remove the event_name gate from artifact download (with continue-on-error: true it's safe to run when no artifacts exist). The npm-fallback step now checks if prebuilds/ already has content from artifacts before attempting npm pack. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90 Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26 Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device Farm's 60-min job timeout STOPS the run during teardown even though all 3 tests passed. Bumping to 90 min gives enough headroom. NMT already overrides to 120 via the consumer shim. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(mobile-monitor): distinguish STOPPED-but-passed from real failures Device Farm result=STOPPED means the jobTimeoutMinutes cap expired, not that tests failed. When a device is STOPPED but its counters show 0 failed / 0 errored / N passed, the tests all completed successfully — DF just killed the teardown phase. Before: STOPPED counted as USER_FAILED, triggering exit 1 even though every test passed. This burned investigation time. Now: STOPPED with clean counters → ⚠️ warning + USER_PASSED. STOPPED with actual failures → ❌ with counter breakdown. WARNED → treated as success (same as PASSED). FAILED / ERRORED → ❌ with counter breakdown. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mobile-upload): implement perf-only group filtering via perf-test-regex input The monolith had an inline filter_perf() + maybe_make_and_upload() that intersected each test group's grep with a perf-test regex when qvac_perf_only=true, skipping groups with no matching perf tests. This was lost when the composite was created — qvac-perf-only was threaded through to on-device config but the scheduling-side filter was missing. Result: benchmark runs scheduled ALL test groups on ALL devices instead of the perf-emitting subset. New perf-test-regex input on upload-to-devicefarm: when qvac-perf-only=true and perf-test-regex is set, each group's grep is filtered to only keep matching tests. Empty groups are skipped with a clear log message. LLM consumer now passes the same PERF_REGEX the monolith used: ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$ Other addons don't use qvac_perf_only so they're unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: move perf-test filter regex to consumer input instead of test-groups.json The perf_tests key in test-groups.json broke the LLM addon's generate-mobile-integration-tests.js validator, which treats every top-level key as a platform and expects all tests to be covered. Match the original monolith approach: the perf-emitting test regex is supplied by the consumer workflow via a new `perf-test-regex` composite input, keeping test-groups.json identical to main. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: convention-based perf-test filtering via perf-tests.json Replace the hardcoded perf-test-regex consumer input with a convention-based auto-discovery file (perf-tests.json) that sits alongside test-groups.json in the addon's test/mobile/ directory. The composite reads the file when qvac_perf_only=true, builds the filter regex from the array, and skips groups with no perf-emitting tests. No consumer workflow changes needed — addons opt in by dropping a perf-tests.json file. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

infra: fix mobile int tests on linux due to now missing global nodejs

7e848cf

tamer-hassan-tether requested review from a team as code owners May 19, 2026 13:36

gianni-cor approved these changes May 19, 2026

View reviewed changes

NamelsKing approved these changes May 19, 2026

View reviewed changes

tamer-hassan-tether had a problem deploying to release May 19, 2026 13:42 — with GitHub Actions Failure

gianni-cor merged commit 4fbf09f into main May 19, 2026
238 of 253 checks passed

gianni-cor deleted the tmp-fix-linux-nodejs-mob-int-tests branch May 19, 2026 13:47

tamer-hassan-tether temporarily deployed to release May 19, 2026 13:56 — with GitHub Actions Inactive

tobi-legan mentioned this pull request May 20, 2026

QVAC-18168 follow-up: migrate remaining 9 mobile addons onto shared composite #2153

Merged

4 tasks

Proletter pushed a commit that referenced this pull request May 24, 2026

infra: fix mobile int tests on linux due to now missing global nodejs (…

2ddd17a

…#2123)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: fix mobile int tests on linux due to now missing global nodejs#2123

infra: fix mobile int tests on linux due to now missing global nodejs#2123
gianni-cor merged 1 commit into
mainfrom
tmp-fix-linux-nodejs-mob-int-tests

tamer-hassan-tether commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tamer-hassan-tether commented May 19, 2026

What problem does this PR solve?

How does it solve it?

Breaking changes

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 19, 2026 •

edited

Loading