QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile#1840
Merged
Merged
Conversation
…ow_dispatch GitHub requires a `workflow_dispatch` workflow to exist on the default branch before it shows up in the Actions tab and becomes triggerable with `--ref <feature-branch>`. This lands the LLM benchmark workflow on `main` so the QVAC-17830 perf-metrics feature branch can be dispatched against it for end-to-end validation. Changes: - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new): manual `workflow_dispatch` only — mirrors the structure of the existing Parakeet / Whispercpp benchmark workflows. Calls `prebuilds-...yml` then `integration-test-...yml` with bench-mode iteration counts (`QVAC_PERF_RUNS=3`, `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop artifacts into a combined HTML / step-summary. Phase-1 scope is desktop only — mobile (Device Farm) needs a build-time hook in the test app to thread env vars through to bare and is tracked as a QVAC-18111 follow-up. - `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call` + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` / `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run steps. Empty string => unset, so the umbrella PR workflow continues to honour the test-side default and PR runs are unaffected by this change. Per the perf policy agreed on Slack (2026-04-30): the umbrella on-pr workflow runs perf tests at the cheap default so we don't pay full perf cost on every PR; this dedicated workflow is the only place we crank up the iteration counts to produce mean ± std numbers. Made-with: Cursor
…rkflow Made-with: Cursor
…w_dispatch Made-with: Cursor
5 tasks
…a pushFile Extends the mobile integration workflow with the same iteration-count inputs as the desktop reusable workflow, and adds a `mobile-benchmarks` job to the LLM benchmark dispatch so it covers Device Farm too. The bare runtime on Device Farm doesn't see GitHub Actions env vars, so we mirror the existing `testFilter.txt` pattern: when the workflow inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt` to the device (Android: `/data/local/tmp/`, iOS: `@bundleId:documents/`) with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch — without that branch this PR is a no-op (orphan file), so it is safe to land independently. Changes: - `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and `workflow_dispatch`; add `__QVAC_PERF_RUNS__` / `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS WDIO config blobs and the corresponding pushFile block in the `before` hook; substitute the placeholders in `make_split`. - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a `mobile-benchmarks` job calling the mobile workflow with the bench-mode iteration counts; have `summarize` `needs:` it; drop the "desktop only" caveat in the step-summary blurb. PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook skips the perf-config push. Made-with: Cursor
…_dispatch Made-with: Cursor
b961099 to
9cdf07b
Compare
Resolves add/add conflict in benchmark-performance-qvac-lib-infer-llamacpp-llm.yml by keeping the mobile-aware version from this branch (run_mobile input, mobile-benchmarks job, summarize wired through both desktop+mobile). Made-with: Cursor
Contributor
Author
|
/review |
Contributor
Tier-based Approval Status |
Proletter
approved these changes
May 4, 2026
Collaborator
|
review |
tamer-hassan-tether
approved these changes
May 4, 2026
Contributor
Author
|
/review |
GustavoA1604
approved these changes
May 4, 2026
Contributor
|
/review |
tobi-legan
added a commit
that referenced
this pull request
May 4, 2026
Picks up the QVAC-18111 scaffold PRs that landed on main: - #1839: Benchmark Performance (LLM) workflow_dispatch + qvac_perf_runs/ qvac_perf_warmup_runs plumbing on integration-test-...yml - #1840: bridge of those inputs into integration-mobile-test-...yml via pushFile of qvacPerfConfig.txt onto the device Conflict resolution: - benchmark-performance-...yml: take main entirely (canonical, has run_desktop+run_mobile toggles + mobile-benchmarks job) - integration-test-...yml: keep this branch's HTML perf report generation step + per-runner perf-report-llamacpp-llm-* upload, drop the chatty inline comments around qvac_perf_runs (config is now self-documenting in the workflow surface) - integration-mobile-test-...yml: take main's WDIO_CONFIG for Android (has the __QVAC_PERF_RUNS__/qvacPerfConfig.txt pushFile bridge) Made-with: Cursor
tobi-legan
added a commit
that referenced
this pull request
May 4, 2026
Closes the missing leg in the QVAC-18111 mobile bench-mode bridge. The Benchmark Performance (LLM) workflow_dispatch already pushes a KEY=VALUE qvacPerfConfig.txt onto the device via Appium pushFile (PR #1840) at /data/local/tmp/qvacPerfConfig.txt on Android and @bundleId:documents/qvacPerfConfig.txt on iOS. Without a consumer the file just sat there and tests silently kept the PR-default 1+1 count. This adds a tryLoadPerfConfig helper to integration-runtime.cjs that mirrors the existing testFilter.txt loader: read the file from the same two paths, parse KEY=VALUE per line, inject each via bare-os setEnv, then unlink. The load is piggy-backed onto the first __shouldRunTest call, which fires before runIntegrationModule dynamically imports any test module — so the existing os.getEnv() lookups in _image-common.js / bitnet.test.js / tool-calling.test.js see the dispatched value at their own module init time. Empty file / missing file is a no-op so PR runs are unaffected. Made-with: Cursor
tobi-legan
added a commit
that referenced
this pull request
May 4, 2026
…time Mobile side of QVAC-18111: now that the umbrella benchmark workflow pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime needs to actually consume it. Without this, dispatched perf overrides were silently ignored on Android / iOS and tests defaulted to 1+1. Changes: - packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs: read qvacPerfConfig.txt from the same paths the testFilter loader uses (/data/local/tmp on Android, global.testDir on iOS), parse KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads on first __shouldRunTest call, which fires before runIntegrationModule imports any test module — so os.getEnv() inside _image-common.js / bitnet.test.js / tool-calling.test.js sees the dispatched value at module init. Normalises both literal '\\n' (the bash-escaped form the WDIO before-hook produces today) and real '\\r?\\n' so the parser doesn't depend on which encoding the workflow happens to use. - .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml: extract the two giant Android / iOS WDIO_CONFIG JS strings out of the 'Create and Upload Test Spec' run: block into env: vars. GitHub Actions imposes a hard 21 000-char limit on any run: block that contains ${{ }} expressions, and the new Android image-perf group + iOS heavy7/heavy8 splits + collapsed-format echo line for the test split summary pushed us over. Moving the WDIO blobs to env: drops the run block from ~21 KB to ~11 KB so we have headroom for further additions. Made-with: Cursor
tamer-hassan-tether
pushed a commit
that referenced
this pull request
May 5, 2026
…a pushFile (#1840) * QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch GitHub requires a `workflow_dispatch` workflow to exist on the default branch before it shows up in the Actions tab and becomes triggerable with `--ref <feature-branch>`. This lands the LLM benchmark workflow on `main` so the QVAC-17830 perf-metrics feature branch can be dispatched against it for end-to-end validation. Changes: - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new): manual `workflow_dispatch` only — mirrors the structure of the existing Parakeet / Whispercpp benchmark workflows. Calls `prebuilds-...yml` then `integration-test-...yml` with bench-mode iteration counts (`QVAC_PERF_RUNS=3`, `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop artifacts into a combined HTML / step-summary. Phase-1 scope is desktop only — mobile (Device Farm) needs a build-time hook in the test app to thread env vars through to bare and is tracked as a QVAC-18111 follow-up. - `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call` + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` / `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run steps. Empty string => unset, so the umbrella PR workflow continues to honour the test-side default and PR runs are unaffected by this change. Per the perf policy agreed on Slack (2026-04-30): the umbrella on-pr workflow runs perf tests at the cheap default so we don't pay full perf cost on every PR; this dedicated workflow is the only place we crank up the iteration counts to produce mean ± std numbers. Made-with: Cursor * QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow Made-with: Cursor * QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch Made-with: Cursor * QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile Extends the mobile integration workflow with the same iteration-count inputs as the desktop reusable workflow, and adds a `mobile-benchmarks` job to the LLM benchmark dispatch so it covers Device Farm too. The bare runtime on Device Farm doesn't see GitHub Actions env vars, so we mirror the existing `testFilter.txt` pattern: when the workflow inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt` to the device (Android: `/data/local/tmp/`, iOS: `@bundleId:documents/`) with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch — without that branch this PR is a no-op (orphan file), so it is safe to land independently. Changes: - `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and `workflow_dispatch`; add `__QVAC_PERF_RUNS__` / `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS WDIO config blobs and the corresponding pushFile block in the `before` hook; substitute the placeholders in `make_split`. - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a `mobile-benchmarks` job calling the mobile workflow with the bench-mode iteration counts; have `summarize` `needs:` it; drop the "desktop only" caveat in the step-summary blurb. PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook skips the perf-config push. Made-with: Cursor * QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch Made-with: Cursor --------- Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
tamer-hassan-tether
pushed a commit
that referenced
this pull request
May 6, 2026
…a pushFile (#1840) * QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch GitHub requires a `workflow_dispatch` workflow to exist on the default branch before it shows up in the Actions tab and becomes triggerable with `--ref <feature-branch>`. This lands the LLM benchmark workflow on `main` so the QVAC-17830 perf-metrics feature branch can be dispatched against it for end-to-end validation. Changes: - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new): manual `workflow_dispatch` only — mirrors the structure of the existing Parakeet / Whispercpp benchmark workflows. Calls `prebuilds-...yml` then `integration-test-...yml` with bench-mode iteration counts (`QVAC_PERF_RUNS=3`, `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop artifacts into a combined HTML / step-summary. Phase-1 scope is desktop only — mobile (Device Farm) needs a build-time hook in the test app to thread env vars through to bare and is tracked as a QVAC-18111 follow-up. - `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call` + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` / `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run steps. Empty string => unset, so the umbrella PR workflow continues to honour the test-side default and PR runs are unaffected by this change. Per the perf policy agreed on Slack (2026-04-30): the umbrella on-pr workflow runs perf tests at the cheap default so we don't pay full perf cost on every PR; this dedicated workflow is the only place we crank up the iteration counts to produce mean ± std numbers. Made-with: Cursor * QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow Made-with: Cursor * QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch Made-with: Cursor * QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile Extends the mobile integration workflow with the same iteration-count inputs as the desktop reusable workflow, and adds a `mobile-benchmarks` job to the LLM benchmark dispatch so it covers Device Farm too. The bare runtime on Device Farm doesn't see GitHub Actions env vars, so we mirror the existing `testFilter.txt` pattern: when the workflow inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt` to the device (Android: `/data/local/tmp/`, iOS: `@bundleId:documents/`) with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch — without that branch this PR is a no-op (orphan file), so it is safe to land independently. Changes: - `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and `workflow_dispatch`; add `__QVAC_PERF_RUNS__` / `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS WDIO config blobs and the corresponding pushFile block in the `before` hook; substitute the placeholders in `make_split`. - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a `mobile-benchmarks` job calling the mobile workflow with the bench-mode iteration counts; have `summarize` `needs:` it; drop the "desktop only" caveat in the step-summary blurb. PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook skips the perf-config push. Made-with: Cursor * QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch Made-with: Cursor --------- Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
gianni-cor
added a commit
that referenced
this pull request
May 12, 2026
… & GPU) (#1843) * QVAC-17830 feat: add VLM perf metrics with multi-run averaging Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf pipeline. Perf reporter / aggregator - scripts/test-utils/performance-reporter.js: extend `vision` addon type with backend, platform, prefill, decode, vision_encode, image_prefill and status columns. - scripts/perf-report/utils.js: teach the aggregator the new VLM metric labels so markdown/HTML reports render them. CI wiring - integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add aggregate + upload-artifact steps so CPU/GPU perf artifacts are produced alongside the existing test run. - integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile): scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END] markers, run extract-from-log.js + aggregate.js, upload per-platform perf artifacts. Mirrors the OCR mobile workflow pattern. Test refactor (image.test.js) - Include image name in the perf label ("[elephant] [CPU]" etc.) so aggregate.js (which groups by `result.test`) no longer pools different images into a single CPU/GPU cell. - Run 1 warmup + 3 counted inferences per (image x backend) so the aggregator produces count=3, mean, std within a single CI run -- matches OCR's PERF_RUNS=3 convention. - Reuse the loaded LlmLlamacpp instance across iterations inside a single test() block: deliberate divergence from OCR's one-test()- per-iteration shape because VLM model load is multi-GB and already stresses iOS Device Farm memory. - Bump per-test timeout to 25 min for the multi-iter perf tests; leave TEST_CONSTANTS.timeout unchanged for other tests in the file. - Keyword assertions run once against the last counted iteration's output. vision_encode_time_ms / image_prefill_time_ms are stubbed as null pending native RuntimeStats instrumentation (separate follow-up task). * QVAC-17830 feat: wire Mobile LLM into perf-report.yml weekly aggregator Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)" block to the scheduled cron job so Monday 9am UTC runs aggregate perf artifacts from Mobile Integration Tests (LLM) into a separate llamacpp-llm-mobile-performance.{md,json,html} report alongside the existing desktop one. Also adds "Mobile Integration Tests (LLM)" to the workflow_dispatch workflow_name choice list so it can be triggered manually. Without this, the per-run Android/iOS perf artifacts the mobile workflow now uploads would sit unused by the weekly cross-run aggregator. * QVAC-17830 feat: add per-run joint perf reporter to mobile LLM workflow Mirrors the `combine-reports` job in the OCR mobile workflow. Runs after the per-platform matrix completes and downloads every artifact matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because the desktop (integration-test-*.yml) and mobile (this file) workflows are both invoked via workflow_call from the umbrella On PR Trigger (LLM), share the same github.run_number and therefore live together in the single run's artifact scope. The job then: - aggregates all desktop + mobile rows into a single combined HTML, Markdown and JSON report via scripts/perf-report/aggregate.js; - patches the `device.name` for desktop artifacts using the artifact suffix (e.g. "linux-x64", "macos-arm64") so they render with recognisable column headers next to the mobile device names; - emits a per-device HTML report for each unique device encountered (desktop OS/arch variants + each mobile device); - writes the combined markdown report into $GITHUB_STEP_SUMMARY so the PR run page shows one unified perf table covering every platform the addon ran on in that run. This is what the user actually meant by "joint performance reporter" -- distinct from the weekly/cross-run aggregation in perf-report.yml, which aggregates historical runs for trend tracking. * QVAC-17830 fix: preserve mobile perf data on OOM, split image tests per device Android perf data was never surfacing because the inline fallback reporter used on mobile was a no-op stub (no record/writeToConsole), so logcat never saw PERF_REPORT markers. iOS image tests were also crashing mid-run (Jetsam/memorystatus) before process.on('exit') could flush, losing whichever runs had already completed. - Replace the mobile fallback reporter in _image-common.js with a functional inline implementation that mirrors OCR's: record(), writeReport(), and writeToConsole() with lightweight snapshots and logcat-safe chunking. - Flush incrementally after every record() on mobile so earlier iterations survive a SIGKILL later in the same group. - Split image.test.js into per-image files (elephant, fruit plate, high-res aurora) behind a shared _image-common.js helper, and wire each into its own Device Farm group (Android: 3 new groups; iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora). - Teach extract-from-log.js to --merge multiple reports per device (needed because each physical device now runs three groups) and add --device-from-filename as a fallback when logs land in a flat layout. - Update the mobile workflow to schedule 5 Android / 10 iOS runs, bucket downloaded artifacts by device, unzip Customer_Artifacts inline so bare_console.log is scannable, and invoke extract-from-log.js with --merge. Verified locally with a happy-path sim (3 samples per image×backend per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated report on the crashed device, and the other three groups report fully. Made-with: Cursor * QVAC-17830 fix: preserve mobile perf data under iOS V8 Zone OOM Fruit plate iOS test crashed inside Builtin_JsonStringify because the per-test flush stringified the cumulative results array on every record() call (3+ runs x 2 backends x verbose VLM output over a 10MB image + Metal compiler memory). Even with logcat chunking, the JSON build itself exhausted V8's Zone allocator, producing SIGTRAP from FatalProcessOutOfMemory before any data reached logcat. Changes in the mobile fallback reporter (_image-common.js): - Cap stored output text to 400 chars per record so the accumulated results array stays small regardless of model verbosity. - Add delta mode to writeToConsole: emit only the latest row per flush, so stringify stays O(1) in iteration count on mobile. - Stop calling writeReport() on disk per-record; rewriting the full JSON every iteration is both expensive and an OOM hazard. One final writeReport runs in the exit hook for the on-device artifact. - Remove the final cumulative writeToConsole in the exit hook: the per-test delta emits already carry every row, and emitting a large cumulative payload at exit is exactly when we most need the earlier deltas to survive in logcat/syslog. Changes in extract-from-log.js: - Split extractFromText into extractAllFromText (returns every report parsed from the text) + the legacy extractFromText wrapper that returns the last one. - Add opts.all to extractFromFile so --merge gathers every delta emit from a log file, not just the largest or last marker pair. - main() with --merge now pushes all reports per file into the device bucket before handing them to mergeDeviceReports, so partial data from a pre-crash run still lands in the aggregate. Verified locally with a delta-emit simulation that mid-test crashes leave runs 1..N-1 intact in the reconstructed report. Made-with: Cursor * QVAC-17830 fix: plug combined perf report gaps (desktop race + artifact collision) Two pre-existing bugs caused the combined performance table to drop data on a PR run: 1. `darwin-x64` (and any other slow desktop job) was missing because `combine-reports` lived inside the mobile workflow and only waited on the mobile matrix. Desktop jobs that finished AFTER mobile were silently excluded. Move the job to the umbrella (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports` with `needs: [run-integration-tests, run-mobile-integration-tests]`, so it blocks on the full fan-out before scanning artifacts. 2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair) was overwritten because every matrix entry uploaded its artifact under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two entries sharing platform+arch clobbered each other on download. Add a unique `label` field to each matrix entry (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`, `linux-arm64-u24`, ...) and key the upload + device-name patch on `matrix.label` instead. Made-with: Cursor * QVAC-17830 fix: iOS fruit plate retry + consolidate Android images + flush logs on crash Three changes scoped to the mobile LLM integration workflow: * Android: merge the three per-image Device Farm groups back into a single groupImagesPerf run. Android does not have the iOS jetsam pressure, so running all three image tests in one invocation is both cheaper and simpler while still giving the aggregator 9 rows (3 images x 3 iters) per device. * iOS: schedule Heavy7 (fruit plate) a second time as Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device folder on download, and extract-from-log.js --merge dedupes on the metric fingerprint, so if either attempt gets past iOS jetsam we have perf data. If both succeed we just get more iterations to average over, which is strictly better. run_count goes 10 -> 11 and the Monitor / Download loops were extended accordingly. * Crash path (both platforms): route process.exit(1) in checkAppCrash and crashMonitor through a new global.flushLogsAndExit helper that (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain and (b) on iOS performs an inline Appium pull_file of bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means delta perf emits that hit disk before a jetsam kill now reach the aggregator instead of being stuck in the app sandbox. Made-with: Cursor * QVAC-17830 fix: inline crash flush-delay, drop duplicated pull helper Previous commit pushed the `Create and Upload Test Spec` run: block over GitHub Actions' ~30K template-expression budget, so the umbrella workflow refused to dispatch with: Line 927, Col 14: Exceeded max expression length 21000 Root cause: `global.flushLogsAndExit` was inlined into BOTH Android and iOS wdio configs. On iOS it duplicated the exact Appium pull_file logic the `after` hook already has, and on Android it carried dead iOS-only code — roughly 1.8 kB of repeated JS across the two configs. Instead: * Drop `global.flushLogsAndExit` entirely. * In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`. Same 5s drain window for logcat / os_log / bare stdout, one line of JS, zero helper definition. * iOS `after` hook still pulls `bare_console.log` via Appium on the normal path, so delta perf data from successful runs is unchanged. Net `run:` block size goes from 31,524 → 28,992 chars (below the previously-working 28,995). Both wdio configs revalidated with `node --check` and the full workflow YAML round-trips through PyYAML. Made-with: Cursor * QVAC-17830 fix: iOS fruit plate warmup + merge linux legs + mobile detail tables - Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so Metal shaders / KV cache / image-prefill buffer are allocated with a cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop and Android paths remain byte-for-byte identical. - Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix legs into single "linux-x64" / "linux-arm64" columns in the combined report's Fix desktop device names step. Previously the summary showed four separate columns with "-" GPU rows that looked like missing data even though the sibling leg had populated them. - Preserve categorical metrics (backend, platform, status, execution_provider) through aggregateReports and render per-device detail tables for every device — mobile included — in the combined markdown summary via a new --device-details flag on aggregate.js. Mobile devices now get the same twelve-column breakdown (Test, EP, Backend, Platform, Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that desktop matrix legs emit via writeStepSummary. Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/ (27/27 assertions pass). Made-with: Cursor * QVAC-17830 fix: iOS fruit plate 1-iter override + HTML detail tables Two fixes after run 24909705757 landed iOS perf data for elephant + aurora but still lost fruit plate, and the combined HTML report was missing the row-per-test detail tables that the step-summary markdown has. 1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone. With the elephant pre-warmup in place, the fruit-plate test still crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd counted iteration, leaving zero rows in the perf artifact for both iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted iteration cuts the cold-path peak footprint from 4 inferences to 2 (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in local simulation. Desktop + Android still run the full PERF_RUNS=3. 2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the combined HTML report mirrors the markdown step-summary layout — one section per device with Backend / Platform / Total Time / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS / Status columns. Previously only the markdown had those columns, so mobile devices (which can't write to GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML artifact. The per-device HTML report step also gets the flags so each individual HTML has parity. Verified locally against the run-24909705757 artifact set: all 9 devices render the full row- per-test summary, including Apple iPhone 16 Pro / iPhone 17 with backend=metal and platform=ios-arm64. Made-with: Cursor * QVAC-17830 fix: warm process iOS heavy7 + dedupe perf legs + drop retry Five user-requested follow-ups after run 24913460964: 1) iOS heavy7 is now a "warm process" group. test-groups.json puts runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal / native allocators are pre-warmed by a cheap test before the 10 MB fruit-plate PNG hits the multimodal pipeline. 2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11) is removed from the mobile workflow. Run count goes from 11 → 10; monitoring + log-download loop bounds updated; the inline comment that justified the retry is replaced with one explaining the new warm-process rationale. 3) JS-only crash-path bare_console.log pull on iOS. The existing pull_file logic that used to live inline in `after` is now a global.flushBareLog helper installed in `before`, called from global.checkAppCrash and the setInterval crash monitor (after a 1.5 s browser.pause to give stdout a chance to flush) BEFORE the 5 s setTimeout(process.exit). Net effect: crashed runs that previously left a 0-byte LOG artifact now have a real bare_console.log on disk for triage. No C++ touched. 4) `aggregateReports` now dedupes sibling matrix legs that get folded into one device name. Previously linux-x64 / [elephant] [CPU] accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the per-test breakdown tables. The new rule: per (device, test, run_number), only the FIRST report's rows are kept — subsequent sibling-leg reports for the same pair are skipped. Within a single report all PERF_RUNS iterations still flow through (so counted runs per device stays at 3). The weekly aggregator is unaffected because its inputs span different run_numbers. Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter reports per matrix leg; linux-x64 / [elephant] [CPU] now reports `count: 3` and the markdown header reads "Iterations: 3". 5) Per-device detail-table headings use the device name as the anchor instead of a repeated "### Performance: <addon>". GitHub collapses duplicate heading slugs in step summaries which made it impossible to deep-link or scroll to the mobile detail tables — they were rendered, just hidden under a single anchor. Each device now gets `### <Device> — <addon> (<platform/arch>)` so Android / iOS sections appear in the step-summary TOC. Made-with: Cursor * QVAC-17830 fix: warm iOS heavy7 with elephant instead of api-behavior The previous warm-process seed (runApiBehaviorTest) used a text-only Qwen3-0.6B model, so it never exercised the mmproj load or the Metal vision shader JIT that fruit-plate then hit cold. iPhone 17 kept SIGABRT'ing between the LLM context init and the first inference during exactly those steps. Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the small (23 KB) elephant test pre-warms the exact VLM code path — SmolVLM2 + mmproj + Metal vision graph + KV cache — before the 10 MB fruit-plate PNG lands. * QVAC-17830 fix: shrink iOS fruit-plate to 2 inferences cold The previous attempt put runImageElephantTest as a "warm process" prefix to heavy7. That actually executes the full elephant perf cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media- path test = 9 elephant inferences and 2 SmolVLM2 model loads — before runImageFruitPlateTest gets to do anything. Net effect on heavy7 was 12 inferences pre-fruit-plate-row, which regressed iPhone 16 Pro (had been passing) and iPhone 17 still died after the inline elephant warmup completed. New shape: - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"] (cold process again — runImageElephantTest stays in heavy3 as its own perf slot, untouched). - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage) succeeds, set an iosPreWarmupRan flag and skip the standard PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass. - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg' + iosPerfRuns=1 config; comments rewritten to describe the new flow. - mobile workflow: heavy7 comment block updated to match. End-to-end iOS shape for fruit-plate is now exactly what was asked for: 1. setupMultimodalInference loads SmolVLM2 cold 2. 1 elephant inference (inline iosWarmupImage, no perf record) 3. standard warmup loop SKIPPED 4. 1 fruit-plate inference (counted, perf recorded) = 2 inferences total Desktop and Android paths are completely unchanged — they still run 1 warmup + PERF_RUNS=3 counted. iOS tests without an iosWarmupImage (elephant, high-res aurora) keep their existing 1 warmup + iosPerfRuns/PERF_RUNS counted shape. Made-with: Cursor * QVAC-17830 feat: scenario grouping, GPU probe, squashed PR summary, perf for bitnet/tool-calling - extract _perf-helper.js out of _image-common.js so bitnet, tool-calling and image tests share one perf-recording path - add `scenario` field on perf records ('image' / 'bitnet' / 'tool-calling' / 'default'), plumb through aggregator + reports. Per-device detail tables (md + html) now group rows by scenario so a single combined report can surface multiple test families side by side instead of mixing them into one giant table - drop the always-'ok' Status column from the vision metric layout (just noise; the JS layer no longer hardcodes status: 'ok') - squash the combined GH step summary to three Mean ± std mini-tables (Total Time / TTFT / TPS), grouped by scenario. The HTML artifact still carries the full per-iteration breakdown - split aggregate.js --device-details into --md-device-details and --html-device-details (legacy --device-details still toggles both) - wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js (one perf row per model_variant x prompt cell) onto _perf-helper.js - add desktop GPU probe in detectDevice(): nvidia-smi -L on linux/win32, lspci as linux fallback, system_profiler on darwin, wmic on win32. New device.gpu field is surfaced in the per-device detail-table headings (md + html). Mobile leaves gpu=null — device.name (Device Farm label) already implies the chipset - add scripts/perf-report/sim-llamacpp-llm.js for local report-shape verification mirroring the umbrella workflow flags Made-with: Cursor * feat: surface per-device detail tables in PR summary with mean ±std cells QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the combined PR summary, but the per-device detail tables (Test / EP / Backend / Platform / Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended underneath instead of being HTML-only. Numeric cells in the detail tables render as `mean ±std` so the iteration spread is visible at every metric, not just the rolled-up ones; token-count columns stay as bare integers since deterministic generation makes `±0` noise. The umbrella workflow flips from `--html-device-details` back to `--device-details` to apply this in both the markdown and HTML outputs. Made-with: Cursor * refactor: drop image_prefill_time_ms from perf report QVAC-17830. Per-prompt image-prefill time turned out to overlap with the prefill_time_ms / TTFT signal we already report, so the column was essentially redundant. Vision Enc (ms) — the mmproj / vision- encoder time — is the only VLM-specific timer worth keeping for now; its native wiring is tracked separately under https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702. Removed everywhere it surfaced: - scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS + record() default metrics) - scripts/perf-report/utils.js (label map + vision detail columns) - packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js (mobile inline-fallback default metrics + recordPerformance payload) - scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators) Made-with: Cursor * QVAC-17830 fix: tighten combined perf report layout (column filtering + legend) The cross-platform mean tables previously emitted ALL device columns for every scenario block, so e.g. the bitnet block (Android-only by design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty desktop/iOS columns of dashes, which made the report look broken when it was actually intentional. - Per-scenario × per-metric column filtering: drop columns that have zero data in this scenario block. Bitnet collapses from 9 → 2 columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64` (test is `skip: isDarwinX64`). - Short legend right under the report title explaining the dashes that remain (gated tests like medgemma-4b-it = desktop-only, mobile = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner). Per-device detail tables are unchanged — they were already scenario- partitioned so the layout fix doesn't apply there. Made-with: Cursor * QVAC-18111 feat: env-driven perf iterations + Benchmark Performance (LLM) workflow_dispatch New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco): PR runs default to 1 warmup + 1 counted iteration (cheap, single- shot signal). The dedicated `Benchmark Performance (LLM)` workflow is the only place we crank iterations up to produce mean ± std. Changes: - `_image-common.js` and `bitnet.test.js`: replace hardcoded `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS', 1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative env values fall back to the safe default of 1+1. - iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark workflow doesn't OOM the Jetsam ceiling on heavy images. Comments updated to reflect the new default. - `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The umbrella PR workflow doesn't pass these, so the PR critical path is unaffected. - `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls the desktop integration-test reusable with `qvac_perf_runs=3`, aggregates a benchmark report, uploads HTML artifact + GitHub step summary. Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device Farm) needs a build-time env-var injection hook in the test app — tracked as a QVAC-18111 follow-up. Made-with: Cursor * QVAC-17830 fix: tool-calling EP label honours NO_GPU on linux-x64-cpu runner The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan device is present — but `tool-calling.test.js` previously hardcoded `useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`. That made the combined report show GPU bars on a CPU runner for tool-calling rows, which read as inconsistent next to the [CPU] rows from `image` and `bitnet`. Match the `_image-common.js` pattern: read `NO_GPU` from both `process.env` and `os.getEnv()`, treat any case-insensitive "true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the `linux-x64-cpu` leg labels its rows `[CPU]` correctly while the `linux-x64-gpu` leg keeps `[GPU]`. Made-with: Cursor * QVAC-17830 fix: terser perf-report legend + full metric breakdown in cross-platform tables Two follow-ups based on review feedback on the combined report: - Trim the verbose "intentional gaps" legend down to a single short italic line (`> _` `-` ` = not run on this device._`). The per-scenario column filter already drops fully-empty columns, so the long explanation was no longer pulling its weight. - Surface the same metrics the per-device detail tables already show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS) in the cross-platform mini-tables, instead of just Total/TTFT/TPS. Reviewer needs the full breakdown to compare backends, not just the headline. `vision_encode_time_ms` is currently null everywhere (native timer pending — Asana 1214371583877702) so its mini-table is auto-suppressed by the existing `hasAnyData` guard until the timer lands; the entry is in the list ahead of time so the table appears automatically once data flows. Made-with: Cursor * QVAC-17830 fix: use bare-os getEnv() for QVAC_PERF_RUNS / NO_GPU lookups Bare runtime doesn't define `process` as a global at module-init time, so referencing `process.env[key]` from the top-level `_envInt(...)` call in `bitnet.test.js` blew up the entire integration suite with `ReferenceError: process is not defined` on first eval. Switch the env reads to `bare-os`'s `os.getEnv()` (already imported) as the primary source, and keep a `typeof process !== 'undefined'`- guarded `process.env` fallback for any Node code paths that happen to import these files. Apply the same pattern in: - `_image-common.js` — for both `_envInt` and the existing `noGpuEnv` lookup which had the same latent bug. - `bitnet.test.js` — for `_envInt`. - `tool-calling.test.js` — for the new `noGpuEnv` lookup added in the linux-x64-cpu label fix. Verified locally: behaves correctly when `process` is undefined, when `process.env[key]` is unset, and when only one of the two sources has the value. Made-with: Cursor * QVAC-18111 chore: align Benchmark Performance (LLM) workflow with the addon convention Rename `.github/workflows/benchmark-llamacpp-llm.yml` → `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern on main (Parakeet, Whispercpp, Supertonic TTS, etc.). Also align the structure: introduce a `context` job that derives `repository` / `ref` from optional inputs (with fallback to `github.repository` / `github.ref_name`), wire prebuild + desktop integration tests through that, and add a `summarize` job whose artifact follows the convention `<addon>-performance-findings.{md, html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs` inputs (defaulting to 3 / 1) are preserved — this is still the single place we crank perf iterations up to produce mean ± std numbers, and the umbrella PR workflow continues to leave them at the cheap default of 1 + 1. Workflow_dispatch only — explicitly NOT wired into the PR umbrella, matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco). Made-with: Cursor * QVAC-17830 fix: address CodeQL security findings on combined-perf-report job Resolves the three CodeQL alerts surfaced on PR #1843: - Alert 731 (`scripts/perf-report/extract-from-log.js`, regex injection): The `--device-from-filename` flag fed user-supplied input straight into `new RegExp(...)`. No workflow on main or in this PR ever passes the flag — it was dead code from an earlier mobile artifact-layout experiment. Drop the flag, the helper, and the field from `parseArgs`, which removes the ReDoS surface entirely. - Alerts 733 / 734 (`.github/workflows/on-pr-qvac-lib-infer-llamacpp-llm.yml`, cache poisoning via execution of untrusted code): The new `combine-perf-reports` job inherits the umbrella's `pull_request_target` trigger and consumes PR-author-controlled perf artifacts. The blast radius is already bounded (`permissions: contents: read`, no `actions/cache`, no secrets touched by the aggregator), but the per-device HTML step interpolated a PR-controlled `device.name` into a filesystem path with only a space → dash sanitiser. Harden it: * Pass the JSON file path via `process.argv` instead of shell- interpolating it into `node -e`, so the inner `JSON.parse` never sees shell-meta from the surrounding quoting. * Validate `device.name` is actually a string before using it. * Replace the `tr ' ' '-'` fallback with a strict `[A-Za-z0-9-]` allowlist (anything else collapses to `_`) plus a 64-char cap, which kills path traversal and shell-meta exposure even on a maliciously crafted JSON. * Skip the iteration entirely when the slug ends up empty. Add a security block comment on the job documenting the trust model (privileged trigger, scoped permissions, no shared cache, sanitised filename input) so future readers don't re-derive it. Drive-by: SHA-pin three `actions/upload-artifact@v4` references this PR introduced (one in the mobile workflow, two in the umbrella) to `@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # 7.0.0`, matching the repo's existing pinned-action convention. Pre-existing `@v4` references in `integration-mobile-test-...yml` lines 1684 / 1693 are out of scope for this PR. Made-with: Cursor * QVAC-17830 fix: add shell + security note on Generate combined report step CodeQL alert 733 (cache-poisoning/poisonable-step) is the same trust-model concern the new block comment on combine-perf-reports already documents — privileged checkout of PR-head code is unavoidable because the introducing PR needs to exercise its own aggregate.js changes. Inline the security note on the offending step so the CodeQL fingerprint reflects the explicit acknowledgement, mirroring what already silenced alert 734 on the per-device HTML step. Drive-by: add explicit `shell: bash` on the step (was inheriting from the runner default; explicit is safer + matches the surrounding steps). Made-with: Cursor * QVAC-17830 feat: bridge QVAC_PERF_RUNS overrides into mobile bare runtime Mobile side of QVAC-18111: now that the umbrella benchmark workflow pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime needs to actually consume it. Without this, dispatched perf overrides were silently ignored on Android / iOS and tests defaulted to 1+1. Changes: - packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs: read qvacPerfConfig.txt from the same paths the testFilter loader uses (/data/local/tmp on Android, global.testDir on iOS), parse KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads on first __shouldRunTest call, which fires before runIntegrationModule imports any test module — so os.getEnv() inside _image-common.js / bitnet.test.js / tool-calling.test.js sees the dispatched value at module init. Normalises both literal '\\n' (the bash-escaped form the WDIO before-hook produces today) and real '\\r?\\n' so the parser doesn't depend on which encoding the workflow happens to use. - .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml: extract the two giant Android / iOS WDIO_CONFIG JS strings out of the 'Create and Upload Test Spec' run: block into env: vars. GitHub Actions imposes a hard 21 000-char limit on any run: block that contains ${{ }} expressions, and the new Android image-perf group + iOS heavy7/heavy8 splits + collapsed-format echo line for the test split summary pushed us over. Moving the WDIO blobs to env: drops the run block from ~21 KB to ~11 KB so we have headroom for further additions. Made-with: Cursor * QVAC-17830 feat: gate Benchmark Performance (LLM) to perf-emitting tests only Adds qvac_perf_only input to the desktop and mobile reusable workflows and turns it on from Benchmark Performance (LLM) so dispatched runs only execute the perf-emitting tests instead of the full integration suite. Desktop: regenerates test/integration/all.js from bitnet, tool-calling and the three image-* tests, then runs bare against it. Mobile: filters each group's run*Test list down to perf-emitting wrappers, skips groups whose intersection is empty (no spec uploaded, no Device Farm run booked), and tolerates gaps in the RUN_ARN_* slot sequence when monitoring / downloading logs. Co-authored-by: Cursor <cursoragent@cursor.com> * fix[ci]: drop PR-head checkout in combine-perf-reports to clear CodeQL 735/736 The `combine-perf-reports` job runs in `pull_request_target` context and was checking out the PR head's `scripts/perf-report` so it could aggregate the matrix artifacts. CodeQL flagged this as a cache-poisoning / poisonable-step issue (alerts 735/736) because we were executing PR-author code in a privileged context. Removing the `repository:` / `ref:` overrides on the checkout makes `actions/checkout` fall back to the base-branch SHA in `pull_request_target` runs, so `aggregate.js` is now trusted code from the default branch. `workflow_dispatch` runs are unaffected because `github.ref` resolves to the dispatched branch. Existing mitigations stay in place: `contents: read` only, no `actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]` allowlist + 64-char cap on `device.name` before it touches any filesystem path. SECURITY comments updated to reflect the new posture. Co-authored-by: Cursor <cursoragent@cursor.com> * fix[ci]: add runGemma4Test + runOcrPaddleTest to iOS lightB The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8, displacing main's runGemma4Test/runOcrPaddleTest. The validator in scripts/generate-mobile-integration-tests.js requires every test to be assigned on both platforms — these two were missing on iOS. Placed them in lightB alongside runQwen35Test and runModelLoadingTest since they're model-loading tests with similar runtime characteristics. * fix[ci]: bound mobile monitor when AWS API permanently fails Two complementary fixes for the Device Farm monitor loop in integration-mobile-test-llm-llamacpp.yml. 1. Match the initial credentials step's role-duration-seconds: 7200 on the 'Refresh AWS credentials for monitoring' step. Without it, the refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is 2h, so every devicefarm API call past the 1h mark fails until timeout — producing the spinning 'Run N: () | AWS API attempt X/3 failed, retrying in 5s...' output for ~1h with no escape. 2. Make the poll loop self-bounding under persistent API failure: - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard never prints the empty 'Run N: ()' line. - On aws_retry returning empty (or 'None'), preserve last-known-good STATUS/RESULT for that slot and increment a per-slot FAIL counter. - After 10 consecutive failures (~5 min at the 30s cadence, well past any normal STS / Device Farm transient) mark that slot ERRORED / NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning the full 2h ceiling. Final-results phase reports a real failure. * mod[notask]: isolate iOS Gemma4 and OcrPaddle into their own Device Farm slots Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to isolated slots instead of folding them into lightB. The image perf tests keep their existing isolation (heavy3/heavy7/heavy8), so the iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12). - test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest; drop both from lightB - integration-mobile-test-llm-llamacpp.yml: - filter_perf: extract heavy9 + heavy10 wrappers - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml, renumber light specs to test_spec_arn_11 / test_spec_arn_12 - schedule: 12 maybe_schedule calls for iOS - monitor + log download: RUN_ARN_1..12 explicit assignments (the seq-based loops are already dynamic via $RUN_COUNT) Co-authored-by: Cursor <cursoragent@cursor.com> * QVAC-17830 fix: apply fruit-plate iOS OOM mitigation to high-res aurora iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed peak memory past the ~3.3 GB Jetsam ceiling once the device was warm from earlier jobs in the queue, the app was killed via memorystatus, and Appium never received a clean failure signal — leaving the Device Farm run in RUNNING (PENDING) until the 60-min job timeout. When `image.test.js` was split into per-image files in 37afd3d, the fruit-plate variant correctly received `iosWarmupImage` + `iosPerfRuns: 1`, but high-res aurora — the heaviest image in the suite — was left on the standard path. That meant CPU + GPU each ran 1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone the cold path crossed the Jetsam line. Mirror the fruit-plate config: - `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline with a 23 KB image so Metal shaders, KV cache, and image-prefill buffers are allocated before the 12 MP JPEG arrives. `_image-common.js` skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs. - `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM the iPhone. iOS cold-path footprint is now exactly 2 inferences per backend (1 small pre-warmup + 1 counted real image), matching fruit-plate's known-good shape. Desktop and Android paths are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Proletter
added a commit
that referenced
this pull request
May 24, 2026
…a pushFile (#1840) * QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch GitHub requires a `workflow_dispatch` workflow to exist on the default branch before it shows up in the Actions tab and becomes triggerable with `--ref <feature-branch>`. This lands the LLM benchmark workflow on `main` so the QVAC-17830 perf-metrics feature branch can be dispatched against it for end-to-end validation. Changes: - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new): manual `workflow_dispatch` only — mirrors the structure of the existing Parakeet / Whispercpp benchmark workflows. Calls `prebuilds-...yml` then `integration-test-...yml` with bench-mode iteration counts (`QVAC_PERF_RUNS=3`, `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop artifacts into a combined HTML / step-summary. Phase-1 scope is desktop only — mobile (Device Farm) needs a build-time hook in the test app to thread env vars through to bare and is tracked as a QVAC-18111 follow-up. - `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call` + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` / `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run steps. Empty string => unset, so the umbrella PR workflow continues to honour the test-side default and PR runs are unaffected by this change. Per the perf policy agreed on Slack (2026-04-30): the umbrella on-pr workflow runs perf tests at the cheap default so we don't pay full perf cost on every PR; this dedicated workflow is the only place we crank up the iteration counts to produce mean ± std numbers. Made-with: Cursor * QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow Made-with: Cursor * QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch Made-with: Cursor * QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile Extends the mobile integration workflow with the same iteration-count inputs as the desktop reusable workflow, and adds a `mobile-benchmarks` job to the LLM benchmark dispatch so it covers Device Farm too. The bare runtime on Device Farm doesn't see GitHub Actions env vars, so we mirror the existing `testFilter.txt` pattern: when the workflow inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt` to the device (Android: `/data/local/tmp/`, iOS: `@bundleId:documents/`) with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch — without that branch this PR is a no-op (orphan file), so it is safe to land independently. Changes: - `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and `workflow_dispatch`; add `__QVAC_PERF_RUNS__` / `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS WDIO config blobs and the corresponding pushFile block in the `before` hook; substitute the placeholders in `make_split`. - `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a `mobile-benchmarks` job calling the mobile workflow with the bench-mode iteration counts; have `summarize` `needs:` it; drop the "desktop only" caveat in the step-summary blurb. PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook skips the perf-config push. Made-with: Cursor * QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch Made-with: Cursor --------- Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
… & GPU) (#1843) * QVAC-17830 feat: add VLM perf metrics with multi-run averaging Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf pipeline. Perf reporter / aggregator - scripts/test-utils/performance-reporter.js: extend `vision` addon type with backend, platform, prefill, decode, vision_encode, image_prefill and status columns. - scripts/perf-report/utils.js: teach the aggregator the new VLM metric labels so markdown/HTML reports render them. CI wiring - integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add aggregate + upload-artifact steps so CPU/GPU perf artifacts are produced alongside the existing test run. - integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile): scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END] markers, run extract-from-log.js + aggregate.js, upload per-platform perf artifacts. Mirrors the OCR mobile workflow pattern. Test refactor (image.test.js) - Include image name in the perf label ("[elephant] [CPU]" etc.) so aggregate.js (which groups by `result.test`) no longer pools different images into a single CPU/GPU cell. - Run 1 warmup + 3 counted inferences per (image x backend) so the aggregator produces count=3, mean, std within a single CI run -- matches OCR's PERF_RUNS=3 convention. - Reuse the loaded LlmLlamacpp instance across iterations inside a single test() block: deliberate divergence from OCR's one-test()- per-iteration shape because VLM model load is multi-GB and already stresses iOS Device Farm memory. - Bump per-test timeout to 25 min for the multi-iter perf tests; leave TEST_CONSTANTS.timeout unchanged for other tests in the file. - Keyword assertions run once against the last counted iteration's output. vision_encode_time_ms / image_prefill_time_ms are stubbed as null pending native RuntimeStats instrumentation (separate follow-up task). * QVAC-17830 feat: wire Mobile LLM into perf-report.yml weekly aggregator Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)" block to the scheduled cron job so Monday 9am UTC runs aggregate perf artifacts from Mobile Integration Tests (LLM) into a separate llamacpp-llm-mobile-performance.{md,json,html} report alongside the existing desktop one. Also adds "Mobile Integration Tests (LLM)" to the workflow_dispatch workflow_name choice list so it can be triggered manually. Without this, the per-run Android/iOS perf artifacts the mobile workflow now uploads would sit unused by the weekly cross-run aggregator. * QVAC-17830 feat: add per-run joint perf reporter to mobile LLM workflow Mirrors the `combine-reports` job in the OCR mobile workflow. Runs after the per-platform matrix completes and downloads every artifact matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because the desktop (integration-test-*.yml) and mobile (this file) workflows are both invoked via workflow_call from the umbrella On PR Trigger (LLM), share the same github.run_number and therefore live together in the single run's artifact scope. The job then: - aggregates all desktop + mobile rows into a single combined HTML, Markdown and JSON report via scripts/perf-report/aggregate.js; - patches the `device.name` for desktop artifacts using the artifact suffix (e.g. "linux-x64", "macos-arm64") so they render with recognisable column headers next to the mobile device names; - emits a per-device HTML report for each unique device encountered (desktop OS/arch variants + each mobile device); - writes the combined markdown report into $GITHUB_STEP_SUMMARY so the PR run page shows one unified perf table covering every platform the addon ran on in that run. This is what the user actually meant by "joint performance reporter" -- distinct from the weekly/cross-run aggregation in perf-report.yml, which aggregates historical runs for trend tracking. * QVAC-17830 fix: preserve mobile perf data on OOM, split image tests per device Android perf data was never surfacing because the inline fallback reporter used on mobile was a no-op stub (no record/writeToConsole), so logcat never saw PERF_REPORT markers. iOS image tests were also crashing mid-run (Jetsam/memorystatus) before process.on('exit') could flush, losing whichever runs had already completed. - Replace the mobile fallback reporter in _image-common.js with a functional inline implementation that mirrors OCR's: record(), writeReport(), and writeToConsole() with lightweight snapshots and logcat-safe chunking. - Flush incrementally after every record() on mobile so earlier iterations survive a SIGKILL later in the same group. - Split image.test.js into per-image files (elephant, fruit plate, high-res aurora) behind a shared _image-common.js helper, and wire each into its own Device Farm group (Android: 3 new groups; iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora). - Teach extract-from-log.js to --merge multiple reports per device (needed because each physical device now runs three groups) and add --device-from-filename as a fallback when logs land in a flat layout. - Update the mobile workflow to schedule 5 Android / 10 iOS runs, bucket downloaded artifacts by device, unzip Customer_Artifacts inline so bare_console.log is scannable, and invoke extract-from-log.js with --merge. Verified locally with a happy-path sim (3 samples per image×backend per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated report on the crashed device, and the other three groups report fully. Made-with: Cursor * QVAC-17830 fix: preserve mobile perf data under iOS V8 Zone OOM Fruit plate iOS test crashed inside Builtin_JsonStringify because the per-test flush stringified the cumulative results array on every record() call (3+ runs x 2 backends x verbose VLM output over a 10MB image + Metal compiler memory). Even with logcat chunking, the JSON build itself exhausted V8's Zone allocator, producing SIGTRAP from FatalProcessOutOfMemory before any data reached logcat. Changes in the mobile fallback reporter (_image-common.js): - Cap stored output text to 400 chars per record so the accumulated results array stays small regardless of model verbosity. - Add delta mode to writeToConsole: emit only the latest row per flush, so stringify stays O(1) in iteration count on mobile. - Stop calling writeReport() on disk per-record; rewriting the full JSON every iteration is both expensive and an OOM hazard. One final writeReport runs in the exit hook for the on-device artifact. - Remove the final cumulative writeToConsole in the exit hook: the per-test delta emits already carry every row, and emitting a large cumulative payload at exit is exactly when we most need the earlier deltas to survive in logcat/syslog. Changes in extract-from-log.js: - Split extractFromText into extractAllFromText (returns every report parsed from the text) + the legacy extractFromText wrapper that returns the last one. - Add opts.all to extractFromFile so --merge gathers every delta emit from a log file, not just the largest or last marker pair. - main() with --merge now pushes all reports per file into the device bucket before handing them to mergeDeviceReports, so partial data from a pre-crash run still lands in the aggregate. Verified locally with a delta-emit simulation that mid-test crashes leave runs 1..N-1 intact in the reconstructed report. Made-with: Cursor * QVAC-17830 fix: plug combined perf report gaps (desktop race + artifact collision) Two pre-existing bugs caused the combined performance table to drop data on a PR run: 1. `darwin-x64` (and any other slow desktop job) was missing because `combine-reports` lived inside the mobile workflow and only waited on the mobile matrix. Desktop jobs that finished AFTER mobile were silently excluded. Move the job to the umbrella (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports` with `needs: [run-integration-tests, run-mobile-integration-tests]`, so it blocks on the full fan-out before scanning artifacts. 2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair) was overwritten because every matrix entry uploaded its artifact under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two entries sharing platform+arch clobbered each other on download. Add a unique `label` field to each matrix entry (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`, `linux-arm64-u24`, ...) and key the upload + device-name patch on `matrix.label` instead. Made-with: Cursor * QVAC-17830 fix: iOS fruit plate retry + consolidate Android images + flush logs on crash Three changes scoped to the mobile LLM integration workflow: * Android: merge the three per-image Device Farm groups back into a single groupImagesPerf run. Android does not have the iOS jetsam pressure, so running all three image tests in one invocation is both cheaper and simpler while still giving the aggregator 9 rows (3 images x 3 iters) per device. * iOS: schedule Heavy7 (fruit plate) a second time as Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device folder on download, and extract-from-log.js --merge dedupes on the metric fingerprint, so if either attempt gets past iOS jetsam we have perf data. If both succeed we just get more iterations to average over, which is strictly better. run_count goes 10 -> 11 and the Monitor / Download loops were extended accordingly. * Crash path (both platforms): route process.exit(1) in checkAppCrash and crashMonitor through a new global.flushLogsAndExit helper that (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain and (b) on iOS performs an inline Appium pull_file of bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means delta perf emits that hit disk before a jetsam kill now reach the aggregator instead of being stuck in the app sandbox. Made-with: Cursor * QVAC-17830 fix: inline crash flush-delay, drop duplicated pull helper Previous commit pushed the `Create and Upload Test Spec` run: block over GitHub Actions' ~30K template-expression budget, so the umbrella workflow refused to dispatch with: Line 927, Col 14: Exceeded max expression length 21000 Root cause: `global.flushLogsAndExit` was inlined into BOTH Android and iOS wdio configs. On iOS it duplicated the exact Appium pull_file logic the `after` hook already has, and on Android it carried dead iOS-only code — roughly 1.8 kB of repeated JS across the two configs. Instead: * Drop `global.flushLogsAndExit` entirely. * In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`. Same 5s drain window for logcat / os_log / bare stdout, one line of JS, zero helper definition. * iOS `after` hook still pulls `bare_console.log` via Appium on the normal path, so delta perf data from successful runs is unchanged. Net `run:` block size goes from 31,524 → 28,992 chars (below the previously-working 28,995). Both wdio configs revalidated with `node --check` and the full workflow YAML round-trips through PyYAML. Made-with: Cursor * QVAC-17830 fix: iOS fruit plate warmup + merge linux legs + mobile detail tables - Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so Metal shaders / KV cache / image-prefill buffer are allocated with a cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop and Android paths remain byte-for-byte identical. - Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix legs into single "linux-x64" / "linux-arm64" columns in the combined report's Fix desktop device names step. Previously the summary showed four separate columns with "-" GPU rows that looked like missing data even though the sibling leg had populated them. - Preserve categorical metrics (backend, platform, status, execution_provider) through aggregateReports and render per-device detail tables for every device — mobile included — in the combined markdown summary via a new --device-details flag on aggregate.js. Mobile devices now get the same twelve-column breakdown (Test, EP, Backend, Platform, Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that desktop matrix legs emit via writeStepSummary. Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/ (27/27 assertions pass). Made-with: Cursor * QVAC-17830 fix: iOS fruit plate 1-iter override + HTML detail tables Two fixes after run 24909705757 landed iOS perf data for elephant + aurora but still lost fruit plate, and the combined HTML report was missing the row-per-test detail tables that the step-summary markdown has. 1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone. With the elephant pre-warmup in place, the fruit-plate test still crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd counted iteration, leaving zero rows in the perf artifact for both iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted iteration cuts the cold-path peak footprint from 4 inferences to 2 (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in local simulation. Desktop + Android still run the full PERF_RUNS=3. 2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the combined HTML report mirrors the markdown step-summary layout — one section per device with Backend / Platform / Total Time / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS / Status columns. Previously only the markdown had those columns, so mobile devices (which can't write to GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML artifact. The per-device HTML report step also gets the flags so each individual HTML has parity. Verified locally against the run-24909705757 artifact set: all 9 devices render the full row- per-test summary, including Apple iPhone 16 Pro / iPhone 17 with backend=metal and platform=ios-arm64. Made-with: Cursor * QVAC-17830 fix: warm process iOS heavy7 + dedupe perf legs + drop retry Five user-requested follow-ups after run 24913460964: 1) iOS heavy7 is now a "warm process" group. test-groups.json puts runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal / native allocators are pre-warmed by a cheap test before the 10 MB fruit-plate PNG hits the multimodal pipeline. 2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11) is removed from the mobile workflow. Run count goes from 11 → 10; monitoring + log-download loop bounds updated; the inline comment that justified the retry is replaced with one explaining the new warm-process rationale. 3) JS-only crash-path bare_console.log pull on iOS. The existing pull_file logic that used to live inline in `after` is now a global.flushBareLog helper installed in `before`, called from global.checkAppCrash and the setInterval crash monitor (after a 1.5 s browser.pause to give stdout a chance to flush) BEFORE the 5 s setTimeout(process.exit). Net effect: crashed runs that previously left a 0-byte LOG artifact now have a real bare_console.log on disk for triage. No C++ touched. 4) `aggregateReports` now dedupes sibling matrix legs that get folded into one device name. Previously linux-x64 / [elephant] [CPU] accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the per-test breakdown tables. The new rule: per (device, test, run_number), only the FIRST report's rows are kept — subsequent sibling-leg reports for the same pair are skipped. Within a single report all PERF_RUNS iterations still flow through (so counted runs per device stays at 3). The weekly aggregator is unaffected because its inputs span different run_numbers. Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter reports per matrix leg; linux-x64 / [elephant] [CPU] now reports `count: 3` and the markdown header reads "Iterations: 3". 5) Per-device detail-table headings use the device name as the anchor instead of a repeated "### Performance: <addon>". GitHub collapses duplicate heading slugs in step summaries which made it impossible to deep-link or scroll to the mobile detail tables — they were rendered, just hidden under a single anchor. Each device now gets `### <Device> — <addon> (<platform/arch>)` so Android / iOS sections appear in the step-summary TOC. Made-with: Cursor * QVAC-17830 fix: warm iOS heavy7 with elephant instead of api-behavior The previous warm-process seed (runApiBehaviorTest) used a text-only Qwen3-0.6B model, so it never exercised the mmproj load or the Metal vision shader JIT that fruit-plate then hit cold. iPhone 17 kept SIGABRT'ing between the LLM context init and the first inference during exactly those steps. Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the small (23 KB) elephant test pre-warms the exact VLM code path — SmolVLM2 + mmproj + Metal vision graph + KV cache — before the 10 MB fruit-plate PNG lands. * QVAC-17830 fix: shrink iOS fruit-plate to 2 inferences cold The previous attempt put runImageElephantTest as a "warm process" prefix to heavy7. That actually executes the full elephant perf cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media- path test = 9 elephant inferences and 2 SmolVLM2 model loads — before runImageFruitPlateTest gets to do anything. Net effect on heavy7 was 12 inferences pre-fruit-plate-row, which regressed iPhone 16 Pro (had been passing) and iPhone 17 still died after the inline elephant warmup completed. New shape: - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"] (cold process again — runImageElephantTest stays in heavy3 as its own perf slot, untouched). - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage) succeeds, set an iosPreWarmupRan flag and skip the standard PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass. - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg' + iosPerfRuns=1 config; comments rewritten to describe the new flow. - mobile workflow: heavy7 comment block updated to match. End-to-end iOS shape for fruit-plate is now exactly what was asked for: 1. setupMultimodalInference loads SmolVLM2 cold 2. 1 elephant inference (inline iosWarmupImage, no perf record) 3. standard warmup loop SKIPPED 4. 1 fruit-plate inference (counted, perf recorded) = 2 inferences total Desktop and Android paths are completely unchanged — they still run 1 warmup + PERF_RUNS=3 counted. iOS tests without an iosWarmupImage (elephant, high-res aurora) keep their existing 1 warmup + iosPerfRuns/PERF_RUNS counted shape. Made-with: Cursor * QVAC-17830 feat: scenario grouping, GPU probe, squashed PR summary, perf for bitnet/tool-calling - extract _perf-helper.js out of _image-common.js so bitnet, tool-calling and image tests share one perf-recording path - add `scenario` field on perf records ('image' / 'bitnet' / 'tool-calling' / 'default'), plumb through aggregator + reports. Per-device detail tables (md + html) now group rows by scenario so a single combined report can surface multiple test families side by side instead of mixing them into one giant table - drop the always-'ok' Status column from the vision metric layout (just noise; the JS layer no longer hardcodes status: 'ok') - squash the combined GH step summary to three Mean ± std mini-tables (Total Time / TTFT / TPS), grouped by scenario. The HTML artifact still carries the full per-iteration breakdown - split aggregate.js --device-details into --md-device-details and --html-device-details (legacy --device-details still toggles both) - wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js (one perf row per model_variant x prompt cell) onto _perf-helper.js - add desktop GPU probe in detectDevice(): nvidia-smi -L on linux/win32, lspci as linux fallback, system_profiler on darwin, wmic on win32. New device.gpu field is surfaced in the per-device detail-table headings (md + html). Mobile leaves gpu=null — device.name (Device Farm label) already implies the chipset - add scripts/perf-report/sim-llamacpp-llm.js for local report-shape verification mirroring the umbrella workflow flags Made-with: Cursor * feat: surface per-device detail tables in PR summary with mean ±std cells QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the combined PR summary, but the per-device detail tables (Test / EP / Backend / Platform / Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended underneath instead of being HTML-only. Numeric cells in the detail tables render as `mean ±std` so the iteration spread is visible at every metric, not just the rolled-up ones; token-count columns stay as bare integers since deterministic generation makes `±0` noise. The umbrella workflow flips from `--html-device-details` back to `--device-details` to apply this in both the markdown and HTML outputs. Made-with: Cursor * refactor: drop image_prefill_time_ms from perf report QVAC-17830. Per-prompt image-prefill time turned out to overlap with the prefill_time_ms / TTFT signal we already report, so the column was essentially redundant. Vision Enc (ms) — the mmproj / vision- encoder time — is the only VLM-specific timer worth keeping for now; its native wiring is tracked separately under https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702. Removed everywhere it surfaced: - scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS + record() default metrics) - scripts/perf-report/utils.js (label map + vision detail columns) - packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js (mobile inline-fallback default metrics + recordPerformance payload) - scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators) Made-with: Cursor * QVAC-17830 fix: tighten combined perf report layout (column filtering + legend) The cross-platform mean tables previously emitted ALL device columns for every scenario block, so e.g. the bitnet block (Android-only by design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty desktop/iOS columns of dashes, which made the report look broken when it was actually intentional. - Per-scenario × per-metric column filtering: drop columns that have zero data in this scenario block. Bitnet collapses from 9 → 2 columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64` (test is `skip: isDarwinX64`). - Short legend right under the report title explaining the dashes that remain (gated tests like medgemma-4b-it = desktop-only, mobile = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner). Per-device detail tables are unchanged — they were already scenario- partitioned so the layout fix doesn't apply there. Made-with: Cursor * QVAC-18111 feat: env-driven perf iterations + Benchmark Performance (LLM) workflow_dispatch New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco): PR runs default to 1 warmup + 1 counted iteration (cheap, single- shot signal). The dedicated `Benchmark Performance (LLM)` workflow is the only place we crank iterations up to produce mean ± std. Changes: - `_image-common.js` and `bitnet.test.js`: replace hardcoded `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS', 1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative env values fall back to the safe default of 1+1. - iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark workflow doesn't OOM the Jetsam ceiling on heavy images. Comments updated to reflect the new default. - `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The umbrella PR workflow doesn't pass these, so the PR critical path is unaffected. - `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls the desktop integration-test reusable with `qvac_perf_runs=3`, aggregates a benchmark report, uploads HTML artifact + GitHub step summary. Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device Farm) needs a build-time env-var injection hook in the test app — tracked as a QVAC-18111 follow-up. Made-with: Cursor * QVAC-17830 fix: tool-calling EP label honours NO_GPU on linux-x64-cpu runner The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan device is present — but `tool-calling.test.js` previously hardcoded `useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`. That made the combined report show GPU bars on a CPU runner for tool-calling rows, which read as inconsistent next to the [CPU] rows from `image` and `bitnet`. Match the `_image-common.js` pattern: read `NO_GPU` from both `process.env` and `os.getEnv()`, treat any case-insensitive "true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the `linux-x64-cpu` leg labels its rows `[CPU]` correctly while the `linux-x64-gpu` leg keeps `[GPU]`. Made-with: Cursor * QVAC-17830 fix: terser perf-report legend + full metric breakdown in cross-platform tables Two follow-ups based on review feedback on the combined report: - Trim the verbose "intentional gaps" legend down to a single short italic line (`> _` `-` ` = not run on this device._`). The per-scenario column filter already drops fully-empty columns, so the long explanation was no longer pulling its weight. - Surface the same metrics the per-device detail tables already show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS) in the cross-platform mini-tables, instead of just Total/TTFT/TPS. Reviewer needs the full breakdown to compare backends, not just the headline. `vision_encode_time_ms` is currently null everywhere (native timer pending — Asana 1214371583877702) so its mini-table is auto-suppressed by the existing `hasAnyData` guard until the timer lands; the entry is in the list ahead of time so the table appears automatically once data flows. Made-with: Cursor * QVAC-17830 fix: use bare-os getEnv() for QVAC_PERF_RUNS / NO_GPU lookups Bare runtime doesn't define `process` as a global at module-init time, so referencing `process.env[key]` from the top-level `_envInt(...)` call in `bitnet.test.js` blew up the entire integration suite with `ReferenceError: process is not defined` on first eval. Switch the env reads to `bare-os`'s `os.getEnv()` (already imported) as the primary source, and keep a `typeof process !== 'undefined'`- guarded `process.env` fallback for any Node code paths that happen to import these files. Apply the same pattern in: - `_image-common.js` — for both `_envInt` and the existing `noGpuEnv` lookup which had the same latent bug. - `bitnet.test.js` — for `_envInt`. - `tool-calling.test.js` — for the new `noGpuEnv` lookup added in the linux-x64-cpu label fix. Verified locally: behaves correctly when `process` is undefined, when `process.env[key]` is unset, and when only one of the two sources has the value. Made-with: Cursor * QVAC-18111 chore: align Benchmark Performance (LLM) workflow with the addon convention Rename `.github/workflows/benchmark-llamacpp-llm.yml` → `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern on main (Parakeet, Whispercpp, Supertonic TTS, etc.). Also align the structure: introduce a `context` job that derives `repository` / `ref` from optional inputs (with fallback to `github.repository` / `github.ref_name`), wire prebuild + desktop integration tests through that, and add a `summarize` job whose artifact follows the convention `<addon>-performance-findings.{md, html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs` inputs (defaulting to 3 / 1) are preserved — this is still the single place we crank perf iterations up to produce mean ± std numbers, and the umbrella PR workflow continues to leave them at the cheap default of 1 + 1. Workflow_dispatch only — explicitly NOT wired into the PR umbrella, matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco). Made-with: Cursor * QVAC-17830 fix: address CodeQL security findings on combined-perf-report job Resolves the three CodeQL alerts surfaced on PR #1843: - Alert 731 (`scripts/perf-report/extract-from-log.js`, regex injection): The `--device-from-filename` flag fed user-supplied input straight into `new RegExp(...)`. No workflow on main or in this PR ever passes the flag — it was dead code from an earlier mobile artifact-layout experiment. Drop the flag, the helper, and the field from `parseArgs`, which removes the ReDoS surface entirely. - Alerts 733 / 734 (`.github/workflows/on-pr-qvac-lib-infer-llamacpp-llm.yml`, cache poisoning via execution of untrusted code): The new `combine-perf-reports` job inherits the umbrella's `pull_request_target` trigger and consumes PR-author-controlled perf artifacts. The blast radius is already bounded (`permissions: contents: read`, no `actions/cache`, no secrets touched by the aggregator), but the per-device HTML step interpolated a PR-controlled `device.name` into a filesystem path with only a space → dash sanitiser. Harden it: * Pass the JSON file path via `process.argv` instead of shell- interpolating it into `node -e`, so the inner `JSON.parse` never sees shell-meta from the surrounding quoting. * Validate `device.name` is actually a string before using it. * Replace the `tr ' ' '-'` fallback with a strict `[A-Za-z0-9-]` allowlist (anything else collapses to `_`) plus a 64-char cap, which kills path traversal and shell-meta exposure even on a maliciously crafted JSON. * Skip the iteration entirely when the slug ends up empty. Add a security block comment on the job documenting the trust model (privileged trigger, scoped permissions, no shared cache, sanitised filename input) so future readers don't re-derive it. Drive-by: SHA-pin three `actions/upload-artifact@v4` references this PR introduced (one in the mobile workflow, two in the umbrella) to `@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # 7.0.0`, matching the repo's existing pinned-action convention. Pre-existing `@v4` references in `integration-mobile-test-...yml` lines 1684 / 1693 are out of scope for this PR. Made-with: Cursor * QVAC-17830 fix: add shell + security note on Generate combined report step CodeQL alert 733 (cache-poisoning/poisonable-step) is the same trust-model concern the new block comment on combine-perf-reports already documents — privileged checkout of PR-head code is unavoidable because the introducing PR needs to exercise its own aggregate.js changes. Inline the security note on the offending step so the CodeQL fingerprint reflects the explicit acknowledgement, mirroring what already silenced alert 734 on the per-device HTML step. Drive-by: add explicit `shell: bash` on the step (was inheriting from the runner default; explicit is safer + matches the surrounding steps). Made-with: Cursor * QVAC-17830 feat: bridge QVAC_PERF_RUNS overrides into mobile bare runtime Mobile side of QVAC-18111: now that the umbrella benchmark workflow pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime needs to actually consume it. Without this, dispatched perf overrides were silently ignored on Android / iOS and tests defaulted to 1+1. Changes: - packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs: read qvacPerfConfig.txt from the same paths the testFilter loader uses (/data/local/tmp on Android, global.testDir on iOS), parse KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads on first __shouldRunTest call, which fires before runIntegrationModule imports any test module — so os.getEnv() inside _image-common.js / bitnet.test.js / tool-calling.test.js sees the dispatched value at module init. Normalises both literal '\\n' (the bash-escaped form the WDIO before-hook produces today) and real '\\r?\\n' so the parser doesn't depend on which encoding the workflow happens to use. - .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml: extract the two giant Android / iOS WDIO_CONFIG JS strings out of the 'Create and Upload Test Spec' run: block into env: vars. GitHub Actions imposes a hard 21 000-char limit on any run: block that contains ${{ }} expressions, and the new Android image-perf group + iOS heavy7/heavy8 splits + collapsed-format echo line for the test split summary pushed us over. Moving the WDIO blobs to env: drops the run block from ~21 KB to ~11 KB so we have headroom for further additions. Made-with: Cursor * QVAC-17830 feat: gate Benchmark Performance (LLM) to perf-emitting tests only Adds qvac_perf_only input to the desktop and mobile reusable workflows and turns it on from Benchmark Performance (LLM) so dispatched runs only execute the perf-emitting tests instead of the full integration suite. Desktop: regenerates test/integration/all.js from bitnet, tool-calling and the three image-* tests, then runs bare against it. Mobile: filters each group's run*Test list down to perf-emitting wrappers, skips groups whose intersection is empty (no spec uploaded, no Device Farm run booked), and tolerates gaps in the RUN_ARN_* slot sequence when monitoring / downloading logs. Co-authored-by: Cursor <cursoragent@cursor.com> * fix[ci]: drop PR-head checkout in combine-perf-reports to clear CodeQL 735/736 The `combine-perf-reports` job runs in `pull_request_target` context and was checking out the PR head's `scripts/perf-report` so it could aggregate the matrix artifacts. CodeQL flagged this as a cache-poisoning / poisonable-step issue (alerts 735/736) because we were executing PR-author code in a privileged context. Removing the `repository:` / `ref:` overrides on the checkout makes `actions/checkout` fall back to the base-branch SHA in `pull_request_target` runs, so `aggregate.js` is now trusted code from the default branch. `workflow_dispatch` runs are unaffected because `github.ref` resolves to the dispatched branch. Existing mitigations stay in place: `contents: read` only, no `actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]` allowlist + 64-char cap on `device.name` before it touches any filesystem path. SECURITY comments updated to reflect the new posture. Co-authored-by: Cursor <cursoragent@cursor.com> * fix[ci]: add runGemma4Test + runOcrPaddleTest to iOS lightB The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8, displacing main's runGemma4Test/runOcrPaddleTest. The validator in scripts/generate-mobile-integration-tests.js requires every test to be assigned on both platforms — these two were missing on iOS. Placed them in lightB alongside runQwen35Test and runModelLoadingTest since they're model-loading tests with similar runtime characteristics. * fix[ci]: bound mobile monitor when AWS API permanently fails Two complementary fixes for the Device Farm monitor loop in integration-mobile-test-llm-llamacpp.yml. 1. Match the initial credentials step's role-duration-seconds: 7200 on the 'Refresh AWS credentials for monitoring' step. Without it, the refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is 2h, so every devicefarm API call past the 1h mark fails until timeout — producing the spinning 'Run N: () | AWS API attempt X/3 failed, retrying in 5s...' output for ~1h with no escape. 2. Make the poll loop self-bounding under persistent API failure: - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard never prints the empty 'Run N: ()' line. - On aws_retry returning empty (or 'None'), preserve last-known-good STATUS/RESULT for that slot and increment a per-slot FAIL counter. - After 10 consecutive failures (~5 min at the 30s cadence, well past any normal STS / Device Farm transient) mark that slot ERRORED / NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning the full 2h ceiling. Final-results phase reports a real failure. * mod[notask]: isolate iOS Gemma4 and OcrPaddle into their own Device Farm slots Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to isolated slots instead of folding them into lightB. The image perf tests keep their existing isolation (heavy3/heavy7/heavy8), so the iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12). - test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest; drop both from lightB - integration-mobile-test-llm-llamacpp.yml: - filter_perf: extract heavy9 + heavy10 wrappers - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml, renumber light specs to test_spec_arn_11 / test_spec_arn_12 - schedule: 12 maybe_schedule calls for iOS - monitor + log download: RUN_ARN_1..12 explicit assignments (the seq-based loops are already dynamic via $RUN_COUNT) Co-authored-by: Cursor <cursoragent@cursor.com> * QVAC-17830 fix: apply fruit-plate iOS OOM mitigation to high-res aurora iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed peak memory past the ~3.3 GB Jetsam ceiling once the device was warm from earlier jobs in the queue, the app was killed via memorystatus, and Appium never received a clean failure signal — leaving the Device Farm run in RUNNING (PENDING) until the 60-min job timeout. When `image.test.js` was split into per-image files in 37afd3d, the fruit-plate variant correctly received `iosWarmupImage` + `iosPerfRuns: 1`, but high-res aurora — the heaviest image in the suite — was left on the standard path. That meant CPU + GPU each ran 1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone the cold path crossed the Jetsam line. Mirror the fruit-plate config: - `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline with a 23 KB image so Metal shaders, KV cache, and image-prefill buffers are allocated before the 12 MP JPEG arrives. `_image-common.js` skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs. - `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM the iPhone. iOS cold-path footprint is now exactly 2 inferences per backend (1 small pre-warmup + 1 counted real image), matching fruit-plate's known-good shape. Desktop and Android paths are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 of the LLM benchmark workflow: extends iteration-count overrides (
QVAC_PERF_RUNS/QVAC_PERF_WARMUP_RUNS) to the Device Farm side, and wires Android + iOS into the dispatch so the benchmark workflow covers desktop and mobile.Note
Stacked on #1839 (the desktop-only scaffold, which already adds the
run_desktoptoggle). Once #1839 lands onmain, GitHub will retarget this PR's base tomainautomatically.The bare runtime inside the Device Farm test app does not see GitHub Actions env vars, so this mirrors the existing
testFilter.txtpattern — when the workflow inputs are non-empty, the WDIObeforehook pushes aqvacPerfConfig.txtto the device with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch; without it this PR is a harmless no-op (orphan file), so it is safe to land independently.Changes
integration-mobile-test-qvac-lib-infer-llamacpp-llm.ymlqvac_perf_runs/qvac_perf_warmup_runstoworkflow_callandworkflow_dispatch(empty default ⇒ no behaviour change for PR runs)__QVAC_PERF_RUNS__/__QVAC_PERF_WARMUP_RUNS__placeholders to the Android + iOS WDIO config blobs and a siblingpushFileblock in thebeforehook (writesKEY=VALUE\npairs)/data/local/tmp/qvacPerfConfig.txt@io.tether.test.qvac:documents/qvacPerfConfig.txtmake_splitsubstitutes the new placeholders alongside the existing__TEST_FILTER__benchmark-performance-qvac-lib-infer-llamacpp-llm.ymlmobile-benchmarksjob callingintegration-mobile-test-…ymlwith the bench-mode iteration countsrun_mobiletoggle (workflow_dispatchboolean input, defaults totrue) and gatemobile-benchmarkswithif: ${{ inputs.run_mobile }}. Together with therun_desktoptoggle from QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 this lets dispatchers run desktop and mobile benchmarks independently (full / desktop-only / mobile-only)summarize.needsincludesmobile-benchmarksand is hardened withif: ${{ always() && needs.context.result == 'success' }}so it still produces a partial report when one of the two matrices is toggled off or failsToggle matrix (combined with #1839)
run_desktoprun_mobileprebuildis not gated and runs unconditionally; splitting it per matrix is a separate optimisation.What's NOT in this PR
_envInt()file-fallback in_image-common.js/bitnet.test.js/tool-calling.test.js— that lives on the QVAC-17830 perf branch (where_envIntis defined). Without it, the file pushed by this PR is unread (no behaviour change). The two PRs together (perf branch + this PR + QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839) deliver bench-mode for mobile.Test plan
qvacPerfConfig.txtpushed when inputs are empty)Benchmark Performance (LLM)and confirm:[QvacPerfConfig] loaded ...(or equivalent — naming TBD on perf branch)mean ± stdinstead of single-shot datarun_desktop=false,run_mobile=true→ only Device Farm runs; summarize still produces a (mobile-only) reportrun_desktop=true,run_mobile=false→ desktop-only run, no Device Farm cost