QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU) by tobi-legan · Pull Request #1843 · tetherto/qvac

tobi-legan · 2026-04-30T19:58:13Z

🎯 What problem does this PR solve?

QVAC-17830: VLM inference has no per-platform performance signal in CI. We can't tell when a code change tanks vision-encode / prefill / decode / TTFT / TPS for image, tool-calling, or bitnet workloads on Linux / macOS / Windows / Android / iOS, let alone CPU vs GPU on the same device.
Heavy-iteration benchmarking (mean ± std) on every PR would 2–3× our CI cost, but a single counted run per PR is enough to flag regressions, provided the same harness can be re-dispatched with bigger iteration counts on demand.

📝 How does it solve it?

Per-test perf records (image, bitnet, tool-calling) recording total_time_ms, prefill_time_ms, decode_time_ms, vision_encode_time_ms (stub, native wiring tracked under QVAC-18103), ttft_ms, prompt_tokens, generated_tokens, tps, plus scenario, device.gpu (probed per-platform), backend, platform. Shared singleton reporter extracted into _perf-helper.js; _image-common.js consumes it.
Configurable iterations via QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS read through bare-os.getEnv() (with a process.env fallback for Node code paths). Defaults to 1 warmup + 1 counted so PR runs stay cheap. The dedicated Benchmark Performance (LLM) workflow_dispatch (scaffolded by QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 + mobile-bridged by QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile #1840, both merged) cranks it up.
Mobile bridge: the umbrella benchmark workflow pushes qvacPerfConfig.txt to the device; the bare runtime in integration-runtime.cjs reads it on first __shouldRunTest call and injects QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS via os.setEnv() before any test module imports.
Perf-only filter on the benchmark workflow (qvac_perf_only input). When dispatched, desktop regenerates test/integration/all.js from only the 5 perf-emitting tests (bitnet, tool-calling, image-elephant, image-fruit-plate, image-high-res-aurora). Mobile filters each Device Farm group's wrapper list down to perf-emitting tests, skips groups whose intersection is empty, and tolerates gaps in the slot sequence. Cuts dispatched runs from ~full integration suite down to just the perf path.
Aggregated reporting in scripts/perf-report/aggregate.js: combined HTML artifact (HTML-Report-All-Platforms-*) + per-device HTML (HTML-Reports-Per-Device-*) + GitHub step-summary markdown with squashed mean ± std mini-tables for Total Time / Prefill / Decode / Vision Enc / TTFT / TPS, grouped by scenario (image / tool-calling / bitnet), with per-scenario column filtering (e.g. bitnet only shows the 2 Android columns) and a short "- = not run on this device" legend.
Mobile stability: iOS fruit-plate (heavy7) is a cold heavy group with 1 inline elephant warmup + 1 counted run (was OOM-crashing under Jetsam). Android image tests consolidated into one Device Farm group (groupImagesPerf) so all three images run in a single invocation.
GPU probe in performance-reporter.js populates device.gpu (nvidia-smi / lspci on Linux, system_profiler on macOS, wmic on Windows) so detail tables show Apple M3 / GeForce RTX 4070 / Adreno 750 instead of unknown.
CodeQL: 3 alerts addressed on the combine-perf-reports job (regex-injection in extract-from-log.js, two cache-poisoning paths in the umbrella workflow). Strict [A-Za-z0-9-] allowlist on PR-controlled device.name before it touches the filesystem; trust-model documented inline. actions/upload-artifact references SHA-pinned to match the repo convention.

🧪 How was it tested?

CI run 25181286617 on this branch — full umbrella green except two known external flakes (cpp-lint formatting noise + linux-x64 exit 139 after tests pass). Verified in artifacts:
- perf-report-llamacpp-llm-Android-2427 — 1×bitnet + 2×tool-calling + 6×image rows on Pixel 9 Pro / S25 Ultra.
- perf-report-llamacpp-llm-iOS-2427 — image + tool-calling rows on iPhone 16 Pro / 17 (bitnet correctly skipped).
- 7 desktop perf JSONs (linux-x64-cpu/gpu, linux-arm64-u22/u24, darwin-arm64/x64, win32-x64).
- Combined markdown summary contains every scenario block under Mean Total / Prefill / Decode / TTFT / TPS + per-device detail tables.
Perf-only filter end-to-end validated on a Benchmark Performance (LLM) dispatch: Android ran only the 3 perf groups (vs full integration suite); iOS skipped Heavy1/4/5/6/LightB groups whose perf-intersection was empty.
Bare runtime fix (env via os.getEnv()) verified locally — previous process.env lookup threw ReferenceError: process is not defined on Bare module-init; the guarded os.getEnv() path passes.
Local aggregate.js regen against the run artifacts reproduces the same combined markdown / HTML output the workflow uploads.
iOS fruit-plate stability: 5 consecutive Device Farm runs without OOM after switching to the cold-group + inline-warmup pattern.

Test-only PR. No addon runtime / native code changed; package.json version intentionally unchanged.

Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf pipeline. Perf reporter / aggregator - scripts/test-utils/performance-reporter.js: extend `vision` addon type with backend, platform, prefill, decode, vision_encode, image_prefill and status columns. - scripts/perf-report/utils.js: teach the aggregator the new VLM metric labels so markdown/HTML reports render them. CI wiring - integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add aggregate + upload-artifact steps so CPU/GPU perf artifacts are produced alongside the existing test run. - integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile): scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END] markers, run extract-from-log.js + aggregate.js, upload per-platform perf artifacts. Mirrors the OCR mobile workflow pattern. Test refactor (image.test.js) - Include image name in the perf label ("[elephant] [CPU]" etc.) so aggregate.js (which groups by `result.test`) no longer pools different images into a single CPU/GPU cell. - Run 1 warmup + 3 counted inferences per (image x backend) so the aggregator produces count=3, mean, std within a single CI run -- matches OCR's PERF_RUNS=3 convention. - Reuse the loaded LlmLlamacpp instance across iterations inside a single test() block: deliberate divergence from OCR's one-test()- per-iteration shape because VLM model load is multi-GB and already stresses iOS Device Farm memory. - Bump per-test timeout to 25 min for the multi-iter perf tests; leave TEST_CONSTANTS.timeout unchanged for other tests in the file. - Keyword assertions run once against the last counted iteration's output. vision_encode_time_ms / image_prefill_time_ms are stubbed as null pending native RuntimeStats instrumentation (separate follow-up task).

Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)" block to the scheduled cron job so Monday 9am UTC runs aggregate perf artifacts from Mobile Integration Tests (LLM) into a separate llamacpp-llm-mobile-performance.{md,json,html} report alongside the existing desktop one. Also adds "Mobile Integration Tests (LLM)" to the workflow_dispatch workflow_name choice list so it can be triggered manually. Without this, the per-run Android/iOS perf artifacts the mobile workflow now uploads would sit unused by the weekly cross-run aggregator.

Mirrors the `combine-reports` job in the OCR mobile workflow. Runs after the per-platform matrix completes and downloads every artifact matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because the desktop (integration-test-*.yml) and mobile (this file) workflows are both invoked via workflow_call from the umbrella On PR Trigger (LLM), share the same github.run_number and therefore live together in the single run's artifact scope. The job then: - aggregates all desktop + mobile rows into a single combined HTML, Markdown and JSON report via scripts/perf-report/aggregate.js; - patches the `device.name` for desktop artifacts using the artifact suffix (e.g. "linux-x64", "macos-arm64") so they render with recognisable column headers next to the mobile device names; - emits a per-device HTML report for each unique device encountered (desktop OS/arch variants + each mobile device); - writes the combined markdown report into $GITHUB_STEP_SUMMARY so the PR run page shows one unified perf table covering every platform the addon ran on in that run. This is what the user actually meant by "joint performance reporter" -- distinct from the weekly/cross-run aggregation in perf-report.yml, which aggregates historical runs for trend tracking.

…er device Android perf data was never surfacing because the inline fallback reporter used on mobile was a no-op stub (no record/writeToConsole), so logcat never saw PERF_REPORT markers. iOS image tests were also crashing mid-run (Jetsam/memorystatus) before process.on('exit') could flush, losing whichever runs had already completed. - Replace the mobile fallback reporter in _image-common.js with a functional inline implementation that mirrors OCR's: record(), writeReport(), and writeToConsole() with lightweight snapshots and logcat-safe chunking. - Flush incrementally after every record() on mobile so earlier iterations survive a SIGKILL later in the same group. - Split image.test.js into per-image files (elephant, fruit plate, high-res aurora) behind a shared _image-common.js helper, and wire each into its own Device Farm group (Android: 3 new groups; iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora). - Teach extract-from-log.js to --merge multiple reports per device (needed because each physical device now runs three groups) and add --device-from-filename as a fallback when logs land in a flat layout. - Update the mobile workflow to schedule 5 Android / 10 iOS runs, bucket downloaded artifacts by device, unzip Customer_Artifacts inline so bare_console.log is scannable, and invoke extract-from-log.js with --merge. Verified locally with a happy-path sim (3 samples per image×backend per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated report on the crashed device, and the other three groups report fully. Made-with: Cursor

Fruit plate iOS test crashed inside Builtin_JsonStringify because the per-test flush stringified the cumulative results array on every record() call (3+ runs x 2 backends x verbose VLM output over a 10MB image + Metal compiler memory). Even with logcat chunking, the JSON build itself exhausted V8's Zone allocator, producing SIGTRAP from FatalProcessOutOfMemory before any data reached logcat. Changes in the mobile fallback reporter (_image-common.js): - Cap stored output text to 400 chars per record so the accumulated results array stays small regardless of model verbosity. - Add delta mode to writeToConsole: emit only the latest row per flush, so stringify stays O(1) in iteration count on mobile. - Stop calling writeReport() on disk per-record; rewriting the full JSON every iteration is both expensive and an OOM hazard. One final writeReport runs in the exit hook for the on-device artifact. - Remove the final cumulative writeToConsole in the exit hook: the per-test delta emits already carry every row, and emitting a large cumulative payload at exit is exactly when we most need the earlier deltas to survive in logcat/syslog. Changes in extract-from-log.js: - Split extractFromText into extractAllFromText (returns every report parsed from the text) + the legacy extractFromText wrapper that returns the last one. - Add opts.all to extractFromFile so --merge gathers every delta emit from a log file, not just the largest or last marker pair. - main() with --merge now pushes all reports per file into the device bucket before handing them to mergeDeviceReports, so partial data from a pre-crash run still lands in the aggregate. Verified locally with a delta-emit simulation that mid-test crashes leave runs 1..N-1 intact in the reconstructed report. Made-with: Cursor

…ct collision) Two pre-existing bugs caused the combined performance table to drop data on a PR run: 1. `darwin-x64` (and any other slow desktop job) was missing because `combine-reports` lived inside the mobile workflow and only waited on the mobile matrix. Desktop jobs that finished AFTER mobile were silently excluded. Move the job to the umbrella (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports` with `needs: [run-integration-tests, run-mobile-integration-tests]`, so it blocks on the full fan-out before scanning artifacts. 2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair) was overwritten because every matrix entry uploaded its artifact under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two entries sharing platform+arch clobbered each other on download. Add a unique `label` field to each matrix entry (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`, `linux-arm64-u24`, ...) and key the upload + device-name patch on `matrix.label` instead. Made-with: Cursor

…flush logs on crash Three changes scoped to the mobile LLM integration workflow: * Android: merge the three per-image Device Farm groups back into a single groupImagesPerf run. Android does not have the iOS jetsam pressure, so running all three image tests in one invocation is both cheaper and simpler while still giving the aggregator 9 rows (3 images x 3 iters) per device. * iOS: schedule Heavy7 (fruit plate) a second time as Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device folder on download, and extract-from-log.js --merge dedupes on the metric fingerprint, so if either attempt gets past iOS jetsam we have perf data. If both succeed we just get more iterations to average over, which is strictly better. run_count goes 10 -> 11 and the Monitor / Download loops were extended accordingly. * Crash path (both platforms): route process.exit(1) in checkAppCrash and crashMonitor through a new global.flushLogsAndExit helper that (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain and (b) on iOS performs an inline Appium pull_file of bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means delta perf emits that hit disk before a jetsam kill now reach the aggregator instead of being stuck in the app sandbox. Made-with: Cursor

Previous commit pushed the `Create and Upload Test Spec` run: block over GitHub Actions' ~30K template-expression budget, so the umbrella workflow refused to dispatch with: Line 927, Col 14: Exceeded max expression length 21000 Root cause: `global.flushLogsAndExit` was inlined into BOTH Android and iOS wdio configs. On iOS it duplicated the exact Appium pull_file logic the `after` hook already has, and on Android it carried dead iOS-only code — roughly 1.8 kB of repeated JS across the two configs. Instead: * Drop `global.flushLogsAndExit` entirely. * In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`. Same 5s drain window for logcat / os_log / bare stdout, one line of JS, zero helper definition. * iOS `after` hook still pulls `bare_console.log` via Appium on the normal path, so delta perf data from successful runs is unchanged. Net `run:` block size goes from 31,524 → 28,992 chars (below the previously-working 28,995). Both wdio configs revalidated with `node --check` and the full workflow YAML round-trips through PyYAML. Made-with: Cursor

…tail tables - Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so Metal shaders / KV cache / image-prefill buffer are allocated with a cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop and Android paths remain byte-for-byte identical. - Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix legs into single "linux-x64" / "linux-arm64" columns in the combined report's Fix desktop device names step. Previously the summary showed four separate columns with "-" GPU rows that looked like missing data even though the sibling leg had populated them. - Preserve categorical metrics (backend, platform, status, execution_provider) through aggregateReports and render per-device detail tables for every device — mobile included — in the combined markdown summary via a new --device-details flag on aggregate.js. Mobile devices now get the same twelve-column breakdown (Test, EP, Backend, Platform, Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that desktop matrix legs emit via writeStepSummary. Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/ (27/27 assertions pass). Made-with: Cursor

Two fixes after run 24909705757 landed iOS perf data for elephant + aurora but still lost fruit plate, and the combined HTML report was missing the row-per-test detail tables that the step-summary markdown has. 1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone. With the elephant pre-warmup in place, the fruit-plate test still crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd counted iteration, leaving zero rows in the perf artifact for both iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted iteration cuts the cold-path peak footprint from 4 inferences to 2 (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in local simulation. Desktop + Android still run the full PERF_RUNS=3. 2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the combined HTML report mirrors the markdown step-summary layout — one section per device with Backend / Platform / Total Time / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS / Status columns. Previously only the markdown had those columns, so mobile devices (which can't write to GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML artifact. The per-device HTML report step also gets the flags so each individual HTML has parity. Verified locally against the run-24909705757 artifact set: all 9 devices render the full row- per-test summary, including Apple iPhone 16 Pro / iPhone 17 with backend=metal and platform=ios-arm64. Made-with: Cursor

Five user-requested follow-ups after run 24913460964: 1) iOS heavy7 is now a "warm process" group. test-groups.json puts runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal / native allocators are pre-warmed by a cheap test before the 10 MB fruit-plate PNG hits the multimodal pipeline. 2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11) is removed from the mobile workflow. Run count goes from 11 → 10; monitoring + log-download loop bounds updated; the inline comment that justified the retry is replaced with one explaining the new warm-process rationale. 3) JS-only crash-path bare_console.log pull on iOS. The existing pull_file logic that used to live inline in `after` is now a global.flushBareLog helper installed in `before`, called from global.checkAppCrash and the setInterval crash monitor (after a 1.5 s browser.pause to give stdout a chance to flush) BEFORE the 5 s setTimeout(process.exit). Net effect: crashed runs that previously left a 0-byte LOG artifact now have a real bare_console.log on disk for triage. No C++ touched. 4) `aggregateReports` now dedupes sibling matrix legs that get folded into one device name. Previously linux-x64 / [elephant] [CPU] accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the per-test breakdown tables. The new rule: per (device, test, run_number), only the FIRST report's rows are kept — subsequent sibling-leg reports for the same pair are skipped. Within a single report all PERF_RUNS iterations still flow through (so counted runs per device stays at 3). The weekly aggregator is unaffected because its inputs span different run_numbers. Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter reports per matrix leg; linux-x64 / [elephant] [CPU] now reports `count: 3` and the markdown header reads "Iterations: 3". 5) Per-device detail-table headings use the device name as the anchor instead of a repeated "### Performance: <addon>". GitHub collapses duplicate heading slugs in step summaries which made it impossible to deep-link or scroll to the mobile detail tables — they were rendered, just hidden under a single anchor. Each device now gets `### <Device> — <addon> (<platform/arch>)` so Android / iOS sections appear in the step-summary TOC. Made-with: Cursor

The previous warm-process seed (runApiBehaviorTest) used a text-only Qwen3-0.6B model, so it never exercised the mmproj load or the Metal vision shader JIT that fruit-plate then hit cold. iPhone 17 kept SIGABRT'ing between the LLM context init and the first inference during exactly those steps. Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the small (23 KB) elephant test pre-warms the exact VLM code path — SmolVLM2 + mmproj + Metal vision graph + KV cache — before the 10 MB fruit-plate PNG lands.

The previous attempt put runImageElephantTest as a "warm process" prefix to heavy7. That actually executes the full elephant perf cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media- path test = 9 elephant inferences and 2 SmolVLM2 model loads — before runImageFruitPlateTest gets to do anything. Net effect on heavy7 was 12 inferences pre-fruit-plate-row, which regressed iPhone 16 Pro (had been passing) and iPhone 17 still died after the inline elephant warmup completed. New shape: - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"] (cold process again — runImageElephantTest stays in heavy3 as its own perf slot, untouched). - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage) succeeds, set an iosPreWarmupRan flag and skip the standard PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass. - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg' + iosPerfRuns=1 config; comments rewritten to describe the new flow. - mobile workflow: heavy7 comment block updated to match. End-to-end iOS shape for fruit-plate is now exactly what was asked for: 1. setupMultimodalInference loads SmolVLM2 cold 2. 1 elephant inference (inline iosWarmupImage, no perf record) 3. standard warmup loop SKIPPED 4. 1 fruit-plate inference (counted, perf recorded) = 2 inferences total Desktop and Android paths are completely unchanged — they still run 1 warmup + PERF_RUNS=3 counted. iOS tests without an iosWarmupImage (elephant, high-res aurora) keep their existing 1 warmup + iosPerfRuns/PERF_RUNS counted shape. Made-with: Cursor

…erf for bitnet/tool-calling - extract _perf-helper.js out of _image-common.js so bitnet, tool-calling and image tests share one perf-recording path - add `scenario` field on perf records ('image' / 'bitnet' / 'tool-calling' / 'default'), plumb through aggregator + reports. Per-device detail tables (md + html) now group rows by scenario so a single combined report can surface multiple test families side by side instead of mixing them into one giant table - drop the always-'ok' Status column from the vision metric layout (just noise; the JS layer no longer hardcodes status: 'ok') - squash the combined GH step summary to three Mean ± std mini-tables (Total Time / TTFT / TPS), grouped by scenario. The HTML artifact still carries the full per-iteration breakdown - split aggregate.js --device-details into --md-device-details and --html-device-details (legacy --device-details still toggles both) - wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js (one perf row per model_variant x prompt cell) onto _perf-helper.js - add desktop GPU probe in detectDevice(): nvidia-smi -L on linux/win32, lspci as linux fallback, system_profiler on darwin, wmic on win32. New device.gpu field is surfaced in the per-device detail-table headings (md + html). Mobile leaves gpu=null — device.name (Device Farm label) already implies the chipset - add scripts/perf-report/sim-llamacpp-llm.js for local report-shape verification mirroring the umbrella workflow flags Made-with: Cursor

…ells QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the combined PR summary, but the per-device detail tables (Test / EP / Backend / Platform / Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended underneath instead of being HTML-only. Numeric cells in the detail tables render as `mean ±std` so the iteration spread is visible at every metric, not just the rolled-up ones; token-count columns stay as bare integers since deterministic generation makes `±0` noise. The umbrella workflow flips from `--html-device-details` back to `--device-details` to apply this in both the markdown and HTML outputs. Made-with: Cursor

QVAC-17830. Per-prompt image-prefill time turned out to overlap with the prefill_time_ms / TTFT signal we already report, so the column was essentially redundant. Vision Enc (ms) — the mmproj / vision- encoder time — is the only VLM-specific timer worth keeping for now; its native wiring is tracked separately under https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702. Removed everywhere it surfaced: - scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS + record() default metrics) - scripts/perf-report/utils.js (label map + vision detail columns) - packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js (mobile inline-fallback default metrics + recordPerformance payload) - scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators) Made-with: Cursor

…m-perf-metrics

…m-perf-metrics Made-with: Cursor # Conflicts: # packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration.auto.cjs # packages/qvac-lib-infer-llamacpp-llm/test/mobile/test-groups.json

…m-perf-metrics Made-with: Cursor # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml # scripts/perf-report/utils.js

… + legend) The cross-platform mean tables previously emitted ALL device columns for every scenario block, so e.g. the bitnet block (Android-only by design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty desktop/iOS columns of dashes, which made the report look broken when it was actually intentional. - Per-scenario × per-metric column filtering: drop columns that have zero data in this scenario block. Bitnet collapses from 9 → 2 columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64` (test is `skip: isDarwinX64`). - Short legend right under the report title explaining the dashes that remain (gated tests like medgemma-4b-it = desktop-only, mobile = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner). Per-device detail tables are unchanged — they were already scenario- partitioned so the layout fix doesn't apply there. Made-with: Cursor

@Olya

…LLM) workflow_dispatch New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco): PR runs default to 1 warmup + 1 counted iteration (cheap, single- shot signal). The dedicated `Benchmark Performance (LLM)` workflow is the only place we crank iterations up to produce mean ± std. Changes: - `_image-common.js` and `bitnet.test.js`: replace hardcoded `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS', 1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative env values fall back to the safe default of 1+1. - iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark workflow doesn't OOM the Jetsam ceiling on heavy images. Comments updated to reflect the new default. - `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The umbrella PR workflow doesn't pass these, so the PR critical path is unaffected. - `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls the desktop integration-test reusable with `qvac_perf_runs=3`, aggregates a benchmark report, uploads HTML artifact + GitHub step summary. Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device Farm) needs a build-time env-var injection hook in the test app — tracked as a QVAC-18111 follow-up. Made-with: Cursor

… runner The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan device is present — but `tool-calling.test.js` previously hardcoded `useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`. That made the combined report show GPU bars on a CPU runner for tool-calling rows, which read as inconsistent next to the [CPU] rows from `image` and `bitnet`. Match the `_image-common.js` pattern: read `NO_GPU` from both `process.env` and `os.getEnv()`, treat any case-insensitive "true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the `linux-x64-cpu` leg labels its rows `[CPU]` correctly while the `linux-x64-gpu` leg keeps `[GPU]`. Made-with: Cursor

…cross-platform tables Two follow-ups based on review feedback on the combined report: - Trim the verbose "intentional gaps" legend down to a single short italic line (`> _` `-` ` = not run on this device._`). The per-scenario column filter already drops fully-empty columns, so the long explanation was no longer pulling its weight. - Surface the same metrics the per-device detail tables already show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS) in the cross-platform mini-tables, instead of just Total/TTFT/TPS. Reviewer needs the full breakdown to compare backends, not just the headline. `vision_encode_time_ms` is currently null everywhere (native timer pending — Asana 1214371583877702) so its mini-table is auto-suppressed by the existing `hasAnyData` guard until the timer lands; the entry is in the list ahead of time so the table appears automatically once data flows. Made-with: Cursor

…m-perf-metrics

Bare runtime doesn't define `process` as a global at module-init time, so referencing `process.env[key]` from the top-level `_envInt(...)` call in `bitnet.test.js` blew up the entire integration suite with `ReferenceError: process is not defined` on first eval. Switch the env reads to `bare-os`'s `os.getEnv()` (already imported) as the primary source, and keep a `typeof process !== 'undefined'`- guarded `process.env` fallback for any Node code paths that happen to import these files. Apply the same pattern in: - `_image-common.js` — for both `_envInt` and the existing `noGpuEnv` lookup which had the same latent bug. - `bitnet.test.js` — for `_envInt`. - `tool-calling.test.js` — for the new `noGpuEnv` lookup added in the linux-x64-cpu label fix. Verified locally: behaves correctly when `process` is undefined, when `process.env[key]` is unset, and when only one of the two sources has the value. Made-with: Cursor

@Olya

… addon convention Rename `.github/workflows/benchmark-llamacpp-llm.yml` → `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern on main (Parakeet, Whispercpp, Supertonic TTS, etc.). Also align the structure: introduce a `context` job that derives `repository` / `ref` from optional inputs (with fallback to `github.repository` / `github.ref_name`), wire prebuild + desktop integration tests through that, and add a `summarize` job whose artifact follows the convention `<addon>-performance-findings.{md, html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs` inputs (defaulting to 3 / 1) are preserved — this is still the single place we crank perf iterations up to produce mean ± std numbers, and the umbrella PR workflow continues to leave them at the cheap default of 1 + 1. Workflow_dispatch only — explicitly NOT wired into the PR umbrella, matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco). Made-with: Cursor

Picks up the QVAC-18111 scaffold PRs that landed on main: - #1839: Benchmark Performance (LLM) workflow_dispatch + qvac_perf_runs/ qvac_perf_warmup_runs plumbing on integration-test-...yml - #1840: bridge of those inputs into integration-mobile-test-...yml via pushFile of qvacPerfConfig.txt onto the device Conflict resolution: - benchmark-performance-...yml: take main entirely (canonical, has run_desktop+run_mobile toggles + mobile-benchmarks job) - integration-test-...yml: keep this branch's HTML perf report generation step + per-runner perf-report-llamacpp-llm-* upload, drop the chatty inline comments around qvac_perf_runs (config is now self-documenting in the workflow surface) - integration-mobile-test-...yml: take main's WDIO_CONFIG for Android (has the __QVAC_PERF_RUNS__/qvacPerfConfig.txt pushFile bridge) Made-with: Cursor

…time Mobile side of QVAC-18111: now that the umbrella benchmark workflow pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime needs to actually consume it. Without this, dispatched perf overrides were silently ignored on Android / iOS and tests defaulted to 1+1. Changes: - packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs: read qvacPerfConfig.txt from the same paths the testFilter loader uses (/data/local/tmp on Android, global.testDir on iOS), parse KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads on first __shouldRunTest call, which fires before runIntegrationModule imports any test module — so os.getEnv() inside _image-common.js / bitnet.test.js / tool-calling.test.js sees the dispatched value at module init. Normalises both literal '\\n' (the bash-escaped form the WDIO before-hook produces today) and real '\\r?\\n' so the parser doesn't depend on which encoding the workflow happens to use. - .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml: extract the two giant Android / iOS WDIO_CONFIG JS strings out of the 'Create and Upload Test Spec' run: block into env: vars. GitHub Actions imposes a hard 21 000-char limit on any run: block that contains ${{ }} expressions, and the new Android image-perf group + iOS heavy7/heavy8 splits + collapsed-format echo line for the test split summary pushed us over. Moving the WDIO blobs to env: drops the run block from ~21 KB to ~11 KB so we have headroom for further additions. Made-with: Cursor

…sts only Adds qvac_perf_only input to the desktop and mobile reusable workflows and turns it on from Benchmark Performance (LLM) so dispatched runs only execute the perf-emitting tests instead of the full integration suite. Desktop: regenerates test/integration/all.js from bitnet, tool-calling and the three image-* tests, then runs bare against it. Mobile: filters each group's run*Test list down to perf-emitting wrappers, skips groups whose intersection is empty (no spec uploaded, no Device Farm run booked), and tolerates gaps in the RUN_ARN_* slot sequence when monitoring / downloading logs. Co-authored-by: Cursor <cursoragent@cursor.com>

…m-perf-metrics Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml

…m-perf-metrics

…L 735/736 The `combine-perf-reports` job runs in `pull_request_target` context and was checking out the PR head's `scripts/perf-report` so it could aggregate the matrix artifacts. CodeQL flagged this as a cache-poisoning / poisonable-step issue (alerts 735/736) because we were executing PR-author code in a privileged context. Removing the `repository:` / `ref:` overrides on the checkout makes `actions/checkout` fall back to the base-branch SHA in `pull_request_target` runs, so `aggregate.js` is now trusted code from the default branch. `workflow_dispatch` runs are unaffected because `github.ref` resolves to the dispatched branch. Existing mitigations stay in place: `contents: read` only, no `actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]` allowlist + 64-char cap on `device.name` before it touches any filesystem path. SECURITY comments updated to reflect the new posture. Co-authored-by: Cursor <cursoragent@cursor.com>

Resolved against the package-rename PR (#1860) on main: - packages/qvac-lib-infer-llamacpp-llm -> packages/llm-llamacpp - workflows renamed (e.g. on-pr-qvac-lib-infer-llamacpp-llm.yml -> on-pr-llm-llamacpp.yml) Conflicts resolved: - Moved our 5 new image perf test files (image-elephant, image-fruit-plate, image-high-res-aurora, _image-common, _perf-helper) under the new packages/llm-llamacpp/test/integration/ path. - Honored our deletion of image.test.js (split into 3 perf tests). - test-groups.json: kept our iOS slot mapping for the 3 image perf tests (heavy3/heavy7/heavy8) and our Android groupImagesPerf; took main's lightA/ lightB/groupA/groupB additions for runQwen35Test and the afriquegemma/moe drops. main's runGemma4Test/runOcrPaddleTest remain covered on Android via groupB; iOS coverage for those is left for a follow-up PR.

kinsta · 2026-05-11T11:31:01Z

Preview deployments for qvac-docs-staging ⚡️

Status	Branch preview	Commit preview
✅ Ready	Visit preview	Visit preview

Commit: 03e8346e9bf0296c6857a87150d15e8b2f72e982

Deployment ID: 3ea8dfd1-c5de-46e2-bce8-f262e05c30f3

Static site name: qvac-docs-staging-fazwv

The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8, displacing main's runGemma4Test/runOcrPaddleTest. The validator in scripts/generate-mobile-integration-tests.js requires every test to be assigned on both platforms — these two were missing on iOS. Placed them in lightB alongside runQwen35Test and runModelLoadingTest since they're model-loading tests with similar runtime characteristics.

Two complementary fixes for the Device Farm monitor loop in integration-mobile-test-llm-llamacpp.yml. 1. Match the initial credentials step's role-duration-seconds: 7200 on the 'Refresh AWS credentials for monitoring' step. Without it, the refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is 2h, so every devicefarm API call past the 1h mark fails until timeout — producing the spinning 'Run N: () | AWS API attempt X/3 failed, retrying in 5s...' output for ~1h with no escape. 2. Make the poll loop self-bounding under persistent API failure: - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard never prints the empty 'Run N: ()' line. - On aws_retry returning empty (or 'None'), preserve last-known-good STATUS/RESULT for that slot and increment a per-slot FAIL counter. - After 10 consecutive failures (~5 min at the 30s cadence, well past any normal STS / Device Farm transient) mark that slot ERRORED / NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning the full 2h ceiling. Final-results phase reports a real failure.

@gianni-cor

…arm slots Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to isolated slots instead of folding them into lightB. The image perf tests keep their existing isolation (heavy3/heavy7/heavy8), so the iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12). - test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest; drop both from lightB - integration-mobile-test-llm-llamacpp.yml: - filter_perf: extract heavy9 + heavy10 wrappers - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml, renumber light specs to test_spec_arn_11 / test_spec_arn_12 - schedule: 12 maybe_schedule calls for iOS - monitor + log download: RUN_ARN_1..12 explicit assignments (the seq-based loops are already dynamic via $RUN_COUNT) Co-authored-by: Cursor <cursoragent@cursor.com>

iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed peak memory past the ~3.3 GB Jetsam ceiling once the device was warm from earlier jobs in the queue, the app was killed via memorystatus, and Appium never received a clean failure signal — leaving the Device Farm run in RUNNING (PENDING) until the 60-min job timeout. When `image.test.js` was split into per-image files in 37afd3d, the fruit-plate variant correctly received `iosWarmupImage` + `iosPerfRuns: 1`, but high-res aurora — the heaviest image in the suite — was left on the standard path. That meant CPU + GPU each ran 1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone the cold path crossed the Jetsam line. Mirror the fruit-plate config: - `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline with a 23 KB image so Metal shaders, KV cache, and image-prefill buffers are allocated before the 12 MP JPEG arrives. `_image-common.js` skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs. - `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM the iPhone. iOS cold-path footprint is now exactly 2 inferences per backend (1 small pre-warmup + 1 counted real image), matching fruit-plate's known-good shape. Desktop and Android paths are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>

gianni-cor

Requesting changes for the two perf-report regressions noted inline.

gianni-cor

Approved after re-checking the latest changes.

gianni-cor · 2026-05-12T14:08:34Z

/review

tobi-legan added 26 commits April 24, 2026 13:23

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

7f17c52

…m-perf-metrics

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

4f23a05

…m-perf-metrics Made-with: Cursor # Conflicts: # packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration.auto.cjs # packages/qvac-lib-infer-llamacpp-llm/test/mobile/test-groups.json

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

fa2e76d

…m-perf-metrics Made-with: Cursor # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml # scripts/perf-report/utils.js

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

7cf5f34

…m-perf-metrics

tobi-legan requested review from a team as code owners April 30, 2026 19:58

tobi-legan had a problem deploying to release April 30, 2026 19:58 — with GitHub Actions Failure

tobi-legan had a problem deploying to release April 30, 2026 20:00 — with GitHub Actions Failure

tobi-legan had a problem deploying to release April 30, 2026 20:51 — with GitHub Actions Failure

tobi-legan temporarily deployed to release April 30, 2026 20:53 — with GitHub Actions Inactive

tobi-legan and others added 6 commits May 4, 2026 14:21

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

e0ac869

…m-perf-metrics Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

c3db5cb

…m-perf-metrics Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml

Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…

e67fe3a

…m-perf-metrics

DmitryMalishev previously approved these changes May 5, 2026

View reviewed changes

tobi-legan and others added 2 commits May 6, 2026 19:24

Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics

3102ddf

tamer-hassan-tether previously approved these changes May 8, 2026

View reviewed changes

github-code-quality Bot found potential problems May 11, 2026

View reviewed changes

Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs

Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs

Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs

tobi-legan added 2 commits May 11, 2026 12:58

gianni-cor requested changes May 11, 2026

View reviewed changes

Comment thread packages/llm-llamacpp/test/mobile/test-groups.json

tobi-legan and others added 2 commits May 11, 2026 18:01

tamer-hassan-tether approved these changes May 11, 2026

View reviewed changes

gianni-cor previously requested changes May 12, 2026

View reviewed changes

gianni-cor approved these changes May 12, 2026

View reviewed changes

gianni-cor added 2 commits May 12, 2026 15:00

Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics

03e8346

Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics

66c24fa

tobi-legan mentioned this pull request May 14, 2026

QVAC-17830 infra: surface model name + GPU in LLM perf report + Device Farm queue/run visibility (QVAC-18943) #2072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU)#1843

QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU)#1843
gianni-cor merged 43 commits into
mainfrom
feature-qvac-17830-vlm-perf-metrics

tobi-legan commented Apr 30, 2026 •

edited

Loading

Uh oh!

kinsta Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gianni-cor left a comment

Uh oh!

gianni-cor left a comment

Uh oh!

gianni-cor commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tobi-legan commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

Uh oh!

kinsta Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview deployments for qvac-docs-staging ⚡️

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gianni-cor left a comment

Choose a reason for hiding this comment

Uh oh!

gianni-cor left a comment

Choose a reason for hiding this comment

Uh oh!

gianni-cor commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tobi-legan commented Apr 30, 2026 •

edited

Loading

kinsta Bot commented May 11, 2026 •

edited

Loading