QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU)#1843
Merged
Conversation
Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on
desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf
pipeline.
Perf reporter / aggregator
- scripts/test-utils/performance-reporter.js: extend `vision` addon
type with backend, platform, prefill, decode, vision_encode,
image_prefill and status columns.
- scripts/perf-report/utils.js: teach the aggregator the new VLM
metric labels so markdown/HTML reports render them.
CI wiring
- integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add
aggregate + upload-artifact steps so CPU/GPU perf artifacts are
produced alongside the existing test run.
- integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile):
scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END]
markers, run extract-from-log.js + aggregate.js, upload per-platform
perf artifacts. Mirrors the OCR mobile workflow pattern.
Test refactor (image.test.js)
- Include image name in the perf label ("[elephant] [CPU]" etc.) so
aggregate.js (which groups by `result.test`) no longer pools
different images into a single CPU/GPU cell.
- Run 1 warmup + 3 counted inferences per (image x backend) so the
aggregator produces count=3, mean, std within a single CI run --
matches OCR's PERF_RUNS=3 convention.
- Reuse the loaded LlmLlamacpp instance across iterations inside a
single test() block: deliberate divergence from OCR's one-test()-
per-iteration shape because VLM model load is multi-GB and already
stresses iOS Device Farm memory.
- Bump per-test timeout to 25 min for the multi-iter perf tests;
leave TEST_CONSTANTS.timeout unchanged for other tests in the file.
- Keyword assertions run once against the last counted iteration's
output.
vision_encode_time_ms / image_prefill_time_ms are stubbed as null
pending native RuntimeStats instrumentation (separate follow-up task).
Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)"
block to the scheduled cron job so Monday 9am UTC runs aggregate
perf artifacts from Mobile Integration Tests (LLM) into a separate
llamacpp-llm-mobile-performance.{md,json,html} report alongside the
existing desktop one. Also adds "Mobile Integration Tests (LLM)" to
the workflow_dispatch workflow_name choice list so it can be
triggered manually.
Without this, the per-run Android/iOS perf artifacts the mobile
workflow now uploads would sit unused by the weekly cross-run
aggregator.
Mirrors the `combine-reports` job in the OCR mobile workflow. Runs after the per-platform matrix completes and downloads every artifact matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because the desktop (integration-test-*.yml) and mobile (this file) workflows are both invoked via workflow_call from the umbrella On PR Trigger (LLM), share the same github.run_number and therefore live together in the single run's artifact scope. The job then: - aggregates all desktop + mobile rows into a single combined HTML, Markdown and JSON report via scripts/perf-report/aggregate.js; - patches the `device.name` for desktop artifacts using the artifact suffix (e.g. "linux-x64", "macos-arm64") so they render with recognisable column headers next to the mobile device names; - emits a per-device HTML report for each unique device encountered (desktop OS/arch variants + each mobile device); - writes the combined markdown report into $GITHUB_STEP_SUMMARY so the PR run page shows one unified perf table covering every platform the addon ran on in that run. This is what the user actually meant by "joint performance reporter" -- distinct from the weekly/cross-run aggregation in perf-report.yml, which aggregates historical runs for trend tracking.
…er device
Android perf data was never surfacing because the inline fallback
reporter used on mobile was a no-op stub (no record/writeToConsole),
so logcat never saw PERF_REPORT markers. iOS image tests were also
crashing mid-run (Jetsam/memorystatus) before process.on('exit')
could flush, losing whichever runs had already completed.
- Replace the mobile fallback reporter in _image-common.js with a
functional inline implementation that mirrors OCR's: record(),
writeReport(), and writeToConsole() with lightweight snapshots
and logcat-safe chunking.
- Flush incrementally after every record() on mobile so earlier
iterations survive a SIGKILL later in the same group.
- Split image.test.js into per-image files (elephant, fruit plate,
high-res aurora) behind a shared _image-common.js helper, and
wire each into its own Device Farm group (Android: 3 new groups;
iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora).
- Teach extract-from-log.js to --merge multiple reports per device
(needed because each physical device now runs three groups) and
add --device-from-filename as a fallback when logs land in a
flat layout.
- Update the mobile workflow to schedule 5 Android / 10 iOS runs,
bucket downloaded artifacts by device, unzip Customer_Artifacts
inline so bare_console.log is scannable, and invoke
extract-from-log.js with --merge.
Verified locally with a happy-path sim (3 samples per image×backend
per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on
iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated
report on the crashed device, and the other three groups report
fully.
Made-with: Cursor
Fruit plate iOS test crashed inside Builtin_JsonStringify because the per-test flush stringified the cumulative results array on every record() call (3+ runs x 2 backends x verbose VLM output over a 10MB image + Metal compiler memory). Even with logcat chunking, the JSON build itself exhausted V8's Zone allocator, producing SIGTRAP from FatalProcessOutOfMemory before any data reached logcat. Changes in the mobile fallback reporter (_image-common.js): - Cap stored output text to 400 chars per record so the accumulated results array stays small regardless of model verbosity. - Add delta mode to writeToConsole: emit only the latest row per flush, so stringify stays O(1) in iteration count on mobile. - Stop calling writeReport() on disk per-record; rewriting the full JSON every iteration is both expensive and an OOM hazard. One final writeReport runs in the exit hook for the on-device artifact. - Remove the final cumulative writeToConsole in the exit hook: the per-test delta emits already carry every row, and emitting a large cumulative payload at exit is exactly when we most need the earlier deltas to survive in logcat/syslog. Changes in extract-from-log.js: - Split extractFromText into extractAllFromText (returns every report parsed from the text) + the legacy extractFromText wrapper that returns the last one. - Add opts.all to extractFromFile so --merge gathers every delta emit from a log file, not just the largest or last marker pair. - main() with --merge now pushes all reports per file into the device bucket before handing them to mergeDeviceReports, so partial data from a pre-crash run still lands in the aggregate. Verified locally with a delta-emit simulation that mid-test crashes leave runs 1..N-1 intact in the reconstructed report. Made-with: Cursor
…ct collision) Two pre-existing bugs caused the combined performance table to drop data on a PR run: 1. `darwin-x64` (and any other slow desktop job) was missing because `combine-reports` lived inside the mobile workflow and only waited on the mobile matrix. Desktop jobs that finished AFTER mobile were silently excluded. Move the job to the umbrella (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports` with `needs: [run-integration-tests, run-mobile-integration-tests]`, so it blocks on the full fan-out before scanning artifacts. 2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair) was overwritten because every matrix entry uploaded its artifact under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two entries sharing platform+arch clobbered each other on download. Add a unique `label` field to each matrix entry (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`, `linux-arm64-u24`, ...) and key the upload + device-name patch on `matrix.label` instead. Made-with: Cursor
…flush logs on crash Three changes scoped to the mobile LLM integration workflow: * Android: merge the three per-image Device Farm groups back into a single groupImagesPerf run. Android does not have the iOS jetsam pressure, so running all three image tests in one invocation is both cheaper and simpler while still giving the aggregator 9 rows (3 images x 3 iters) per device. * iOS: schedule Heavy7 (fruit plate) a second time as Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device folder on download, and extract-from-log.js --merge dedupes on the metric fingerprint, so if either attempt gets past iOS jetsam we have perf data. If both succeed we just get more iterations to average over, which is strictly better. run_count goes 10 -> 11 and the Monitor / Download loops were extended accordingly. * Crash path (both platforms): route process.exit(1) in checkAppCrash and crashMonitor through a new global.flushLogsAndExit helper that (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain and (b) on iOS performs an inline Appium pull_file of bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means delta perf emits that hit disk before a jetsam kill now reach the aggregator instead of being stuck in the app sandbox. Made-with: Cursor
Previous commit pushed the `Create and Upload Test Spec` run: block
over GitHub Actions' ~30K template-expression budget, so the umbrella
workflow refused to dispatch with:
Line 927, Col 14: Exceeded max expression length 21000
Root cause: `global.flushLogsAndExit` was inlined into BOTH Android
and iOS wdio configs. On iOS it duplicated the exact Appium pull_file
logic the `after` hook already has, and on Android it carried dead
iOS-only code — roughly 1.8 kB of repeated JS across the two configs.
Instead:
* Drop `global.flushLogsAndExit` entirely.
* In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace
`process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`.
Same 5s drain window for logcat / os_log / bare stdout, one line of
JS, zero helper definition.
* iOS `after` hook still pulls `bare_console.log` via Appium on the
normal path, so delta perf data from successful runs is unchanged.
Net `run:` block size goes from 31,524 → 28,992 chars (below the
previously-working 28,995). Both wdio configs revalidated with
`node --check` and the full workflow YAML round-trips through PyYAML.
Made-with: Cursor
…tail tables - Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so Metal shaders / KV cache / image-prefill buffer are allocated with a cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop and Android paths remain byte-for-byte identical. - Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix legs into single "linux-x64" / "linux-arm64" columns in the combined report's Fix desktop device names step. Previously the summary showed four separate columns with "-" GPU rows that looked like missing data even though the sibling leg had populated them. - Preserve categorical metrics (backend, platform, status, execution_provider) through aggregateReports and render per-device detail tables for every device — mobile included — in the combined markdown summary via a new --device-details flag on aggregate.js. Mobile devices now get the same twelve-column breakdown (Test, EP, Backend, Platform, Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that desktop matrix legs emit via writeStepSummary. Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/ (27/27 assertions pass). Made-with: Cursor
Two fixes after run 24909705757 landed iOS perf data for elephant + aurora but still lost fruit plate, and the combined HTML report was missing the row-per-test detail tables that the step-summary markdown has. 1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone. With the elephant pre-warmup in place, the fruit-plate test still crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd counted iteration, leaving zero rows in the perf artifact for both iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted iteration cuts the cold-path peak footprint from 4 inferences to 2 (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in local simulation. Desktop + Android still run the full PERF_RUNS=3. 2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the combined HTML report mirrors the markdown step-summary layout — one section per device with Backend / Platform / Total Time / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS / Status columns. Previously only the markdown had those columns, so mobile devices (which can't write to GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML artifact. The per-device HTML report step also gets the flags so each individual HTML has parity. Verified locally against the run-24909705757 artifact set: all 9 devices render the full row- per-test summary, including Apple iPhone 16 Pro / iPhone 17 with backend=metal and platform=ios-arm64. Made-with: Cursor
Five user-requested follow-ups after run 24913460964: 1) iOS heavy7 is now a "warm process" group. test-groups.json puts runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal / native allocators are pre-warmed by a cheap test before the 10 MB fruit-plate PNG hits the multimodal pipeline. 2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11) is removed from the mobile workflow. Run count goes from 11 → 10; monitoring + log-download loop bounds updated; the inline comment that justified the retry is replaced with one explaining the new warm-process rationale. 3) JS-only crash-path bare_console.log pull on iOS. The existing pull_file logic that used to live inline in `after` is now a global.flushBareLog helper installed in `before`, called from global.checkAppCrash and the setInterval crash monitor (after a 1.5 s browser.pause to give stdout a chance to flush) BEFORE the 5 s setTimeout(process.exit). Net effect: crashed runs that previously left a 0-byte LOG artifact now have a real bare_console.log on disk for triage. No C++ touched. 4) `aggregateReports` now dedupes sibling matrix legs that get folded into one device name. Previously linux-x64 / [elephant] [CPU] accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the per-test breakdown tables. The new rule: per (device, test, run_number), only the FIRST report's rows are kept — subsequent sibling-leg reports for the same pair are skipped. Within a single report all PERF_RUNS iterations still flow through (so counted runs per device stays at 3). The weekly aggregator is unaffected because its inputs span different run_numbers. Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter reports per matrix leg; linux-x64 / [elephant] [CPU] now reports `count: 3` and the markdown header reads "Iterations: 3". 5) Per-device detail-table headings use the device name as the anchor instead of a repeated "### Performance: <addon>". GitHub collapses duplicate heading slugs in step summaries which made it impossible to deep-link or scroll to the mobile detail tables — they were rendered, just hidden under a single anchor. Each device now gets `### <Device> — <addon> (<platform/arch>)` so Android / iOS sections appear in the step-summary TOC. Made-with: Cursor
The previous warm-process seed (runApiBehaviorTest) used a text-only Qwen3-0.6B model, so it never exercised the mmproj load or the Metal vision shader JIT that fruit-plate then hit cold. iPhone 17 kept SIGABRT'ing between the LLM context init and the first inference during exactly those steps. Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the small (23 KB) elephant test pre-warms the exact VLM code path — SmolVLM2 + mmproj + Metal vision graph + KV cache — before the 10 MB fruit-plate PNG lands.
The previous attempt put runImageElephantTest as a "warm process"
prefix to heavy7. That actually executes the full elephant perf
cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media-
path test = 9 elephant inferences and 2 SmolVLM2 model loads —
before runImageFruitPlateTest gets to do anything. Net effect on
heavy7 was 12 inferences pre-fruit-plate-row, which regressed
iPhone 16 Pro (had been passing) and iPhone 17 still died after
the inline elephant warmup completed.
New shape:
- test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"]
(cold process again — runImageElephantTest stays in heavy3 as
its own perf slot, untouched).
- _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage)
succeeds, set an iosPreWarmupRan flag and skip the standard
PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass.
- image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg'
+ iosPerfRuns=1 config; comments rewritten to describe the new
flow.
- mobile workflow: heavy7 comment block updated to match.
End-to-end iOS shape for fruit-plate is now exactly what was
asked for:
1. setupMultimodalInference loads SmolVLM2 cold
2. 1 elephant inference (inline iosWarmupImage, no perf record)
3. standard warmup loop SKIPPED
4. 1 fruit-plate inference (counted, perf recorded)
= 2 inferences total
Desktop and Android paths are completely unchanged — they still
run 1 warmup + PERF_RUNS=3 counted. iOS tests without an
iosWarmupImage (elephant, high-res aurora) keep their existing
1 warmup + iosPerfRuns/PERF_RUNS counted shape.
Made-with: Cursor
…erf for bitnet/tool-calling
- extract _perf-helper.js out of _image-common.js so bitnet,
tool-calling and image tests share one perf-recording path
- add `scenario` field on perf records ('image' / 'bitnet' /
'tool-calling' / 'default'), plumb through aggregator + reports.
Per-device detail tables (md + html) now group rows by scenario
so a single combined report can surface multiple test families
side by side instead of mixing them into one giant table
- drop the always-'ok' Status column from the vision metric layout
(just noise; the JS layer no longer hardcodes status: 'ok')
- squash the combined GH step summary to three Mean ± std
mini-tables (Total Time / TTFT / TPS), grouped by scenario. The
HTML artifact still carries the full per-iteration breakdown
- split aggregate.js --device-details into --md-device-details and
--html-device-details (legacy --device-details still toggles both)
- wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js
(one perf row per model_variant x prompt cell) onto _perf-helper.js
- add desktop GPU probe in detectDevice(): nvidia-smi -L on
linux/win32, lspci as linux fallback, system_profiler on darwin,
wmic on win32. New device.gpu field is surfaced in the per-device
detail-table headings (md + html). Mobile leaves gpu=null —
device.name (Device Farm label) already implies the chipset
- add scripts/perf-report/sim-llamacpp-llm.js for local report-shape
verification mirroring the umbrella workflow flags
Made-with: Cursor
…ells QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the combined PR summary, but the per-device detail tables (Test / EP / Backend / Platform / Total / Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended underneath instead of being HTML-only. Numeric cells in the detail tables render as `mean ±std` so the iteration spread is visible at every metric, not just the rolled-up ones; token-count columns stay as bare integers since deterministic generation makes `±0` noise. The umbrella workflow flips from `--html-device-details` back to `--device-details` to apply this in both the markdown and HTML outputs. Made-with: Cursor
QVAC-17830. Per-prompt image-prefill time turned out to overlap with the prefill_time_ms / TTFT signal we already report, so the column was essentially redundant. Vision Enc (ms) — the mmproj / vision- encoder time — is the only VLM-specific timer worth keeping for now; its native wiring is tracked separately under https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702. Removed everywhere it surfaced: - scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS + record() default metrics) - scripts/perf-report/utils.js (label map + vision detail columns) - packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js (mobile inline-fallback default metrics + recordPerformance payload) - scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators) Made-with: Cursor
…m-perf-metrics Made-with: Cursor # Conflicts: # packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration.auto.cjs # packages/qvac-lib-infer-llamacpp-llm/test/mobile/test-groups.json
…m-perf-metrics Made-with: Cursor # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml # scripts/perf-report/utils.js
… + legend) The cross-platform mean tables previously emitted ALL device columns for every scenario block, so e.g. the bitnet block (Android-only by design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty desktop/iOS columns of dashes, which made the report look broken when it was actually intentional. - Per-scenario × per-metric column filtering: drop columns that have zero data in this scenario block. Bitnet collapses from 9 → 2 columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64` (test is `skip: isDarwinX64`). - Short legend right under the report title explaining the dashes that remain (gated tests like medgemma-4b-it = desktop-only, mobile = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner). Per-device detail tables are unchanged — they were already scenario- partitioned so the layout fix doesn't apply there. Made-with: Cursor
…LLM) workflow_dispatch New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco): PR runs default to 1 warmup + 1 counted iteration (cheap, single- shot signal). The dedicated `Benchmark Performance (LLM)` workflow is the only place we crank iterations up to produce mean ± std. Changes: - `_image-common.js` and `bitnet.test.js`: replace hardcoded `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS', 1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative env values fall back to the safe default of 1+1. - iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark workflow doesn't OOM the Jetsam ceiling on heavy images. Comments updated to reflect the new default. - `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The umbrella PR workflow doesn't pass these, so the PR critical path is unaffected. - `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls the desktop integration-test reusable with `qvac_perf_runs=3`, aggregates a benchmark report, uploads HTML artifact + GitHub step summary. Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device Farm) needs a build-time env-var injection hook in the test app — tracked as a QVAC-18111 follow-up. Made-with: Cursor
… runner The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan device is present — but `tool-calling.test.js` previously hardcoded `useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`. That made the combined report show GPU bars on a CPU runner for tool-calling rows, which read as inconsistent next to the [CPU] rows from `image` and `bitnet`. Match the `_image-common.js` pattern: read `NO_GPU` from both `process.env` and `os.getEnv()`, treat any case-insensitive "true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the `linux-x64-cpu` leg labels its rows `[CPU]` correctly while the `linux-x64-gpu` leg keeps `[GPU]`. Made-with: Cursor
…cross-platform tables Two follow-ups based on review feedback on the combined report: - Trim the verbose "intentional gaps" legend down to a single short italic line (`> _` `-` ` = not run on this device._`). The per-scenario column filter already drops fully-empty columns, so the long explanation was no longer pulling its weight. - Surface the same metrics the per-device detail tables already show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS) in the cross-platform mini-tables, instead of just Total/TTFT/TPS. Reviewer needs the full breakdown to compare backends, not just the headline. `vision_encode_time_ms` is currently null everywhere (native timer pending — Asana 1214371583877702) so its mini-table is auto-suppressed by the existing `hasAnyData` guard until the timer lands; the entry is in the list ahead of time so the table appears automatically once data flows. Made-with: Cursor
Bare runtime doesn't define `process` as a global at module-init time, so referencing `process.env[key]` from the top-level `_envInt(...)` call in `bitnet.test.js` blew up the entire integration suite with `ReferenceError: process is not defined` on first eval. Switch the env reads to `bare-os`'s `os.getEnv()` (already imported) as the primary source, and keep a `typeof process !== 'undefined'`- guarded `process.env` fallback for any Node code paths that happen to import these files. Apply the same pattern in: - `_image-common.js` — for both `_envInt` and the existing `noGpuEnv` lookup which had the same latent bug. - `bitnet.test.js` — for `_envInt`. - `tool-calling.test.js` — for the new `noGpuEnv` lookup added in the linux-x64-cpu label fix. Verified locally: behaves correctly when `process` is undefined, when `process.env[key]` is unset, and when only one of the two sources has the value. Made-with: Cursor
… addon convention
Rename `.github/workflows/benchmark-llamacpp-llm.yml` →
`benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the
existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern
on main (Parakeet, Whispercpp, Supertonic TTS, etc.).
Also align the structure: introduce a `context` job that derives
`repository` / `ref` from optional inputs (with fallback to
`github.repository` / `github.ref_name`), wire prebuild + desktop
integration tests through that, and add a `summarize` job whose
artifact follows the convention `<addon>-performance-findings.{md,
html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs`
inputs (defaulting to 3 / 1) are preserved — this is still the
single place we crank perf iterations up to produce mean ± std
numbers, and the umbrella PR workflow continues to leave them at
the cheap default of 1 + 1.
Workflow_dispatch only — explicitly NOT wired into the PR umbrella,
matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco).
Made-with: Cursor
Picks up the QVAC-18111 scaffold PRs that landed on main: - #1839: Benchmark Performance (LLM) workflow_dispatch + qvac_perf_runs/ qvac_perf_warmup_runs plumbing on integration-test-...yml - #1840: bridge of those inputs into integration-mobile-test-...yml via pushFile of qvacPerfConfig.txt onto the device Conflict resolution: - benchmark-performance-...yml: take main entirely (canonical, has run_desktop+run_mobile toggles + mobile-benchmarks job) - integration-test-...yml: keep this branch's HTML perf report generation step + per-runner perf-report-llamacpp-llm-* upload, drop the chatty inline comments around qvac_perf_runs (config is now self-documenting in the workflow surface) - integration-mobile-test-...yml: take main's WDIO_CONFIG for Android (has the __QVAC_PERF_RUNS__/qvacPerfConfig.txt pushFile bridge) Made-with: Cursor
…time Mobile side of QVAC-18111: now that the umbrella benchmark workflow pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime needs to actually consume it. Without this, dispatched perf overrides were silently ignored on Android / iOS and tests defaulted to 1+1. Changes: - packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs: read qvacPerfConfig.txt from the same paths the testFilter loader uses (/data/local/tmp on Android, global.testDir on iOS), parse KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads on first __shouldRunTest call, which fires before runIntegrationModule imports any test module — so os.getEnv() inside _image-common.js / bitnet.test.js / tool-calling.test.js sees the dispatched value at module init. Normalises both literal '\\n' (the bash-escaped form the WDIO before-hook produces today) and real '\\r?\\n' so the parser doesn't depend on which encoding the workflow happens to use. - .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml: extract the two giant Android / iOS WDIO_CONFIG JS strings out of the 'Create and Upload Test Spec' run: block into env: vars. GitHub Actions imposes a hard 21 000-char limit on any run: block that contains ${{ }} expressions, and the new Android image-perf group + iOS heavy7/heavy8 splits + collapsed-format echo line for the test split summary pushed us over. Moving the WDIO blobs to env: drops the run block from ~21 KB to ~11 KB so we have headroom for further additions. Made-with: Cursor
…sts only Adds qvac_perf_only input to the desktop and mobile reusable workflows and turns it on from Benchmark Performance (LLM) so dispatched runs only execute the perf-emitting tests instead of the full integration suite. Desktop: regenerates test/integration/all.js from bitnet, tool-calling and the three image-* tests, then runs bare against it. Mobile: filters each group's run*Test list down to perf-emitting wrappers, skips groups whose intersection is empty (no spec uploaded, no Device Farm run booked), and tolerates gaps in the RUN_ARN_* slot sequence when monitoring / downloading logs. Co-authored-by: Cursor <cursoragent@cursor.com>
…m-perf-metrics Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
…m-perf-metrics Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
DmitryMalishev
previously approved these changes
May 5, 2026
…L 735/736 The `combine-perf-reports` job runs in `pull_request_target` context and was checking out the PR head's `scripts/perf-report` so it could aggregate the matrix artifacts. CodeQL flagged this as a cache-poisoning / poisonable-step issue (alerts 735/736) because we were executing PR-author code in a privileged context. Removing the `repository:` / `ref:` overrides on the checkout makes `actions/checkout` fall back to the base-branch SHA in `pull_request_target` runs, so `aggregate.js` is now trusted code from the default branch. `workflow_dispatch` runs are unaffected because `github.ref` resolves to the dispatched branch. Existing mitigations stay in place: `contents: read` only, no `actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]` allowlist + 64-char cap on `device.name` before it touches any filesystem path. SECURITY comments updated to reflect the new posture. Co-authored-by: Cursor <cursoragent@cursor.com>
tamer-hassan-tether
previously approved these changes
May 8, 2026
Resolved against the package-rename PR (#1860) on main: - packages/qvac-lib-infer-llamacpp-llm -> packages/llm-llamacpp - workflows renamed (e.g. on-pr-qvac-lib-infer-llamacpp-llm.yml -> on-pr-llm-llamacpp.yml) Conflicts resolved: - Moved our 5 new image perf test files (image-elephant, image-fruit-plate, image-high-res-aurora, _image-common, _perf-helper) under the new packages/llm-llamacpp/test/integration/ path. - Honored our deletion of image.test.js (split into 3 perf tests). - test-groups.json: kept our iOS slot mapping for the 3 image perf tests (heavy3/heavy7/heavy8) and our Android groupImagesPerf; took main's lightA/ lightB/groupA/groupB additions for runQwen35Test and the afriquegemma/moe drops. main's runGemma4Test/runOcrPaddleTest remain covered on Android via groupB; iOS coverage for those is left for a follow-up PR.
Preview deployments for qvac-docs-staging ⚡️
Commit: Deployment ID: Static site name: |
The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8, displacing main's runGemma4Test/runOcrPaddleTest. The validator in scripts/generate-mobile-integration-tests.js requires every test to be assigned on both platforms — these two were missing on iOS. Placed them in lightB alongside runQwen35Test and runModelLoadingTest since they're model-loading tests with similar runtime characteristics.
Two complementary fixes for the Device Farm monitor loop in
integration-mobile-test-llm-llamacpp.yml.
1. Match the initial credentials step's role-duration-seconds: 7200 on
the 'Refresh AWS credentials for monitoring' step. Without it, the
refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is
2h, so every devicefarm API call past the 1h mark fails until
timeout — producing the spinning
'Run N: () | AWS API attempt X/3 failed, retrying in 5s...'
output for ~1h with no escape.
2. Make the poll loop self-bounding under persistent API failure:
- Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard
never prints the empty 'Run N: ()' line.
- On aws_retry returning empty (or 'None'), preserve last-known-good
STATUS/RESULT for that slot and increment a per-slot FAIL counter.
- After 10 consecutive failures (~5 min at the 30s cadence, well past
any normal STS / Device Farm transient) mark that slot ERRORED /
NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning
the full 2h ceiling. Final-results phase reports a real failure.
gianni-cor
requested changes
May 11, 2026
…arm slots Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to isolated slots instead of folding them into lightB. The image perf tests keep their existing isolation (heavy3/heavy7/heavy8), so the iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12). - test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest; drop both from lightB - integration-mobile-test-llm-llamacpp.yml: - filter_perf: extract heavy9 + heavy10 wrappers - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml, renumber light specs to test_spec_arn_11 / test_spec_arn_12 - schedule: 12 maybe_schedule calls for iOS - monitor + log download: RUN_ARN_1..12 explicit assignments (the seq-based loops are already dynamic via $RUN_COUNT) Co-authored-by: Cursor <cursoragent@cursor.com>
iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed peak memory past the ~3.3 GB Jetsam ceiling once the device was warm from earlier jobs in the queue, the app was killed via memorystatus, and Appium never received a clean failure signal — leaving the Device Farm run in RUNNING (PENDING) until the 60-min job timeout. When `image.test.js` was split into per-image files in 37afd3d, the fruit-plate variant correctly received `iosWarmupImage` + `iosPerfRuns: 1`, but high-res aurora — the heaviest image in the suite — was left on the standard path. That meant CPU + GPU each ran 1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone the cold path crossed the Jetsam line. Mirror the fruit-plate config: - `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline with a 23 KB image so Metal shaders, KV cache, and image-prefill buffers are allocated before the 12 MP JPEG arrives. `_image-common.js` skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs. - `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM the iPhone. iOS cold-path footprint is now exactly 2 inferences per backend (1 small pre-warmup + 1 counted real image), matching fruit-plate's known-good shape. Desktop and Android paths are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
tamer-hassan-tether
approved these changes
May 11, 2026
gianni-cor
previously requested changes
May 12, 2026
gianni-cor
left a comment
Contributor
There was a problem hiding this comment.
Requesting changes for the two perf-report regressions noted inline.
gianni-cor
approved these changes
May 12, 2026
gianni-cor
left a comment
Contributor
There was a problem hiding this comment.
Approved after re-checking the latest changes.
Contributor
|
/review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 What problem does this PR solve?
📝 How does it solve it?
total_time_ms,prefill_time_ms,decode_time_ms,vision_encode_time_ms(stub, native wiring tracked under QVAC-18103),ttft_ms,prompt_tokens,generated_tokens,tps, plusscenario,device.gpu(probed per-platform),backend,platform. Shared singleton reporter extracted into_perf-helper.js;_image-common.jsconsumes it.QVAC_PERF_RUNS/QVAC_PERF_WARMUP_RUNSread throughbare-os.getEnv()(with aprocess.envfallback for Node code paths). Defaults to 1 warmup + 1 counted so PR runs stay cheap. The dedicatedBenchmark Performance (LLM)workflow_dispatch(scaffolded by QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 + mobile-bridged by QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile #1840, both merged) cranks it up.qvacPerfConfig.txtto the device; the bare runtime inintegration-runtime.cjsreads it on first__shouldRunTestcall and injectsQVAC_PERF_RUNS/QVAC_PERF_WARMUP_RUNSviaos.setEnv()before any test module imports.qvac_perf_onlyinput). When dispatched, desktop regeneratestest/integration/all.jsfrom only the 5 perf-emitting tests (bitnet,tool-calling,image-elephant,image-fruit-plate,image-high-res-aurora). Mobile filters each Device Farm group's wrapper list down to perf-emitting tests, skips groups whose intersection is empty, and tolerates gaps in the slot sequence. Cuts dispatched runs from ~full integration suite down to just the perf path.scripts/perf-report/aggregate.js: combined HTML artifact (HTML-Report-All-Platforms-*) + per-device HTML (HTML-Reports-Per-Device-*) + GitHub step-summary markdown with squashed mean ± std mini-tables forTotal Time / Prefill / Decode / Vision Enc / TTFT / TPS, grouped byscenario(image / tool-calling / bitnet), with per-scenario column filtering (e.g. bitnet only shows the 2 Android columns) and a short "-= not run on this device" legend.groupImagesPerf) so all three images run in a single invocation.performance-reporter.jspopulatesdevice.gpu(nvidia-smi / lspci on Linux,system_profileron macOS,wmicon Windows) so detail tables showApple M3 / GeForce RTX 4070 / Adreno 750instead ofunknown.combine-perf-reportsjob (regex-injection inextract-from-log.js, two cache-poisoning paths in the umbrella workflow). Strict[A-Za-z0-9-]allowlist on PR-controlleddevice.namebefore it touches the filesystem; trust-model documented inline.actions/upload-artifactreferences SHA-pinned to match the repo convention.🧪 How was it tested?
cpp-lintformatting noise +linux-x64exit 139 after tests pass). Verified in artifacts:perf-report-llamacpp-llm-Android-2427— 1×bitnet + 2×tool-calling + 6×image rows on Pixel 9 Pro / S25 Ultra.perf-report-llamacpp-llm-iOS-2427— image + tool-calling rows on iPhone 16 Pro / 17 (bitnet correctly skipped).linux-x64-cpu/gpu,linux-arm64-u22/u24,darwin-arm64/x64,win32-x64).Mean Total / Prefill / Decode / TTFT / TPS+ per-device detail tables.Benchmark Performance (LLM)dispatch: Android ran only the 3 perf groups (vs full integration suite); iOS skipped Heavy1/4/5/6/LightB groups whose perf-intersection was empty.os.getEnv()) verified locally — previousprocess.envlookup threwReferenceError: process is not definedon Bare module-init; the guardedos.getEnv()path passes.aggregate.jsregen against the run artifacts reproduces the same combined markdown / HTML output the workflow uploads.