Skip to content

QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU)#1843

Merged
gianni-cor merged 43 commits into
mainfrom
feature-qvac-17830-vlm-perf-metrics
May 12, 2026
Merged

QVAC-17830 feat: VLM performance metrics across desktop + mobile (CPU & GPU)#1843
gianni-cor merged 43 commits into
mainfrom
feature-qvac-17830-vlm-perf-metrics

Conversation

@tobi-legan

@tobi-legan tobi-legan commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • QVAC-17830: VLM inference has no per-platform performance signal in CI. We can't tell when a code change tanks vision-encode / prefill / decode / TTFT / TPS for image, tool-calling, or bitnet workloads on Linux / macOS / Windows / Android / iOS, let alone CPU vs GPU on the same device.
  • Heavy-iteration benchmarking (mean ± std) on every PR would 2–3× our CI cost, but a single counted run per PR is enough to flag regressions, provided the same harness can be re-dispatched with bigger iteration counts on demand.

📝 How does it solve it?

  • Per-test perf records (image, bitnet, tool-calling) recording total_time_ms, prefill_time_ms, decode_time_ms, vision_encode_time_ms (stub, native wiring tracked under QVAC-18103), ttft_ms, prompt_tokens, generated_tokens, tps, plus scenario, device.gpu (probed per-platform), backend, platform. Shared singleton reporter extracted into _perf-helper.js; _image-common.js consumes it.
  • Configurable iterations via QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS read through bare-os.getEnv() (with a process.env fallback for Node code paths). Defaults to 1 warmup + 1 counted so PR runs stay cheap. The dedicated Benchmark Performance (LLM) workflow_dispatch (scaffolded by QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 + mobile-bridged by QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile #1840, both merged) cranks it up.
  • Mobile bridge: the umbrella benchmark workflow pushes qvacPerfConfig.txt to the device; the bare runtime in integration-runtime.cjs reads it on first __shouldRunTest call and injects QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS via os.setEnv() before any test module imports.
  • Perf-only filter on the benchmark workflow (qvac_perf_only input). When dispatched, desktop regenerates test/integration/all.js from only the 5 perf-emitting tests (bitnet, tool-calling, image-elephant, image-fruit-plate, image-high-res-aurora). Mobile filters each Device Farm group's wrapper list down to perf-emitting tests, skips groups whose intersection is empty, and tolerates gaps in the slot sequence. Cuts dispatched runs from ~full integration suite down to just the perf path.
  • Aggregated reporting in scripts/perf-report/aggregate.js: combined HTML artifact (HTML-Report-All-Platforms-*) + per-device HTML (HTML-Reports-Per-Device-*) + GitHub step-summary markdown with squashed mean ± std mini-tables for Total Time / Prefill / Decode / Vision Enc / TTFT / TPS, grouped by scenario (image / tool-calling / bitnet), with per-scenario column filtering (e.g. bitnet only shows the 2 Android columns) and a short "- = not run on this device" legend.
  • Mobile stability: iOS fruit-plate (heavy7) is a cold heavy group with 1 inline elephant warmup + 1 counted run (was OOM-crashing under Jetsam). Android image tests consolidated into one Device Farm group (groupImagesPerf) so all three images run in a single invocation.
  • GPU probe in performance-reporter.js populates device.gpu (nvidia-smi / lspci on Linux, system_profiler on macOS, wmic on Windows) so detail tables show Apple M3 / GeForce RTX 4070 / Adreno 750 instead of unknown.
  • CodeQL: 3 alerts addressed on the combine-perf-reports job (regex-injection in extract-from-log.js, two cache-poisoning paths in the umbrella workflow). Strict [A-Za-z0-9-] allowlist on PR-controlled device.name before it touches the filesystem; trust-model documented inline. actions/upload-artifact references SHA-pinned to match the repo convention.

🧪 How was it tested?

  • CI run 25181286617 on this branch — full umbrella green except two known external flakes (cpp-lint formatting noise + linux-x64 exit 139 after tests pass). Verified in artifacts:
    • perf-report-llamacpp-llm-Android-2427 — 1×bitnet + 2×tool-calling + 6×image rows on Pixel 9 Pro / S25 Ultra.
    • perf-report-llamacpp-llm-iOS-2427 — image + tool-calling rows on iPhone 16 Pro / 17 (bitnet correctly skipped).
    • 7 desktop perf JSONs (linux-x64-cpu/gpu, linux-arm64-u22/u24, darwin-arm64/x64, win32-x64).
    • Combined markdown summary contains every scenario block under Mean Total / Prefill / Decode / TTFT / TPS + per-device detail tables.
  • Perf-only filter end-to-end validated on a Benchmark Performance (LLM) dispatch: Android ran only the 3 perf groups (vs full integration suite); iOS skipped Heavy1/4/5/6/LightB groups whose perf-intersection was empty.
  • Bare runtime fix (env via os.getEnv()) verified locally — previous process.env lookup threw ReferenceError: process is not defined on Bare module-init; the guarded os.getEnv() path passes.
  • Local aggregate.js regen against the run artifacts reproduces the same combined markdown / HTML output the workflow uploads.
  • iOS fruit-plate stability: 5 consecutive Device Farm runs without OOM after switching to the cold-group + inline-warmup pattern.

Test-only PR. No addon runtime / native code changed; package.json version intentionally unchanged.

Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on
desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf
pipeline.

Perf reporter / aggregator
- scripts/test-utils/performance-reporter.js: extend `vision` addon
  type with backend, platform, prefill, decode, vision_encode,
  image_prefill and status columns.
- scripts/perf-report/utils.js: teach the aggregator the new VLM
  metric labels so markdown/HTML reports render them.

CI wiring
- integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add
  aggregate + upload-artifact steps so CPU/GPU perf artifacts are
  produced alongside the existing test run.
- integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile):
  scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END]
  markers, run extract-from-log.js + aggregate.js, upload per-platform
  perf artifacts. Mirrors the OCR mobile workflow pattern.

Test refactor (image.test.js)
- Include image name in the perf label ("[elephant] [CPU]" etc.) so
  aggregate.js (which groups by `result.test`) no longer pools
  different images into a single CPU/GPU cell.
- Run 1 warmup + 3 counted inferences per (image x backend) so the
  aggregator produces count=3, mean, std within a single CI run --
  matches OCR's PERF_RUNS=3 convention.
- Reuse the loaded LlmLlamacpp instance across iterations inside a
  single test() block: deliberate divergence from OCR's one-test()-
  per-iteration shape because VLM model load is multi-GB and already
  stresses iOS Device Farm memory.
- Bump per-test timeout to 25 min for the multi-iter perf tests;
  leave TEST_CONSTANTS.timeout unchanged for other tests in the file.
- Keyword assertions run once against the last counted iteration's
  output.

vision_encode_time_ms / image_prefill_time_ms are stubbed as null
pending native RuntimeStats instrumentation (separate follow-up task).
Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)"
block to the scheduled cron job so Monday 9am UTC runs aggregate
perf artifacts from Mobile Integration Tests (LLM) into a separate
llamacpp-llm-mobile-performance.{md,json,html} report alongside the
existing desktop one. Also adds "Mobile Integration Tests (LLM)" to
the workflow_dispatch workflow_name choice list so it can be
triggered manually.

Without this, the per-run Android/iOS perf artifacts the mobile
workflow now uploads would sit unused by the weekly cross-run
aggregator.
Mirrors the `combine-reports` job in the OCR mobile workflow. Runs
after the per-platform matrix completes and downloads every artifact
matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because
the desktop (integration-test-*.yml) and mobile (this file) workflows
are both invoked via workflow_call from the umbrella On PR Trigger
(LLM), share the same github.run_number and therefore live together
in the single run's artifact scope. The job then:

- aggregates all desktop + mobile rows into a single combined HTML,
  Markdown and JSON report via scripts/perf-report/aggregate.js;
- patches the `device.name` for desktop artifacts using the artifact
  suffix (e.g. "linux-x64", "macos-arm64") so they render with
  recognisable column headers next to the mobile device names;
- emits a per-device HTML report for each unique device encountered
  (desktop OS/arch variants + each mobile device);
- writes the combined markdown report into $GITHUB_STEP_SUMMARY so
  the PR run page shows one unified perf table covering every
  platform the addon ran on in that run.

This is what the user actually meant by "joint performance reporter"
-- distinct from the weekly/cross-run aggregation in perf-report.yml,
which aggregates historical runs for trend tracking.
…er device

Android perf data was never surfacing because the inline fallback
reporter used on mobile was a no-op stub (no record/writeToConsole),
so logcat never saw PERF_REPORT markers. iOS image tests were also
crashing mid-run (Jetsam/memorystatus) before process.on('exit')
could flush, losing whichever runs had already completed.

- Replace the mobile fallback reporter in _image-common.js with a
  functional inline implementation that mirrors OCR's: record(),
  writeReport(), and writeToConsole() with lightweight snapshots
  and logcat-safe chunking.
- Flush incrementally after every record() on mobile so earlier
  iterations survive a SIGKILL later in the same group.
- Split image.test.js into per-image files (elephant, fruit plate,
  high-res aurora) behind a shared _image-common.js helper, and
  wire each into its own Device Farm group (Android: 3 new groups;
  iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora).
- Teach extract-from-log.js to --merge multiple reports per device
  (needed because each physical device now runs three groups) and
  add --device-from-filename as a fallback when logs land in a
  flat layout.
- Update the mobile workflow to schedule 5 Android / 10 iOS runs,
  bucket downloaded artifacts by device, unzip Customer_Artifacts
  inline so bare_console.log is scannable, and invoke
  extract-from-log.js with --merge.

Verified locally with a happy-path sim (3 samples per image×backend
per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on
iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated
report on the crashed device, and the other three groups report
fully.

Made-with: Cursor
Fruit plate iOS test crashed inside Builtin_JsonStringify because the
per-test flush stringified the cumulative results array on every
record() call (3+ runs x 2 backends x verbose VLM output over a 10MB
image + Metal compiler memory). Even with logcat chunking, the JSON
build itself exhausted V8's Zone allocator, producing SIGTRAP from
FatalProcessOutOfMemory before any data reached logcat.

Changes in the mobile fallback reporter (_image-common.js):
- Cap stored output text to 400 chars per record so the accumulated
  results array stays small regardless of model verbosity.
- Add delta mode to writeToConsole: emit only the latest row per
  flush, so stringify stays O(1) in iteration count on mobile.
- Stop calling writeReport() on disk per-record; rewriting the full
  JSON every iteration is both expensive and an OOM hazard. One
  final writeReport runs in the exit hook for the on-device artifact.
- Remove the final cumulative writeToConsole in the exit hook: the
  per-test delta emits already carry every row, and emitting a large
  cumulative payload at exit is exactly when we most need the earlier
  deltas to survive in logcat/syslog.

Changes in extract-from-log.js:
- Split extractFromText into extractAllFromText (returns every report
  parsed from the text) + the legacy extractFromText wrapper that
  returns the last one.
- Add opts.all to extractFromFile so --merge gathers every delta
  emit from a log file, not just the largest or last marker pair.
- main() with --merge now pushes all reports per file into the
  device bucket before handing them to mergeDeviceReports, so
  partial data from a pre-crash run still lands in the aggregate.

Verified locally with a delta-emit simulation that mid-test crashes
leave runs 1..N-1 intact in the reconstructed report.

Made-with: Cursor
…ct collision)

Two pre-existing bugs caused the combined performance table to drop
data on a PR run:

1. `darwin-x64` (and any other slow desktop job) was missing because
   `combine-reports` lived inside the mobile workflow and only waited
   on the mobile matrix. Desktop jobs that finished AFTER mobile were
   silently excluded. Move the job to the umbrella
   (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports`
   with `needs: [run-integration-tests, run-mobile-integration-tests]`,
   so it blocks on the full fan-out before scanning artifacts.

2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair)
   was overwritten because every matrix entry uploaded its artifact
   under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two
   entries sharing platform+arch clobbered each other on download.
   Add a unique `label` field to each matrix entry
   (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`,
   `linux-arm64-u24`, ...) and key the upload + device-name patch on
   `matrix.label` instead.

Made-with: Cursor
…flush logs on crash

Three changes scoped to the mobile LLM integration workflow:

* Android: merge the three per-image Device Farm groups back into a single
  groupImagesPerf run. Android does not have the iOS jetsam pressure, so
  running all three image tests in one invocation is both cheaper and
  simpler while still giving the aggregator 9 rows (3 images x 3 iters)
  per device.

* iOS: schedule Heavy7 (fruit plate) a second time as
  Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device
  folder on download, and extract-from-log.js --merge dedupes on the
  metric fingerprint, so if either attempt gets past iOS jetsam we have
  perf data. If both succeed we just get more iterations to average
  over, which is strictly better. run_count goes 10 -> 11 and the
  Monitor / Download loops were extended accordingly.

* Crash path (both platforms): route process.exit(1) in checkAppCrash
  and crashMonitor through a new global.flushLogsAndExit helper that
  (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain
  and (b) on iOS performs an inline Appium pull_file of
  bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means
  delta perf emits that hit disk before a jetsam kill now reach the
  aggregator instead of being stuck in the app sandbox.

Made-with: Cursor
Previous commit pushed the `Create and Upload Test Spec` run: block
over GitHub Actions' ~30K template-expression budget, so the umbrella
workflow refused to dispatch with:
    Line 927, Col 14: Exceeded max expression length 21000

Root cause: `global.flushLogsAndExit` was inlined into BOTH Android
and iOS wdio configs. On iOS it duplicated the exact Appium pull_file
logic the `after` hook already has, and on Android it carried dead
iOS-only code — roughly 1.8 kB of repeated JS across the two configs.

Instead:

* Drop `global.flushLogsAndExit` entirely.
* In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace
  `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`.
  Same 5s drain window for logcat / os_log / bare stdout, one line of
  JS, zero helper definition.
* iOS `after` hook still pulls `bare_console.log` via Appium on the
  normal path, so delta perf data from successful runs is unchanged.

Net `run:` block size goes from 31,524 → 28,992 chars (below the
previously-working 28,995). Both wdio configs revalidated with
`node --check` and the full workflow YAML round-trips through PyYAML.

Made-with: Cursor
…tail tables

- Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so
  Metal shaders / KV cache / image-prefill buffer are allocated with a
  cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's
  per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop
  and Android paths remain byte-for-byte identical.
- Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix
  legs into single "linux-x64" / "linux-arm64" columns in the combined
  report's Fix desktop device names step. Previously the summary showed
  four separate columns with "-" GPU rows that looked like missing data
  even though the sibling leg had populated them.
- Preserve categorical metrics (backend, platform, status,
  execution_provider) through aggregateReports and render per-device
  detail tables for every device — mobile included — in the combined
  markdown summary via a new --device-details flag on aggregate.js.
  Mobile devices now get the same twelve-column breakdown (Test, EP,
  Backend, Platform, Total / Prefill / Decode / Vision Enc / Img
  Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that
  desktop matrix legs emit via writeStepSummary.

Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/
(27/27 assertions pass).

Made-with: Cursor
Two fixes after run 24909705757 landed iOS perf data for elephant +
aurora but still lost fruit plate, and the combined HTML report was
missing the row-per-test detail tables that the step-summary markdown
has.

1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone.
   With the elephant pre-warmup in place, the fruit-plate test still
   crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd
   counted iteration, leaving zero rows in the perf artifact for both
   iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted
   iteration cuts the cold-path peak footprint from 4 inferences to 2
   (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in
   local simulation. Desktop + Android still run the full PERF_RUNS=3.

2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the
   combined HTML report mirrors the markdown step-summary layout —
   one section per device with Backend / Platform / Total Time /
   Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens /
   Prompt Tokens / TPS / Status columns. Previously only the markdown
   had those columns, so mobile devices (which can't write to
   GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML
   artifact. The per-device HTML report step also gets the flags so
   each individual HTML has parity. Verified locally against the
   run-24909705757 artifact set: all 9 devices render the full row-
   per-test summary, including Apple iPhone 16 Pro / iPhone 17 with
   backend=metal and platform=ios-arm64.

Made-with: Cursor
Five user-requested follow-ups after run 24913460964:

1) iOS heavy7 is now a "warm process" group. test-groups.json puts
   runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal /
   native allocators are pre-warmed by a cheap test before the
   10 MB fruit-plate PNG hits the multimodal pipeline.

2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11)
   is removed from the mobile workflow. Run count goes from 11 → 10;
   monitoring + log-download loop bounds updated; the inline comment
   that justified the retry is replaced with one explaining the new
   warm-process rationale.

3) JS-only crash-path bare_console.log pull on iOS. The existing
   pull_file logic that used to live inline in `after` is now a
   global.flushBareLog helper installed in `before`, called from
   global.checkAppCrash and the setInterval crash monitor (after a
   1.5 s browser.pause to give stdout a chance to flush) BEFORE the
   5 s setTimeout(process.exit). Net effect: crashed runs that
   previously left a 0-byte LOG artifact now have a real
   bare_console.log on disk for triage. No C++ touched.

4) `aggregateReports` now dedupes sibling matrix legs that get folded
   into one device name. Previously linux-x64 / [elephant] [CPU]
   accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU
   leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the
   per-test breakdown tables. The new rule: per (device, test,
   run_number), only the FIRST report's rows are kept — subsequent
   sibling-leg reports for the same pair are skipped. Within a
   single report all PERF_RUNS iterations still flow through (so
   counted runs per device stays at 3). The weekly aggregator is
   unaffected because its inputs span different run_numbers.

   Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter
   reports per matrix leg; linux-x64 / [elephant] [CPU] now reports
   `count: 3` and the markdown header reads "Iterations: 3".

5) Per-device detail-table headings use the device name as the
   anchor instead of a repeated "### Performance: <addon>". GitHub
   collapses duplicate heading slugs in step summaries which made
   it impossible to deep-link or scroll to the mobile detail
   tables — they were rendered, just hidden under a single anchor.
   Each device now gets `### <Device> — <addon> (<platform/arch>)`
   so Android / iOS sections appear in the step-summary TOC.

Made-with: Cursor
The previous warm-process seed (runApiBehaviorTest) used a text-only
Qwen3-0.6B model, so it never exercised the mmproj load or the Metal
vision shader JIT that fruit-plate then hit cold. iPhone 17 kept
SIGABRT'ing between the LLM context init and the first inference
during exactly those steps.

Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the
small (23 KB) elephant test pre-warms the exact VLM code path —
SmolVLM2 + mmproj + Metal vision graph + KV cache — before the
10 MB fruit-plate PNG lands.
The previous attempt put runImageElephantTest as a "warm process"
prefix to heavy7. That actually executes the full elephant perf
cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media-
path test = 9 elephant inferences and 2 SmolVLM2 model loads —
before runImageFruitPlateTest gets to do anything. Net effect on
heavy7 was 12 inferences pre-fruit-plate-row, which regressed
iPhone 16 Pro (had been passing) and iPhone 17 still died after
the inline elephant warmup completed.

New shape:
  - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"]
    (cold process again — runImageElephantTest stays in heavy3 as
    its own perf slot, untouched).
  - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage)
    succeeds, set an iosPreWarmupRan flag and skip the standard
    PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass.
  - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg'
    + iosPerfRuns=1 config; comments rewritten to describe the new
    flow.
  - mobile workflow: heavy7 comment block updated to match.

End-to-end iOS shape for fruit-plate is now exactly what was
asked for:
  1. setupMultimodalInference loads SmolVLM2 cold
  2. 1 elephant inference (inline iosWarmupImage, no perf record)
  3. standard warmup loop SKIPPED
  4. 1 fruit-plate inference (counted, perf recorded)
  = 2 inferences total

Desktop and Android paths are completely unchanged — they still
run 1 warmup + PERF_RUNS=3 counted. iOS tests without an
iosWarmupImage (elephant, high-res aurora) keep their existing
1 warmup + iosPerfRuns/PERF_RUNS counted shape.

Made-with: Cursor
…erf for bitnet/tool-calling

- extract _perf-helper.js out of _image-common.js so bitnet,
  tool-calling and image tests share one perf-recording path
- add `scenario` field on perf records ('image' / 'bitnet' /
  'tool-calling' / 'default'), plumb through aggregator + reports.
  Per-device detail tables (md + html) now group rows by scenario
  so a single combined report can surface multiple test families
  side by side instead of mixing them into one giant table
- drop the always-'ok' Status column from the vision metric layout
  (just noise; the JS layer no longer hardcodes status: 'ok')
- squash the combined GH step summary to three Mean ± std
  mini-tables (Total Time / TTFT / TPS), grouped by scenario. The
  HTML artifact still carries the full per-iteration breakdown
- split aggregate.js --device-details into --md-device-details and
  --html-device-details (legacy --device-details still toggles both)
- wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js
  (one perf row per model_variant x prompt cell) onto _perf-helper.js
- add desktop GPU probe in detectDevice(): nvidia-smi -L on
  linux/win32, lspci as linux fallback, system_profiler on darwin,
  wmic on win32. New device.gpu field is surfaced in the per-device
  detail-table headings (md + html). Mobile leaves gpu=null —
  device.name (Device Farm label) already implies the chipset
- add scripts/perf-report/sim-llamacpp-llm.js for local report-shape
  verification mirroring the umbrella workflow flags

Made-with: Cursor
…ells

QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the
combined PR summary, but the per-device detail tables (Test / EP /
Backend / Platform / Total / Prefill / Decode / Vision Enc / Img
Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended
underneath instead of being HTML-only. Numeric cells in the detail
tables render as `mean ±std` so the iteration spread is visible at
every metric, not just the rolled-up ones; token-count columns stay
as bare integers since deterministic generation makes `±0` noise.
The umbrella workflow flips from `--html-device-details` back to
`--device-details` to apply this in both the markdown and HTML outputs.

Made-with: Cursor
QVAC-17830. Per-prompt image-prefill time turned out to overlap with
the prefill_time_ms / TTFT signal we already report, so the column
was essentially redundant. Vision Enc (ms) — the mmproj / vision-
encoder time — is the only VLM-specific timer worth keeping for now;
its native wiring is tracked separately under
https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702.

Removed everywhere it surfaced:
- scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS
  + record() default metrics)
- scripts/perf-report/utils.js (label map + vision detail columns)
- packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js
  (mobile inline-fallback default metrics + recordPerformance payload)
- scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators)

Made-with: Cursor
…m-perf-metrics

Made-with: Cursor

# Conflicts:
#	packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration.auto.cjs
#	packages/qvac-lib-infer-llamacpp-llm/test/mobile/test-groups.json
…m-perf-metrics

Made-with: Cursor

# Conflicts:
#	.github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
#	scripts/perf-report/utils.js
… + legend)

The cross-platform mean tables previously emitted ALL device columns
for every scenario block, so e.g. the bitnet block (Android-only by
design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty
desktop/iOS columns of dashes, which made the report look broken
when it was actually intentional.

- Per-scenario × per-metric column filtering: drop columns that have
  zero data in this scenario block. Bitnet collapses from 9 → 2
  columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64`
  (test is `skip: isDarwinX64`).
- Short legend right under the report title explaining the dashes
  that remain (gated tests like medgemma-4b-it = desktop-only, mobile
  = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner).

Per-device detail tables are unchanged — they were already scenario-
partitioned so the layout fix doesn't apply there.

Made-with: Cursor
…LLM) workflow_dispatch

New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco):
PR runs default to 1 warmup + 1 counted iteration (cheap, single-
shot signal). The dedicated `Benchmark Performance (LLM)` workflow
is the only place we crank iterations up to produce mean ± std.

Changes:

- `_image-common.js` and `bitnet.test.js`: replace hardcoded
  `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS',
  1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative
  env values fall back to the safe default of 1+1.
- iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps
  iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark
  workflow doesn't OOM the Jetsam ceiling on heavy images. Comments
  updated to reflect the new default.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new
  `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs
  (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The
  umbrella PR workflow doesn't pass these, so the PR critical path
  is unaffected.
- `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls
  the desktop integration-test reusable with `qvac_perf_runs=3`,
  aggregates a benchmark report, uploads HTML artifact + GitHub
  step summary.

Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device
Farm) needs a build-time env-var injection hook in the test app —
tracked as a QVAC-18111 follow-up.

Made-with: Cursor
… runner

The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs
on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan
device is present — but `tool-calling.test.js` previously hardcoded
`useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`.
That made the combined report show GPU bars on a CPU runner for
tool-calling rows, which read as inconsistent next to the [CPU]
rows from `image` and `bitnet`.

Match the `_image-common.js` pattern: read `NO_GPU` from both
`process.env` and `os.getEnv()`, treat any case-insensitive
"true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the
`linux-x64-cpu` leg labels its rows `[CPU]` correctly while the
`linux-x64-gpu` leg keeps `[GPU]`.

Made-with: Cursor
…cross-platform tables

Two follow-ups based on review feedback on the combined report:

- Trim the verbose "intentional gaps" legend down to a single short
  italic line (`> _` `-` ` = not run on this device._`). The
  per-scenario column filter already drops fully-empty columns, so
  the long explanation was no longer pulling its weight.

- Surface the same metrics the per-device detail tables already
  show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS)
  in the cross-platform mini-tables, instead of just Total/TTFT/TPS.
  Reviewer needs the full breakdown to compare backends, not just
  the headline. `vision_encode_time_ms` is currently null everywhere
  (native timer pending — Asana 1214371583877702) so its mini-table
  is auto-suppressed by the existing `hasAnyData` guard until the
  timer lands; the entry is in the list ahead of time so the table
  appears automatically once data flows.

Made-with: Cursor
Bare runtime doesn't define `process` as a global at module-init
time, so referencing `process.env[key]` from the top-level
`_envInt(...)` call in `bitnet.test.js` blew up the entire
integration suite with `ReferenceError: process is not defined`
on first eval.

Switch the env reads to `bare-os`'s `os.getEnv()` (already imported)
as the primary source, and keep a `typeof process !== 'undefined'`-
guarded `process.env` fallback for any Node code paths that happen
to import these files. Apply the same pattern in:

- `_image-common.js` — for both `_envInt` and the existing
  `noGpuEnv` lookup which had the same latent bug.
- `bitnet.test.js` — for `_envInt`.
- `tool-calling.test.js` — for the new `noGpuEnv` lookup added in
  the linux-x64-cpu label fix.

Verified locally: behaves correctly when `process` is undefined,
when `process.env[key]` is unset, and when only one of the two
sources has the value.

Made-with: Cursor
… addon convention

Rename `.github/workflows/benchmark-llamacpp-llm.yml` →
`benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the
existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern
on main (Parakeet, Whispercpp, Supertonic TTS, etc.).

Also align the structure: introduce a `context` job that derives
`repository` / `ref` from optional inputs (with fallback to
`github.repository` / `github.ref_name`), wire prebuild + desktop
integration tests through that, and add a `summarize` job whose
artifact follows the convention `<addon>-performance-findings.{md,
html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs`
inputs (defaulting to 3 / 1) are preserved — this is still the
single place we crank perf iterations up to produce mean ± std
numbers, and the umbrella PR workflow continues to leave them at
the cheap default of 1 + 1.

Workflow_dispatch only — explicitly NOT wired into the PR umbrella,
matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco).

Made-with: Cursor
tobi-legan and others added 6 commits May 4, 2026 14:21
Picks up the QVAC-18111 scaffold PRs that landed on main:
- #1839: Benchmark Performance (LLM) workflow_dispatch + qvac_perf_runs/
  qvac_perf_warmup_runs plumbing on integration-test-...yml
- #1840: bridge of those inputs into integration-mobile-test-...yml via
  pushFile of qvacPerfConfig.txt onto the device

Conflict resolution:
- benchmark-performance-...yml: take main entirely (canonical, has
  run_desktop+run_mobile toggles + mobile-benchmarks job)
- integration-test-...yml: keep this branch's HTML perf report
  generation step + per-runner perf-report-llamacpp-llm-* upload, drop
  the chatty inline comments around qvac_perf_runs (config is now
  self-documenting in the workflow surface)
- integration-mobile-test-...yml: take main's WDIO_CONFIG for Android
  (has the __QVAC_PERF_RUNS__/qvacPerfConfig.txt pushFile bridge)

Made-with: Cursor
…time

Mobile side of QVAC-18111: now that the umbrella benchmark workflow
pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime
needs to actually consume it. Without this, dispatched perf overrides
were silently ignored on Android / iOS and tests defaulted to 1+1.

Changes:
- packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs:
  read qvacPerfConfig.txt from the same paths the testFilter loader
  uses (/data/local/tmp on Android, global.testDir on iOS), parse
  KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads
  on first __shouldRunTest call, which fires before runIntegrationModule
  imports any test module — so os.getEnv() inside _image-common.js /
  bitnet.test.js / tool-calling.test.js sees the dispatched value at
  module init. Normalises both literal '\\n' (the bash-escaped form
  the WDIO before-hook produces today) and real '\\r?\\n' so the
  parser doesn't depend on which encoding the workflow happens to use.

- .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml:
  extract the two giant Android / iOS WDIO_CONFIG JS strings out of
  the 'Create and Upload Test Spec' run: block into env: vars. GitHub
  Actions imposes a hard 21 000-char limit on any run: block that
  contains ${{ }} expressions, and the new Android image-perf group
  + iOS heavy7/heavy8 splits + collapsed-format echo line for the test
  split summary pushed us over. Moving the WDIO blobs to env: drops
  the run block from ~21 KB to ~11 KB so we have headroom for further
  additions.

Made-with: Cursor
…sts only

Adds qvac_perf_only input to the desktop and mobile reusable workflows
and turns it on from Benchmark Performance (LLM) so dispatched runs
only execute the perf-emitting tests instead of the full integration
suite.

Desktop: regenerates test/integration/all.js from bitnet, tool-calling
and the three image-* tests, then runs bare against it.

Mobile: filters each group's run*Test list down to perf-emitting
wrappers, skips groups whose intersection is empty (no spec uploaded,
no Device Farm run booked), and tolerates gaps in the RUN_ARN_*
slot sequence when monitoring / downloading logs.

Co-authored-by: Cursor <cursoragent@cursor.com>
…m-perf-metrics

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	.github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
…m-perf-metrics

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	.github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
DmitryMalishev
DmitryMalishev previously approved these changes May 5, 2026
tobi-legan and others added 2 commits May 6, 2026 19:24
…L 735/736

The `combine-perf-reports` job runs in `pull_request_target` context and
was checking out the PR head's `scripts/perf-report` so it could
aggregate the matrix artifacts. CodeQL flagged this as a
cache-poisoning / poisonable-step issue (alerts 735/736) because we
were executing PR-author code in a privileged context.

Removing the `repository:` / `ref:` overrides on the checkout makes
`actions/checkout` fall back to the base-branch SHA in
`pull_request_target` runs, so `aggregate.js` is now trusted code from
the default branch. `workflow_dispatch` runs are unaffected because
`github.ref` resolves to the dispatched branch.

Existing mitigations stay in place: `contents: read` only, no
`actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]`
allowlist + 64-char cap on `device.name` before it touches any
filesystem path. SECURITY comments updated to reflect the new posture.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolved against the package-rename PR (#1860) on main:
- packages/qvac-lib-infer-llamacpp-llm -> packages/llm-llamacpp
- workflows renamed (e.g. on-pr-qvac-lib-infer-llamacpp-llm.yml -> on-pr-llm-llamacpp.yml)

Conflicts resolved:
- Moved our 5 new image perf test files (image-elephant, image-fruit-plate,
  image-high-res-aurora, _image-common, _perf-helper) under the new
  packages/llm-llamacpp/test/integration/ path.
- Honored our deletion of image.test.js (split into 3 perf tests).
- test-groups.json: kept our iOS slot mapping for the 3 image perf tests
  (heavy3/heavy7/heavy8) and our Android groupImagesPerf; took main's lightA/
  lightB/groupA/groupB additions for runQwen35Test and the afriquegemma/moe
  drops. main's runGemma4Test/runOcrPaddleTest remain covered on Android
  via groupB; iOS coverage for those is left for a follow-up PR.
@kinsta

kinsta Bot commented May 11, 2026

Copy link
Copy Markdown

Preview deployments for qvac-docs-staging ⚡️

Status Branch preview Commit preview
✅ Ready Visit preview Visit preview

Commit: 03e8346e9bf0296c6857a87150d15e8b2f72e982

Deployment ID: 3ea8dfd1-c5de-46e2-bce8-f262e05c30f3

Static site name: qvac-docs-staging-fazwv

Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs
Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs
Comment thread packages/llm-llamacpp/test/mobile/integration.auto.cjs
The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8,
displacing main's runGemma4Test/runOcrPaddleTest. The validator in
scripts/generate-mobile-integration-tests.js requires every test to be
assigned on both platforms — these two were missing on iOS.

Placed them in lightB alongside runQwen35Test and runModelLoadingTest
since they're model-loading tests with similar runtime characteristics.
Two complementary fixes for the Device Farm monitor loop in
integration-mobile-test-llm-llamacpp.yml.

1. Match the initial credentials step's role-duration-seconds: 7200 on
   the 'Refresh AWS credentials for monitoring' step. Without it, the
   refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is
   2h, so every devicefarm API call past the 1h mark fails until
   timeout — producing the spinning
     'Run N:  () | AWS API attempt X/3 failed, retrying in 5s...'
   output for ~1h with no escape.

2. Make the poll loop self-bounding under persistent API failure:
   - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard
     never prints the empty 'Run N:  ()' line.
   - On aws_retry returning empty (or 'None'), preserve last-known-good
     STATUS/RESULT for that slot and increment a per-slot FAIL counter.
   - After 10 consecutive failures (~5 min at the 30s cadence, well past
     any normal STS / Device Farm transient) mark that slot ERRORED /
     NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning
     the full 2h ceiling. Final-results phase reports a real failure.
Comment thread packages/llm-llamacpp/test/mobile/test-groups.json
tobi-legan and others added 2 commits May 11, 2026 18:01
…arm slots

Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to
isolated slots instead of folding them into lightB. The image perf
tests keep their existing isolation (heavy3/heavy7/heavy8), so the
iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12).

- test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest;
  drop both from lightB
- integration-mobile-test-llm-llamacpp.yml:
  - filter_perf: extract heavy9 + heavy10 wrappers
  - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml,
    renumber light specs to test_spec_arn_11 / test_spec_arn_12
  - schedule: 12 maybe_schedule calls for iOS
  - monitor + log download: RUN_ARN_1..12 explicit assignments
  (the seq-based loops are already dynamic via $RUN_COUNT)

Co-authored-by: Cursor <cursoragent@cursor.com>
iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job
on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed
peak memory past the ~3.3 GB Jetsam ceiling once the device was warm
from earlier jobs in the queue, the app was killed via memorystatus,
and Appium never received a clean failure signal — leaving the Device
Farm run in RUNNING (PENDING) until the 60-min job timeout.

When `image.test.js` was split into per-image files in 37afd3d, the
fruit-plate variant correctly received `iosWarmupImage` +
`iosPerfRuns: 1`, but high-res aurora — the heaviest image in the
suite — was left on the standard path. That meant CPU + GPU each ran
1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone
the cold path crossed the Jetsam line.

Mirror the fruit-plate config:
- `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline
  with a 23 KB image so Metal shaders, KV cache, and image-prefill
  buffers are allocated before the 12 MP JPEG arrives. `_image-common.js`
  skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs.
- `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of
  QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM
  the iPhone.

iOS cold-path footprint is now exactly 2 inferences per backend
(1 small pre-warmup + 1 counted real image), matching fruit-plate's
known-good shape. Desktop and Android paths are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the two perf-report regressions noted inline.

@gianni-cor gianni-cor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after re-checking the latest changes.

@gianni-cor

Copy link
Copy Markdown
Contributor

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants