Skip to content

QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile#1840

Merged
tobi-legan merged 8 commits into
mainfrom
infra/qvac-18111-benchmark-llm-mobile
May 4, 2026
Merged

QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile#1840
tobi-legan merged 8 commits into
mainfrom
infra/qvac-18111-benchmark-llm-mobile

Conversation

@tobi-legan

@tobi-legan tobi-legan commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 2 of the LLM benchmark workflow: extends iteration-count overrides (QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS) to the Device Farm side, and wires Android + iOS into the dispatch so the benchmark workflow covers desktop and mobile.

Note

Stacked on #1839 (the desktop-only scaffold, which already adds the run_desktop toggle). Once #1839 lands on main, GitHub will retarget this PR's base to main automatically.

The bare runtime inside the Device Farm test app does not see GitHub Actions env vars, so this mirrors the existing testFilter.txt pattern — when the workflow inputs are non-empty, the WDIO before hook pushes a qvacPerfConfig.txt to the device with the iteration overrides as KEY=VALUE lines. The file-reading side on bare lives on the QVAC-17830 perf branch; without it this PR is a harmless no-op (orphan file), so it is safe to land independently.

Changes

  • integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
    • Add qvac_perf_runs / qvac_perf_warmup_runs to workflow_call and workflow_dispatch (empty default ⇒ no behaviour change for PR runs)
    • Add __QVAC_PERF_RUNS__ / __QVAC_PERF_WARMUP_RUNS__ placeholders to the Android + iOS WDIO config blobs and a sibling pushFile block in the before hook (writes KEY=VALUE\n pairs)
      • Android target: /data/local/tmp/qvacPerfConfig.txt
      • iOS target: @io.tether.test.qvac:documents/qvacPerfConfig.txt
    • make_split substitutes the new placeholders alongside the existing __TEST_FILTER__
  • benchmark-performance-qvac-lib-infer-llamacpp-llm.yml
    • Add a mobile-benchmarks job calling integration-mobile-test-…yml with the bench-mode iteration counts
    • Add run_mobile toggle (workflow_dispatch boolean input, defaults to true) and gate mobile-benchmarks with if: ${{ inputs.run_mobile }}. Together with the run_desktop toggle from QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 this lets dispatchers run desktop and mobile benchmarks independently (full / desktop-only / mobile-only)
    • summarize.needs includes mobile-benchmarks and is hardened with if: ${{ always() && needs.context.result == 'success' }} so it still produces a partial report when one of the two matrices is toggled off or fails
    • Drop the "Desktop only" caveat in the step-summary blurb

Toggle matrix (combined with #1839)

run_desktop run_mobile What runs
full benchmark (desktop + Android + iOS) — default
desktop-only
mobile-only
nothing useful — summarize emits an empty report header

prebuild is not gated and runs unconditionally; splitting it per matrix is a separate optimisation.

What's NOT in this PR

Test plan

  • Land QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch #1839 first
  • Land this PR — confirm umbrella PR runs are unchanged (no qvacPerfConfig.txt pushed when inputs are empty)
  • Once the perf branch (QVAC-17830) lands the bare-side file fallback, dispatch Benchmark Performance (LLM) and confirm:
    • Mobile job pushes the file before clicking "Run Automated Tests"
    • Bare logs [QvacPerfConfig] loaded ... (or equivalent — naming TBD on perf branch)
    • Mobile rows in the combined report show mean ± std instead of single-shot data
  • Dispatch with run_desktop=false, run_mobile=true → only Device Farm runs; summarize still produces a (mobile-only) report
  • Dispatch with run_desktop=true, run_mobile=false → desktop-only run, no Device Farm cost

…ow_dispatch

GitHub requires a `workflow_dispatch` workflow to exist on the
default branch before it shows up in the Actions tab and becomes
triggerable with `--ref <feature-branch>`. This lands the LLM
benchmark workflow on `main` so the QVAC-17830 perf-metrics feature
branch can be dispatched against it for end-to-end validation.

Changes:
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new):
  manual `workflow_dispatch` only — mirrors the structure of the
  existing Parakeet / Whispercpp benchmark workflows. Calls
  `prebuilds-...yml` then `integration-test-...yml` with
  bench-mode iteration counts (`QVAC_PERF_RUNS=3`,
  `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop
  artifacts into a combined HTML / step-summary. Phase-1 scope is
  desktop only — mobile (Device Farm) needs a build-time hook in
  the test app to thread env vars through to bare and is tracked
  as a QVAC-18111 follow-up.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread
  `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call`
  + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` /
  `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run
  steps. Empty string => unset, so the umbrella PR workflow
  continues to honour the test-side default and PR runs are
  unaffected by this change.

Per the perf policy agreed on Slack (2026-04-30): the umbrella
on-pr workflow runs perf tests at the cheap default so we don't pay
full perf cost on every PR; this dedicated workflow is the only
place we crank up the iteration counts to produce mean ± std
numbers.

Made-with: Cursor
…a pushFile

Extends the mobile integration workflow with the same iteration-count
inputs as the desktop reusable workflow, and adds a `mobile-benchmarks`
job to the LLM benchmark dispatch so it covers Device Farm too.

The bare runtime on Device Farm doesn't see GitHub Actions env vars,
so we mirror the existing `testFilter.txt` pattern: when the workflow
inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt`
to the device (Android: `/data/local/tmp/`, iOS:
`@bundleId:documents/`) with the iteration overrides as KEY=VALUE
lines. The file-reading side on bare lives on the QVAC-17830 perf
branch — without that branch this PR is a no-op (orphan file), so it
is safe to land independently.

Changes:
- `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add
  `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and
  `workflow_dispatch`; add `__QVAC_PERF_RUNS__` /
  `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS
  WDIO config blobs and the corresponding pushFile block in the
  `before` hook; substitute the placeholders in `make_split`.
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a
  `mobile-benchmarks` job calling the mobile workflow with the
  bench-mode iteration counts; have `summarize` `needs:` it; drop
  the "desktop only" caveat in the step-summary blurb.

PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook
skips the perf-config push.

Made-with: Cursor
@tobi-legan tobi-legan force-pushed the infra/qvac-18111-benchmark-llm-mobile branch from b961099 to 9cdf07b Compare April 30, 2026 18:00
Base automatically changed from infra/qvac-18111-benchmark-llm-workflow to main May 4, 2026 11:41
Resolves add/add conflict in benchmark-performance-qvac-lib-infer-llamacpp-llm.yml
by keeping the mobile-aware version from this branch (run_mobile input,
mobile-benchmarks job, summarize wired through both desktop+mobile).

Made-with: Cursor
@tobi-legan tobi-legan marked this pull request as ready for review May 4, 2026 11:47
@tobi-legan tobi-legan requested a review from a team as a code owner May 4, 2026 11:47
@tobi-legan

Copy link
Copy Markdown
Contributor Author

/review

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@Proletter

Copy link
Copy Markdown
Collaborator

review

@tobi-legan

Copy link
Copy Markdown
Contributor Author

/review

@GustavoA1604

Copy link
Copy Markdown
Contributor

/review

@tobi-legan tobi-legan merged commit 0905567 into main May 4, 2026
15 of 29 checks passed
@tobi-legan tobi-legan deleted the infra/qvac-18111-benchmark-llm-mobile branch May 4, 2026 13:09
tobi-legan added a commit that referenced this pull request May 4, 2026
Picks up the QVAC-18111 scaffold PRs that landed on main:
- #1839: Benchmark Performance (LLM) workflow_dispatch + qvac_perf_runs/
  qvac_perf_warmup_runs plumbing on integration-test-...yml
- #1840: bridge of those inputs into integration-mobile-test-...yml via
  pushFile of qvacPerfConfig.txt onto the device

Conflict resolution:
- benchmark-performance-...yml: take main entirely (canonical, has
  run_desktop+run_mobile toggles + mobile-benchmarks job)
- integration-test-...yml: keep this branch's HTML perf report
  generation step + per-runner perf-report-llamacpp-llm-* upload, drop
  the chatty inline comments around qvac_perf_runs (config is now
  self-documenting in the workflow surface)
- integration-mobile-test-...yml: take main's WDIO_CONFIG for Android
  (has the __QVAC_PERF_RUNS__/qvacPerfConfig.txt pushFile bridge)

Made-with: Cursor
tobi-legan added a commit that referenced this pull request May 4, 2026
Closes the missing leg in the QVAC-18111 mobile bench-mode bridge.

The Benchmark Performance (LLM) workflow_dispatch already pushes a
KEY=VALUE qvacPerfConfig.txt onto the device via Appium pushFile (PR
#1840) at /data/local/tmp/qvacPerfConfig.txt on Android and
@bundleId:documents/qvacPerfConfig.txt on iOS. Without a consumer the
file just sat there and tests silently kept the PR-default 1+1 count.

This adds a tryLoadPerfConfig helper to integration-runtime.cjs that
mirrors the existing testFilter.txt loader: read the file from the
same two paths, parse KEY=VALUE per line, inject each via
bare-os setEnv, then unlink. The load is piggy-backed onto the first
__shouldRunTest call, which fires before runIntegrationModule
dynamically imports any test module — so the existing os.getEnv()
lookups in _image-common.js / bitnet.test.js / tool-calling.test.js
see the dispatched value at their own module init time.

Empty file / missing file is a no-op so PR runs are unaffected.

Made-with: Cursor
tobi-legan added a commit that referenced this pull request May 4, 2026
…time

Mobile side of QVAC-18111: now that the umbrella benchmark workflow
pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime
needs to actually consume it. Without this, dispatched perf overrides
were silently ignored on Android / iOS and tests defaulted to 1+1.

Changes:
- packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs:
  read qvacPerfConfig.txt from the same paths the testFilter loader
  uses (/data/local/tmp on Android, global.testDir on iOS), parse
  KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads
  on first __shouldRunTest call, which fires before runIntegrationModule
  imports any test module — so os.getEnv() inside _image-common.js /
  bitnet.test.js / tool-calling.test.js sees the dispatched value at
  module init. Normalises both literal '\\n' (the bash-escaped form
  the WDIO before-hook produces today) and real '\\r?\\n' so the
  parser doesn't depend on which encoding the workflow happens to use.

- .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml:
  extract the two giant Android / iOS WDIO_CONFIG JS strings out of
  the 'Create and Upload Test Spec' run: block into env: vars. GitHub
  Actions imposes a hard 21 000-char limit on any run: block that
  contains ${{ }} expressions, and the new Android image-perf group
  + iOS heavy7/heavy8 splits + collapsed-format echo line for the test
  split summary pushed us over. Moving the WDIO blobs to env: drops
  the run block from ~21 KB to ~11 KB so we have headroom for further
  additions.

Made-with: Cursor
tamer-hassan-tether pushed a commit that referenced this pull request May 5, 2026
…a pushFile (#1840)

* QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch

GitHub requires a `workflow_dispatch` workflow to exist on the
default branch before it shows up in the Actions tab and becomes
triggerable with `--ref <feature-branch>`. This lands the LLM
benchmark workflow on `main` so the QVAC-17830 perf-metrics feature
branch can be dispatched against it for end-to-end validation.

Changes:
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new):
  manual `workflow_dispatch` only — mirrors the structure of the
  existing Parakeet / Whispercpp benchmark workflows. Calls
  `prebuilds-...yml` then `integration-test-...yml` with
  bench-mode iteration counts (`QVAC_PERF_RUNS=3`,
  `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop
  artifacts into a combined HTML / step-summary. Phase-1 scope is
  desktop only — mobile (Device Farm) needs a build-time hook in
  the test app to thread env vars through to bare and is tracked
  as a QVAC-18111 follow-up.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread
  `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call`
  + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` /
  `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run
  steps. Empty string => unset, so the umbrella PR workflow
  continues to honour the test-side default and PR runs are
  unaffected by this change.

Per the perf policy agreed on Slack (2026-04-30): the umbrella
on-pr workflow runs perf tests at the cheap default so we don't pay
full perf cost on every PR; this dedicated workflow is the only
place we crank up the iteration counts to produce mean ± std
numbers.

Made-with: Cursor

* QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch

Made-with: Cursor

* QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile

Extends the mobile integration workflow with the same iteration-count
inputs as the desktop reusable workflow, and adds a `mobile-benchmarks`
job to the LLM benchmark dispatch so it covers Device Farm too.

The bare runtime on Device Farm doesn't see GitHub Actions env vars,
so we mirror the existing `testFilter.txt` pattern: when the workflow
inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt`
to the device (Android: `/data/local/tmp/`, iOS:
`@bundleId:documents/`) with the iteration overrides as KEY=VALUE
lines. The file-reading side on bare lives on the QVAC-17830 perf
branch — without that branch this PR is a no-op (orphan file), so it
is safe to land independently.

Changes:
- `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add
  `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and
  `workflow_dispatch`; add `__QVAC_PERF_RUNS__` /
  `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS
  WDIO config blobs and the corresponding pushFile block in the
  `before` hook; substitute the placeholders in `make_split`.
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a
  `mobile-benchmarks` job calling the mobile workflow with the
  bench-mode iteration counts; have `summarize` `needs:` it; drop
  the "desktop only" caveat in the step-summary blurb.

PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook
skips the perf-config push.

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch

Made-with: Cursor

---------

Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
tamer-hassan-tether pushed a commit that referenced this pull request May 6, 2026
…a pushFile (#1840)

* QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch

GitHub requires a `workflow_dispatch` workflow to exist on the
default branch before it shows up in the Actions tab and becomes
triggerable with `--ref <feature-branch>`. This lands the LLM
benchmark workflow on `main` so the QVAC-17830 perf-metrics feature
branch can be dispatched against it for end-to-end validation.

Changes:
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new):
  manual `workflow_dispatch` only — mirrors the structure of the
  existing Parakeet / Whispercpp benchmark workflows. Calls
  `prebuilds-...yml` then `integration-test-...yml` with
  bench-mode iteration counts (`QVAC_PERF_RUNS=3`,
  `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop
  artifacts into a combined HTML / step-summary. Phase-1 scope is
  desktop only — mobile (Device Farm) needs a build-time hook in
  the test app to thread env vars through to bare and is tracked
  as a QVAC-18111 follow-up.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread
  `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call`
  + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` /
  `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run
  steps. Empty string => unset, so the umbrella PR workflow
  continues to honour the test-side default and PR runs are
  unaffected by this change.

Per the perf policy agreed on Slack (2026-04-30): the umbrella
on-pr workflow runs perf tests at the cheap default so we don't pay
full perf cost on every PR; this dedicated workflow is the only
place we crank up the iteration counts to produce mean ± std
numbers.

Made-with: Cursor

* QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch

Made-with: Cursor

* QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile

Extends the mobile integration workflow with the same iteration-count
inputs as the desktop reusable workflow, and adds a `mobile-benchmarks`
job to the LLM benchmark dispatch so it covers Device Farm too.

The bare runtime on Device Farm doesn't see GitHub Actions env vars,
so we mirror the existing `testFilter.txt` pattern: when the workflow
inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt`
to the device (Android: `/data/local/tmp/`, iOS:
`@bundleId:documents/`) with the iteration overrides as KEY=VALUE
lines. The file-reading side on bare lives on the QVAC-17830 perf
branch — without that branch this PR is a no-op (orphan file), so it
is safe to land independently.

Changes:
- `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add
  `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and
  `workflow_dispatch`; add `__QVAC_PERF_RUNS__` /
  `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS
  WDIO config blobs and the corresponding pushFile block in the
  `before` hook; substitute the placeholders in `make_split`.
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a
  `mobile-benchmarks` job calling the mobile workflow with the
  bench-mode iteration counts; have `summarize` `needs:` it; drop
  the "desktop only" caveat in the step-summary blurb.

PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook
skips the perf-config push.

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch

Made-with: Cursor

---------

Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
gianni-cor added a commit that referenced this pull request May 12, 2026
… & GPU) (#1843)

* QVAC-17830 feat: add VLM perf metrics with multi-run averaging

Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on
desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf
pipeline.

Perf reporter / aggregator
- scripts/test-utils/performance-reporter.js: extend `vision` addon
  type with backend, platform, prefill, decode, vision_encode,
  image_prefill and status columns.
- scripts/perf-report/utils.js: teach the aggregator the new VLM
  metric labels so markdown/HTML reports render them.

CI wiring
- integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add
  aggregate + upload-artifact steps so CPU/GPU perf artifacts are
  produced alongside the existing test run.
- integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile):
  scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END]
  markers, run extract-from-log.js + aggregate.js, upload per-platform
  perf artifacts. Mirrors the OCR mobile workflow pattern.

Test refactor (image.test.js)
- Include image name in the perf label ("[elephant] [CPU]" etc.) so
  aggregate.js (which groups by `result.test`) no longer pools
  different images into a single CPU/GPU cell.
- Run 1 warmup + 3 counted inferences per (image x backend) so the
  aggregator produces count=3, mean, std within a single CI run --
  matches OCR's PERF_RUNS=3 convention.
- Reuse the loaded LlmLlamacpp instance across iterations inside a
  single test() block: deliberate divergence from OCR's one-test()-
  per-iteration shape because VLM model load is multi-GB and already
  stresses iOS Device Farm memory.
- Bump per-test timeout to 25 min for the multi-iter perf tests;
  leave TEST_CONSTANTS.timeout unchanged for other tests in the file.
- Keyword assertions run once against the last counted iteration's
  output.

vision_encode_time_ms / image_prefill_time_ms are stubbed as null
pending native RuntimeStats instrumentation (separate follow-up task).

* QVAC-17830 feat: wire Mobile LLM into perf-report.yml weekly aggregator

Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)"
block to the scheduled cron job so Monday 9am UTC runs aggregate
perf artifacts from Mobile Integration Tests (LLM) into a separate
llamacpp-llm-mobile-performance.{md,json,html} report alongside the
existing desktop one. Also adds "Mobile Integration Tests (LLM)" to
the workflow_dispatch workflow_name choice list so it can be
triggered manually.

Without this, the per-run Android/iOS perf artifacts the mobile
workflow now uploads would sit unused by the weekly cross-run
aggregator.

* QVAC-17830 feat: add per-run joint perf reporter to mobile LLM workflow

Mirrors the `combine-reports` job in the OCR mobile workflow. Runs
after the per-platform matrix completes and downloads every artifact
matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because
the desktop (integration-test-*.yml) and mobile (this file) workflows
are both invoked via workflow_call from the umbrella On PR Trigger
(LLM), share the same github.run_number and therefore live together
in the single run's artifact scope. The job then:

- aggregates all desktop + mobile rows into a single combined HTML,
  Markdown and JSON report via scripts/perf-report/aggregate.js;
- patches the `device.name` for desktop artifacts using the artifact
  suffix (e.g. "linux-x64", "macos-arm64") so they render with
  recognisable column headers next to the mobile device names;
- emits a per-device HTML report for each unique device encountered
  (desktop OS/arch variants + each mobile device);
- writes the combined markdown report into $GITHUB_STEP_SUMMARY so
  the PR run page shows one unified perf table covering every
  platform the addon ran on in that run.

This is what the user actually meant by "joint performance reporter"
-- distinct from the weekly/cross-run aggregation in perf-report.yml,
which aggregates historical runs for trend tracking.

* QVAC-17830 fix: preserve mobile perf data on OOM, split image tests per device

Android perf data was never surfacing because the inline fallback
reporter used on mobile was a no-op stub (no record/writeToConsole),
so logcat never saw PERF_REPORT markers. iOS image tests were also
crashing mid-run (Jetsam/memorystatus) before process.on('exit')
could flush, losing whichever runs had already completed.

- Replace the mobile fallback reporter in _image-common.js with a
  functional inline implementation that mirrors OCR's: record(),
  writeReport(), and writeToConsole() with lightweight snapshots
  and logcat-safe chunking.
- Flush incrementally after every record() on mobile so earlier
  iterations survive a SIGKILL later in the same group.
- Split image.test.js into per-image files (elephant, fruit plate,
  high-res aurora) behind a shared _image-common.js helper, and
  wire each into its own Device Farm group (Android: 3 new groups;
  iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora).
- Teach extract-from-log.js to --merge multiple reports per device
  (needed because each physical device now runs three groups) and
  add --device-from-filename as a fallback when logs land in a
  flat layout.
- Update the mobile workflow to schedule 5 Android / 10 iOS runs,
  bucket downloaded artifacts by device, unzip Customer_Artifacts
  inline so bare_console.log is scannable, and invoke
  extract-from-log.js with --merge.

Verified locally with a happy-path sim (3 samples per image×backend
per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on
iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated
report on the crashed device, and the other three groups report
fully.

Made-with: Cursor

* QVAC-17830 fix: preserve mobile perf data under iOS V8 Zone OOM

Fruit plate iOS test crashed inside Builtin_JsonStringify because the
per-test flush stringified the cumulative results array on every
record() call (3+ runs x 2 backends x verbose VLM output over a 10MB
image + Metal compiler memory). Even with logcat chunking, the JSON
build itself exhausted V8's Zone allocator, producing SIGTRAP from
FatalProcessOutOfMemory before any data reached logcat.

Changes in the mobile fallback reporter (_image-common.js):
- Cap stored output text to 400 chars per record so the accumulated
  results array stays small regardless of model verbosity.
- Add delta mode to writeToConsole: emit only the latest row per
  flush, so stringify stays O(1) in iteration count on mobile.
- Stop calling writeReport() on disk per-record; rewriting the full
  JSON every iteration is both expensive and an OOM hazard. One
  final writeReport runs in the exit hook for the on-device artifact.
- Remove the final cumulative writeToConsole in the exit hook: the
  per-test delta emits already carry every row, and emitting a large
  cumulative payload at exit is exactly when we most need the earlier
  deltas to survive in logcat/syslog.

Changes in extract-from-log.js:
- Split extractFromText into extractAllFromText (returns every report
  parsed from the text) + the legacy extractFromText wrapper that
  returns the last one.
- Add opts.all to extractFromFile so --merge gathers every delta
  emit from a log file, not just the largest or last marker pair.
- main() with --merge now pushes all reports per file into the
  device bucket before handing them to mergeDeviceReports, so
  partial data from a pre-crash run still lands in the aggregate.

Verified locally with a delta-emit simulation that mid-test crashes
leave runs 1..N-1 intact in the reconstructed report.

Made-with: Cursor

* QVAC-17830 fix: plug combined perf report gaps (desktop race + artifact collision)

Two pre-existing bugs caused the combined performance table to drop
data on a PR run:

1. `darwin-x64` (and any other slow desktop job) was missing because
   `combine-reports` lived inside the mobile workflow and only waited
   on the mobile matrix. Desktop jobs that finished AFTER mobile were
   silently excluded. Move the job to the umbrella
   (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports`
   with `needs: [run-integration-tests, run-mobile-integration-tests]`,
   so it blocks on the full fan-out before scanning artifacts.

2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair)
   was overwritten because every matrix entry uploaded its artifact
   under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two
   entries sharing platform+arch clobbered each other on download.
   Add a unique `label` field to each matrix entry
   (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`,
   `linux-arm64-u24`, ...) and key the upload + device-name patch on
   `matrix.label` instead.

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate retry + consolidate Android images + flush logs on crash

Three changes scoped to the mobile LLM integration workflow:

* Android: merge the three per-image Device Farm groups back into a single
  groupImagesPerf run. Android does not have the iOS jetsam pressure, so
  running all three image tests in one invocation is both cheaper and
  simpler while still giving the aggregator 9 rows (3 images x 3 iters)
  per device.

* iOS: schedule Heavy7 (fruit plate) a second time as
  Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device
  folder on download, and extract-from-log.js --merge dedupes on the
  metric fingerprint, so if either attempt gets past iOS jetsam we have
  perf data. If both succeed we just get more iterations to average
  over, which is strictly better. run_count goes 10 -> 11 and the
  Monitor / Download loops were extended accordingly.

* Crash path (both platforms): route process.exit(1) in checkAppCrash
  and crashMonitor through a new global.flushLogsAndExit helper that
  (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain
  and (b) on iOS performs an inline Appium pull_file of
  bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means
  delta perf emits that hit disk before a jetsam kill now reach the
  aggregator instead of being stuck in the app sandbox.

Made-with: Cursor

* QVAC-17830 fix: inline crash flush-delay, drop duplicated pull helper

Previous commit pushed the `Create and Upload Test Spec` run: block
over GitHub Actions' ~30K template-expression budget, so the umbrella
workflow refused to dispatch with:
    Line 927, Col 14: Exceeded max expression length 21000

Root cause: `global.flushLogsAndExit` was inlined into BOTH Android
and iOS wdio configs. On iOS it duplicated the exact Appium pull_file
logic the `after` hook already has, and on Android it carried dead
iOS-only code — roughly 1.8 kB of repeated JS across the two configs.

Instead:

* Drop `global.flushLogsAndExit` entirely.
* In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace
  `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`.
  Same 5s drain window for logcat / os_log / bare stdout, one line of
  JS, zero helper definition.
* iOS `after` hook still pulls `bare_console.log` via Appium on the
  normal path, so delta perf data from successful runs is unchanged.

Net `run:` block size goes from 31,524 → 28,992 chars (below the
previously-working 28,995). Both wdio configs revalidated with
`node --check` and the full workflow YAML round-trips through PyYAML.

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate warmup + merge linux legs + mobile detail tables

- Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so
  Metal shaders / KV cache / image-prefill buffer are allocated with a
  cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's
  per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop
  and Android paths remain byte-for-byte identical.
- Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix
  legs into single "linux-x64" / "linux-arm64" columns in the combined
  report's Fix desktop device names step. Previously the summary showed
  four separate columns with "-" GPU rows that looked like missing data
  even though the sibling leg had populated them.
- Preserve categorical metrics (backend, platform, status,
  execution_provider) through aggregateReports and render per-device
  detail tables for every device — mobile included — in the combined
  markdown summary via a new --device-details flag on aggregate.js.
  Mobile devices now get the same twelve-column breakdown (Test, EP,
  Backend, Platform, Total / Prefill / Decode / Vision Enc / Img
  Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that
  desktop matrix legs emit via writeStepSummary.

Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/
(27/27 assertions pass).

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate 1-iter override + HTML detail tables

Two fixes after run 24909705757 landed iOS perf data for elephant +
aurora but still lost fruit plate, and the combined HTML report was
missing the row-per-test detail tables that the step-summary markdown
has.

1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone.
   With the elephant pre-warmup in place, the fruit-plate test still
   crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd
   counted iteration, leaving zero rows in the perf artifact for both
   iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted
   iteration cuts the cold-path peak footprint from 4 inferences to 2
   (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in
   local simulation. Desktop + Android still run the full PERF_RUNS=3.

2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the
   combined HTML report mirrors the markdown step-summary layout —
   one section per device with Backend / Platform / Total Time /
   Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens /
   Prompt Tokens / TPS / Status columns. Previously only the markdown
   had those columns, so mobile devices (which can't write to
   GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML
   artifact. The per-device HTML report step also gets the flags so
   each individual HTML has parity. Verified locally against the
   run-24909705757 artifact set: all 9 devices render the full row-
   per-test summary, including Apple iPhone 16 Pro / iPhone 17 with
   backend=metal and platform=ios-arm64.

Made-with: Cursor

* QVAC-17830 fix: warm process iOS heavy7 + dedupe perf legs + drop retry

Five user-requested follow-ups after run 24913460964:

1) iOS heavy7 is now a "warm process" group. test-groups.json puts
   runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal /
   native allocators are pre-warmed by a cheap test before the
   10 MB fruit-plate PNG hits the multimodal pipeline.

2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11)
   is removed from the mobile workflow. Run count goes from 11 → 10;
   monitoring + log-download loop bounds updated; the inline comment
   that justified the retry is replaced with one explaining the new
   warm-process rationale.

3) JS-only crash-path bare_console.log pull on iOS. The existing
   pull_file logic that used to live inline in `after` is now a
   global.flushBareLog helper installed in `before`, called from
   global.checkAppCrash and the setInterval crash monitor (after a
   1.5 s browser.pause to give stdout a chance to flush) BEFORE the
   5 s setTimeout(process.exit). Net effect: crashed runs that
   previously left a 0-byte LOG artifact now have a real
   bare_console.log on disk for triage. No C++ touched.

4) `aggregateReports` now dedupes sibling matrix legs that get folded
   into one device name. Previously linux-x64 / [elephant] [CPU]
   accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU
   leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the
   per-test breakdown tables. The new rule: per (device, test,
   run_number), only the FIRST report's rows are kept — subsequent
   sibling-leg reports for the same pair are skipped. Within a
   single report all PERF_RUNS iterations still flow through (so
   counted runs per device stays at 3). The weekly aggregator is
   unaffected because its inputs span different run_numbers.

   Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter
   reports per matrix leg; linux-x64 / [elephant] [CPU] now reports
   `count: 3` and the markdown header reads "Iterations: 3".

5) Per-device detail-table headings use the device name as the
   anchor instead of a repeated "### Performance: <addon>". GitHub
   collapses duplicate heading slugs in step summaries which made
   it impossible to deep-link or scroll to the mobile detail
   tables — they were rendered, just hidden under a single anchor.
   Each device now gets `### <Device> — <addon> (<platform/arch>)`
   so Android / iOS sections appear in the step-summary TOC.

Made-with: Cursor

* QVAC-17830 fix: warm iOS heavy7 with elephant instead of api-behavior

The previous warm-process seed (runApiBehaviorTest) used a text-only
Qwen3-0.6B model, so it never exercised the mmproj load or the Metal
vision shader JIT that fruit-plate then hit cold. iPhone 17 kept
SIGABRT'ing between the LLM context init and the first inference
during exactly those steps.

Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the
small (23 KB) elephant test pre-warms the exact VLM code path —
SmolVLM2 + mmproj + Metal vision graph + KV cache — before the
10 MB fruit-plate PNG lands.

* QVAC-17830 fix: shrink iOS fruit-plate to 2 inferences cold

The previous attempt put runImageElephantTest as a "warm process"
prefix to heavy7. That actually executes the full elephant perf
cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media-
path test = 9 elephant inferences and 2 SmolVLM2 model loads —
before runImageFruitPlateTest gets to do anything. Net effect on
heavy7 was 12 inferences pre-fruit-plate-row, which regressed
iPhone 16 Pro (had been passing) and iPhone 17 still died after
the inline elephant warmup completed.

New shape:
  - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"]
    (cold process again — runImageElephantTest stays in heavy3 as
    its own perf slot, untouched).
  - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage)
    succeeds, set an iosPreWarmupRan flag and skip the standard
    PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass.
  - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg'
    + iosPerfRuns=1 config; comments rewritten to describe the new
    flow.
  - mobile workflow: heavy7 comment block updated to match.

End-to-end iOS shape for fruit-plate is now exactly what was
asked for:
  1. setupMultimodalInference loads SmolVLM2 cold
  2. 1 elephant inference (inline iosWarmupImage, no perf record)
  3. standard warmup loop SKIPPED
  4. 1 fruit-plate inference (counted, perf recorded)
  = 2 inferences total

Desktop and Android paths are completely unchanged — they still
run 1 warmup + PERF_RUNS=3 counted. iOS tests without an
iosWarmupImage (elephant, high-res aurora) keep their existing
1 warmup + iosPerfRuns/PERF_RUNS counted shape.

Made-with: Cursor

* QVAC-17830 feat: scenario grouping, GPU probe, squashed PR summary, perf for bitnet/tool-calling

- extract _perf-helper.js out of _image-common.js so bitnet,
  tool-calling and image tests share one perf-recording path
- add `scenario` field on perf records ('image' / 'bitnet' /
  'tool-calling' / 'default'), plumb through aggregator + reports.
  Per-device detail tables (md + html) now group rows by scenario
  so a single combined report can surface multiple test families
  side by side instead of mixing them into one giant table
- drop the always-'ok' Status column from the vision metric layout
  (just noise; the JS layer no longer hardcodes status: 'ok')
- squash the combined GH step summary to three Mean ± std
  mini-tables (Total Time / TTFT / TPS), grouped by scenario. The
  HTML artifact still carries the full per-iteration breakdown
- split aggregate.js --device-details into --md-device-details and
  --html-device-details (legacy --device-details still toggles both)
- wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js
  (one perf row per model_variant x prompt cell) onto _perf-helper.js
- add desktop GPU probe in detectDevice(): nvidia-smi -L on
  linux/win32, lspci as linux fallback, system_profiler on darwin,
  wmic on win32. New device.gpu field is surfaced in the per-device
  detail-table headings (md + html). Mobile leaves gpu=null —
  device.name (Device Farm label) already implies the chipset
- add scripts/perf-report/sim-llamacpp-llm.js for local report-shape
  verification mirroring the umbrella workflow flags

Made-with: Cursor

* feat: surface per-device detail tables in PR summary with mean ±std cells

QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the
combined PR summary, but the per-device detail tables (Test / EP /
Backend / Platform / Total / Prefill / Decode / Vision Enc / Img
Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended
underneath instead of being HTML-only. Numeric cells in the detail
tables render as `mean ±std` so the iteration spread is visible at
every metric, not just the rolled-up ones; token-count columns stay
as bare integers since deterministic generation makes `±0` noise.
The umbrella workflow flips from `--html-device-details` back to
`--device-details` to apply this in both the markdown and HTML outputs.

Made-with: Cursor

* refactor: drop image_prefill_time_ms from perf report

QVAC-17830. Per-prompt image-prefill time turned out to overlap with
the prefill_time_ms / TTFT signal we already report, so the column
was essentially redundant. Vision Enc (ms) — the mmproj / vision-
encoder time — is the only VLM-specific timer worth keeping for now;
its native wiring is tracked separately under
https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702.

Removed everywhere it surfaced:
- scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS
  + record() default metrics)
- scripts/perf-report/utils.js (label map + vision detail columns)
- packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js
  (mobile inline-fallback default metrics + recordPerformance payload)
- scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators)

Made-with: Cursor

* QVAC-17830 fix: tighten combined perf report layout (column filtering + legend)

The cross-platform mean tables previously emitted ALL device columns
for every scenario block, so e.g. the bitnet block (Android-only by
design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty
desktop/iOS columns of dashes, which made the report look broken
when it was actually intentional.

- Per-scenario × per-metric column filtering: drop columns that have
  zero data in this scenario block. Bitnet collapses from 9 → 2
  columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64`
  (test is `skip: isDarwinX64`).
- Short legend right under the report title explaining the dashes
  that remain (gated tests like medgemma-4b-it = desktop-only, mobile
  = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner).

Per-device detail tables are unchanged — they were already scenario-
partitioned so the layout fix doesn't apply there.

Made-with: Cursor

* QVAC-18111 feat: env-driven perf iterations + Benchmark Performance (LLM) workflow_dispatch

New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco):
PR runs default to 1 warmup + 1 counted iteration (cheap, single-
shot signal). The dedicated `Benchmark Performance (LLM)` workflow
is the only place we crank iterations up to produce mean ± std.

Changes:

- `_image-common.js` and `bitnet.test.js`: replace hardcoded
  `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS',
  1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative
  env values fall back to the safe default of 1+1.
- iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps
  iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark
  workflow doesn't OOM the Jetsam ceiling on heavy images. Comments
  updated to reflect the new default.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new
  `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs
  (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The
  umbrella PR workflow doesn't pass these, so the PR critical path
  is unaffected.
- `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls
  the desktop integration-test reusable with `qvac_perf_runs=3`,
  aggregates a benchmark report, uploads HTML artifact + GitHub
  step summary.

Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device
Farm) needs a build-time env-var injection hook in the test app —
tracked as a QVAC-18111 follow-up.

Made-with: Cursor

* QVAC-17830 fix: tool-calling EP label honours NO_GPU on linux-x64-cpu runner

The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs
on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan
device is present — but `tool-calling.test.js` previously hardcoded
`useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`.
That made the combined report show GPU bars on a CPU runner for
tool-calling rows, which read as inconsistent next to the [CPU]
rows from `image` and `bitnet`.

Match the `_image-common.js` pattern: read `NO_GPU` from both
`process.env` and `os.getEnv()`, treat any case-insensitive
"true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the
`linux-x64-cpu` leg labels its rows `[CPU]` correctly while the
`linux-x64-gpu` leg keeps `[GPU]`.

Made-with: Cursor

* QVAC-17830 fix: terser perf-report legend + full metric breakdown in cross-platform tables

Two follow-ups based on review feedback on the combined report:

- Trim the verbose "intentional gaps" legend down to a single short
  italic line (`> _` `-` ` = not run on this device._`). The
  per-scenario column filter already drops fully-empty columns, so
  the long explanation was no longer pulling its weight.

- Surface the same metrics the per-device detail tables already
  show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS)
  in the cross-platform mini-tables, instead of just Total/TTFT/TPS.
  Reviewer needs the full breakdown to compare backends, not just
  the headline. `vision_encode_time_ms` is currently null everywhere
  (native timer pending — Asana 1214371583877702) so its mini-table
  is auto-suppressed by the existing `hasAnyData` guard until the
  timer lands; the entry is in the list ahead of time so the table
  appears automatically once data flows.

Made-with: Cursor

* QVAC-17830 fix: use bare-os getEnv() for QVAC_PERF_RUNS / NO_GPU lookups

Bare runtime doesn't define `process` as a global at module-init
time, so referencing `process.env[key]` from the top-level
`_envInt(...)` call in `bitnet.test.js` blew up the entire
integration suite with `ReferenceError: process is not defined`
on first eval.

Switch the env reads to `bare-os`'s `os.getEnv()` (already imported)
as the primary source, and keep a `typeof process !== 'undefined'`-
guarded `process.env` fallback for any Node code paths that happen
to import these files. Apply the same pattern in:

- `_image-common.js` — for both `_envInt` and the existing
  `noGpuEnv` lookup which had the same latent bug.
- `bitnet.test.js` — for `_envInt`.
- `tool-calling.test.js` — for the new `noGpuEnv` lookup added in
  the linux-x64-cpu label fix.

Verified locally: behaves correctly when `process` is undefined,
when `process.env[key]` is unset, and when only one of the two
sources has the value.

Made-with: Cursor

* QVAC-18111 chore: align Benchmark Performance (LLM) workflow with the addon convention

Rename `.github/workflows/benchmark-llamacpp-llm.yml` →
`benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the
existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern
on main (Parakeet, Whispercpp, Supertonic TTS, etc.).

Also align the structure: introduce a `context` job that derives
`repository` / `ref` from optional inputs (with fallback to
`github.repository` / `github.ref_name`), wire prebuild + desktop
integration tests through that, and add a `summarize` job whose
artifact follows the convention `<addon>-performance-findings.{md,
html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs`
inputs (defaulting to 3 / 1) are preserved — this is still the
single place we crank perf iterations up to produce mean ± std
numbers, and the umbrella PR workflow continues to leave them at
the cheap default of 1 + 1.

Workflow_dispatch only — explicitly NOT wired into the PR umbrella,
matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco).

Made-with: Cursor

* QVAC-17830 fix: address CodeQL security findings on combined-perf-report job

Resolves the three CodeQL alerts surfaced on PR #1843:

- Alert 731 (`scripts/perf-report/extract-from-log.js`, regex injection):
  The `--device-from-filename` flag fed user-supplied input straight into
  `new RegExp(...)`. No workflow on main or in this PR ever passes the
  flag — it was dead code from an earlier mobile artifact-layout
  experiment. Drop the flag, the helper, and the field from `parseArgs`,
  which removes the ReDoS surface entirely.

- Alerts 733 / 734 (`.github/workflows/on-pr-qvac-lib-infer-llamacpp-llm.yml`,
  cache poisoning via execution of untrusted code):
  The new `combine-perf-reports` job inherits the umbrella's
  `pull_request_target` trigger and consumes PR-author-controlled perf
  artifacts. The blast radius is already bounded (`permissions:
  contents: read`, no `actions/cache`, no secrets touched by the
  aggregator), but the per-device HTML step interpolated a
  PR-controlled `device.name` into a filesystem path with only a
  space → dash sanitiser. Harden it:

    * Pass the JSON file path via `process.argv` instead of shell-
      interpolating it into `node -e`, so the inner `JSON.parse` never
      sees shell-meta from the surrounding quoting.
    * Validate `device.name` is actually a string before using it.
    * Replace the `tr ' ' '-'` fallback with a strict `[A-Za-z0-9-]`
      allowlist (anything else collapses to `_`) plus a 64-char cap,
      which kills path traversal and shell-meta exposure even on a
      maliciously crafted JSON.
    * Skip the iteration entirely when the slug ends up empty.

  Add a security block comment on the job documenting the trust model
  (privileged trigger, scoped permissions, no shared cache, sanitised
  filename input) so future readers don't re-derive it.

Drive-by: SHA-pin three `actions/upload-artifact@v4` references this PR
introduced (one in the mobile workflow, two in the umbrella) to
`@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # 7.0.0`, matching the
repo's existing pinned-action convention. Pre-existing `@v4`
references in `integration-mobile-test-...yml` lines 1684 / 1693 are
out of scope for this PR.

Made-with: Cursor

* QVAC-17830 fix: add shell + security note on Generate combined report step

CodeQL alert 733 (cache-poisoning/poisonable-step) is the same
trust-model concern the new block comment on combine-perf-reports
already documents — privileged checkout of PR-head code is unavoidable
because the introducing PR needs to exercise its own aggregate.js
changes. Inline the security note on the offending step so the CodeQL
fingerprint reflects the explicit acknowledgement, mirroring what
already silenced alert 734 on the per-device HTML step.

Drive-by: add explicit `shell: bash` on the step (was inheriting
from the runner default; explicit is safer + matches the surrounding
steps).

Made-with: Cursor

* QVAC-17830 feat: bridge QVAC_PERF_RUNS overrides into mobile bare runtime

Mobile side of QVAC-18111: now that the umbrella benchmark workflow
pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime
needs to actually consume it. Without this, dispatched perf overrides
were silently ignored on Android / iOS and tests defaulted to 1+1.

Changes:
- packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs:
  read qvacPerfConfig.txt from the same paths the testFilter loader
  uses (/data/local/tmp on Android, global.testDir on iOS), parse
  KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads
  on first __shouldRunTest call, which fires before runIntegrationModule
  imports any test module — so os.getEnv() inside _image-common.js /
  bitnet.test.js / tool-calling.test.js sees the dispatched value at
  module init. Normalises both literal '\\n' (the bash-escaped form
  the WDIO before-hook produces today) and real '\\r?\\n' so the
  parser doesn't depend on which encoding the workflow happens to use.

- .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml:
  extract the two giant Android / iOS WDIO_CONFIG JS strings out of
  the 'Create and Upload Test Spec' run: block into env: vars. GitHub
  Actions imposes a hard 21 000-char limit on any run: block that
  contains ${{ }} expressions, and the new Android image-perf group
  + iOS heavy7/heavy8 splits + collapsed-format echo line for the test
  split summary pushed us over. Moving the WDIO blobs to env: drops
  the run block from ~21 KB to ~11 KB so we have headroom for further
  additions.

Made-with: Cursor

* QVAC-17830 feat: gate Benchmark Performance (LLM) to perf-emitting tests only

Adds qvac_perf_only input to the desktop and mobile reusable workflows
and turns it on from Benchmark Performance (LLM) so dispatched runs
only execute the perf-emitting tests instead of the full integration
suite.

Desktop: regenerates test/integration/all.js from bitnet, tool-calling
and the three image-* tests, then runs bare against it.

Mobile: filters each group's run*Test list down to perf-emitting
wrappers, skips groups whose intersection is empty (no spec uploaded,
no Device Farm run booked), and tolerates gaps in the RUN_ARN_*
slot sequence when monitoring / downloading logs.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix[ci]: drop PR-head checkout in combine-perf-reports to clear CodeQL 735/736

The `combine-perf-reports` job runs in `pull_request_target` context and
was checking out the PR head's `scripts/perf-report` so it could
aggregate the matrix artifacts. CodeQL flagged this as a
cache-poisoning / poisonable-step issue (alerts 735/736) because we
were executing PR-author code in a privileged context.

Removing the `repository:` / `ref:` overrides on the checkout makes
`actions/checkout` fall back to the base-branch SHA in
`pull_request_target` runs, so `aggregate.js` is now trusted code from
the default branch. `workflow_dispatch` runs are unaffected because
`github.ref` resolves to the dispatched branch.

Existing mitigations stay in place: `contents: read` only, no
`actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]`
allowlist + 64-char cap on `device.name` before it touches any
filesystem path. SECURITY comments updated to reflect the new posture.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix[ci]: add runGemma4Test + runOcrPaddleTest to iOS lightB

The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8,
displacing main's runGemma4Test/runOcrPaddleTest. The validator in
scripts/generate-mobile-integration-tests.js requires every test to be
assigned on both platforms — these two were missing on iOS.

Placed them in lightB alongside runQwen35Test and runModelLoadingTest
since they're model-loading tests with similar runtime characteristics.

* fix[ci]: bound mobile monitor when AWS API permanently fails

Two complementary fixes for the Device Farm monitor loop in
integration-mobile-test-llm-llamacpp.yml.

1. Match the initial credentials step's role-duration-seconds: 7200 on
   the 'Refresh AWS credentials for monitoring' step. Without it, the
   refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is
   2h, so every devicefarm API call past the 1h mark fails until
   timeout — producing the spinning
     'Run N:  () | AWS API attempt X/3 failed, retrying in 5s...'
   output for ~1h with no escape.

2. Make the poll loop self-bounding under persistent API failure:
   - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard
     never prints the empty 'Run N:  ()' line.
   - On aws_retry returning empty (or 'None'), preserve last-known-good
     STATUS/RESULT for that slot and increment a per-slot FAIL counter.
   - After 10 consecutive failures (~5 min at the 30s cadence, well past
     any normal STS / Device Farm transient) mark that slot ERRORED /
     NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning
     the full 2h ceiling. Final-results phase reports a real failure.

* mod[notask]: isolate iOS Gemma4 and OcrPaddle into their own Device Farm slots

Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to
isolated slots instead of folding them into lightB. The image perf
tests keep their existing isolation (heavy3/heavy7/heavy8), so the
iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12).

- test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest;
  drop both from lightB
- integration-mobile-test-llm-llamacpp.yml:
  - filter_perf: extract heavy9 + heavy10 wrappers
  - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml,
    renumber light specs to test_spec_arn_11 / test_spec_arn_12
  - schedule: 12 maybe_schedule calls for iOS
  - monitor + log download: RUN_ARN_1..12 explicit assignments
  (the seq-based loops are already dynamic via $RUN_COUNT)

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-17830 fix: apply fruit-plate iOS OOM mitigation to high-res aurora

iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job
on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed
peak memory past the ~3.3 GB Jetsam ceiling once the device was warm
from earlier jobs in the queue, the app was killed via memorystatus,
and Appium never received a clean failure signal — leaving the Device
Farm run in RUNNING (PENDING) until the 60-min job timeout.

When `image.test.js` was split into per-image files in 37afd3d, the
fruit-plate variant correctly received `iosWarmupImage` +
`iosPerfRuns: 1`, but high-res aurora — the heaviest image in the
suite — was left on the standard path. That meant CPU + GPU each ran
1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone
the cold path crossed the Jetsam line.

Mirror the fruit-plate config:
- `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline
  with a 23 KB image so Metal shaders, KV cache, and image-prefill
  buffers are allocated before the 12 MP JPEG arrives. `_image-common.js`
  skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs.
- `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of
  QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM
  the iPhone.

iOS cold-path footprint is now exactly 2 inferences per backend
(1 small pre-warmup + 1 counted real image), matching fruit-plate's
known-good shape. Desktop and Android paths are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Proletter added a commit that referenced this pull request May 24, 2026
…a pushFile (#1840)

* QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch

GitHub requires a `workflow_dispatch` workflow to exist on the
default branch before it shows up in the Actions tab and becomes
triggerable with `--ref <feature-branch>`. This lands the LLM
benchmark workflow on `main` so the QVAC-17830 perf-metrics feature
branch can be dispatched against it for end-to-end validation.

Changes:
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new):
  manual `workflow_dispatch` only — mirrors the structure of the
  existing Parakeet / Whispercpp benchmark workflows. Calls
  `prebuilds-...yml` then `integration-test-...yml` with
  bench-mode iteration counts (`QVAC_PERF_RUNS=3`,
  `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop
  artifacts into a combined HTML / step-summary. Phase-1 scope is
  desktop only — mobile (Device Farm) needs a build-time hook in
  the test app to thread env vars through to bare and is tracked
  as a QVAC-18111 follow-up.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread
  `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call`
  + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` /
  `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run
  steps. Empty string => unset, so the umbrella PR workflow
  continues to honour the test-side default and PR runs are
  unaffected by this change.

Per the perf policy agreed on Slack (2026-04-30): the umbrella
on-pr workflow runs perf tests at the cheap default so we don't pay
full perf cost on every PR; this dedicated workflow is the only
place we crank up the iteration counts to produce mean ± std
numbers.

Made-with: Cursor

* QVAC-18111 chore[notask]: trim chatty inline comments in benchmark workflow

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_desktop toggle to benchmark workflow_dispatch

Made-with: Cursor

* QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile

Extends the mobile integration workflow with the same iteration-count
inputs as the desktop reusable workflow, and adds a `mobile-benchmarks`
job to the LLM benchmark dispatch so it covers Device Farm too.

The bare runtime on Device Farm doesn't see GitHub Actions env vars,
so we mirror the existing `testFilter.txt` pattern: when the workflow
inputs are non-empty, the WDIO before-hook pushes a `qvacPerfConfig.txt`
to the device (Android: `/data/local/tmp/`, iOS:
`@bundleId:documents/`) with the iteration overrides as KEY=VALUE
lines. The file-reading side on bare lives on the QVAC-17830 perf
branch — without that branch this PR is a no-op (orphan file), so it
is safe to land independently.

Changes:
- `integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml`: add
  `qvac_perf_runs` / `qvac_perf_warmup_runs` to `workflow_call` and
  `workflow_dispatch`; add `__QVAC_PERF_RUNS__` /
  `__QVAC_PERF_WARMUP_RUNS__` placeholders to the Android + iOS
  WDIO config blobs and the corresponding pushFile block in the
  `before` hook; substitute the placeholders in `make_split`.
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml`: add a
  `mobile-benchmarks` job calling the mobile workflow with the
  bench-mode iteration counts; have `summarize` `needs:` it; drop
  the "desktop only" caveat in the step-summary blurb.

PR runs are unchanged: empty input ⇒ empty placeholder ⇒ before-hook
skips the perf-config push.

Made-with: Cursor

* QVAC-18111 chore[notask]: add run_mobile toggle to benchmark workflow_dispatch

Made-with: Cursor

---------

Co-authored-by: Proletter <40578159+Proletter@users.noreply.github.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
… & GPU) (#1843)

* QVAC-17830 feat: add VLM perf metrics with multi-run averaging

Adds CPU + GPU VLM performance reporting to the llamacpp-llm addon on
desktop + mobile (iOS / Android Device Farm), mirroring the OCR perf
pipeline.

Perf reporter / aggregator
- scripts/test-utils/performance-reporter.js: extend `vision` addon
  type with backend, platform, prefill, decode, vision_encode,
  image_prefill and status columns.
- scripts/perf-report/utils.js: teach the aggregator the new VLM
  metric labels so markdown/HTML reports render them.

CI wiring
- integration-test-qvac-lib-infer-llamacpp-llm.yml (desktop): add
  aggregate + upload-artifact steps so CPU/GPU perf artifacts are
  produced alongside the existing test run.
- integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (mobile):
  scan Device Farm logs for [PERF_REPORT_START]...[PERF_REPORT_END]
  markers, run extract-from-log.js + aggregate.js, upload per-platform
  perf artifacts. Mirrors the OCR mobile workflow pattern.

Test refactor (image.test.js)
- Include image name in the perf label ("[elephant] [CPU]" etc.) so
  aggregate.js (which groups by `result.test`) no longer pools
  different images into a single CPU/GPU cell.
- Run 1 warmup + 3 counted inferences per (image x backend) so the
  aggregator produces count=3, mean, std within a single CI run --
  matches OCR's PERF_RUNS=3 convention.
- Reuse the loaded LlmLlamacpp instance across iterations inside a
  single test() block: deliberate divergence from OCR's one-test()-
  per-iteration shape because VLM model load is multi-GB and already
  stresses iOS Device Farm memory.
- Bump per-test timeout to 25 min for the multi-iter perf tests;
  leave TEST_CONSTANTS.timeout unchanged for other tests in the file.
- Keyword assertions run once against the last counted iteration's
  output.

vision_encode_time_ms / image_prefill_time_ms are stubbed as null
pending native RuntimeStats instrumentation (separate follow-up task).

* QVAC-17830 feat: wire Mobile LLM into perf-report.yml weekly aggregator

Mirrors the OCR desktop/mobile split. Adds a "Vision/LLM (Mobile)"
block to the scheduled cron job so Monday 9am UTC runs aggregate
perf artifacts from Mobile Integration Tests (LLM) into a separate
llamacpp-llm-mobile-performance.{md,json,html} report alongside the
existing desktop one. Also adds "Mobile Integration Tests (LLM)" to
the workflow_dispatch workflow_name choice list so it can be
triggered manually.

Without this, the per-run Android/iOS perf artifacts the mobile
workflow now uploads would sit unused by the weekly cross-run
aggregator.

* QVAC-17830 feat: add per-run joint perf reporter to mobile LLM workflow

Mirrors the `combine-reports` job in the OCR mobile workflow. Runs
after the per-platform matrix completes and downloads every artifact
matching `perf-report-llamacpp-llm-*-<run_number>` -- which, because
the desktop (integration-test-*.yml) and mobile (this file) workflows
are both invoked via workflow_call from the umbrella On PR Trigger
(LLM), share the same github.run_number and therefore live together
in the single run's artifact scope. The job then:

- aggregates all desktop + mobile rows into a single combined HTML,
  Markdown and JSON report via scripts/perf-report/aggregate.js;
- patches the `device.name` for desktop artifacts using the artifact
  suffix (e.g. "linux-x64", "macos-arm64") so they render with
  recognisable column headers next to the mobile device names;
- emits a per-device HTML report for each unique device encountered
  (desktop OS/arch variants + each mobile device);
- writes the combined markdown report into $GITHUB_STEP_SUMMARY so
  the PR run page shows one unified perf table covering every
  platform the addon ran on in that run.

This is what the user actually meant by "joint performance reporter"
-- distinct from the weekly/cross-run aggregation in perf-report.yml,
which aggregates historical runs for trend tracking.

* QVAC-17830 fix: preserve mobile perf data on OOM, split image tests per device

Android perf data was never surfacing because the inline fallback
reporter used on mobile was a no-op stub (no record/writeToConsole),
so logcat never saw PERF_REPORT markers. iOS image tests were also
crashing mid-run (Jetsam/memorystatus) before process.on('exit')
could flush, losing whichever runs had already completed.

- Replace the mobile fallback reporter in _image-common.js with a
  functional inline implementation that mirrors OCR's: record(),
  writeReport(), and writeToConsole() with lightweight snapshots
  and logcat-safe chunking.
- Flush incrementally after every record() on mobile so earlier
  iterations survive a SIGKILL later in the same group.
- Split image.test.js into per-image files (elephant, fruit plate,
  high-res aurora) behind a shared _image-common.js helper, and
  wire each into its own Device Farm group (Android: 3 new groups;
  iOS: heavy3=elephant, heavy7=fruit, heavy8=aurora).
- Teach extract-from-log.js to --merge multiple reports per device
  (needed because each physical device now runs three groups) and
  add --device-from-filename as a fallback when logs land in a
  flat layout.
- Update the mobile workflow to schedule 5 Android / 10 iOS runs,
  bucket downloaded artifacts by device, unzip Customer_Artifacts
  inline so bare_console.log is scannable, and invoke
  extract-from-log.js with --merge.

Verified locally with a happy-path sim (3 samples per image×backend
per device) and a Jetsam crash sim (aurora SIGKILLed mid-GPU on
iPhone 16 Pro Max) — 16 of 18 rows still reach the aggregated
report on the crashed device, and the other three groups report
fully.

Made-with: Cursor

* QVAC-17830 fix: preserve mobile perf data under iOS V8 Zone OOM

Fruit plate iOS test crashed inside Builtin_JsonStringify because the
per-test flush stringified the cumulative results array on every
record() call (3+ runs x 2 backends x verbose VLM output over a 10MB
image + Metal compiler memory). Even with logcat chunking, the JSON
build itself exhausted V8's Zone allocator, producing SIGTRAP from
FatalProcessOutOfMemory before any data reached logcat.

Changes in the mobile fallback reporter (_image-common.js):
- Cap stored output text to 400 chars per record so the accumulated
  results array stays small regardless of model verbosity.
- Add delta mode to writeToConsole: emit only the latest row per
  flush, so stringify stays O(1) in iteration count on mobile.
- Stop calling writeReport() on disk per-record; rewriting the full
  JSON every iteration is both expensive and an OOM hazard. One
  final writeReport runs in the exit hook for the on-device artifact.
- Remove the final cumulative writeToConsole in the exit hook: the
  per-test delta emits already carry every row, and emitting a large
  cumulative payload at exit is exactly when we most need the earlier
  deltas to survive in logcat/syslog.

Changes in extract-from-log.js:
- Split extractFromText into extractAllFromText (returns every report
  parsed from the text) + the legacy extractFromText wrapper that
  returns the last one.
- Add opts.all to extractFromFile so --merge gathers every delta
  emit from a log file, not just the largest or last marker pair.
- main() with --merge now pushes all reports per file into the
  device bucket before handing them to mergeDeviceReports, so
  partial data from a pre-crash run still lands in the aggregate.

Verified locally with a delta-emit simulation that mid-test crashes
leave runs 1..N-1 intact in the reconstructed report.

Made-with: Cursor

* QVAC-17830 fix: plug combined perf report gaps (desktop race + artifact collision)

Two pre-existing bugs caused the combined performance table to drop
data on a PR run:

1. `darwin-x64` (and any other slow desktop job) was missing because
   `combine-reports` lived inside the mobile workflow and only waited
   on the mobile matrix. Desktop jobs that finished AFTER mobile were
   silently excluded. Move the job to the umbrella
   (`on-pr-qvac-lib-infer-llamacpp-llm.yml`) as `combine-perf-reports`
   with `needs: [run-integration-tests, run-mobile-integration-tests]`,
   so it blocks on the full fan-out before scanning artifacts.

2. `linux-x64` GPU data (and the ubuntu-22 vs ubuntu-24 arm64 pair)
   was overwritten because every matrix entry uploaded its artifact
   under `perf-report-llamacpp-llm-<platform>-<arch>-<run>`; any two
   entries sharing platform+arch clobbered each other on download.
   Add a unique `label` field to each matrix entry
   (`linux-x64-cpu`, `linux-x64-gpu`, `linux-arm64-u22`,
   `linux-arm64-u24`, ...) and key the upload + device-name patch on
   `matrix.label` instead.

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate retry + consolidate Android images + flush logs on crash

Three changes scoped to the mobile LLM integration workflow:

* Android: merge the three per-image Device Farm groups back into a single
  groupImagesPerf run. Android does not have the iOS jetsam pressure, so
  running all three image tests in one invocation is both cheaper and
  simpler while still giving the aggregator 9 rows (3 images x 3 iters)
  per device.

* iOS: schedule Heavy7 (fruit plate) a second time as
  Heavy7-ImgFruitPlate-Retry. Both attempts land in the same per-device
  folder on download, and extract-from-log.js --merge dedupes on the
  metric fingerprint, so if either attempt gets past iOS jetsam we have
  perf data. If both succeed we just get more iterations to average
  over, which is strictly better. run_count goes 10 -> 11 and the
  Monitor / Download loops were extended accordingly.

* Crash path (both platforms): route process.exit(1) in checkAppCrash
  and crashMonitor through a new global.flushLogsAndExit helper that
  (a) sleeps 5s so bare stdout + iOS os_log / Android logcat can drain
  and (b) on iOS performs an inline Appium pull_file of
  bare_console.log to $DEVICEFARM_LOG_DIR before exiting. This means
  delta perf emits that hit disk before a jetsam kill now reach the
  aggregator instead of being stuck in the app sandbox.

Made-with: Cursor

* QVAC-17830 fix: inline crash flush-delay, drop duplicated pull helper

Previous commit pushed the `Create and Upload Test Spec` run: block
over GitHub Actions' ~30K template-expression budget, so the umbrella
workflow refused to dispatch with:
    Line 927, Col 14: Exceeded max expression length 21000

Root cause: `global.flushLogsAndExit` was inlined into BOTH Android
and iOS wdio configs. On iOS it duplicated the exact Appium pull_file
logic the `after` hook already has, and on Android it carried dead
iOS-only code — roughly 1.8 kB of repeated JS across the two configs.

Instead:

* Drop `global.flushLogsAndExit` entirely.
* In `checkAppCrash` and `crashMonitor` (both Android + iOS), replace
  `process.exit(1)` with `setTimeout(function(){process.exit(1);},5000)`.
  Same 5s drain window for logcat / os_log / bare stdout, one line of
  JS, zero helper definition.
* iOS `after` hook still pulls `bare_console.log` via Appium on the
  normal path, so delta perf data from successful runs is unchanged.

Net `run:` block size goes from 31,524 → 28,992 chars (below the
previously-working 28,995). Both wdio configs revalidated with
`node --check` and the full workflow YAML round-trips through PyYAML.

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate warmup + merge linux legs + mobile detail tables

- Add iOS-only tiny-image (elephant.jpg) warmup to fruit plate test so
  Metal shaders / KV cache / image-prefill buffer are allocated with a
  cheap 23 KB image before the 10 MB PNG hits, keeping us under Jetsam's
  per-process memory cap. Opt-in via testCase.iosWarmupImage so desktop
  and Android paths remain byte-for-byte identical.
- Fold linux-x64-cpu / linux-x64-gpu (and linux-arm64-u22 / u24) matrix
  legs into single "linux-x64" / "linux-arm64" columns in the combined
  report's Fix desktop device names step. Previously the summary showed
  four separate columns with "-" GPU rows that looked like missing data
  even though the sibling leg had populated them.
- Preserve categorical metrics (backend, platform, status,
  execution_provider) through aggregateReports and render per-device
  detail tables for every device — mobile included — in the combined
  markdown summary via a new --device-details flag on aggregate.js.
  Mobile devices now get the same twelve-column breakdown (Test, EP,
  Backend, Platform, Total / Prefill / Decode / Vision Enc / Img
  Prefill / TTFT ms, Gen Tokens, Prompt Tokens, TPS, Status) that
  desktop matrix legs emit via writeStepSummary.

Verified end-to-end with the local simulation at /tmp/vlm-combined-sim/
(27/27 assertions pass).

Made-with: Cursor

* QVAC-17830 fix: iOS fruit plate 1-iter override + HTML detail tables

Two fixes after run 24909705757 landed iOS perf data for elephant +
aurora but still lost fruit plate, and the combined HTML report was
missing the row-per-test detail tables that the step-summary markdown
has.

1) iOS-only `iosPerfRuns` override, set to 1 for fruit plate alone.
   With the elephant pre-warmup in place, the fruit-plate test still
   crashed at t+85s (V8 bad_alloc during JSON.stringify) on the 2nd/3rd
   counted iteration, leaving zero rows in the perf artifact for both
   iPhone 16 Pro and iPhone 17. Dropping iOS fruit plate to 1 counted
   iteration cuts the cold-path peak footprint from 4 inferences to 2
   (warmup + 1 counted), which stays under the 3376 MB Jetsam cap in
   local simulation. Desktop + Android still run the full PERF_RUNS=3.

2) Wire `generateDeviceDetailTables` into `generateHtmlReport` so the
   combined HTML report mirrors the markdown step-summary layout —
   one section per device with Backend / Platform / Total Time /
   Prefill / Decode / Vision Enc / Img Prefill / TTFT / Gen Tokens /
   Prompt Tokens / TPS / Status columns. Previously only the markdown
   had those columns, so mobile devices (which can't write to
   GITHUB_STEP_SUMMARY from Device Farm) appeared blank in the HTML
   artifact. The per-device HTML report step also gets the flags so
   each individual HTML has parity. Verified locally against the
   run-24909705757 artifact set: all 9 devices render the full row-
   per-test summary, including Apple iPhone 16 Pro / iPhone 17 with
   backend=metal and platform=ios-arm64.

Made-with: Cursor

* QVAC-17830 fix: warm process iOS heavy7 + dedupe perf legs + drop retry

Five user-requested follow-ups after run 24913460964:

1) iOS heavy7 is now a "warm process" group. test-groups.json puts
   runApiBehaviorTest before runImageFruitPlateTest so V8 / Metal /
   native allocators are pre-warmed by a cheap test before the
   10 MB fruit-plate PNG hits the multimodal pipeline.

2) The iOS Heavy7-ImgFruitPlate-Retry one-shot retry run (RUN_ARN_11)
   is removed from the mobile workflow. Run count goes from 11 → 10;
   monitoring + log-download loop bounds updated; the inline comment
   that justified the retry is replaced with one explaining the new
   warm-process rationale.

3) JS-only crash-path bare_console.log pull on iOS. The existing
   pull_file logic that used to live inline in `after` is now a
   global.flushBareLog helper installed in `before`, called from
   global.checkAppCrash and the setInterval crash monitor (after a
   1.5 s browser.pause to give stdout a chance to flush) BEFORE the
   5 s setTimeout(process.exit). Net effect: crashed runs that
   previously left a 0-byte LOG artifact now have a real
   bare_console.log on disk for triage. No C++ touched.

4) `aggregateReports` now dedupes sibling matrix legs that get folded
   into one device name. Previously linux-x64 / [elephant] [CPU]
   accumulated 6 iterations (3 from linux-x64-cpu + 3 from the GPU
   leg's CPU rows) which surfaced as "Run 4 / 5 / 6" headers in the
   per-test breakdown tables. The new rule: per (device, test,
   run_number), only the FIRST report's rows are kept — subsequent
   sibling-leg reports for the same pair are skipped. Within a
   single report all PERF_RUNS iterations still flow through (so
   counted runs per device stays at 3). The weekly aggregator is
   unaffected because its inputs span different run_numbers.

   Verified in /tmp/vlm-combined-sim/sim.js with synthetic 3-iter
   reports per matrix leg; linux-x64 / [elephant] [CPU] now reports
   `count: 3` and the markdown header reads "Iterations: 3".

5) Per-device detail-table headings use the device name as the
   anchor instead of a repeated "### Performance: <addon>". GitHub
   collapses duplicate heading slugs in step summaries which made
   it impossible to deep-link or scroll to the mobile detail
   tables — they were rendered, just hidden under a single anchor.
   Each device now gets `### <Device> — <addon> (<platform/arch>)`
   so Android / iOS sections appear in the step-summary TOC.

Made-with: Cursor

* QVAC-17830 fix: warm iOS heavy7 with elephant instead of api-behavior

The previous warm-process seed (runApiBehaviorTest) used a text-only
Qwen3-0.6B model, so it never exercised the mmproj load or the Metal
vision shader JIT that fruit-plate then hit cold. iPhone 17 kept
SIGABRT'ing between the LLM context init and the first inference
during exactly those steps.

Swap heavy7 to [runImageElephantTest, runImageFruitPlateTest] so the
small (23 KB) elephant test pre-warms the exact VLM code path —
SmolVLM2 + mmproj + Metal vision graph + KV cache — before the
10 MB fruit-plate PNG lands.

* QVAC-17830 fix: shrink iOS fruit-plate to 2 inferences cold

The previous attempt put runImageElephantTest as a "warm process"
prefix to heavy7. That actually executes the full elephant perf
cycle — 1 warmup + 3 counted on CPU, same on GPU, plus a media-
path test = 9 elephant inferences and 2 SmolVLM2 model loads —
before runImageFruitPlateTest gets to do anything. Net effect on
heavy7 was 12 inferences pre-fruit-plate-row, which regressed
iPhone 16 Pro (had been passing) and iPhone 17 still died after
the inline elephant warmup completed.

New shape:
  - test-groups.json: iOS heavy7 = ["runImageFruitPlateTest"]
    (cold process again — runImageElephantTest stays in heavy3 as
    its own perf slot, untouched).
  - _image-common.js: when iOS pre-warmup (testCase.iosWarmupImage)
    succeeds, set an iosPreWarmupRan flag and skip the standard
    PERF_WARMUP_RUNS loop. The pre-warmup IS the warmup pass.
  - image-fruit-plate.test.js: same iosWarmupImage='elephant.jpg'
    + iosPerfRuns=1 config; comments rewritten to describe the new
    flow.
  - mobile workflow: heavy7 comment block updated to match.

End-to-end iOS shape for fruit-plate is now exactly what was
asked for:
  1. setupMultimodalInference loads SmolVLM2 cold
  2. 1 elephant inference (inline iosWarmupImage, no perf record)
  3. standard warmup loop SKIPPED
  4. 1 fruit-plate inference (counted, perf recorded)
  = 2 inferences total

Desktop and Android paths are completely unchanged — they still
run 1 warmup + PERF_RUNS=3 counted. iOS tests without an
iosWarmupImage (elephant, high-res aurora) keep their existing
1 warmup + iosPerfRuns/PERF_RUNS counted shape.

Made-with: Cursor

* QVAC-17830 feat: scenario grouping, GPU probe, squashed PR summary, perf for bitnet/tool-calling

- extract _perf-helper.js out of _image-common.js so bitnet,
  tool-calling and image tests share one perf-recording path
- add `scenario` field on perf records ('image' / 'bitnet' /
  'tool-calling' / 'default'), plumb through aggregator + reports.
  Per-device detail tables (md + html) now group rows by scenario
  so a single combined report can surface multiple test families
  side by side instead of mixing them into one giant table
- drop the always-'ok' Status column from the vision metric layout
  (just noise; the JS layer no longer hardcodes status: 'ok')
- squash the combined GH step summary to three Mean ± std
  mini-tables (Total Time / TTFT / TPS), grouped by scenario. The
  HTML artifact still carries the full per-iteration breakdown
- split aggregate.js --device-details into --md-device-details and
  --html-device-details (legacy --device-details still toggles both)
- wire bitnet.test.js (1 warmup + 1 counted) and tool-calling.test.js
  (one perf row per model_variant x prompt cell) onto _perf-helper.js
- add desktop GPU probe in detectDevice(): nvidia-smi -L on
  linux/win32, lspci as linux fallback, system_profiler on darwin,
  wmic on win32. New device.gpu field is surfaced in the per-device
  detail-table headings (md + html). Mobile leaves gpu=null —
  device.name (Device Farm label) already implies the chipset
- add scripts/perf-report/sim-llamacpp-llm.js for local report-shape
  verification mirroring the umbrella workflow flags

Made-with: Cursor

* feat: surface per-device detail tables in PR summary with mean ±std cells

QVAC-17830. The squashed Mean ± std mini-tables stay at the top of the
combined PR summary, but the per-device detail tables (Test / EP /
Backend / Platform / Total / Prefill / Decode / Vision Enc / Img
Prefill / TTFT / Gen Tokens / Prompt Tokens / TPS) are now appended
underneath instead of being HTML-only. Numeric cells in the detail
tables render as `mean ±std` so the iteration spread is visible at
every metric, not just the rolled-up ones; token-count columns stay
as bare integers since deterministic generation makes `±0` noise.
The umbrella workflow flips from `--html-device-details` back to
`--device-details` to apply this in both the markdown and HTML outputs.

Made-with: Cursor

* refactor: drop image_prefill_time_ms from perf report

QVAC-17830. Per-prompt image-prefill time turned out to overlap with
the prefill_time_ms / TTFT signal we already report, so the column
was essentially redundant. Vision Enc (ms) — the mmproj / vision-
encoder time — is the only VLM-specific timer worth keeping for now;
its native wiring is tracked separately under
https://app.asana.com/1/45238840754660/project/1212638335655990/task/1214371583877702.

Removed everywhere it surfaced:
- scripts/test-utils/performance-reporter.js (vision METRIC_COLUMNS
  + record() default metrics)
- scripts/perf-report/utils.js (label map + vision detail columns)
- packages/qvac-lib-infer-llamacpp-llm/test/integration/_perf-helper.js
  (mobile inline-fallback default metrics + recordPerformance payload)
- scripts/perf-report/sim-llamacpp-llm.js (synthetic row generators)

Made-with: Cursor

* QVAC-17830 fix: tighten combined perf report layout (column filtering + legend)

The cross-platform mean tables previously emitted ALL device columns
for every scenario block, so e.g. the bitnet block (Android-only by
design — `skip: !isAndroid` in bitnet.test.js) showed 7 empty
desktop/iOS columns of dashes, which made the report look broken
when it was actually intentional.

- Per-scenario × per-metric column filtering: drop columns that have
  zero data in this scenario block. Bitnet collapses from 9 → 2
  columns (Pixel 9 Pro + S25 Ultra). Tool-calling drops `darwin-x64`
  (test is `skip: isDarwinX64`).
- Short legend right under the report title explaining the dashes
  that remain (gated tests like medgemma-4b-it = desktop-only, mobile
  = qwen3-1.7b only; `darwin-x64` / `linux-arm64` have no GPU runner).

Per-device detail tables are unchanged — they were already scenario-
partitioned so the layout fix doesn't apply there.

Made-with: Cursor

* QVAC-18111 feat: env-driven perf iterations + Benchmark Performance (LLM) workflow_dispatch

New perf policy agreed on Slack (2026-04-30, @Olya / @gianfranco):
PR runs default to 1 warmup + 1 counted iteration (cheap, single-
shot signal). The dedicated `Benchmark Performance (LLM)` workflow
is the only place we crank iterations up to produce mean ± std.

Changes:

- `_image-common.js` and `bitnet.test.js`: replace hardcoded
  `PERF_RUNS=3 / PERF_WARMUP_RUNS=1` with `_envInt('QVAC_PERF_RUNS',
  1)` / `_envInt('QVAC_PERF_WARMUP_RUNS', 1)`. Bad / empty / negative
  env values fall back to the safe default of 1+1.
- iOS fruit-plate's `iosPerfRuns: 1` override is preserved — it caps
  iPhone iterations regardless of `QVAC_PERF_RUNS` so the benchmark
  workflow doesn't OOM the Jetsam ceiling on heavy images. Comments
  updated to reflect the new default.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml` accepts new
  `qvac_perf_runs` / `qvac_perf_warmup_runs` workflow_call inputs
  (default empty ⇒ unset env ⇒ test honours PR default of 1+1). The
  umbrella PR workflow doesn't pass these, so the PR critical path
  is unaffected.
- `benchmark-llamacpp-llm.yml` (new): workflow_dispatch only. Calls
  the desktop integration-test reusable with `qvac_perf_runs=3`,
  aggregates a benchmark report, uploads HTML artifact + GitHub
  step summary.

Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device
Farm) needs a build-time env-var injection hook in the test app —
tracked as a QVAC-18111 follow-up.

Made-with: Cursor

* QVAC-17830 fix: tool-calling EP label honours NO_GPU on linux-x64-cpu runner

The `linux-x64-cpu` matrix leg (ubuntu-22.04, no_gpu='true') runs
on a CPU-only runner — llama.cpp falls back to CPU since no Vulkan
device is present — but `tool-calling.test.js` previously hardcoded
`useCpu = isLinuxArm64`, so its perf rows came out tagged `[GPU]`.
That made the combined report show GPU bars on a CPU runner for
tool-calling rows, which read as inconsistent next to the [CPU]
rows from `image` and `bitnet`.

Match the `_image-common.js` pattern: read `NO_GPU` from both
`process.env` and `os.getEnv()`, treat any case-insensitive
"true" as CPU. `useCpu` is now `isLinuxArm64 || noGpu`, so the
`linux-x64-cpu` leg labels its rows `[CPU]` correctly while the
`linux-x64-gpu` leg keeps `[GPU]`.

Made-with: Cursor

* QVAC-17830 fix: terser perf-report legend + full metric breakdown in cross-platform tables

Two follow-ups based on review feedback on the combined report:

- Trim the verbose "intentional gaps" legend down to a single short
  italic line (`> _` `-` ` = not run on this device._`). The
  per-scenario column filter already drops fully-empty columns, so
  the long explanation was no longer pulling its weight.

- Surface the same metrics the per-device detail tables already
  show (Total Time, Prefill, Decode, Vision Encode, TTFT, TPS)
  in the cross-platform mini-tables, instead of just Total/TTFT/TPS.
  Reviewer needs the full breakdown to compare backends, not just
  the headline. `vision_encode_time_ms` is currently null everywhere
  (native timer pending — Asana 1214371583877702) so its mini-table
  is auto-suppressed by the existing `hasAnyData` guard until the
  timer lands; the entry is in the list ahead of time so the table
  appears automatically once data flows.

Made-with: Cursor

* QVAC-17830 fix: use bare-os getEnv() for QVAC_PERF_RUNS / NO_GPU lookups

Bare runtime doesn't define `process` as a global at module-init
time, so referencing `process.env[key]` from the top-level
`_envInt(...)` call in `bitnet.test.js` blew up the entire
integration suite with `ReferenceError: process is not defined`
on first eval.

Switch the env reads to `bare-os`'s `os.getEnv()` (already imported)
as the primary source, and keep a `typeof process !== 'undefined'`-
guarded `process.env` fallback for any Node code paths that happen
to import these files. Apply the same pattern in:

- `_image-common.js` — for both `_envInt` and the existing
  `noGpuEnv` lookup which had the same latent bug.
- `bitnet.test.js` — for `_envInt`.
- `tool-calling.test.js` — for the new `noGpuEnv` lookup added in
  the linux-x64-cpu label fix.

Verified locally: behaves correctly when `process` is undefined,
when `process.env[key]` is unset, and when only one of the two
sources has the value.

Made-with: Cursor

* QVAC-18111 chore: align Benchmark Performance (LLM) workflow with the addon convention

Rename `.github/workflows/benchmark-llamacpp-llm.yml` →
`benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` to match the
existing `benchmark-performance-qvac-lib-infer-<addon>.yml` pattern
on main (Parakeet, Whispercpp, Supertonic TTS, etc.).

Also align the structure: introduce a `context` job that derives
`repository` / `ref` from optional inputs (with fallback to
`github.repository` / `github.ref_name`), wire prebuild + desktop
integration tests through that, and add a `summarize` job whose
artifact follows the convention `<addon>-performance-findings.{md,
html,json}`. The new `qvac_perf_runs` / `qvac_perf_warmup_runs`
inputs (defaulting to 3 / 1) are preserved — this is still the
single place we crank perf iterations up to produce mean ± std
numbers, and the umbrella PR workflow continues to leave them at
the cheap default of 1 + 1.

Workflow_dispatch only — explicitly NOT wired into the PR umbrella,
matching the policy agreed on Slack (2026-04-30, @Olya / @gianfranco).

Made-with: Cursor

* QVAC-17830 fix: address CodeQL security findings on combined-perf-report job

Resolves the three CodeQL alerts surfaced on PR #1843:

- Alert 731 (`scripts/perf-report/extract-from-log.js`, regex injection):
  The `--device-from-filename` flag fed user-supplied input straight into
  `new RegExp(...)`. No workflow on main or in this PR ever passes the
  flag — it was dead code from an earlier mobile artifact-layout
  experiment. Drop the flag, the helper, and the field from `parseArgs`,
  which removes the ReDoS surface entirely.

- Alerts 733 / 734 (`.github/workflows/on-pr-qvac-lib-infer-llamacpp-llm.yml`,
  cache poisoning via execution of untrusted code):
  The new `combine-perf-reports` job inherits the umbrella's
  `pull_request_target` trigger and consumes PR-author-controlled perf
  artifacts. The blast radius is already bounded (`permissions:
  contents: read`, no `actions/cache`, no secrets touched by the
  aggregator), but the per-device HTML step interpolated a
  PR-controlled `device.name` into a filesystem path with only a
  space → dash sanitiser. Harden it:

    * Pass the JSON file path via `process.argv` instead of shell-
      interpolating it into `node -e`, so the inner `JSON.parse` never
      sees shell-meta from the surrounding quoting.
    * Validate `device.name` is actually a string before using it.
    * Replace the `tr ' ' '-'` fallback with a strict `[A-Za-z0-9-]`
      allowlist (anything else collapses to `_`) plus a 64-char cap,
      which kills path traversal and shell-meta exposure even on a
      maliciously crafted JSON.
    * Skip the iteration entirely when the slug ends up empty.

  Add a security block comment on the job documenting the trust model
  (privileged trigger, scoped permissions, no shared cache, sanitised
  filename input) so future readers don't re-derive it.

Drive-by: SHA-pin three `actions/upload-artifact@v4` references this PR
introduced (one in the mobile workflow, two in the umbrella) to
`@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # 7.0.0`, matching the
repo's existing pinned-action convention. Pre-existing `@v4`
references in `integration-mobile-test-...yml` lines 1684 / 1693 are
out of scope for this PR.

Made-with: Cursor

* QVAC-17830 fix: add shell + security note on Generate combined report step

CodeQL alert 733 (cache-poisoning/poisonable-step) is the same
trust-model concern the new block comment on combine-perf-reports
already documents — privileged checkout of PR-head code is unavoidable
because the introducing PR needs to exercise its own aggregate.js
changes. Inline the security note on the offending step so the CodeQL
fingerprint reflects the explicit acknowledgement, mirroring what
already silenced alert 734 on the per-device HTML step.

Drive-by: add explicit `shell: bash` on the step (was inheriting
from the runner default; explicit is safer + matches the surrounding
steps).

Made-with: Cursor

* QVAC-17830 feat: bridge QVAC_PERF_RUNS overrides into mobile bare runtime

Mobile side of QVAC-18111: now that the umbrella benchmark workflow
pushes qvacPerfConfig.txt onto the device (PR #1840), the bare runtime
needs to actually consume it. Without this, dispatched perf overrides
were silently ignored on Android / iOS and tests defaulted to 1+1.

Changes:
- packages/qvac-lib-infer-llamacpp-llm/test/mobile/integration-runtime.cjs:
  read qvacPerfConfig.txt from the same paths the testFilter loader
  uses (/data/local/tmp on Android, global.testDir on iOS), parse
  KEY=VALUE per line, inject each into bare-os via os.setEnv. Reads
  on first __shouldRunTest call, which fires before runIntegrationModule
  imports any test module — so os.getEnv() inside _image-common.js /
  bitnet.test.js / tool-calling.test.js sees the dispatched value at
  module init. Normalises both literal '\\n' (the bash-escaped form
  the WDIO before-hook produces today) and real '\\r?\\n' so the
  parser doesn't depend on which encoding the workflow happens to use.

- .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml:
  extract the two giant Android / iOS WDIO_CONFIG JS strings out of
  the 'Create and Upload Test Spec' run: block into env: vars. GitHub
  Actions imposes a hard 21 000-char limit on any run: block that
  contains ${{ }} expressions, and the new Android image-perf group
  + iOS heavy7/heavy8 splits + collapsed-format echo line for the test
  split summary pushed us over. Moving the WDIO blobs to env: drops
  the run block from ~21 KB to ~11 KB so we have headroom for further
  additions.

Made-with: Cursor

* QVAC-17830 feat: gate Benchmark Performance (LLM) to perf-emitting tests only

Adds qvac_perf_only input to the desktop and mobile reusable workflows
and turns it on from Benchmark Performance (LLM) so dispatched runs
only execute the perf-emitting tests instead of the full integration
suite.

Desktop: regenerates test/integration/all.js from bitnet, tool-calling
and the three image-* tests, then runs bare against it.

Mobile: filters each group's run*Test list down to perf-emitting
wrappers, skips groups whose intersection is empty (no spec uploaded,
no Device Farm run booked), and tolerates gaps in the RUN_ARN_*
slot sequence when monitoring / downloading logs.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix[ci]: drop PR-head checkout in combine-perf-reports to clear CodeQL 735/736

The `combine-perf-reports` job runs in `pull_request_target` context and
was checking out the PR head's `scripts/perf-report` so it could
aggregate the matrix artifacts. CodeQL flagged this as a
cache-poisoning / poisonable-step issue (alerts 735/736) because we
were executing PR-author code in a privileged context.

Removing the `repository:` / `ref:` overrides on the checkout makes
`actions/checkout` fall back to the base-branch SHA in
`pull_request_target` runs, so `aggregate.js` is now trusted code from
the default branch. `workflow_dispatch` runs are unaffected because
`github.ref` resolves to the dispatched branch.

Existing mitigations stay in place: `contents: read` only, no
`actions/cache`, SHA-pinned actions, and the strict `[A-Za-z0-9-]`
allowlist + 64-char cap on `device.name` before it touches any
filesystem path. SECURITY comments updated to reflect the new posture.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix[ci]: add runGemma4Test + runOcrPaddleTest to iOS lightB

The merge-resolution earlier kept our 3 image perf tests in iOS heavy3/7/8,
displacing main's runGemma4Test/runOcrPaddleTest. The validator in
scripts/generate-mobile-integration-tests.js requires every test to be
assigned on both platforms — these two were missing on iOS.

Placed them in lightB alongside runQwen35Test and runModelLoadingTest
since they're model-loading tests with similar runtime characteristics.

* fix[ci]: bound mobile monitor when AWS API permanently fails

Two complementary fixes for the Device Farm monitor loop in
integration-mobile-test-llm-llamacpp.yml.

1. Match the initial credentials step's role-duration-seconds: 7200 on
   the 'Refresh AWS credentials for monitoring' step. Without it, the
   refresh defaults to the action's 1h fallback while MAX_WAIT_TIME is
   2h, so every devicefarm API call past the 1h mark fails until
   timeout — producing the spinning
     'Run N:  () | AWS API attempt X/3 failed, retrying in 5s...'
   output for ~1h with no escape.

2. Make the poll loop self-bounding under persistent API failure:
   - Initialize per-slot STATUS/RESULT to PENDING/- so the dashboard
     never prints the empty 'Run N:  ()' line.
   - On aws_retry returning empty (or 'None'), preserve last-known-good
     STATUS/RESULT for that slot and increment a per-slot FAIL counter.
   - After 10 consecutive failures (~5 min at the 30s cadence, well past
     any normal STS / Device Farm transient) mark that slot ERRORED /
     NO_API_RESPONSE / DONE so the loop exits cleanly instead of burning
     the full 2h ceiling. Final-results phase reports a real failure.

* mod[notask]: isolate iOS Gemma4 and OcrPaddle into their own Device Farm slots

Per @gianni-cor review on PR #1843, restore Gemma4 and OcrPaddle to
isolated slots instead of folding them into lightB. The image perf
tests keep their existing isolation (heavy3/heavy7/heavy8), so the
iOS heavy bank grows from 8 → 10 (plus lightA/lightB at slots 11/12).

- test-groups.json: add heavy9=runGemma4Test, heavy10=runOcrPaddleTest;
  drop both from lightB
- integration-mobile-test-llm-llamacpp.yml:
  - filter_perf: extract heavy9 + heavy10 wrappers
  - upload spec: add testspec-heavy-9.yml + testspec-heavy-10.yml,
    renumber light specs to test_spec_arn_11 / test_spec_arn_12
  - schedule: 12 maybe_schedule calls for iOS
  - monitor + log download: RUN_ARN_1..12 explicit assignments
  (the seq-based loops are already dynamic via $RUN_COUNT)

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-17830 fix: apply fruit-plate iOS OOM mitigation to high-res aurora

iOS Device Farm runs (iPhone 16 Pro / 17) were hanging the on-PR job
on `runImageHighResAuroraTest`. The 3000x4000 (~12 MP) JPEG pushed
peak memory past the ~3.3 GB Jetsam ceiling once the device was warm
from earlier jobs in the queue, the app was killed via memorystatus,
and Appium never received a clean failure signal — leaving the Device
Farm run in RUNNING (PENDING) until the 60-min job timeout.

When `image.test.js` was split into per-image files in 37afd3d, the
fruit-plate variant correctly received `iosWarmupImage` +
`iosPerfRuns: 1`, but high-res aurora — the heaviest image in the
suite — was left on the standard path. That meant CPU + GPU each ran
1 warmup + N counted with the full 12 MP JPEG, so on a warm iPhone
the cold path crossed the Jetsam line.

Mirror the fruit-plate config:
- `iosWarmupImage: 'elephant.jpg'` pre-warms the multimodal pipeline
  with a 23 KB image so Metal shaders, KV cache, and image-prefill
  buffers are allocated before the 12 MP JPEG arrives. `_image-common.js`
  skips the standard PERF_WARMUP_RUNS loop once the pre-warmup runs.
- `iosPerfRuns: 1` caps iOS counted iterations at 1 regardless of
  QVAC_PERF_RUNS, so the benchmark workflow's n=3 plan does not OOM
  the iPhone.

iOS cold-path footprint is now exactly 2 inferences per backend
(1 small pre-warmup + 1 counted real image), matching fruit-plate's
known-good shape. Desktop and Android paths are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants