chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile) by donriddo · Pull Request #2400 · tetherto/qvac

donriddo · 2026-06-02T16:13:53Z

🎯 What problem does this PR solve?

The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.

📝 How does it solve it?

Coverage

Models: Qwen3.5-0.8B + 2B (5 quants each), keep Qwen3-1.7B as a desktop comparison baseline, drop Qwen3-4B. No PyTorch.
Reasoning budget -1 and 0; single ~512-token prompt (verified at ~518 templated tokens against the Qwen3.5 tokenizer).
Mobile KV-cache types f16, q8_0, q4_0, plus TurboQuant/PolarQuant (tbq3_0/pq3_0, tbq4_0/pq4_0, pq3_0, pq4_0); desktop runs GPU, mobile runs both gpu and cpu.

Report — unified renderer (render-report.js), one identical table per device (desktop + 5 mobile):

Columns: TTFT (ms) · TPS · ppTPS · Tokens, each as mean ± stddev across repeats (desktop 5, mobile 3).
Header records addon version, prompt size, repeats, and the detected desktop GPU — version + GPU are stamped into the run's artifacts so they're accurate and survive a later re-render.
Crashed rows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).
Best configuration per device (highest TPS, highest ppTPS).

Cross-run comparison (regression detection)

summarize_only re-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.
compare_run_id adds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.

Mobile execution

Sharded one group per (model × KV-cache type) = 70 shards (2 sizes × 5 quants × 7 KV-cache types), run as 7 sequential KV-cache batches to fit the Device Farm per-test ceiling and avoid pool/disk exhaustion. 3 measured repetitions per config.
The 70 shard files and the workflow's test_groups are generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.
Deliberately absent from test-groups.json; scheduled only via the workflow's test_groups override.

Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref, run_desktop, run_mobile, summarize_only, artifact_run_id, compare_run_id. The shared integration-mobile-test-llm-llamacpp.yml gains two additive optional inputs (job_timeout_minutes default 120, artifact_suffix default empty) — backward-compatible for other addon callers.

🧪 How was it tested?

npx standard clean; validate-mobile-tests.js in sync; verify:benchmark-shards confirms the matrix, the generated integration.auto.cjs (shard-file refs and run-function names), and the workflow test_groups are all in lockstep, so a generator change can't silently desync the Device Farm grep.
Generation pipeline verified locally: from a fresh checkout (shards absent) test:integration:generate regenerates everything with zero drift in the committed integration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.
Validated end-to-end across every input combination with real runs (full, desktop-only, mobile-only, re-render, comparison).
- Mobile run on the 3-KV-cache matrix (pre-TurboQuant, 30 shards) — all 6 batches green on 5 real devices: https://github.com/tetherto/qvac/actions/runs/27231785449 — 594 data rows at mobile=3, mean ± stddev, tokens, best-config per device; 0 rows with anomalous variance; 83 Crashed rows all on the Adreno quantized-KV path (78/83 are gpu + quantized).
- Full 70-shard run including the TurboQuant/PolarQuant KV-cache types and the Pixel 10 device is in progress: https://github.com/tetherto/qvac/actions/runs/27292978920 — the TurboQuant batches are expected to show Crashed on iOS (Metal) and Samsung GPU, since those kernels are Vulkan + CPU only.
- Desktop sweep (Desktop (Tesla T4), desktop=5) validated in the earlier full run: https://github.com/tetherto/qvac/actions/runs/27183153829.
- Per-batch wall-clock ran 54–114 min under the 180-min cap; the 7 KV-cache batches run sequentially.

💥 Known findings from the runs (data, not code issues)

Adreno GPU (Samsung S25/S26) crashes on all gpu + kv=q4_0 and gpu + kv=q8_0 — confirmed and reported as Crashed. CPU path handles quantized KV fine.
Mobile thermal throttling: on some mobile configs successive repeats get slower (e.g. ppTPS 850 -> 492 -> 428 across 3 reps), which widens the ± stddev on those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.
Pixel 9 Pro GPU TTFT/ppTPS are notably weaker than the other devices across quants, consistent with a Vulkan/driver characteristic; CPU results are plausible.

📦 Notes

Benchmark/test infrastructure only — no addon index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).
Pairs with #2382 (workflow infra, already merged).

gianni-cor · 2026-06-02T17:33:11Z

just for mobile, can you run the bench on both CPU and GPU?

github-actions · 2026-06-02T17:33:39Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

donriddo · 2026-06-02T18:26:06Z

just for mobile, can you run the bench on both CPU and GPU?

Already does. mobile.config.json sets "devices": ["gpu", "cpu"] and benchmark-perf.test.js loops over both, so each model and quant runs on CPU and GPU.

donriddo · 2026-06-10T15:45:57Z

Would be possible, in addition of the raw table results of the .md file for the consolidated report, to produce a rendered .html file with graphs to compare different categories (e.g. avg/stdev bars graph with comparing KV cache types and aggregating other matrix parameters/devices, then other graph comparing weight quantization and aggregating other parameters, etc.). That would be easier to inspect quickly.

Maybe to choose a meaningful configuration, show the graphs per device or when keeping other parameters on their default values. E.g. a graph showing how tokens per second vary when varying KV cache type but keeping all other parameters as default for S25, etc.

The point I want to make is that a compact graphical representation of the results would be easy for us to re-view quickly, rather than having to compare hundreds of raw entries of the .md table.

You have a point. Let me see how easy this would be to do. If not, it can come as a follow up.

Three report-output improvements from review: - A legend explaining the [model] [gpu|cpu] [rb=N] [kv=type] config labels, including that rb is the reasoning budget (-1 leaves the reasoning channel on, 0 disables it). - A coverage warning when zero mobile devices reported, so a run whose mobile data was lost is not mistaken for a complete desktop-only report. - A per-device note when a baseline comparison has no matching device, instead of silently rendering every delta as a bare dash.

Placeholders were emitted per device at the top of each device iteration, so a hard native crash during the first device's pass (the Adreno failure mode this mechanism guards against) left the second device's combos with no rows at all, and the shard-level coverage check could not see the loss. Emit every combo's placeholder for both devices up front so a crash on one device still leaves rows for the other.

- Fail the summary when compare_run_id is set but no baseline artifacts were found (wrong id, or artifacts expired past the 90-day retention), so a requested regression comparison cannot silently render a delta-less report. - Pass artifact_run_id, compare_run_id, the addon version and the repo through env instead of interpolating dispatch inputs directly into run blocks, removing the shell-injection vector from user-settable inputs. - Add timeout-minutes to the verify-shards and stamp-version jobs so they fail fast instead of inheriting the 360-minute default.

Extends the mobile matrix from 3 to 7 KV-cache types: the TurboQuant schemes tbq3_0/pq3_0 and tbq4_0/pq4_0 and pure PolarQuant pq3_0 and pq4_0 (2 sizes x 5 quants x 7 KV-cache types = 70 shards). KV-cache types are now (k, v) pairs so the asymmetric TurboQuant configs, where the key type differs from the value type, can be expressed; the existing f16/q8_0/q4_0 entries and their shard names are unchanged. TurboQuant/PolarQuant ship Vulkan + CPU kernels only, so they are rejected on Metal (iOS) and unsupported on some GPUs (e.g. Samsung). Those combos are reported as Crashed, which the up-front placeholders render cleanly.

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

A comparison requested via compare_run_id renders delta columns against a baseline run. When the baseline produced no benchmark rows (e.g. only its run-meta/desktop-meta metadata artifacts were downloaded), the comparison was silently empty: the report rendered with no deltas and the job went green even though the requested comparison was never produced. render-report.js now exits non-zero when compareDir is set but the baseline has zero rows. This is distinct from a baseline that has rows but none matching the current devices, which still renders a per-device note.

The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.

jesusmb1995

Only possible .html report missing, maybe can be done in follow up.

The consolidated report is now over a thousand rows, which is hard to scan. render-report.js gains two visual outputs: - A Charts section embedded in the markdown as Mermaid xychart bars, so a device throughput ranking and the KV-cache / quantization comparison for the fastest device render inline in the GitHub step summary. - A --html output that writes a self-contained file (inline SVG, no deps or CDN) with the full per-device grouped charts and stddev error bars. The summarize job emits both; the markdown points viewers to the HTML file (uploaded with the report artifacts) for the full per-device view.

Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both reasoning-budget values on it. The warm-up was inside that reasoning-budget loop, so every backend warmed up twice. But the warm-up only primes the GPU kernels/caches for the loaded model, which the reasoning budget (a per-call generation param) does not change, so the second warm-up was pure overhead (~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.

…-llm-suite

The Stamp desktop device step interpolated the nvidia-smi GPU name directly into the printf inside its run block. Route it through a GPU_NAME env var so the value reaches the shell as data rather than as expanded workflow syntax, matching the env-mapping already used for the dispatch inputs elsewhere in this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform across every step.

…-llm-suite

donriddo · 2026-06-12T10:28:29Z

/review

The mobile chart helpers averaged a metric over every row sharing a (device, category) key, so a single bar blended both backends (gpu and cpu), both model sizes and both reasoning budgets — a value no real configuration produced — and its stddev whisker was the spread across those blended configs, not the measured 3-rep noise. Charts now hold every axis but the one on the x-axis at a fixed value (size 2B, reasoning budget -1, and the non-varied categorical at its default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization chart), so each bar is one measured configuration and its whisker is that config's own 3-rep stddev. gpu and cpu are charted separately and never blended, with a shared y-scale per metric. The inline mermaid is reduced to one device-ranking chart at a single stated config. Crashed configs remain missing bars rather than zeros, and the download note now names the real artifact (qwen35-benchmark-findings) and the file inside it.

Coverage compared the reported shards against the renderer's CURRENT matrix, so re-rendering an older run after the matrix grew showed it as falsely incomplete: a complete 30-shard run read 30/70 against today's 70-shard matrix. The stamp-version job now records the run's expected shard list into run-meta.json alongside the addon version, and coverage scores against that stamped list when present, falling back to the live matrix only for runs that predate the stamp. A re-render of a stamped run is therefore always scored against the matrix it actually targeted, while genuinely missing shards are still surfaced.

The report's chart note told readers to open qwen35-benchmark-charts.html but gave no link, so they had to scroll to the run's artifacts section and download it by hand. The renderer now takes an optional --charts-url and, when given, renders the artifact mention as a markdown link. The summarize job uploads the report first so the artifact's download URL is known, then substitutes that URL into the note before writing the run summary (falling back to the run page URL if the upload yields none). Local renders pass no URL and keep the plain text, so there is never a dangling link.

…-llm-suite

…ebuilds The desktop sweep ran on the GitHub-hosted GPU runner and built the addon from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant for ephemeral runners — destructive on a shared persistent runner. Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration tests use, and download the linux-x64 binary the prebuild job already produces instead of compiling on the runner. This adds the Manual Workspace Cleanup self-hosted runners need and drops the source build, the destructive disk cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for desktop-only dispatches so the binary is available to download.

…-llm-suite

donriddo requested review from a team as code owners June 2, 2026 16:13

donriddo mentioned this pull request Jun 2, 2026

infra: add benchmark-perf-llm-llamacpp workflow #2382

Merged

donriddo temporarily deployed to release June 2, 2026 19:51 — with GitHub Actions Inactive

donriddo added 4 commits June 10, 2026 16:54

This comment was marked as resolved.

Sign in to view

donriddo added 3 commits June 10, 2026 18:20

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

e6e6911

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

docs: correct the matrix dimensions in the generator comment

7463be1

The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.

maxim-smotrov previously approved these changes Jun 10, 2026

View reviewed changes

jesusmb1995 previously approved these changes Jun 11, 2026

View reviewed changes

donriddo added 3 commits June 11, 2026 11:12

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

8b28a7a

…-llm-suite

jesusmb1995 previously approved these changes Jun 11, 2026

View reviewed changes

jpgaribotti previously approved these changes Jun 11, 2026

View reviewed changes

maxim-smotrov previously approved these changes Jun 11, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

50a0bd2

…-llm-suite

donriddo added 6 commits June 12, 2026 12:47

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

5fd6400

…-llm-suite

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

72a6091

…-llm-suite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
donriddo wants to merge 34 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite

donriddo commented Jun 2, 2026 •

edited

Loading

Uh oh!

gianni-cor commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

donriddo commented Jun 2, 2026

Uh oh!

donriddo commented Jun 10, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

jesusmb1995 left a comment

Uh oh!

donriddo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

donriddo commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

💥 Known findings from the runs (data, not code issues)

📦 Notes

Uh oh!

gianni-cor commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

donriddo commented Jun 2, 2026

Uh oh!

donriddo commented Jun 10, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

jesusmb1995 left a comment

Choose a reason for hiding this comment

Uh oh!

donriddo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

donriddo commented Jun 2, 2026 •

edited

Loading

gianni-cor commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading