Skip to content

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400

Open
donriddo wants to merge 34 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite
Open

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
donriddo wants to merge 34 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite

Conversation

@donriddo

@donriddo donriddo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.

📝 How does it solve it?

Coverage

  • Models: Qwen3.5-0.8B + 2B (5 quants each), keep Qwen3-1.7B as a desktop comparison baseline, drop Qwen3-4B. No PyTorch.
  • Reasoning budget -1 and 0; single ~512-token prompt (verified at ~518 templated tokens against the Qwen3.5 tokenizer).
  • Mobile KV-cache types f16, q8_0, q4_0, plus TurboQuant/PolarQuant (tbq3_0/pq3_0, tbq4_0/pq4_0, pq3_0, pq4_0); desktop runs GPU, mobile runs both gpu and cpu.

Report — unified renderer (render-report.js), one identical table per device (desktop + 5 mobile):

  • Columns: TTFT (ms) · TPS · ppTPS · Tokens, each as mean ± stddev across repeats (desktop 5, mobile 3).
  • Header records addon version, prompt size, repeats, and the detected desktop GPU — version + GPU are stamped into the run's artifacts so they're accurate and survive a later re-render.
  • Crashed rows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).
  • Best configuration per device (highest TPS, highest ppTPS).

Cross-run comparison (regression detection)

  • summarize_only re-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.
  • compare_run_id adds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.

Mobile execution

  • Sharded one group per (model × KV-cache type) = 70 shards (2 sizes × 5 quants × 7 KV-cache types), run as 7 sequential KV-cache batches to fit the Device Farm per-test ceiling and avoid pool/disk exhaustion. 3 measured repetitions per config.
  • The 70 shard files and the workflow's test_groups are generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.
  • Deliberately absent from test-groups.json; scheduled only via the workflow's test_groups override.

Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref, run_desktop, run_mobile, summarize_only, artifact_run_id, compare_run_id. The shared integration-mobile-test-llm-llamacpp.yml gains two additive optional inputs (job_timeout_minutes default 120, artifact_suffix default empty) — backward-compatible for other addon callers.

🧪 How was it tested?

  • npx standard clean; validate-mobile-tests.js in sync; verify:benchmark-shards confirms the matrix, the generated integration.auto.cjs (shard-file refs and run-function names), and the workflow test_groups are all in lockstep, so a generator change can't silently desync the Device Farm grep.
  • Generation pipeline verified locally: from a fresh checkout (shards absent) test:integration:generate regenerates everything with zero drift in the committed integration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.
  • Validated end-to-end across every input combination with real runs (full, desktop-only, mobile-only, re-render, comparison).

💥 Known findings from the runs (data, not code issues)

  • Adreno GPU (Samsung S25/S26) crashes on all gpu + kv=q4_0 and gpu + kv=q8_0 — confirmed and reported as Crashed. CPU path handles quantized KV fine.
  • Mobile thermal throttling: on some mobile configs successive repeats get slower (e.g. ppTPS 850 -> 492 -> 428 across 3 reps), which widens the ± stddev on those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.
  • Pixel 9 Pro GPU TTFT/ppTPS are notably weaker than the other devices across quants, consistent with a Vulkan/driver characteristic; CPU results are plausible.

📦 Notes

  • Benchmark/test infrastructure only — no addon index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).
  • Pairs with #2382 (workflow infra, already merged).

@gianni-cor

gianni-cor commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

just for mobile, can you run the bench on both CPU and GPU?

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

@donriddo

donriddo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

just for mobile, can you run the bench on both CPU and GPU?

Already does. mobile.config.json sets "devices": ["gpu", "cpu"] and benchmark-perf.test.js loops over both, so each model and quant runs on CPU and GPU.

@donriddo

Copy link
Copy Markdown
Contributor Author

Would be possible, in addition of the raw table results of the .md file for the consolidated report, to produce a rendered .html file with graphs to compare different categories (e.g. avg/stdev bars graph with comparing KV cache types and aggregating other matrix parameters/devices, then other graph comparing weight quantization and aggregating other parameters, etc.). That would be easier to inspect quickly.

Maybe to choose a meaningful configuration, show the graphs per device or when keeping other parameters on their default values. E.g. a graph showing how tokens per second vary when varying KV cache type but keeping all other parameters as default for S25, etc.

The point I want to make is that a compact graphical representation of the results would be easy for us to re-view quickly, rather than having to compare hundreds of raw entries of the .md table.

You have a point. Let me see how easy this would be to do. If not, it can come as a follow up.

donriddo added 4 commits June 10, 2026 16:54
Three report-output improvements from review:
- A legend explaining the [model] [gpu|cpu] [rb=N] [kv=type] config labels,
  including that rb is the reasoning budget (-1 leaves the reasoning channel on,
  0 disables it).
- A coverage warning when zero mobile devices reported, so a run whose mobile
  data was lost is not mistaken for a complete desktop-only report.
- A per-device note when a baseline comparison has no matching device, instead
  of silently rendering every delta as a bare dash.
Placeholders were emitted per device at the top of each device iteration, so a
hard native crash during the first device's pass (the Adreno failure mode this
mechanism guards against) left the second device's combos with no rows at all,
and the shard-level coverage check could not see the loss. Emit every combo's
placeholder for both devices up front so a crash on one device still leaves rows
for the other.
- Fail the summary when compare_run_id is set but no baseline artifacts were
  found (wrong id, or artifacts expired past the 90-day retention), so a
  requested regression comparison cannot silently render a delta-less report.
- Pass artifact_run_id, compare_run_id, the addon version and the repo through
  env instead of interpolating dispatch inputs directly into run blocks,
  removing the shell-injection vector from user-settable inputs.
- Add timeout-minutes to the verify-shards and stamp-version jobs so they fail
  fast instead of inheriting the 360-minute default.
Extends the mobile matrix from 3 to 7 KV-cache types: the TurboQuant schemes
tbq3_0/pq3_0 and tbq4_0/pq4_0 and pure PolarQuant pq3_0 and pq4_0
(2 sizes x 5 quants x 7 KV-cache types = 70 shards). KV-cache types are now
(k, v) pairs so the asymmetric TurboQuant configs, where the key type differs
from the value type, can be expressed; the existing f16/q8_0/q4_0 entries and
their shard names are unchanged.

TurboQuant/PolarQuant ship Vulkan + CPU kernels only, so they are rejected on
Metal (iOS) and unsupported on some GPUs (e.g. Samsung). Those combos are
reported as Crashed, which the up-front placeholders render cleanly.
@donriddo

This comment was marked as resolved.

@donriddo

This comment was marked as resolved.

@donriddo

This comment was marked as resolved.

@donriddo

This comment was marked as resolved.

@donriddo

This comment was marked as resolved.

donriddo added 3 commits June 10, 2026 18:20
…-llm-suite

# Conflicts:
#	.github/workflows/integration-mobile-test-llm-llamacpp.yml
A comparison requested via compare_run_id renders delta columns against a
baseline run. When the baseline produced no benchmark rows (e.g. only its
run-meta/desktop-meta metadata artifacts were downloaded), the comparison was
silently empty: the report rendered with no deltas and the job went green even
though the requested comparison was never produced. render-report.js now exits
non-zero when compareDir is set but the baseline has zero rows. This is distinct
from a baseline that has rows but none matching the current devices, which still
renders a per-device note.
The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.
maxim-smotrov
maxim-smotrov previously approved these changes Jun 10, 2026
jesusmb1995
jesusmb1995 previously approved these changes Jun 11, 2026

@jesusmb1995 jesusmb1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only possible .html report missing, maybe can be done in follow up.

donriddo added 3 commits June 11, 2026 11:12
The consolidated report is now over a thousand rows, which is hard to scan.
render-report.js gains two visual outputs:
- A Charts section embedded in the markdown as Mermaid xychart bars, so a device
  throughput ranking and the KV-cache / quantization comparison for the fastest
  device render inline in the GitHub step summary.
- A --html output that writes a self-contained file (inline SVG, no deps or CDN)
  with the full per-device grouped charts and stddev error bars.

The summarize job emits both; the markdown points viewers to the HTML file
(uploaded with the report artifacts) for the full per-device view.
Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both
reasoning-budget values on it. The warm-up was inside that reasoning-budget
loop, so every backend warmed up twice. But the warm-up only primes the GPU
kernels/caches for the loaded model, which the reasoning budget (a per-call
generation param) does not change, so the second warm-up was pure overhead
(~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three
measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.
jesusmb1995
jesusmb1995 previously approved these changes Jun 11, 2026
jpgaribotti
jpgaribotti previously approved these changes Jun 11, 2026
The Stamp desktop device step interpolated the nvidia-smi GPU name directly
into the printf inside its run block. Route it through a GPU_NAME env var so
the value reaches the shell as data rather than as expanded workflow syntax,
matching the env-mapping already used for the dispatch inputs elsewhere in
this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform
across every step.
maxim-smotrov
maxim-smotrov previously approved these changes Jun 11, 2026
@donriddo

Copy link
Copy Markdown
Contributor Author

/review

donriddo added 6 commits June 12, 2026 12:47
The mobile chart helpers averaged a metric over every row sharing a
(device, category) key, so a single bar blended both backends (gpu and
cpu), both model sizes and both reasoning budgets — a value no real
configuration produced — and its stddev whisker was the spread across
those blended configs, not the measured 3-rep noise.

Charts now hold every axis but the one on the x-axis at a fixed value
(size 2B, reasoning budget -1, and the non-varied categorical at its
default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization
chart), so each bar is one measured configuration and its whisker is that
config's own 3-rep stddev. gpu and cpu are charted separately and never
blended, with a shared y-scale per metric. The inline mermaid is reduced
to one device-ranking chart at a single stated config. Crashed configs
remain missing bars rather than zeros, and the download note now names the
real artifact (qwen35-benchmark-findings) and the file inside it.
Coverage compared the reported shards against the renderer's CURRENT
matrix, so re-rendering an older run after the matrix grew showed it as
falsely incomplete: a complete 30-shard run read 30/70 against today's
70-shard matrix.

The stamp-version job now records the run's expected shard list into
run-meta.json alongside the addon version, and coverage scores against
that stamped list when present, falling back to the live matrix only for
runs that predate the stamp. A re-render of a stamped run is therefore
always scored against the matrix it actually targeted, while genuinely
missing shards are still surfaced.
The report's chart note told readers to open qwen35-benchmark-charts.html
but gave no link, so they had to scroll to the run's artifacts section and
download it by hand.

The renderer now takes an optional --charts-url and, when given, renders the
artifact mention as a markdown link. The summarize job uploads the report
first so the artifact's download URL is known, then substitutes that URL into
the note before writing the run summary (falling back to the run page URL if
the upload yields none). Local renders pass no URL and keep the plain text,
so there is never a dangling link.
…ebuilds

The desktop sweep ran on the GitHub-hosted GPU runner and built the addon
from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant
for ephemeral runners — destructive on a shared persistent runner.

Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration
tests use, and download the linux-x64 binary the prebuild job already produces
instead of compiling on the runner. This adds the Manual Workspace Cleanup
self-hosted runners need and drops the source build, the destructive disk
cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for
desktop-only dispatches so the binary is available to download.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NLP llm and embed verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants