chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400donriddo wants to merge 34 commits into
Conversation
|
just for mobile, can you run the bench on both CPU and GPU? |
Tier-based Approval Status |
Already does. |
You have a point. Let me see how easy this would be to do. If not, it can come as a follow up. |
Three report-output improvements from review: - A legend explaining the [model] [gpu|cpu] [rb=N] [kv=type] config labels, including that rb is the reasoning budget (-1 leaves the reasoning channel on, 0 disables it). - A coverage warning when zero mobile devices reported, so a run whose mobile data was lost is not mistaken for a complete desktop-only report. - A per-device note when a baseline comparison has no matching device, instead of silently rendering every delta as a bare dash.
Placeholders were emitted per device at the top of each device iteration, so a hard native crash during the first device's pass (the Adreno failure mode this mechanism guards against) left the second device's combos with no rows at all, and the shard-level coverage check could not see the loss. Emit every combo's placeholder for both devices up front so a crash on one device still leaves rows for the other.
- Fail the summary when compare_run_id is set but no baseline artifacts were found (wrong id, or artifacts expired past the 90-day retention), so a requested regression comparison cannot silently render a delta-less report. - Pass artifact_run_id, compare_run_id, the addon version and the repo through env instead of interpolating dispatch inputs directly into run blocks, removing the shell-injection vector from user-settable inputs. - Add timeout-minutes to the verify-shards and stamp-version jobs so they fail fast instead of inheriting the 360-minute default.
Extends the mobile matrix from 3 to 7 KV-cache types: the TurboQuant schemes tbq3_0/pq3_0 and tbq4_0/pq4_0 and pure PolarQuant pq3_0 and pq4_0 (2 sizes x 5 quants x 7 KV-cache types = 70 shards). KV-cache types are now (k, v) pairs so the asymmetric TurboQuant configs, where the key type differs from the value type, can be expressed; the existing f16/q8_0/q4_0 entries and their shard names are unchanged. TurboQuant/PolarQuant ship Vulkan + CPU kernels only, so they are rejected on Metal (iOS) and unsupported on some GPUs (e.g. Samsung). Those combos are reported as Crashed, which the up-front placeholders render cleanly.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml
A comparison requested via compare_run_id renders delta columns against a baseline run. When the baseline produced no benchmark rows (e.g. only its run-meta/desktop-meta metadata artifacts were downloaded), the comparison was silently empty: the report rendered with no deltas and the job went green even though the requested comparison was never produced. render-report.js now exits non-zero when compareDir is set but the baseline has zero rows. This is distinct from a baseline that has rows but none matching the current devices, which still renders a per-device note.
The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.
jesusmb1995
left a comment
There was a problem hiding this comment.
Only possible .html report missing, maybe can be done in follow up.
The consolidated report is now over a thousand rows, which is hard to scan. render-report.js gains two visual outputs: - A Charts section embedded in the markdown as Mermaid xychart bars, so a device throughput ranking and the KV-cache / quantization comparison for the fastest device render inline in the GitHub step summary. - A --html output that writes a self-contained file (inline SVG, no deps or CDN) with the full per-device grouped charts and stddev error bars. The summarize job emits both; the markdown points viewers to the HTML file (uploaded with the report artifacts) for the full per-device view.
Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both reasoning-budget values on it. The warm-up was inside that reasoning-budget loop, so every backend warmed up twice. But the warm-up only primes the GPU kernels/caches for the loaded model, which the reasoning budget (a per-call generation param) does not change, so the second warm-up was pure overhead (~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.
The Stamp desktop device step interpolated the nvidia-smi GPU name directly into the printf inside its run block. Route it through a GPU_NAME env var so the value reaches the shell as data rather than as expanded workflow syntax, matching the env-mapping already used for the dispatch inputs elsewhere in this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform across every step.
|
/review |
The mobile chart helpers averaged a metric over every row sharing a (device, category) key, so a single bar blended both backends (gpu and cpu), both model sizes and both reasoning budgets — a value no real configuration produced — and its stddev whisker was the spread across those blended configs, not the measured 3-rep noise. Charts now hold every axis but the one on the x-axis at a fixed value (size 2B, reasoning budget -1, and the non-varied categorical at its default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization chart), so each bar is one measured configuration and its whisker is that config's own 3-rep stddev. gpu and cpu are charted separately and never blended, with a shared y-scale per metric. The inline mermaid is reduced to one device-ranking chart at a single stated config. Crashed configs remain missing bars rather than zeros, and the download note now names the real artifact (qwen35-benchmark-findings) and the file inside it.
Coverage compared the reported shards against the renderer's CURRENT matrix, so re-rendering an older run after the matrix grew showed it as falsely incomplete: a complete 30-shard run read 30/70 against today's 70-shard matrix. The stamp-version job now records the run's expected shard list into run-meta.json alongside the addon version, and coverage scores against that stamped list when present, falling back to the live matrix only for runs that predate the stamp. A re-render of a stamped run is therefore always scored against the matrix it actually targeted, while genuinely missing shards are still surfaced.
The report's chart note told readers to open qwen35-benchmark-charts.html but gave no link, so they had to scroll to the run's artifacts section and download it by hand. The renderer now takes an optional --charts-url and, when given, renders the artifact mention as a markdown link. The summarize job uploads the report first so the artifact's download URL is known, then substitutes that URL into the note before writing the run summary (falling back to the run page URL if the upload yields none). Local renders pass no URL and keep the plain text, so there is never a dangling link.
…ebuilds The desktop sweep ran on the GitHub-hosted GPU runner and built the addon from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant for ephemeral runners — destructive on a shared persistent runner. Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration tests use, and download the linux-x64 binary the prebuild job already produces instead of compiling on the runner. This adds the Manual Workspace Cleanup self-hosted runners need and drops the source build, the destructive disk cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for desktop-only dispatches so the binary is available to download.
🎯 What problem does this PR solve?
The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.
📝 How does it solve it?
Coverage
Report — unified renderer (
render-report.js), one identical table per device (desktop + 5 mobile):Crashedrows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).Cross-run comparison (regression detection)
summarize_onlyre-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.compare_run_idadds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.Mobile execution
test_groupsare generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.test-groups.json; scheduled only via the workflow'stest_groupsoverride.Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref,run_desktop,run_mobile,summarize_only,artifact_run_id,compare_run_id. The sharedintegration-mobile-test-llm-llamacpp.ymlgains two additive optional inputs (job_timeout_minutesdefault 120,artifact_suffixdefault empty) — backward-compatible for other addon callers.🧪 How was it tested?
npx standardclean;validate-mobile-tests.jsin sync;verify:benchmark-shardsconfirms the matrix, the generatedintegration.auto.cjs(shard-file refs and run-function names), and the workflowtest_groupsare all in lockstep, so a generator change can't silently desync the Device Farm grep.test:integration:generateregenerates everything with zero drift in the committedintegration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.mobile=3, mean ± stddev, tokens, best-config per device; 0 rows with anomalous variance; 83Crashedrows all on the Adreno quantized-KV path (78/83 are gpu + quantized).Crashedon iOS (Metal) and Samsung GPU, since those kernels are Vulkan + CPU only.Desktop (Tesla T4),desktop=5) validated in the earlier full run: https://github.com/tetherto/qvac/actions/runs/27183153829.💥 Known findings from the runs (data, not code issues)
gpu + kv=q4_0andgpu + kv=q8_0— confirmed and reported asCrashed. CPU path handles quantized KV fine.± stddevon those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.📦 Notes
index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).#2382(workflow infra, already merged).