feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff by donriddo · Pull Request #2442 · tetherto/qvac

donriddo · 2026-06-04T10:20:53Z

🎯 What problem does this PR solve?

The benchmark report (render-report.js) was missing context that Gianfranco requested: generated token counts, prompt size, run count per config, which GPU was tested, addon version, and a way to compare results across workflow runs.

Additionally Ridwan needed a way to re-render the report from a previous run's artifacts without re-triggering the full 6-hour benchmark.

📝 How does it solve it?

`render-report.js`

Tokens column: adds generated tokens per config to every device table (desktop: metrics.generatedTokens, mobile: metrics.generated_tokens)
Report header: one-liner showing addon version · prompt size (tokens) · runs per config — extracted automatically from the JSON data
Addon version: reads from mobile report's addon field; overridable via --addon-version CLI flag
--compare-dir: when provided, renders Δ TTFT / Δ TPS / Δ ppTPS columns against a baseline directory for cross-run regression comparison

`benchmark-perf-llm-llamacpp.yml`

GPU detection: nvidia-smi step captures the GPU name and passes it as --desktop-device to the renderer
Addon version: reads from packages/llm-llamacpp/package.json and passes as --addon-version
summarize_only + artifact_run_number inputs: re-render report from a prior run's artifacts without running the 6-hour benchmark
compare_run_number input: downloads baseline artifacts from a prior run and activates the --compare-dir regression view

🧪 How was it tested?

Tested render-report.js locally against real artifacts from run #9 (26917522463):

Header shows @qvac/llm-llamacpp@0.23.2 · Prompt: 510 tokens · Runs per config: 5
Tokens column populates correctly (e.g. 646, 1024, 351)
Comparison mode with same dir produces +0.00 deltas as expected

🔌 API Changes

render-report.js gains two new optional CLI flags: --addon-version <str> and --compare-dir <path>. Existing invocations without these flags produce the same output plus the new Tokens column and header line.

Workflow gains four new optional dispatch inputs: summarize_only, artifact_run_number, compare_run_number. All default to their previous behaviour when omitted.

… sweep, crash reporting, unified report

30 mobile shards in one reused-workflow call exceed that workflow's 120-min job timeout (Android was already at 119 min with 10 shards). Split the groups into three batches by KV-cache type (10 each — the proven in-budget load) via a max-parallel:1 matrix so each batch runs in isolation with no Device Farm pool contention. Add an optional artifact_suffix input to the mobile workflow (default empty, so other addons' artifact names are unchanged) to keep the three batches' perf-report artifacts from colliding; summarize aggregates all three into the unified report.

Replace the 3-batch KV-cache split with a single 30-group mobile call and add an optional job_timeout_minutes input to the mobile workflow (default 120, so every other caller is unchanged). Benchmark passes 240; observed mobile wall for the full matrix is ~140-160 min, comfortably inside it. One app build instead of three, single run, ~2.5-3h.

The single 30-shard mobile job fails on both platforms: Android serializes 30 runs against its Device Farm pool (>240 min) and the macOS runner fills its disk collecting 30 runs' logs (no space left on device). Split into three KV-cache batches (10 groups each — the proven ~119 min load) run sequentially (max-parallel: 1) to avoid pool contention. Each batch passes job_timeout_minutes 180 for headroom and a distinct artifact_suffix so the three perf-reports don't collide; summarize aggregates all three. Both new mobile-workflow inputs are optional and default to current behaviour, so other callers are unchanged.

…to best-config When cache-type-k == cache-type-v the config label now renders [kv=f16] not [kv=f16/f16], making desktop consistent with mobile. Also add a Lowest TTFT column to the best-config-per-device summary.

Gianfranco's spec was highest TPS and highest ppTPS only. Lowest TTFT was not requested and is removed.

…, prompt size, run count, GPU, cross-run diff render-report.js: - Add Tokens column (generated tokens per config) to all device tables - Add report header line: addon version, prompt size (tokens), runs per config - Extract addon version from mobile JSON's `addon` field; accept --addon-version CLI override - Extract promptTokens and generatedTokens from desktop sweep and mobile schemas - Add --compare-dir flag: when provided, renders Δ TTFT / Δ TPS / Δ ppTPS columns against a baseline directory (cross-run regression view) benchmark-perf-llm-llamacpp.yml: - Add nvidia-smi step to detect GPU name; pass it as --desktop-device to render-report - Extract addon version from package.json and pass as --addon-version - Add summarize_only + artifact_run_number inputs: re-render report from a previous run's artifacts without re-running the 6-hour benchmark - Add compare_run_number input: download baseline artifacts from a prior run and pass --compare-dir to render-report for the Δ regression view - Guard desktop/mobile jobs with !inputs.summarize_only

donriddo · 2026-06-04T10:35:18Z

Closing — changes moved into PR #2400 (the main benchmark suite PR).

github-actions · 2026-06-04T10:35:46Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

donriddo added 10 commits June 3, 2026 16:09

feat: LLM benchmark perf suite (Qwen3.5) — desktop + mobile, KV-cache…

da90058

… sweep, crash reporting, unified report

fix(render-report): revert Lowest TTFT from best-config summary

74e4c76

Gianfranco's spec was highest TPS and highest ppTPS only. Lowest TTFT was not requested and is removed.

feat: add Qwen3-1.7B to mobile benchmark (comparison baseline)

71e2676

tmp: 1.7B-only mobile dispatch (3 batches x 3 groups)

f1c4f77

chore: restore full Qwen3.5 matrix after 1.7B dispatch

511f50a

donriddo requested review from a team as code owners June 4, 2026 10:20

donriddo closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442

feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442
donriddo wants to merge 10 commits into
tetherto:mainfrom
donriddo:tmp-bench-local

donriddo commented Jun 4, 2026

Uh oh!

donriddo commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

donriddo commented Jun 4, 2026

🎯 What problem does this PR solve?

📝 How does it solve it?

render-report.js

benchmark-perf-llm-llamacpp.yml

🧪 How was it tested?

🔌 API Changes

Uh oh!

donriddo commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Tier-based Approval Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`render-report.js`

`benchmark-perf-llm-llamacpp.yml`