feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442
Closed
donriddo wants to merge 10 commits into
Closed
feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442donriddo wants to merge 10 commits into
donriddo wants to merge 10 commits into
Conversation
… sweep, crash reporting, unified report
30 mobile shards in one reused-workflow call exceed that workflow's 120-min job timeout (Android was already at 119 min with 10 shards). Split the groups into three batches by KV-cache type (10 each — the proven in-budget load) via a max-parallel:1 matrix so each batch runs in isolation with no Device Farm pool contention. Add an optional artifact_suffix input to the mobile workflow (default empty, so other addons' artifact names are unchanged) to keep the three batches' perf-report artifacts from colliding; summarize aggregates all three into the unified report.
Replace the 3-batch KV-cache split with a single 30-group mobile call and add an optional job_timeout_minutes input to the mobile workflow (default 120, so every other caller is unchanged). Benchmark passes 240; observed mobile wall for the full matrix is ~140-160 min, comfortably inside it. One app build instead of three, single run, ~2.5-3h.
The single 30-shard mobile job fails on both platforms: Android serializes 30 runs against its Device Farm pool (>240 min) and the macOS runner fills its disk collecting 30 runs' logs (no space left on device). Split into three KV-cache batches (10 groups each — the proven ~119 min load) run sequentially (max-parallel: 1) to avoid pool contention. Each batch passes job_timeout_minutes 180 for headroom and a distinct artifact_suffix so the three perf-reports don't collide; summarize aggregates all three. Both new mobile-workflow inputs are optional and default to current behaviour, so other callers are unchanged.
…to best-config When cache-type-k == cache-type-v the config label now renders [kv=f16] not [kv=f16/f16], making desktop consistent with mobile. Also add a Lowest TTFT column to the best-config-per-device summary.
Gianfranco's spec was highest TPS and highest ppTPS only. Lowest TTFT was not requested and is removed.
…, prompt size, run count, GPU, cross-run diff render-report.js: - Add Tokens column (generated tokens per config) to all device tables - Add report header line: addon version, prompt size (tokens), runs per config - Extract addon version from mobile JSON's `addon` field; accept --addon-version CLI override - Extract promptTokens and generatedTokens from desktop sweep and mobile schemas - Add --compare-dir flag: when provided, renders Δ TTFT / Δ TPS / Δ ppTPS columns against a baseline directory (cross-run regression view) benchmark-perf-llm-llamacpp.yml: - Add nvidia-smi step to detect GPU name; pass it as --desktop-device to render-report - Extract addon version from package.json and pass as --addon-version - Add summarize_only + artifact_run_number inputs: re-render report from a previous run's artifacts without re-running the 6-hour benchmark - Add compare_run_number input: download baseline artifacts from a prior run and pass --compare-dir to render-report for the Δ regression view - Guard desktop/mobile jobs with !inputs.summarize_only
Contributor
Author
|
Closing — changes moved into PR #2400 (the main benchmark suite PR). |
Contributor
Tier-based Approval Status |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 What problem does this PR solve?
The benchmark report (
render-report.js) was missing context that Gianfranco requested: generated token counts, prompt size, run count per config, which GPU was tested, addon version, and a way to compare results across workflow runs.Additionally Ridwan needed a way to re-render the report from a previous run's artifacts without re-triggering the full 6-hour benchmark.
📝 How does it solve it?
render-report.jsmetrics.generatedTokens, mobile:metrics.generated_tokens)addonfield; overridable via--addon-versionCLI flag--compare-dir: when provided, rendersΔ TTFT / Δ TPS / Δ ppTPScolumns against a baseline directory for cross-run regression comparisonbenchmark-perf-llm-llamacpp.ymlnvidia-smistep captures the GPU name and passes it as--desktop-deviceto the rendererpackages/llm-llamacpp/package.jsonand passes as--addon-versionsummarize_only+artifact_run_numberinputs: re-render report from a prior run's artifacts without running the 6-hour benchmarkcompare_run_numberinput: downloads baseline artifacts from a prior run and activates the--compare-dirregression view🧪 How was it tested?
Tested
render-report.jslocally against real artifacts from run #9 (26917522463):@qvac/llm-llamacpp@0.23.2 · Prompt: 510 tokens · Runs per config: 5+0.00deltas as expected🔌 API Changes
render-report.jsgains two new optional CLI flags:--addon-version <str>and--compare-dir <path>. Existing invocations without these flags produce the same output plus the new Tokens column and header line.Workflow gains four new optional dispatch inputs:
summarize_only,artifact_run_number,compare_run_number. All default to their previous behaviour when omitted.