Skip to content

feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442

Closed
donriddo wants to merge 10 commits into
tetherto:mainfrom
donriddo:tmp-bench-local
Closed

feat[notask]: enhance LLM benchmark report — tokens, addon version, prompt size, GPU, cross-run diff#2442
donriddo wants to merge 10 commits into
tetherto:mainfrom
donriddo:tmp-bench-local

Conversation

@donriddo

@donriddo donriddo commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

The benchmark report (render-report.js) was missing context that Gianfranco requested: generated token counts, prompt size, run count per config, which GPU was tested, addon version, and a way to compare results across workflow runs.

Additionally Ridwan needed a way to re-render the report from a previous run's artifacts without re-triggering the full 6-hour benchmark.

📝 How does it solve it?

render-report.js

  • Tokens column: adds generated tokens per config to every device table (desktop: metrics.generatedTokens, mobile: metrics.generated_tokens)
  • Report header: one-liner showing addon version · prompt size (tokens) · runs per config — extracted automatically from the JSON data
  • Addon version: reads from mobile report's addon field; overridable via --addon-version CLI flag
  • --compare-dir: when provided, renders Δ TTFT / Δ TPS / Δ ppTPS columns against a baseline directory for cross-run regression comparison

benchmark-perf-llm-llamacpp.yml

  • GPU detection: nvidia-smi step captures the GPU name and passes it as --desktop-device to the renderer
  • Addon version: reads from packages/llm-llamacpp/package.json and passes as --addon-version
  • summarize_only + artifact_run_number inputs: re-render report from a prior run's artifacts without running the 6-hour benchmark
  • compare_run_number input: downloads baseline artifacts from a prior run and activates the --compare-dir regression view

🧪 How was it tested?

Tested render-report.js locally against real artifacts from run #9 (26917522463):

  • Header shows @qvac/llm-llamacpp@0.23.2 · Prompt: 510 tokens · Runs per config: 5
  • Tokens column populates correctly (e.g. 646, 1024, 351)
  • Comparison mode with same dir produces +0.00 deltas as expected

🔌 API Changes

render-report.js gains two new optional CLI flags: --addon-version <str> and --compare-dir <path>. Existing invocations without these flags produce the same output plus the new Tokens column and header line.

Workflow gains four new optional dispatch inputs: summarize_only, artifact_run_number, compare_run_number. All default to their previous behaviour when omitted.

donriddo added 10 commits June 3, 2026 16:09
30 mobile shards in one reused-workflow call exceed that workflow's 120-min
job timeout (Android was already at 119 min with 10 shards). Split the groups
into three batches by KV-cache type (10 each — the proven in-budget load) via
a max-parallel:1 matrix so each batch runs in isolation with no Device Farm
pool contention. Add an optional artifact_suffix input to the mobile workflow
(default empty, so other addons' artifact names are unchanged) to keep the
three batches' perf-report artifacts from colliding; summarize aggregates all
three into the unified report.
Replace the 3-batch KV-cache split with a single 30-group mobile call and add
an optional job_timeout_minutes input to the mobile workflow (default 120, so
every other caller is unchanged). Benchmark passes 240; observed mobile wall
for the full matrix is ~140-160 min, comfortably inside it. One app build
instead of three, single run, ~2.5-3h.
The single 30-shard mobile job fails on both platforms: Android serializes 30
runs against its Device Farm pool (>240 min) and the macOS runner fills its
disk collecting 30 runs' logs (no space left on device). Split into three
KV-cache batches (10 groups each — the proven ~119 min load) run sequentially
(max-parallel: 1) to avoid pool contention. Each batch passes job_timeout_minutes
180 for headroom and a distinct artifact_suffix so the three perf-reports don't
collide; summarize aggregates all three. Both new mobile-workflow inputs are
optional and default to current behaviour, so other callers are unchanged.
…to best-config

When cache-type-k == cache-type-v the config label now renders [kv=f16] not
[kv=f16/f16], making desktop consistent with mobile. Also add a Lowest TTFT
column to the best-config-per-device summary.
Gianfranco's spec was highest TPS and highest ppTPS only. Lowest TTFT was not
requested and is removed.
…, prompt size, run count, GPU, cross-run diff

render-report.js:
- Add Tokens column (generated tokens per config) to all device tables
- Add report header line: addon version, prompt size (tokens), runs per config
- Extract addon version from mobile JSON's `addon` field; accept --addon-version CLI override
- Extract promptTokens and generatedTokens from desktop sweep and mobile schemas
- Add --compare-dir flag: when provided, renders Δ TTFT / Δ TPS / Δ ppTPS columns
  against a baseline directory (cross-run regression view)

benchmark-perf-llm-llamacpp.yml:
- Add nvidia-smi step to detect GPU name; pass it as --desktop-device to render-report
- Extract addon version from package.json and pass as --addon-version
- Add summarize_only + artifact_run_number inputs: re-render report from a previous
  run's artifacts without re-running the 6-hour benchmark
- Add compare_run_number input: download baseline artifacts from a prior run and
  pass --compare-dir to render-report for the Δ regression view
- Guard desktop/mobile jobs with !inputs.summarize_only
@donriddo donriddo requested review from a team as code owners June 4, 2026 10:20
@donriddo

donriddo commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Closing — changes moved into PR #2400 (the main benchmark suite PR).

@donriddo donriddo closed this Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant