Skip to content

QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch#1839

Merged
tobi-legan merged 6 commits into
mainfrom
infra/qvac-18111-benchmark-llm-workflow
May 4, 2026
Merged

QVAC-18111 infra[notask]: scaffold Benchmark Performance (LLM) workflow_dispatch#1839
tobi-legan merged 6 commits into
mainfrom
infra/qvac-18111-benchmark-llm-workflow

Conversation

@tobi-legan

@tobi-legan tobi-legan commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Scaffolds the dedicated Benchmark Performance (LLM) workflow_dispatch on main so the QVAC-17830 perf-metrics feature branch can be dispatched against it. Per the perf policy agreed on Slack, the umbrella on-pr workflow keeps the cheap iteration default; this is the only place we crank up QVAC_PERF_RUNS to get mean ± std numbers.

GitHub requires a workflow_dispatch to exist on the default branch before it shows up in the Actions tab and becomes triggerable with --ref <feature-branch> — that's why this small infra PR lands ahead of the main perf PR.

Changes

  • benchmark-performance-qvac-lib-infer-llamacpp-llm.yml (new)
    • Manual workflow_dispatch only, mirrors the existing Parakeet / Whispercpp benchmark workflows
    • Inputs: repository, ref, qvac_perf_runs (default 3), qvac_perf_warmup_runs (default 1), run_desktop (default true)
    • Jobs: contextprebuilddesktop-benchmarks (gated by run_desktop, calls integration-test-...yml) → summarize (aggregates desktop artifacts into combined HTML + GitHub step summary)
    • Phase-1 scope: desktop only. Mobile (Device Farm) requires a build-time hook in the test app to thread env vars through to bare — landing as a stacked follow-up PR (QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile #1840) which adds mobile-benchmarks + a matching run_mobile toggle so the two matrices can be triggered independently
  • integration-test-qvac-lib-infer-llamacpp-llm.yml
    • Thread qvac_perf_runs / qvac_perf_warmup_runs through workflow_call + workflow_dispatch
    • Surface as QVAC_PERF_RUNS / QVAC_PERF_WARMUP_RUNS env on the Linux/macOS and Windows run-test steps
    • Empty string ⇒ unset, so the umbrella PR workflow continues to honour the test-side default. Existing PR runs are unaffected.

Test plan

  • Land this PR on main so Benchmark Performance (LLM) appears in the Actions tab
  • Dispatch with --ref feature-qvac-17830-vlm-perf-metrics (the perf-metrics branch carries the actual env-var consumption in _image-common.js / bitnet.test.js / tool-calling.test.js) to confirm the bench-mode 3 + 1 iteration counts surface in the combined report
  • Dispatch with run_desktop=false to verify the desktop matrix is skipped (no-op until QVAC-18111 infra[notask]: bridge QVAC_PERF_RUNS to mobile test app via pushFile #1840 lands and adds run_mobile; until then this dispatch produces an empty summary, which is the expected behaviour)
  • Confirm the umbrella on-pr LLM workflow stays unchanged (PR runs use 1 + 1)
  • actionlint / GitHub workflow validation passes

…ow_dispatch

GitHub requires a `workflow_dispatch` workflow to exist on the
default branch before it shows up in the Actions tab and becomes
triggerable with `--ref <feature-branch>`. This lands the LLM
benchmark workflow on `main` so the QVAC-17830 perf-metrics feature
branch can be dispatched against it for end-to-end validation.

Changes:
- `benchmark-performance-qvac-lib-infer-llamacpp-llm.yml` (new):
  manual `workflow_dispatch` only — mirrors the structure of the
  existing Parakeet / Whispercpp benchmark workflows. Calls
  `prebuilds-...yml` then `integration-test-...yml` with
  bench-mode iteration counts (`QVAC_PERF_RUNS=3`,
  `QVAC_PERF_WARMUP_RUNS=1` by default), then aggregates desktop
  artifacts into a combined HTML / step-summary. Phase-1 scope is
  desktop only — mobile (Device Farm) needs a build-time hook in
  the test app to thread env vars through to bare and is tracked
  as a QVAC-18111 follow-up.
- `integration-test-qvac-lib-infer-llamacpp-llm.yml`: thread
  `qvac_perf_runs` / `qvac_perf_warmup_runs` through `workflow_call`
  + `workflow_dispatch` and surface them as `QVAC_PERF_RUNS` /
  `QVAC_PERF_WARMUP_RUNS` on the Linux/macOS and Windows test run
  steps. Empty string => unset, so the umbrella PR workflow
  continues to honour the test-side default and PR runs are
  unaffected by this change.

Per the perf policy agreed on Slack (2026-04-30): the umbrella
on-pr workflow runs perf tests at the cheap default so we don't pay
full perf cost on every PR; this dedicated workflow is the only
place we crank up the iteration counts to produce mean ± std
numbers.

Made-with: Cursor
@olyasir

olyasir commented May 1, 2026

Copy link
Copy Markdown
Contributor

/review

@tobi-legan tobi-legan requested a review from a team as a code owner May 4, 2026 11:39
@tobi-legan

Copy link
Copy Markdown
Contributor Author

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants