Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
bd3bd69
QVAC-17830 feat: add VLM perf metrics with multi-run averaging
tobi-legan Apr 24, 2026
9f8962b
QVAC-17830 feat: wire Mobile LLM into perf-report.yml weekly aggregator
tobi-legan Apr 24, 2026
5739149
QVAC-17830 feat: add per-run joint perf reporter to mobile LLM workflow
tobi-legan Apr 24, 2026
37afd3d
QVAC-17830 fix: preserve mobile perf data on OOM, split image tests p…
tobi-legan Apr 24, 2026
9788468
QVAC-17830 fix: preserve mobile perf data under iOS V8 Zone OOM
tobi-legan Apr 24, 2026
b11e612
QVAC-17830 fix: plug combined perf report gaps (desktop race + artifa…
tobi-legan Apr 24, 2026
2dbabba
QVAC-17830 fix: iOS fruit plate retry + consolidate Android images + …
tobi-legan Apr 24, 2026
c1cff75
QVAC-17830 fix: inline crash flush-delay, drop duplicated pull helper
tobi-legan Apr 24, 2026
562ac14
QVAC-17830 fix: iOS fruit plate warmup + merge linux legs + mobile de…
tobi-legan Apr 24, 2026
398292d
QVAC-17830 fix: iOS fruit plate 1-iter override + HTML detail tables
tobi-legan Apr 24, 2026
e422652
QVAC-17830 fix: warm process iOS heavy7 + dedupe perf legs + drop retry
tobi-legan Apr 24, 2026
d691a15
QVAC-17830 fix: warm iOS heavy7 with elephant instead of api-behavior
tobi-legan Apr 25, 2026
14b1c48
QVAC-17830 fix: shrink iOS fruit-plate to 2 inferences cold
tobi-legan Apr 25, 2026
86ef719
QVAC-17830 feat: scenario grouping, GPU probe, squashed PR summary, p…
tobi-legan Apr 28, 2026
5fabbac
feat: surface per-device detail tables in PR summary with mean ±std c…
tobi-legan Apr 28, 2026
15d06e7
refactor: drop image_prefill_time_ms from perf report
tobi-legan Apr 28, 2026
7f17c52
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan Apr 28, 2026
4f23a05
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan Apr 29, 2026
fa2e76d
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan Apr 30, 2026
973b744
QVAC-17830 fix: tighten combined perf report layout (column filtering…
tobi-legan Apr 30, 2026
0fba1d0
QVAC-18111 feat: env-driven perf iterations + Benchmark Performance (…
tobi-legan Apr 30, 2026
670ee24
QVAC-17830 fix: tool-calling EP label honours NO_GPU on linux-x64-cpu…
tobi-legan Apr 30, 2026
45caf66
QVAC-17830 fix: terser perf-report legend + full metric breakdown in …
tobi-legan Apr 30, 2026
7cf5f34
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan Apr 30, 2026
47239ce
QVAC-17830 fix: use bare-os getEnv() for QVAC_PERF_RUNS / NO_GPU lookups
tobi-legan Apr 30, 2026
a561656
QVAC-18111 chore: align Benchmark Performance (LLM) workflow with the…
tobi-legan Apr 30, 2026
102f9cd
QVAC-17830 fix: address CodeQL security findings on combined-perf-rep…
tobi-legan Apr 30, 2026
a6751c5
QVAC-17830 fix: add shell + security note on Generate combined report…
tobi-legan Apr 30, 2026
5793a04
Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics
tobi-legan May 4, 2026
da03bb5
QVAC-17830 feat: bridge QVAC_PERF_RUNS overrides into mobile bare run…
tobi-legan May 4, 2026
e6f04bf
QVAC-17830 feat: gate Benchmark Performance (LLM) to perf-emitting te…
tobi-legan May 4, 2026
e0ac869
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan May 4, 2026
c3db5cb
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan May 4, 2026
e67fe3a
Merge remote-tracking branch 'origin/main' into feature-qvac-17830-vl…
tobi-legan May 5, 2026
706b0d4
fix[ci]: drop PR-head checkout in combine-perf-reports to clear CodeQ…
tobi-legan May 6, 2026
3102ddf
Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics
tobi-legan May 7, 2026
6665416
Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics
tobi-legan May 11, 2026
8f30064
fix[ci]: add runGemma4Test + runOcrPaddleTest to iOS lightB
tobi-legan May 11, 2026
cb0a21c
fix[ci]: bound mobile monitor when AWS API permanently fails
tobi-legan May 11, 2026
23e23f7
mod[notask]: isolate iOS Gemma4 and OcrPaddle into their own Device F…
tobi-legan May 11, 2026
5d16ad2
QVAC-17830 fix: apply fruit-plate iOS OOM mitigation to high-res aurora
tobi-legan May 11, 2026
03e8346
Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics
gianni-cor May 12, 2026
66c24fa
Merge branch 'main' into feature-qvac-17830-vlm-perf-metrics
gianni-cor May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
name: Benchmark Performance (LLM)

# QVAC-18111: dedicated benchmarking workflow for the LLM addon —
# manually triggered only.
#
# Per the perf policy agreed on Slack (2026-04-30, @Olya / @Gianfranco):
# the umbrella PR workflow runs perf tests at the cheap default
# (1 warmup + 1 counted, no averaging) so we don't pay full perf
# cost on every PR. This workflow is the only place we crank
# QVAC_PERF_RUNS up to produce mean ± std numbers.
#
# Phase-1 scope: desktop matrix only. Mobile (Android / iOS Device
# Farm) needs a build-time hook in the test app to pass env vars
# through to bare — tracked as a QVAC-18111 follow-up. Mobile rows
# in PR runs continue to use the cheap 1+1 default.
#
# Mirrors the structure of the existing `Benchmark Performance
# (Parakeet)` and `Benchmark Performance (Whispercpp)` workflows on
# main: a `context` job derives repo/ref from optional inputs, then
# dispatches `prebuilds-...yml` followed by `integration-test-...yml`
# with the bench-mode iteration counts, and a `summarize` job
# aggregates the artifacts into a single combined HTML + GitHub
# step summary.

on:
workflow_dispatch:
inputs:
repository:
description: "Repository to benchmark"
required: false
type: string
ref:
description: "Git ref (branch/tag/SHA) to benchmark"
required: false
type: string
qvac_perf_runs:
description: "QVAC_PERF_RUNS — counted iterations per perf test"
required: false
type: string
default: "3"
qvac_perf_warmup_runs:
description: "QVAC_PERF_WARMUP_RUNS — warmup iterations per perf test"
required: false
type: string
default: "1"

permissions:
contents: read
packages: read
id-token: write

jobs:
context:
runs-on: ubuntu-latest
outputs:
repository: ${{ steps.ctx.outputs.repository }}
ref: ${{ steps.ctx.outputs.ref }}
steps:
- id: ctx
shell: bash
env:
INPUT_REPO: ${{ inputs.repository }}
INPUT_REF: ${{ inputs.ref }}
REPO: ${{ github.repository }}
REF_NAME: ${{ github.ref_name }}
run: |
repo="${INPUT_REPO:-$REPO}"
ref="${INPUT_REF:-$REF_NAME}"
echo "repository=$repo" >> "$GITHUB_OUTPUT"
echo "ref=$ref" >> "$GITHUB_OUTPUT"

prebuild:
needs: context
permissions:
contents: write
packages: write
pull-requests: write
id-token: write
uses: ./.github/workflows/prebuilds-qvac-lib-infer-llamacpp-llm.yml
secrets: inherit
with:
repository: ${{ needs.context.outputs.repository }}
ref: ${{ needs.context.outputs.ref }}

desktop-benchmarks:
needs: [context, prebuild]
permissions:
contents: read
packages: read
id-token: write
uses: ./.github/workflows/integration-test-qvac-lib-infer-llamacpp-llm.yml
secrets: inherit
with:
repository: ${{ needs.context.outputs.repository }}
ref: ${{ needs.context.outputs.ref }}
qvac_perf_runs: ${{ inputs.qvac_perf_runs }}
qvac_perf_warmup_runs: ${{ inputs.qvac_perf_warmup_runs }}

summarize:
needs: [context, desktop-benchmarks]
if: always()
runs-on: ubuntu-latest
timeout-minutes: 10
permissions:
contents: read
steps:
- name: Checkout repository
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # 6.0.2
with:
repository: ${{ needs.context.outputs.repository }}
ref: ${{ needs.context.outputs.ref }}
token: ${{ secrets.PAT_TOKEN }}
sparse-checkout: |
scripts/perf-report
packages/qvac-lib-infer-llamacpp-llm/media

- name: Setup Node.js
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # 4.4.0
with:
node-version: lts/*

- name: Download all perf report artifacts
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # 8.0.1
with:
pattern: perf-report-llamacpp-llm-*-${{ github.run_number }}
path: combined-reports
continue-on-error: true

- name: Fix desktop device names
shell: bash
run: |
# Same fold as the umbrella combine-perf-reports step:
# sibling matrix legs (linux-x64-cpu+linux-x64-gpu,
# linux-arm64-u22+linux-arm64-u24) collapse onto one device
# name so [CPU]/[GPU] rows sit in the same column.
for dir in combined-reports/perf-report-llamacpp-llm-*/; do
[ -d "$dir" ] || continue
base=$(basename "$dir")
platform=$(echo "$base" | sed "s/^perf-report-llamacpp-llm-//" | sed "s/-${{ github.run_number }}$//")

case "$platform" in Android|iOS) continue ;; esac

case "$platform" in
linux-x64-cpu|linux-x64-gpu) device_name="linux-x64" ;;
linux-arm64-u22|linux-arm64-u24) device_name="linux-arm64" ;;
*) device_name="$platform" ;;
esac

for json in $(find "$dir" -name "performance-report.json" 2>/dev/null); do
if command -v jq >/dev/null 2>&1; then
jq --arg name "$device_name" '.device.name = $name' "$json" > "${json}.tmp" && mv "${json}.tmp" "$json"
echo "Patched device name in $json -> $device_name (was matrix label $platform)"
fi
done
done

- name: Generate consolidated benchmark report
run: |
if ! find combined-reports -name "performance-report.json" -type f 2>/dev/null | grep -q .; then
echo "No performance reports found."
exit 0
fi

mkdir -p benchmark-artifacts

node scripts/perf-report/aggregate.js \
--dir combined-reports \
--addon-type vision \
--device-details \
--output-html benchmark-artifacts/llamacpp-llm-performance-findings.html \
--output-json benchmark-artifacts/llamacpp-llm-performance-findings.json \
--output benchmark-artifacts/llamacpp-llm-performance-findings.md

- name: Add summary
if: always()
shell: bash
run: |
set +e
MD_FILE="benchmark-artifacts/llamacpp-llm-performance-findings.md"
{
echo "## LLM / VLM Benchmark Report (Desktop)"
echo ""
echo "> Triggered manually via \`workflow_dispatch\` — \`QVAC_PERF_RUNS=${{ inputs.qvac_perf_runs }}\`, \`QVAC_PERF_WARMUP_RUNS=${{ inputs.qvac_perf_warmup_runs }}\`."
echo ""
echo "> Mobile (Android / iOS) is **not** covered by this workflow yet — bench-mode iteration counts need a build-time hook in the mobile test app (QVAC-18111 follow-up). Mobile rows shown in PR runs continue to use 1 + 1."
echo ""
if [ -f "$MD_FILE" ]; then
cat "$MD_FILE"
else
echo "No combined performance report available."
fi
} >> "$GITHUB_STEP_SUMMARY"

- name: Upload consolidated benchmark report
if: always()
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # 7.0.0
with:
name: llamacpp-llm-performance-findings
path: |
benchmark-artifacts/llamacpp-llm-performance-findings.md
benchmark-artifacts/llamacpp-llm-performance-findings.json
benchmark-artifacts/llamacpp-llm-performance-findings.html
retention-days: 30
if-no-files-found: ignore
Loading
Loading