Skip to content

Restructure benchmark workflow with per-test jobs and dual runner groups#9217

Merged
michaelstaib merged 9 commits intomainfrom
mst/benchmarks-wf
Feb 25, 2026
Merged

Restructure benchmark workflow with per-test jobs and dual runner groups#9217
michaelstaib merged 9 commits intomainfrom
mst/benchmarks-wf

Conversation

@michaelstaib
Copy link
Copy Markdown
Member

Summary

  • Split the monolithic benchmark workflow into 12 separate matrix jobs: 3 tests (no-recursion, deep-recursion, variable-batch) x 2 modes (constant, ramping) x 2 runner groups (Benchmarking, Benchmarking-2)
  • Each job runs independently on its assigned runner group and progressively updates the PR comment as it completes, providing immediate feedback
  • The performance report distinguishes results by runner group: Constant 1 (Benchmarking) and Constant 2 (Benchmarking-2), etc.
  • Pending benchmarks show as pending until their job completes
  • Baseline results are stored per runner group in the external performance data repository

New files

  • run-single-benchmark.sh - Runs a single test+mode combination with median calculation for constant mode
  • generate-report.sh - Merges all available result JSONs into a combined markdown report

Test plan

  • Verify workflow matrix generates 12 jobs (3 tests x 2 modes x 2 runner groups)
  • Verify each job runs on the correct runner group
  • Verify PR comment is created/updated as each job completes
  • Verify the report shows "pending" for incomplete benchmarks
  • Verify final report shows all 12 results with correct runner labels
  • Verify baseline storage works on push to main

🤖 Generated with Claude Code

…nner groups

Split the monolithic benchmark workflow into separate jobs per benchmark
(no-recursion, deep-recursion, variable-batch) x mode (constant, ramping)
x runner group (Benchmarking, Benchmarking-2) using a matrix strategy.

Each of the 12 jobs runs independently and progressively updates the PR
comment as it completes, giving immediate feedback. The report table
distinguishes runner groups as "Constant 1" / "Constant 2" etc.

New scripts:
- run-single-benchmark.sh: runs a single test+mode combination
- generate-report.sh: merges available results into a combined report

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 25, 2026

Fusion Gateway Performance Results

Simple Composite Query

Req/s Err%
Constant 1 (50 VUs) 5058.48 0.00%
Constant 2 (50 VUs) 3764.65 0.00%
Ramping 1 (0-500-0 VUs) 5605.69 0.00%
Ramping 2 (0-500-0 VUs) 3829.16 0.00%
Response Times
Min Med Avg P90 P95 Max
Constant 1 0.58ms 8.08ms 9.74ms 15.90ms 21.66ms 179.55ms
Constant 2 1.05ms 10.58ms 13.05ms 21.70ms 30.05ms 251.05ms
Ramping 1 0.62ms 34.61ms 39.88ms 74.96ms 104.18ms 250.85ms
Ramping 2 1.33ms 51.55ms 57.08ms 100.06ms 142.75ms 289.04ms

Deep Recursion Query

Req/s Err%
Constant 1 (50 VUs) 953.66 0.00%
Constant 2 (50 VUs) 676.05 0.00%
Ramping 1 (0-500-0 VUs) 1146.77 0.00%
Ramping 2 (0-500-0 VUs) 770.77 0.00%
Response Times
Min Med Avg P90 P95 Max
Constant 1 4.87ms 46.72ms 50.97ms 64.96ms 76.68ms 490.79ms
Constant 2 11.39ms 64.51ms 71.21ms 91.26ms 108.62ms 765.07ms
Ramping 1 1.92ms 159.64ms 184.76ms 400.69ms 449.50ms 707.73ms
Ramping 2 3.10ms 241.07ms 267.35ms 556.33ms 620.58ms 982.39ms

Variable Batching Throughput

Req/s Err%
Constant 1 (50 VUs) 11772.28 0.00%
Constant 2 (50 VUs) 5761.46 0.00%
Ramping 1 (0-500-0 VUs) 9229.55 0.00%
Ramping 2 (0-500-0 VUs) 5197.23 0.00%
Response Times
Min Med Avg P90 P95 Max
Constant 1 0.09ms 3.87ms 4.20ms 6.86ms 9.17ms 50.46ms
Constant 2 0.16ms 8.10ms 8.58ms 14.35ms 17.31ms 60.32ms
Ramping 1 0.10ms 21.74ms 25.01ms 47.16ms 64.48ms 169.17ms
Ramping 2 0.20ms 39.49ms 44.19ms 82.60ms 106.80ms 229.01ms

Runner 1 = Benchmarking, Runner 2 = Benchmarking-2

Run 22403122842 • Commit 60ac985 • Wed, 25 Feb 2026 19:48:42 GMT

michaelstaib and others added 8 commits February 25, 2026 13:29
Per-job updates now track a completion count in a hidden HTML marker and
only overwrite the PR comment if they have more data than what's already
posted. A separate "Final Performance Report" job runs after all
benchmarks complete (using needs + if: always()) and posts the definitive
result with all available data, guaranteeing correctness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmarking-2 machines have 8 cores instead of 16. The script now
selects the CPU pinning profile based on the runner group:

Benchmarking (16 cores):
  k6: 0-1, Gateway: 2-4/2-5, Sources: 5-15/6-15, Inventory: 2-5

Benchmarking-2 (8 cores):
  k6: 0, Gateway: 1-2, Sources: 3-7, Inventory: 1-7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pproach

The benchmark jobs were wasting runner time on artifact downloads, report
generation, and comment updates. Now each benchmark job only does:
1. Run the benchmark
2. Upload the artifact
3. A lightweight github-script step that reads the local result.json,
   pulls accumulated data from a hidden JSON block in the PR comment,
   merges its own result, regenerates the markdown inline, and updates.

No artifact downloads, no shell script execution for reporting — just a
few GitHub API calls. The benchmark runner is freed immediately.

The accumulated results are stored as base64-encoded JSON in a hidden
HTML comment (<!-- benchmark-data:... -->) so each job can read what
previous jobs posted and build on it.

The final "report" job on ubuntu-latest still runs after all benchmarks
complete to post the definitive result from artifacts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous pinning gave the gateway only 3 cores on an 8-core machine
while source schemas got the rest — starving the component being measured.
Cores 8-15 were also referenced on machines that only have 0-7.

New pinning matches the gateways benchmark repo:
  Constant (50 VUs):  k6 core 0, Gateway cores 1-2, Sources unpinned
  Ramping (500 VUs):  k6 core 0, Gateway cores 1-3, Sources unpinned

Same layout for all runner groups — pinning is by mode, not machine size.
Helper scripts now default to no pinning when env vars are unset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects total cores via nproc and pins source schemas to whatever is
left after k6 (core 0) and gateway (1-2 or 1-3):

  Constant: k6=0, Gateway=1-2, Sources=3-(N-1)
  Ramping:  k6=0, Gateway=1-3, Sources=4-(N-1)

On an 8-core machine this gives sources cores 3-7 (constant) or
4-7 (ramping). Clean separation with no overlap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The final report job used completed:999 which always won over progressive
updates. When cancel-in-progress killed jobs mid-run, the final report
had fewer artifacts than the accumulated comment data and overwrote it.

Now the final report job:
1. Reads artifacts AND the accumulated JSON from the existing PR comment
2. Merges both (artifacts win for conflicts, comment fills gaps)
3. Uses the same optimistic concurrency check — only updates if it has
   more data than what's already posted
4. Preserves the accumulated data block for future updates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a new commit triggers a new workflow run, the PR comment now
detects the mismatched run ID and discards accumulated data from
the previous run, preventing stale results from persisting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of relying on run ID detection to reset stale data, a
dedicated setup job on ubuntu-latest now posts the all-pending
comment before any benchmark jobs run. Progressive updates only
accumulate data from the matching run ID as a safety net.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant