Restructure benchmark workflow with per-test jobs and dual runner groups#9217
Merged
michaelstaib merged 9 commits intomainfrom Feb 25, 2026
Merged
Restructure benchmark workflow with per-test jobs and dual runner groups#9217michaelstaib merged 9 commits intomainfrom
michaelstaib merged 9 commits intomainfrom
Conversation
…nner groups Split the monolithic benchmark workflow into separate jobs per benchmark (no-recursion, deep-recursion, variable-batch) x mode (constant, ramping) x runner group (Benchmarking, Benchmarking-2) using a matrix strategy. Each of the 12 jobs runs independently and progressively updates the PR comment as it completes, giving immediate feedback. The report table distinguishes runner groups as "Constant 1" / "Constant 2" etc. New scripts: - run-single-benchmark.sh: runs a single test+mode combination - generate-report.sh: merges available results into a combined report Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Fusion Gateway Performance ResultsSimple Composite Query
Response Times
Deep Recursion Query
Response Times
Variable Batching Throughput
Response Times
Runner 1 = Benchmarking, Runner 2 = Benchmarking-2Run 22403122842 • Commit 60ac985 • Wed, 25 Feb 2026 19:48:42 GMT |
Per-job updates now track a completion count in a hidden HTML marker and only overwrite the PR comment if they have more data than what's already posted. A separate "Final Performance Report" job runs after all benchmarks complete (using needs + if: always()) and posts the definitive result with all available data, guaranteeing correctness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmarking-2 machines have 8 cores instead of 16. The script now selects the CPU pinning profile based on the runner group: Benchmarking (16 cores): k6: 0-1, Gateway: 2-4/2-5, Sources: 5-15/6-15, Inventory: 2-5 Benchmarking-2 (8 cores): k6: 0, Gateway: 1-2, Sources: 3-7, Inventory: 1-7 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pproach The benchmark jobs were wasting runner time on artifact downloads, report generation, and comment updates. Now each benchmark job only does: 1. Run the benchmark 2. Upload the artifact 3. A lightweight github-script step that reads the local result.json, pulls accumulated data from a hidden JSON block in the PR comment, merges its own result, regenerates the markdown inline, and updates. No artifact downloads, no shell script execution for reporting — just a few GitHub API calls. The benchmark runner is freed immediately. The accumulated results are stored as base64-encoded JSON in a hidden HTML comment (<!-- benchmark-data:... -->) so each job can read what previous jobs posted and build on it. The final "report" job on ubuntu-latest still runs after all benchmarks complete to post the definitive result from artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous pinning gave the gateway only 3 cores on an 8-core machine while source schemas got the rest — starving the component being measured. Cores 8-15 were also referenced on machines that only have 0-7. New pinning matches the gateways benchmark repo: Constant (50 VUs): k6 core 0, Gateway cores 1-2, Sources unpinned Ramping (500 VUs): k6 core 0, Gateway cores 1-3, Sources unpinned Same layout for all runner groups — pinning is by mode, not machine size. Helper scripts now default to no pinning when env vars are unset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detects total cores via nproc and pins source schemas to whatever is left after k6 (core 0) and gateway (1-2 or 1-3): Constant: k6=0, Gateway=1-2, Sources=3-(N-1) Ramping: k6=0, Gateway=1-3, Sources=4-(N-1) On an 8-core machine this gives sources cores 3-7 (constant) or 4-7 (ramping). Clean separation with no overlap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The final report job used completed:999 which always won over progressive updates. When cancel-in-progress killed jobs mid-run, the final report had fewer artifacts than the accumulated comment data and overwrote it. Now the final report job: 1. Reads artifacts AND the accumulated JSON from the existing PR comment 2. Merges both (artifacts win for conflicts, comment fills gaps) 3. Uses the same optimistic concurrency check — only updates if it has more data than what's already posted 4. Preserves the accumulated data block for future updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a new commit triggers a new workflow run, the PR comment now detects the mismatched run ID and discards accumulated data from the previous run, preventing stale results from persisting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of relying on run ID detection to reset stale data, a dedicated setup job on ubuntu-latest now posts the all-pending comment before any benchmark jobs run. Progressive updates only accumulate data from the matching run ID as a safety net. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New files
run-single-benchmark.sh- Runs a single test+mode combination with median calculation for constant modegenerate-report.sh- Merges all available result JSONs into a combined markdown reportTest plan
🤖 Generated with Claude Code