Skip to content

perf(ci): Improve benchmark stability with interleaved execution#1348

Merged
yamadashy merged 5 commits intomainfrom
perf/benchmark-interleave-extract-scripts
Mar 28, 2026
Merged

perf(ci): Improve benchmark stability with interleaved execution#1348
yamadashy merged 5 commits intomainfrom
perf/benchmark-interleave-extract-scripts

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Mar 28, 2026

Improve performance benchmark reliability by reducing variance in the PR vs main comparison, and improve maintainability by extracting inline scripts.

Changes

Interleaved execution

Switch from sequential execution (all PR runs → all main runs) to interleaved execution (PR → main alternating each iteration). This ensures both branches experience similar runner load conditions at each measurement point, significantly reducing variance in the difference between PR and main timings.

# Before (sequential): runner load changes between blocks skew the diff
PR, PR, PR, ..., main, main, main, ...

# After (interleaved): both branches share the same conditions per iteration
PR, main, PR, main, PR, main, ...

Increased measurement runs

  • Ubuntu: 10 → 20
  • macOS: 20 → 30
  • Windows: 10 → 20

More samples improve statistical stability, especially important since the benchmark runs on shared CI runners. The additional time (~30-50s per OS) is well within the 15-minute timeout.

Extract scripts to separate files

Move inline Node.js scripts from YAML into .github/scripts/perf-benchmark/:

  • bench-run.mjs — Benchmark execution (interleaved measurement)
  • bench-pending.mjs — Pending comment generation
  • bench-comment.mjs — Results comment generation

This reduces the workflow YAML from ~370 lines to ~160 lines, enables proper syntax highlighting/linting, and makes the scripts easier to review and maintain. Jobs that only need the scripts use sparse-checkout for fast checkout.

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

…extract scripts

- Switch from sequential (all PR then all main) to interleaved execution
  (PR→main alternating) so both branches experience similar runner load
  conditions, reducing variance in the measured difference
- Increase measurement runs from 10/20/10 to 20/30/20 for better
  statistical stability
- Extract inline Node.js scripts from YAML into separate .mjs files
  under .github/scripts/perf-benchmark/ for maintainability
- Use sparse-checkout for jobs that only need the scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

⚡ Performance Benchmark

Latest commit:50a2cc0 fix(ci): Exit with error when all benchmark runs fail
Status:✅ Benchmark complete!
Ubuntu:2.08s (±0.04s) → 2.08s (±0.04s) · +0.00s (+0.0%)
macOS:1.44s (±0.18s) → 1.39s (±0.17s) · -0.05s (-3.4%)
Windows:2.30s (±0.12s) → 2.30s (±0.13s) · -0.00s (-0.2%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

1dbaebc refactor(ci): Address review feedback and fix lint errors

Ubuntu:1.96s (±0.01s) → 1.96s (±0.02s) · +0.00s (+0.2%)
macOS:1.70s (±0.17s) → 1.69s (±0.11s) · -0.01s (-0.6%)
Windows:2.32s (±0.03s) → 2.33s (±0.05s) · +0.01s (+0.3%)

8e7cfe9 refactor(ci): Move history benchmark script to perf-benchmark-history/

Ubuntu:2.16s (±0.02s) → 2.17s (±0.02s) · +0.01s (+0.3%)
macOS:1.69s (±0.19s) → 1.73s (±0.17s) · +0.04s (+2.1%)
Windows:2.81s (±0.06s) → 2.79s (±0.06s) · -0.02s (-0.7%)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a83eb29f-b702-4dc3-a6db-1e1360edaa88

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR refactors performance benchmarking infrastructure by extracting inline benchmark scripts from GitHub Actions workflows into dedicated Node.js files in .github/scripts/perf-benchmark and .github/scripts/perf-benchmark-history directories. Workflows are updated to invoke these external scripts instead of embedding logic inline, alongside adjustments to benchmark run counts and artifact handling.

Changes

Cohort / File(s) Summary
Benchmark Runner Scripts
.github/scripts/perf-benchmark-history/bench-run.mjs, .github/scripts/perf-benchmark/bench-run.mjs
Two benchmark runner scripts: one executes Repomix against a repo directory with warmup and statistical analysis (median/IQR); the other compares PR vs. main branches with per-run timing and writes results to JSON. Both perform warmup executions to stabilize environment before measurement.
Benchmark Comment & History Scripts
.github/scripts/perf-benchmark/bench-comment.mjs, .github/scripts/perf-benchmark/bench-pending.mjs
Comment generation scripts: bench-comment.mjs generates completed benchmark results with history tables; bench-pending.mjs creates in-progress comments with embedded JSON history (max 50 entries) for future extraction and accumulation.
Workflow Updates
.github/workflows/perf-benchmark-history.yml, .github/workflows/perf-benchmark.yml
Replaced inline Node scripts with external script invocations; adjusted benchmark run counts (Ubuntu 10→20, macOS 20→30, Windows 10→20); added sparse checkout steps in comment-generation jobs; simplified workflow logic by delegating to dedicated scripts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change—improving benchmark stability through interleaved execution—which is the primary innovation in this changeset.
Description check ✅ Passed The description provides comprehensive context: motivates the interleaved execution approach with examples, details measurement run increases per OS, explains script extraction benefits, and includes the required checklist items.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/benchmark-interleave-extract-scripts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Extract inline benchmark script to bench-run-history.mjs
- Increase measurement runs from 10/20/10 to 20/30/20 to match
  perf-benchmark.yml for consistency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 50a2cc0
Status: ✅  Deploy successful!
Preview URL: https://44a49c4b.repomix.pages.dev
Branch Preview URL: https://perf-benchmark-interleave-ex.repomix.pages.dev

View logs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gemini-code-assist[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@claude

This comment has been minimized.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.13%. Comparing base (fe6da90) to head (50a2cc0).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1348   +/-   ##
=======================================
  Coverage   87.13%   87.13%           
=======================================
  Files         116      116           
  Lines        4393     4393           
  Branches     1020     1020           
=======================================
  Hits         3828     3828           
  Misses        565      565           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai[bot]

This comment was marked as resolved.

- Alternate PR/main execution order on even/odd iterations to neutralize
  ordering bias from CPU/filesystem cache warming
- Add try/catch in measurement loops so a single failure doesn't lose
  all data; abort if all runs fail
- Extract shared esc(), extractHistory(), renderHistory() into
  bench-utils.mjs to eliminate duplication between pending and comment
- Add error logging for JSON parse failures instead of silent catch
- Fix biome lint: use template literals, sort imports, expand
  single-line try/catch blocks, avoid assignment in expressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Add early exit guard matching bench-run-history.mjs behavior, so a
broken build fails the workflow step instead of silently reporting 0ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy merged commit bd9f343 into main Mar 28, 2026
61 checks passed
@yamadashy yamadashy deleted the perf/benchmark-interleave-extract-scripts branch March 28, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant