perf(daemon): add sync bench comparator#397
Open
andreabadesso wants to merge 1 commit into
Open
Conversation
Reads two bench-results JSON files and emits a markdown report comparing per-metric medians, with 95% CI on the median delta via bootstrap resampling. Informational only — no exit gating, since CI runner variance is too high for a hard threshold to be reliable at the run counts we can afford. Also extends bench-sync.ts output to include raw per-run samples, which the comparator needs for bootstrapping (summary stats alone are not enough). Self-test on identical code (5 runs × 2) shows 18 of 19 metrics as ⚪ noise, 1 false-positive 🔴 — matching the ~1 false positive expected from 19 independent 95% CIs. The totalMs CI ([-14.5%, +19.5%]) confirms the known scaling limitation: scenarios need to be in the thousands-of- events range before sub-20%% deltas become detectable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
andreabadesso
added a commit
that referenced
this pull request
Apr 17, 2026
Third piece of the benchmarking infrastructure. On PRs that touch packages/daemon, runs the bench against both master and the PR branch in parallel, then posts (or updates) a sticky PR comment with the comparator output. Key design choices (rationale in #396 review thread): - Matrix strategy — two runners in parallel, each with its own MySQL + simulator containers, no cross-run state bleed. - Bench scripts from PR head are overlaid onto the baseline checkout (via refs/pull/N/head so fork PRs work too) so the measurement tool stays constant across the comparison and baseline works even before the harness has landed on master. - No exit gating. continue-on-error on every bench step — CI runner variance is too high for a hard threshold to mean anything at the run counts we can afford. - Report also emitted to the job summary, so fork PRs (where GITHUB_TOKEN is read-only) still surface results. - concurrency.cancel-in-progress: true to avoid stacking stale runs when a PR is pushed to repeatedly. - Starts at 5 runs × 1 warmup per side; dial up once the scenario grows past its current 66-event ceiling. Also adds a warning comment at the top of bench-sync.ts flagging the overlay constraint: any symbol this script references must also exist on master. Depends on #396 (harness) and #397 (comparator). Targets #397 so the three PRs can be reviewed as a stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Second piece of the sync-benchmarking infrastructure (follows #396). Given two
bench-results-*.jsonfiles — one for master, one for a PR branch — we need a rigorous way to decide whether a change actually moved the needle. A single point-estimate comparison is misleading at the run counts this harness produces; this PR bootstraps a 95% CI on the median delta for each metric instead.Stacked on #396 — targets
feat/daemon-sync-bench-harness. Once #396 lands, rebase this ontomaster.Acceptance Criteria
yarn bench:compare --baseline a.json --candidate b.jsonreads two bench outputs and emits a markdown comparison to stdoutgh pr commentbench-sync.tsoutput now includes rawsamples: number[]per metric (required for bootstrap)Self-test (same code, two runs, 5 measured runs each)
Scenario:
VOIDED_TOKEN_AUTHORITY(66 events)markUtxosAsVoided(+20.9%, CI [+2.0%, +37.8%])One false positive across 19 independent 95% CIs is exactly what theory predicts (≈0.95 expected). This validates the decision to keep the report informational. The
totalMsCI of[-14.5%, +19.5%]quantifies the other known limitation: scenarios need to grow to the thousands-of-events range before sub-20% deltas are reliably detectable.Example output
Implementation notes
Checklist
master, confirm this code is production-ready and can be included in future releases as soon as it gets merged🤖 Generated with Claude Code