Skip to content

perf(daemon): add sync bench comparator#397

Open
andreabadesso wants to merge 1 commit into
feat/daemon-sync-bench-harnessfrom
feat/daemon-bench-comparator
Open

perf(daemon): add sync bench comparator#397
andreabadesso wants to merge 1 commit into
feat/daemon-sync-bench-harnessfrom
feat/daemon-bench-comparator

Conversation

@andreabadesso
Copy link
Copy Markdown
Collaborator

Motivation

Second piece of the sync-benchmarking infrastructure (follows #396). Given two bench-results-*.json files — one for master, one for a PR branch — we need a rigorous way to decide whether a change actually moved the needle. A single point-estimate comparison is misleading at the run counts this harness produces; this PR bootstraps a 95% CI on the median delta for each metric instead.

Stacked on #396 — targets feat/daemon-sync-bench-harness. Once #396 lands, rebase this onto master.

Acceptance Criteria

  • yarn bench:compare --baseline a.json --candidate b.json reads two bench outputs and emits a markdown comparison to stdout
  • Bootstrap resampling (default 10,000 samples, deterministic via fixed seed) produces a 95% CI on median delta for each metric
  • Each metric is flagged 🟢 (CI fully below 0), 🔴 (CI fully above 0), ⚪ (CI crosses 0 — noise), or ⚠️ (missing/invalid)
  • Tool exits 0 regardless of regressions — output is informational, suitable for piping to gh pr comment
  • bench-sync.ts output now includes raw samples: number[] per metric (required for bootstrap)

Self-test (same code, two runs, 5 measured runs each)

Scenario: VOIDED_TOKEN_AUTHORITY (66 events)

  • 19 metrics compared
  • 18 ⚪ noise (CI crosses 0 — as expected for identical code)
  • 1 🔴 false positive on markUtxosAsVoided (+20.9%, CI [+2.0%, +37.8%])

One false positive across 19 independent 95% CIs is exactly what theory predicts (≈0.95 expected). This validates the decision to keep the report informational. The totalMs CI of [-14.5%, +19.5%] quantifies the other known limitation: scenarios need to grow to the thousands-of-events range before sub-20% deltas are reliably detectable.

Example output

## Sync benchmark comparison

**Scenario:** `VOIDED_TOKEN_AUTHORITY` (66 events)
**Runs:** baseline=5, candidate=5, warmup=1/1
**Bootstrap samples:** 10000, seed: 42
**Verdict:** 🟢 0 improvements · 🔴 1 regression · ⚪ 18 noise · ⚠️ 0 skipped

| metric | baseline p50 (ms) | candidate p50 (ms) | Δ | 95% CI | |
|---|---:|---:|---:|---|:---:|
| totalMs | 1253.6 | 1194.6 | -4.7% | [-14.5%, +19.5%] | ⚪ |
| handleVertexAccepted | 490.6 | 438.1 | -10.7% | [-22.7%, +19.8%] | ⚪ |
| markUtxosAsVoided | 2.369 | 2.865 | +20.9% | [+2.0%, +37.8%] | 🔴 |
...

Implementation notes

  • Deterministic bootstrap (mulberry32 PRNG, fixed seed) so the same inputs always produce the same CIs — comparable across runs
  • Median rather than mean — less sensitive to the inevitable slow outlier run
  • Scenario mismatch between files aborts with exit 1
  • Missing spans on either side are reported as ⚠️ rather than silently dropped

Checklist

  • If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged
  • Make sure either the unit tests and/or the QA tests are capable of testing the new features (self-test on identical code documented above)
  • Make sure you do not include new dependencies in the project unless strictly necessary and do not include dev-dependencies as production ones. (no new deps)

🤖 Generated with Claude Code

Reads two bench-results JSON files and emits a markdown report comparing
per-metric medians, with 95% CI on the median delta via bootstrap
resampling. Informational only — no exit gating, since CI runner variance
is too high for a hard threshold to be reliable at the run counts we can
afford.

Also extends bench-sync.ts output to include raw per-run samples, which
the comparator needs for bootstrapping (summary stats alone are not
enough).

Self-test on identical code (5 runs × 2) shows 18 of 19 metrics as ⚪
noise, 1 false-positive 🔴 — matching the ~1 false positive expected
from 19 independent 95% CIs. The totalMs CI ([-14.5%, +19.5%]) confirms
the known scaling limitation: scenarios need to be in the thousands-of-
events range before sub-20%% deltas become detectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ce302c8-4bf8-4601-badf-725a9ceed6ed

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/daemon-bench-comparator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@andreabadesso andreabadesso self-assigned this Apr 17, 2026
@andreabadesso andreabadesso added the enhancement New feature or request label Apr 17, 2026
@andreabadesso andreabadesso moved this from Todo to In Progress (WIP) in Hathor Network Apr 17, 2026
andreabadesso added a commit that referenced this pull request Apr 17, 2026
Third piece of the benchmarking infrastructure. On PRs that touch
packages/daemon, runs the bench against both master and the PR branch
in parallel, then posts (or updates) a sticky PR comment with the
comparator output.

Key design choices (rationale in #396 review thread):
- Matrix strategy — two runners in parallel, each with its own MySQL +
  simulator containers, no cross-run state bleed.
- Bench scripts from PR head are overlaid onto the baseline checkout
  (via refs/pull/N/head so fork PRs work too) so the measurement tool
  stays constant across the comparison and baseline works even before
  the harness has landed on master.
- No exit gating. continue-on-error on every bench step — CI runner
  variance is too high for a hard threshold to mean anything at the run
  counts we can afford.
- Report also emitted to the job summary, so fork PRs (where
  GITHUB_TOKEN is read-only) still surface results.
- concurrency.cancel-in-progress: true to avoid stacking stale runs
  when a PR is pushed to repeatedly.
- Starts at 5 runs × 1 warmup per side; dial up once the scenario
  grows past its current 66-event ceiling.

Also adds a warning comment at the top of bench-sync.ts flagging the
overlay constraint: any symbol this script references must also exist
on master.

Depends on #396 (harness) and #397 (comparator). Targets #397 so the
three PRs can be reviewed as a stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

1 participant