perf(daemon): add sync bench comparator by andreabadesso · Pull Request #397 · HathorNetwork/hathor-wallet-service

andreabadesso · 2026-04-17T16:34:19Z

Motivation

Second piece of the sync-benchmarking infrastructure (follows #396). Given two bench-results-*.json files — one for master, one for a PR branch — we need a rigorous way to decide whether a change actually moved the needle. A single point-estimate comparison is misleading at the run counts this harness produces; this PR bootstraps a 95% CI on the median delta for each metric instead.

Stacked on #396 — targets feat/daemon-sync-bench-harness. Once #396 lands, rebase this onto master.

Acceptance Criteria

yarn bench:compare --baseline a.json --candidate b.json reads two bench outputs and emits a markdown comparison to stdout
Bootstrap resampling (default 10,000 samples, deterministic via fixed seed) produces a 95% CI on median delta for each metric
Each metric is flagged 🟢 (CI fully below 0), 🔴 (CI fully above 0), ⚪ (CI crosses 0 — noise), or ⚠️ (missing/invalid)
Tool exits 0 regardless of regressions — output is informational, suitable for piping to gh pr comment
bench-sync.ts output now includes raw samples: number[] per metric (required for bootstrap)

Self-test (same code, two runs, 5 measured runs each)

Scenario: VOIDED_TOKEN_AUTHORITY (66 events)

19 metrics compared
18 ⚪ noise (CI crosses 0 — as expected for identical code)
1 🔴 false positive on markUtxosAsVoided (+20.9%, CI [+2.0%, +37.8%])

One false positive across 19 independent 95% CIs is exactly what theory predicts (≈0.95 expected). This validates the decision to keep the report informational. The totalMs CI of [-14.5%, +19.5%] quantifies the other known limitation: scenarios need to grow to the thousands-of-events range before sub-20% deltas are reliably detectable.

Example output

## Sync benchmark comparison

**Scenario:** `VOIDED_TOKEN_AUTHORITY` (66 events)
**Runs:** baseline=5, candidate=5, warmup=1/1
**Bootstrap samples:** 10000, seed: 42
**Verdict:** 🟢 0 improvements · 🔴 1 regression · ⚪ 18 noise · ⚠️ 0 skipped

| metric | baseline p50 (ms) | candidate p50 (ms) | Δ | 95% CI | |
|---|---:|---:|---:|---|:---:|
| totalMs | 1253.6 | 1194.6 | -4.7% | [-14.5%, +19.5%] | ⚪ |
| handleVertexAccepted | 490.6 | 438.1 | -10.7% | [-22.7%, +19.8%] | ⚪ |
| markUtxosAsVoided | 2.369 | 2.865 | +20.9% | [+2.0%, +37.8%] | 🔴 |
...

Implementation notes

Deterministic bootstrap (mulberry32 PRNG, fixed seed) so the same inputs always produce the same CIs — comparable across runs
Median rather than mean — less sensitive to the inevitable slow outlier run
Scenario mismatch between files aborts with exit 1
Missing spans on either side are reported as ⚠️ rather than silently dropped

Checklist

If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged
Make sure either the unit tests and/or the QA tests are capable of testing the new features (self-test on identical code documented above)
Make sure you do not include new dependencies in the project unless strictly necessary and do not include dev-dependencies as production ones. (no new deps)

🤖 Generated with Claude Code

Reads two bench-results JSON files and emits a markdown report comparing per-metric medians, with 95% CI on the median delta via bootstrap resampling. Informational only — no exit gating, since CI runner variance is too high for a hard threshold to be reliable at the run counts we can afford. Also extends bench-sync.ts output to include raw per-run samples, which the comparator needs for bootstrapping (summary stats alone are not enough). Self-test on identical code (5 runs × 2) shows 18 of 19 metrics as ⚪ noise, 1 false-positive 🔴 — matching the ~1 false positive expected from 19 independent 95% CIs. The totalMs CI ([-14.5%, +19.5%]) confirms the known scaling limitation: scenarios need to be in the thousands-of- events range before sub-20%% deltas become detectable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-17T16:34:27Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ce302c8-4bf8-4601-badf-725a9ceed6ed

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/daemon-bench-comparator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Third piece of the benchmarking infrastructure. On PRs that touch packages/daemon, runs the bench against both master and the PR branch in parallel, then posts (or updates) a sticky PR comment with the comparator output. Key design choices (rationale in #396 review thread): - Matrix strategy — two runners in parallel, each with its own MySQL + simulator containers, no cross-run state bleed. - Bench scripts from PR head are overlaid onto the baseline checkout (via refs/pull/N/head so fork PRs work too) so the measurement tool stays constant across the comparison and baseline works even before the harness has landed on master. - No exit gating. continue-on-error on every bench step — CI runner variance is too high for a hard threshold to mean anything at the run counts we can afford. - Report also emitted to the job summary, so fork PRs (where GITHUB_TOKEN is read-only) still surface results. - concurrency.cancel-in-progress: true to avoid stacking stale runs when a PR is pushed to repeatedly. - Starts at 5 runs × 1 warmup per side; dial up once the scenario grows past its current 66-event ceiling. Also adds a warning comment at the top of bench-sync.ts flagging the overlay constraint: any symbol this script references must also exist on master. Depends on #396 (harness) and #397 (comparator). Targets #397 so the three PRs can be reviewed as a stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andreabadesso self-assigned this Apr 17, 2026

andreabadesso added the enhancement New feature or request label Apr 17, 2026

andreabadesso added this to Hathor Network Apr 17, 2026

github-project-automation Bot moved this to Todo in Hathor Network Apr 17, 2026

andreabadesso moved this from Todo to In Progress (WIP) in Hathor Network Apr 17, 2026

This was referenced Apr 17, 2026

perf(daemon): add CI workflow for sync bench comparison #398

Open

perf(daemon): per-event sync optimizations identified from Tempo traces #395

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(daemon): add sync bench comparator#397

perf(daemon): add sync bench comparator#397
andreabadesso wants to merge 1 commit into
feat/daemon-sync-bench-harnessfrom
feat/daemon-bench-comparator

andreabadesso commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreabadesso commented Apr 17, 2026

Motivation

Acceptance Criteria

Self-test (same code, two runs, 5 measured runs each)

Example output

Implementation notes

Checklist

Uh oh!

coderabbitai Bot commented Apr 17, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant