Skip to content

Add upstream protobuf-es as honest baseline in benchmark chart#20

Closed
intech wants to merge 1 commit intomainfrom
feat/upstream-baseline-in-benchmarks
Closed

Add upstream protobuf-es as honest baseline in benchmark chart#20
intech wants to merge 1 commit intomainfrom
feat/upstream-baseline-in-benchmarks

Conversation

@intech
Copy link
Copy Markdown

@intech intech commented Apr 20, 2026

Summary

The benchmark delta chart compared toBinaryFast against the fork's
in-tree toBinary, but that baseline already ships the L0 contiguous-
writer optimisation from #8. That hid the cumulative improvement over
the original protobuf-es and understated the stacked L0+L1+L2 gain.

This PR installs @bufbuild/protobuf@2.11.0 under the npm alias
upstream-protobuf-es so the unmodified upstream encoder runs alongside
the fork's in-tree copy in the same Node process. Since both fork and
upstream are on v2.11.0 and the descriptor protocol is wire-compatible,
upstream's toBinary accepts fork-generated schemas and produces
byte-identical output — no second codegen pass required.

Encoder matrix

The report now emits four encoder columns per fixture:

column what it measures
upstream @bufbuild/protobuf@latest unmodified — honest original baseline
toBinary fork reflective encoder (includes L0 from #8)
toBinaryFast fork codegen-generated fast path (L1 + L2)
protobufjs cross-library reference

Delta chart

The chart-delta.svg now grows a third baseline row per fixture:

  1. vs upstream (red-orange) — honest cumulative gain from original protobuf-es
  2. vs toBinary (fork, L0) (grey) — incremental gain on top of L0
  3. vs protobufjs (blue) — cross-library reference

Bar width in the main chart shrinks 20px → 18px so the four-bar group
stays legible at the same layout.

Sample numbers

Local run on Node v25.8.1, linux/x64 (same 600ms tinybench setting as
before, so variance is similar). ExportTraceRequest (100 spans) row:

encoder ops/sec
upstream 445
toBinary (fork, L0) 1,633
toBinaryFast 2,094
protobufjs 2,262

StressMessage (depth=8, width=200):

encoder ops/sec
upstream 2,083
toBinary (fork, L0) 10,017
toBinaryFast 11,321
protobufjs 16,655

On 19-byte SimpleMessage the upstream/toBinary numbers swap at the
noise floor (800K vs 480K ops/sec) — expected for sub-microsecond
operations at a 600ms measurement window. Realistic fixtures show the
full L0+L1+L2 stack clearly.

Test plan

  • upstream-protobuf-es alias installed via npm install --save-dev "upstream-protobuf-es@npm:@bufbuild/protobuf@2.11.0"
  • Byte-for-byte parity verified between fork toBinary and upstream toBinary across SimpleMessage, ExportTraceRequest, StressMessage fixtures
  • npx tsc --noEmit in benchmarks/ — clean
  • npm run lint -w @bufbuild/protobuf-benchmarks — clean
  • npm run format -w @bufbuild/protobuf-benchmarks — no fixes applied
  • npm run bench:report -w @bufbuild/protobuf-benchmarks — emits 40 results = 10 fixtures × 4 encoders
  • BENCH_REPORT_READ_ONLY=1 npm run bench:report — re-renders charts from the new JSON shape
  • chart-delta.svg renders three stacked sub-bars per fixture row; fixtures without a pbjs stub still render two rows (vs upstream, vs toBinary) correctly

Draft — not for merge until a full pinned-core run on the CI host replaces the local-desktop numbers currently committed to the chart SVGs.

The delta chart previously compared `toBinaryFast` against the fork's
`toBinary`, but that baseline already ships the L0 contiguous-writer
optimisation from PR #8. The comparison therefore hid the cumulative
improvement relative to the original protobuf-es and understated the
stacked L0+L1+L2 gain.

Install `@bufbuild/protobuf@2.11.0` under the `upstream-protobuf-es`
npm alias so the unmodified upstream encoder runs alongside the fork's
in-tree copy in the same process. Verified that upstream's `toBinary`
accepts fork-generated schemas and emits byte-identical output, so a
single schema drives both bars without a second codegen pass.

The bench report now emits four encoder columns per fixture
(`upstream`, `toBinary`, `toBinaryFast`, `protobufjs`). The delta chart
grows a third baseline row: `vs upstream` (honest cumulative gain),
`vs toBinary` (incremental gain on top of L0), and `vs protobufjs`
(cross-library reference). Bar width in the main chart shrinks from
20px to 18px to keep the four-bar group legible at the same layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark: 7 regression(s)

Thresholds: throughput regression >5%, memory regression >10%. Runner pinned to CPU 0 via taskset. Current run on linux/x64, Node v22.22.2, captured 2026-04-20T20:34:31.464Z.
Baseline captured 2026-04-20T17:34:45.007Z on linux/x64, Node v22.22.2.

Summary: 7 regressed, 0 improved, 0 new, 13 unchanged.

Fixture Baseline ops/s PR ops/s Δ ops Baseline B/op PR B/op Δ mem Status
SimpleMessage :: toBinary (pre-built, 19 B) 849,817 868,228 +2.2% ok
ExportTraceRequest (100 spans) :: toBinary (pre-built, 32926 B) 1,231 1,123 -8.7% REGRESSION
ExportMetricsRequest (50 series) :: toBinary (pre-built, 17696 B) 2,168 1,970 -9.1% REGRESSION
ExportLogsRequest (100 records) :: toBinary (pre-built, 21319 B) 2,171 1,968 -9.3% REGRESSION
K8sPodList (20 pods) :: toBinary (pre-built, 28900 B) 2,342 2,162 -7.7% REGRESSION
GraphQLRequest :: toBinary (pre-built, 624 B) 176,305 173,269 -1.7% ok
GraphQLResponse :: toBinary (pre-built, 1366 B) 236,876 239,713 +1.2% ok
RpcRequest :: toBinary (pre-built, 501 B) 296,046 294,321 -0.6% ok
RpcResponse :: toBinary (pre-built, 602 B) 434,888 437,922 +0.7% ok
StressMessage (depth=8, width=200) :: toBinary (pre-built, 12868 B) 7,860 7,290 -7.3% REGRESSION
SimpleMessage :: fromBinary (19 B) 1,020,891 1,038,777 +1.8% ok
ExportTraceRequest (100 spans) :: fromBinary (32926 B) 599.8 610.5 +1.8% ok
ExportMetricsRequest (50 series) :: fromBinary (17696 B) 1,149 1,171 +2.0% ok
ExportLogsRequest (100 records) :: fromBinary (21319 B) 1,073 1,091 +1.7% ok
K8sPodList (20 pods) :: fromBinary (28900 B) 1,398 1,312 -6.1% REGRESSION
GraphQLRequest :: fromBinary (624 B) 300,513 287,857 -4.2% ok
GraphQLResponse :: fromBinary (1366 B) 265,540 269,060 +1.3% ok
RpcRequest :: fromBinary (501 B) 269,405 256,502 -4.8% ok
RpcResponse :: fromBinary (602 B) 378,014 369,598 -2.2% ok
StressMessage (depth=8, width=200) :: fromBinary (12868 B) 4,046 3,752 -7.3% REGRESSION

Produced by benchmarks/scripts/compare-results.ts. Artifacts: bench-results-<pr> (current), bench-baseline-main (baseline).

@intech intech self-assigned this Apr 20, 2026
@intech intech closed this Apr 21, 2026
intech added a commit that referenced this pull request Apr 21, 2026
…seline

Previous state on main: toBinary (L0 contiguous writer, PR #8) +
toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary
broke extension conformance (PR #21). Keeping two encoders required
users to change their call sites to see L1+L2 gains — also unacceptable.

This change:

- Removes toBinaryFast export and L1+L2 source files from main
  (packages/protobuf/src/to-binary-fast.ts, its unit test, and all
  benchmark wiring that referenced it).
- Preserves the full L1+L2 implementation on branch
  archive/l1-l2-schema-plans-experimental for future iteration, not
  discarded. Closes PR #21.
- Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest)
  as a third baseline column in benchmarks (re-applying the
  infrastructure from the closed PR #20), so chart.svg and
  chart-delta.svg honestly show "this fork's toBinary vs original
  upstream vs protobufjs" instead of fork-internal self-comparison.
- Regenerates chart.svg and chart-delta.svg with the three-way layout
  and rewrites the README narrative around the new encoders and the
  "Current state" archival note.
- Updates correctness-matrix.test.ts comments to point at the archive
  branch so contributors know where L1+L2 lives.

Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ):
- upstream toBinary (OTel 100 spans):     331 ops/s  (baseline)
- fork    toBinary (OTel 100 spans):    1,012 ops/s  (+206% vs upstream)
- protobufjs       (OTel 100 spans):    1,680 ops/s

L0 alone gets the fork past 3x original upstream on the real OTel
workload that drove this whole investigation, with byte-identical
wire output.
intech added a commit that referenced this pull request Apr 21, 2026
* Drop toBinaryFast from main, archive L1+L2 to branch, add upstream baseline

Previous state on main: toBinary (L0 contiguous writer, PR #8) +
toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary
broke extension conformance (PR #21). Keeping two encoders required
users to change their call sites to see L1+L2 gains — also unacceptable.

This change:

- Removes toBinaryFast export and L1+L2 source files from main
  (packages/protobuf/src/to-binary-fast.ts, its unit test, and all
  benchmark wiring that referenced it).
- Preserves the full L1+L2 implementation on branch
  archive/l1-l2-schema-plans-experimental for future iteration, not
  discarded. Closes PR #21.
- Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest)
  as a third baseline column in benchmarks (re-applying the
  infrastructure from the closed PR #20), so chart.svg and
  chart-delta.svg honestly show "this fork's toBinary vs original
  upstream vs protobufjs" instead of fork-internal self-comparison.
- Regenerates chart.svg and chart-delta.svg with the three-way layout
  and rewrites the README narrative around the new encoders and the
  "Current state" archival note.
- Updates correctness-matrix.test.ts comments to point at the archive
  branch so contributors know where L1+L2 lives.

Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ):
- upstream toBinary (OTel 100 spans):     331 ops/s  (baseline)
- fork    toBinary (OTel 100 spans):    1,012 ops/s  (+206% vs upstream)
- protobufjs       (OTel 100 spans):    1,680 ops/s

L0 alone gets the fork past 3x original upstream on the real OTel
workload that drove this whole investigation, with byte-identical
wire output.

* Use median-of-5 in bench:report to stabilize per-fixture numbers

Single-run benchmark output from report.ts was vulnerable to host
jitter on small/fast fixtures — a SimpleMessage measurement spread
2-8x across back-to-back runs on the same machine. A committed chart
snapshot could therefore show fork's toBinary 19% slower than
upstream in one run and 15-20% faster in the next four.

Wraps the benchmark loop inside report.ts with a BENCH_REPORT_RUNS
counter (default 5) and takes the per-fixture per-encoder median.
Override via the env var for faster iteration or longer sweeps.

Fresh median-of-5 numbers (local unpinned host, taskset -c 0):
- SimpleMessage toBinary: upstream 1.27M, fork 1.35M (+6.3%)
- OTel 100 spans toBinary: upstream 536, fork 1634 (+205%)
- K8sPodList toBinary: upstream 837, fork 3140 (+275%)
- StressMessage toBinary: upstream 2873, fork 10476 (+265%)
- RpcResponse toBinary: fork 630K vs protobufjs 532K (fork ahead on this shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove leftover L3 / toBinaryFast references from test harness

Files deleted:
- packages/protobuf-test/src/schema-plan-adaptive.test.ts — L3 test
  file, imports toBinaryFast + @bufbuild/protobuf/wire/schema-plan-adaptive
  both of which were already removed from main
- packages/protobuf/src/wire/schema-plan-adaptive.ts — L3
  implementation, orphaned after its only consumer was removed
- benchmarks/src/bench-multishape.ts — L3-only benchmark, imports
  toBinaryFast

These files survived the initial cleanup because they were only
exercised by L3/L1+L2 code paths that were themselves removed. The
full L1+L2 + L3 implementation remains on archive/l1-l2-schema-plans-experimental.

Full test suite now passes: 2909 / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update benchmarks README to match L0-only main

Removed stale L1+L2 descriptions and the toBinaryFast 0.80x claim from
'Reading the results'. Replaced with L0 description, median-of-5 note,
and fresh numbers (OTel 100 spans 0.88x protobufjs, cumulative vs
upstream +6% to +275%).

Dropped bench-multishape reference from 'Future work' (file deleted in
2dfe2fa). Added 'Archived work' section pointing at
archive/l1-l2-schema-plans-experimental for L1+L2 and L3 prototypes.

* Drop stale wire/schema-plan-adaptive export from package.json

The subpath export + typesVersions alias survived the L3 cleanup even
though wire/schema-plan-adaptive.ts was deleted in 2442fd7. attw was
reporting 'Resolution failed' on the orphaned entry, failing PR #22 CI.

* Bench CI: same-runner baseline instead of cross-runner artifact

Previously the bench-matrix job downloaded the baseline JSON from the
most recent push-to-main workflow artifact. That baseline was captured
on whatever ubuntu-latest runner GitHub happened to assign at that time;
the PR run then happened on a different physical host. Even with
taskset -c 0 + median-of-5 on both sides, cross-host variance (different
P/E-core topologies, SMT neighbours, thermal state) remained 5-7%,
producing chronic false-positive regressions on PRs that did not touch
the encode/decode hot path at all.

Switch the PR job to benchmark origin/main and the PR merge commit in
sequence on the SAME runner within the SAME workflow invocation. Every
factor except the code under test is now held constant. The
'bench-baseline-main' artifact upload on push-to-main is preserved for
external trend consumers but is no longer read by PR comparison.

Mechanics:
  - Checkout PR as usual.
  - Record current + origin/main SHAs.
  - git checkout origin/main, install, build, generate, run matrix into
    baseline-results.json.
  - git checkout back to PR head, rerun install/build/generate (PR may
    have changed package-lock), run matrix into bench-results.json.
  - compare-results.ts diffs the two local JSONs, posts the sticky
    comment, flags regressions.

Doubled bench work pushes the job from ~9 min to ~18 min; timeout raised
25 -> 40 min with buffer for slow runners.

Also: compare-results.ts header text now notes 'baseline and current
are benchmarked on the same runner' so the PR comment reflects the new
guarantee, and .gitignore covers baseline-results.json / bench-report.md
to keep the PR working tree clean after a local bench pass.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech deleted the feat/upstream-baseline-in-benchmarks branch April 21, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant