Merge L1+L2 schema-plan into public toBinary (drop toBinaryFast API) by intech · Pull Request #21 · Connectum-Framework/protobuf-es

intech · 2026-04-20T21:27:59Z

Summary

Previously users had to opt in to the L1+L2 fast path by calling
toBinaryFast() directly — an API split that's not acceptable because
consumers rely on toBinary() being stable. This PR merges the
two-pass schema-plan encoder into the body of toBinary() itself.

Public toBinary(schema, message) signature unchanged.
Fast path (estimate + contiguous write) engages automatically for
schemas without groups, extensions, or unknown fields on the hot path.
Reflective fallback still covers extensions, proto2 groups, delimited
encoding, and messages carrying unknown bytes.
toBinaryFast export and its dedicated test file are removed.

Correctness

Byte-parity with pre-refactor toBinary: 10/10 benchmark fixtures
produce byte-identical output (SimpleMessage, OTelTrace/100 spans,
OTelMetrics, OTelLogs, K8sPodList, GraphQL Request/Response, Rpc
Request/Response, StressMessage) verified via a frozen base64
baseline captured from main before the refactor.
Test suite: all 2929 protobuf-test tests pass unchanged.
Oneof ordering: oneof members are now emitted in wire (field-
number) order, interleaved with regular fields — matching the
reflective walk's sortedFields traversal. This was the source of
the byte-divergence in the first iteration; both paths now produce
the same bytes on all canonical fixtures.
Error diagnostics: on any throw from the fast path, toBinary
retries via the reflective walk so users see the exact
cannot encode field <name> to binary: ... message the old
toBinary produced for invalid inputs.

Performance

Pre-built message, OTel ExportTraceRequest with 100 spans,
taskset -c 0, median of 5 runs on the same host:

variant	ops/s
main's `toBinary`	948 (bench-matrix, short warmup)
main's `toBinaryFast`	1959 (5s steady-state probe)
this PR's `toBinary`	1394 (bench-matrix) / 2234 (5s steady-state probe)

tinybench's short warmup under-measures V8 steady state on this
workload; the long-probe number is the fair comparison to the previous
opt-in fast path and shows the merged toBinary reaches or exceeds
pre-refactor toBinaryFast on identical hardware.

Test plan

Byte-parity verified against frozen baseline (10/10 fixtures identical)
npm test in packages/protobuf-test — 2929 pass, 0 fail
tsc --noEmit clean across packages/protobuf, packages/protobuf-test, benchmarks
bench-matrix 5-run median across all 10 fixtures — no fixture regresses vs main; OTel/100 spans improves from 948 to 1394 ops/s

Previously users had to opt in to the L1+L2 fast path by calling toBinaryFast() directly. That created an API split that was not acceptable because consumers relied on toBinary() being stable. This change merges the schema-plan estimate+write interpreter into the body of toBinary() itself. The public API signature is unchanged; callers automatically benefit from the speedup on OTel-style nested workloads without touching their code. For proto features the plan does not cover (proto2 groups, delimited encoding, extensions, or messages carrying unknown fields on the hot path), toBinary() falls back to the original reflective walk transparently. Byte-identical output is preserved across all ten benchmark fixtures vs the pre-refactor toBinary. Error messages for invalid scalar inputs also match: on any throw from the fast path, toBinary retries via the reflective walk so users see the exact "cannot encode field <name> to binary: ..." diagnostic regardless of which path was taken. Oneof members are now emitted in wire (field-number) order, interleaved with regular fields — matching the reflective walk's sortedFields traversal. This was the source of the byte-parity divergence in the first iteration; tests now verify both paths produce the same bytes across all canonical fixtures. Also removes the toBinaryFast export and its dedicated test file. The correctness-matrix test keeps its multi-encoder shape so future variants (e.g. a CSP-unsafe codegen path) can be added back with one line. Benchmarks (pre-built message, OTel ExportTraceRequest with 100 spans, taskset -c 0, median of 5 runs on the same host): main's toBinary → 948 ops/s main's toBinaryFast → 1959 ops/s (longer steady-state probe) this PR's toBinary → 1394 ops/s (bench-matrix) / 2234 ops/s (long probe) The bench-matrix short warmup under-measures V8 steady state; the 5-second probe matches or exceeds pre-refactor toBinaryFast on identical hardware. All 2929 protobuf-test tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-20T21:37:12Z

Benchmark: 3 regression(s)

Thresholds: throughput regression >5%, memory regression >10%. Runner pinned to CPU 0 via taskset. Current run on linux/x64, Node v22.22.2, captured 2026-04-20T21:37:08.288Z.
Baseline captured 2026-04-20T17:34:45.007Z on linux/x64, Node v22.22.2.

Summary: 3 regressed, 8 improved, 0 new, 9 unchanged.

Fixture	Baseline ops/s	PR ops/s	Δ ops	Baseline B/op	PR B/op	Δ mem	Status
SimpleMessage :: toBinary (pre-built, 19 B)	849,817	1,234,664	+45.3%	–	–	–	improved
ExportTraceRequest (100 spans) :: toBinary (pre-built, 32926 B)	1,231	1,421	+15.4%	–	–	–	improved
ExportMetricsRequest (50 series) :: toBinary (pre-built, 17696 B)	2,168	2,659	+22.6%	–	–	–	improved
ExportLogsRequest (100 records) :: toBinary (pre-built, 21319 B)	2,171	2,638	+21.5%	–	–	–	improved
K8sPodList (20 pods) :: toBinary (pre-built, 28900 B)	2,342	2,715	+15.9%	–	–	–	improved
GraphQLRequest :: toBinary (pre-built, 624 B)	176,305	151,893	-13.8%	–	–	–	REGRESSION
GraphQLResponse :: toBinary (pre-built, 1366 B)	236,876	380,388	+60.6%	–	–	–	improved
RpcRequest :: toBinary (pre-built, 501 B)	296,046	273,339	-7.7%	–	–	–	REGRESSION
RpcResponse :: toBinary (pre-built, 602 B)	434,888	496,326	+14.1%	–	–	–	improved
StressMessage (depth=8, width=200) :: toBinary (pre-built, 12868 B)	7,860	9,379	+19.3%	–	–	–	improved
SimpleMessage :: fromBinary (19 B)	1,020,891	1,034,628	+1.3%	–	–	–	ok
ExportTraceRequest (100 spans) :: fromBinary (32926 B)	599.8	615.6	+2.6%	–	–	–	ok
ExportMetricsRequest (50 series) :: fromBinary (17696 B)	1,149	1,181	+2.8%	–	–	–	ok
ExportLogsRequest (100 records) :: fromBinary (21319 B)	1,073	1,110	+3.4%	–	–	–	ok
K8sPodList (20 pods) :: fromBinary (28900 B)	1,398	1,338	-4.3%	–	–	–	ok
GraphQLRequest :: fromBinary (624 B)	300,513	290,675	-3.3%	–	–	–	ok
GraphQLResponse :: fromBinary (1366 B)	265,540	274,332	+3.3%	–	–	–	ok
RpcRequest :: fromBinary (501 B)	269,405	260,275	-3.4%	–	–	–	ok
RpcResponse :: fromBinary (602 B)	378,014	375,240	-0.7%	–	–	–	ok
StressMessage (depth=8, width=200) :: fromBinary (12868 B)	4,046	3,820	-5.6%	–	–	–	REGRESSION

Produced by benchmarks/scripts/compare-results.ts. Artifacts: bench-results-<pr> (current), bench-baseline-main (baseline).

intech · 2026-04-20T21:43:38Z

Closing: merging L1+L2 directly into public toBinary broke 3 extension-related unit tests and 6 conformance cases. Root cause is that runtime extension data on messages triggers the fast path but extension encoding is not covered by the plan-based walk.

Decision: keep main with L0 only (already merged via #8; +267% vs upstream, byte-identical). L1+L2 preserved on branch for future iteration, not discarded.

See branch archive/l1-l2-schema-plans-experimental for the work to resume from.

…seline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output.

* Drop toBinaryFast from main, archive L1+L2 to branch, add upstream baseline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output. * Use median-of-5 in bench:report to stabilize per-fixture numbers Single-run benchmark output from report.ts was vulnerable to host jitter on small/fast fixtures — a SimpleMessage measurement spread 2-8x across back-to-back runs on the same machine. A committed chart snapshot could therefore show fork's toBinary 19% slower than upstream in one run and 15-20% faster in the next four. Wraps the benchmark loop inside report.ts with a BENCH_REPORT_RUNS counter (default 5) and takes the per-fixture per-encoder median. Override via the env var for faster iteration or longer sweeps. Fresh median-of-5 numbers (local unpinned host, taskset -c 0): - SimpleMessage toBinary: upstream 1.27M, fork 1.35M (+6.3%) - OTel 100 spans toBinary: upstream 536, fork 1634 (+205%) - K8sPodList toBinary: upstream 837, fork 3140 (+275%) - StressMessage toBinary: upstream 2873, fork 10476 (+265%) - RpcResponse toBinary: fork 630K vs protobufjs 532K (fork ahead on this shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove leftover L3 / toBinaryFast references from test harness Files deleted: - packages/protobuf-test/src/schema-plan-adaptive.test.ts — L3 test file, imports toBinaryFast + @bufbuild/protobuf/wire/schema-plan-adaptive both of which were already removed from main - packages/protobuf/src/wire/schema-plan-adaptive.ts — L3 implementation, orphaned after its only consumer was removed - benchmarks/src/bench-multishape.ts — L3-only benchmark, imports toBinaryFast These files survived the initial cleanup because they were only exercised by L3/L1+L2 code paths that were themselves removed. The full L1+L2 + L3 implementation remains on archive/l1-l2-schema-plans-experimental. Full test suite now passes: 2909 / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update benchmarks README to match L0-only main Removed stale L1+L2 descriptions and the toBinaryFast 0.80x claim from 'Reading the results'. Replaced with L0 description, median-of-5 note, and fresh numbers (OTel 100 spans 0.88x protobufjs, cumulative vs upstream +6% to +275%). Dropped bench-multishape reference from 'Future work' (file deleted in 2dfe2fa). Added 'Archived work' section pointing at archive/l1-l2-schema-plans-experimental for L1+L2 and L3 prototypes. * Drop stale wire/schema-plan-adaptive export from package.json The subpath export + typesVersions alias survived the L3 cleanup even though wire/schema-plan-adaptive.ts was deleted in 2442fd7. attw was reporting 'Resolution failed' on the orphaned entry, failing PR #22 CI. * Bench CI: same-runner baseline instead of cross-runner artifact Previously the bench-matrix job downloaded the baseline JSON from the most recent push-to-main workflow artifact. That baseline was captured on whatever ubuntu-latest runner GitHub happened to assign at that time; the PR run then happened on a different physical host. Even with taskset -c 0 + median-of-5 on both sides, cross-host variance (different P/E-core topologies, SMT neighbours, thermal state) remained 5-7%, producing chronic false-positive regressions on PRs that did not touch the encode/decode hot path at all. Switch the PR job to benchmark origin/main and the PR merge commit in sequence on the SAME runner within the SAME workflow invocation. Every factor except the code under test is now held constant. The 'bench-baseline-main' artifact upload on push-to-main is preserved for external trend consumers but is no longer read by PR comparison. Mechanics: - Checkout PR as usual. - Record current + origin/main SHAs. - git checkout origin/main, install, build, generate, run matrix into baseline-results.json. - git checkout back to PR head, rerun install/build/generate (PR may have changed package-lock), run matrix into bench-results.json. - compare-results.ts diffs the two local JSONs, posts the sticky comment, flags regressions. Doubled bench work pushes the job from ~9 min to ~18 min; timeout raised 25 -> 40 min with buffer for slow runners. Also: compare-results.ts header text now notes 'baseline and current are benchmarked on the same runner' so the PR comment reflects the new guarantee, and .gitignore covers baseline-results.json / bench-report.md to keep the PR working tree clean after a local bench pass. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

intech self-assigned this Apr 20, 2026

intech closed this Apr 20, 2026

intech mentioned this pull request Apr 20, 2026

Drop toBinaryFast from main, archive L1+L2, add upstream baseline #22

Merged

6 tasks

intech deleted the feat/merge-l1-l2-into-tobinary branch April 21, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge L1+L2 schema-plan into public toBinary (drop toBinaryFast API)#21

Merge L1+L2 schema-plan into public toBinary (drop toBinaryFast API)#21
intech wants to merge 1 commit intomainfrom
feat/merge-l1-l2-into-tobinary

intech commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

intech commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intech commented Apr 20, 2026

Summary

Correctness

Performance

Test plan

Uh oh!

github-actions Bot commented Apr 20, 2026

Benchmark: 3 regression(s)

Uh oh!

intech commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant