Merged
Conversation
Replaces chunk-list + scratch-array state with a single growable Uint8Array plus an integer-offset stack for fork/join framing. Ports the OpenTelemetry PR #6390 ProtobufWriter pattern into the general-purpose protobuf-es BinaryWriter while preserving all 20 public methods. Implementation follows analysis/p1-t1-l0-design-spec.md pinned decisions: - D1/D2/D3: single Uint8Array, initial capacity 1024, 2x growth - D4/D5/D6: placeholder-and-shift fork/join via copyWithin; stack holds integer offsets only (no per-fork object allocation) - D7: single-pass ASCII probe in string() with UTF-8 fallback - D8: finish() returns buf.subarray(0, pos) — no copy (lazy reset on next write keeps returned slice stable if the writer is reused) - D10: removed `protected buf: number[]` field from the legacy writer - D11: cached DataView, rebuilt on grow - D13: int64 family uses number/bigint/string tri-dispatch with protoInt64 fallback for out-of-range or invalid inputs (error-message parity) Additive L0 API (for upcoming L1/L2 consumers): - ensureCapacity(n): grow backing buffer to hold N more bytes - currentOffset(): return current write position - patchVarint32At(offset, value): back-patch a reserved varint32 Measurements (Node 25.8, 100-span ExportTraceRequest fixture): - toBinary 100 spans: 525 → 2,278 ops/s (+334%) - toBinary SimpleMessage (KeyValue): 733k → 784k ops/s (+7%, no regression) - B/op 100 spans (5-run median): 6,338 → 7,616 B (±20% noise band; writer allocations dropped substantially but reflection dispatch still churns — the ≤1,500 B/op gate is blocked by Tier-B reflection work, tracked as P1-T2b) Byte-parity verified via round-trip on the OTel fixture (bytesIdentical = true). All 2,875 existing tests plus 52 new L0 tests pass (edge cases: ensureCapacity growth, placeholder shift at varint-size boundaries, 10-deep fork/join nesting, ASCII + Unicode strings, int64 tri-dispatch parity, additive API contracts). Breaking note: removed undocumented `protected buf: number[]` field. No internal consumer (to-binary.ts, size-delimited.ts, extensions.ts) touches it; external subclassers (if any) must migrate to public API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
P1-T2b — Gate #5 formally verified ✅CPU-profile comparison captured on the OTel 100-span
|
| Branch | Total samples | finish + raw hits |
Combined self-time |
|---|---|---|---|
main (baseline) |
77,389 | 23,161 | 29.92 % |
feat/l0-contiguous-writer |
18,495 | 11 | 0.05 % |
Gate #5 target: ≤ 10 % — measured 0.05 %. PASS (600× headroom).
Extended finish + raw + fork + join view: baseline 31.49 % → L0 2.34 %.
Throughput observed during profiling
main 414 ops/s → L0 1,767 ops/s (+327 %) — consistent with the
+334 % reported from .tmp/l0-bench/ in the PR body.
Top baseline hotspots eliminated
finish14.97 % → below profile cutoffraw14.94 % → 0 hits in L0fixed643.50 % → 0.30 % (no more per-callnew Uint8Array(8))TextEncoder.encode27.73 % → absorbed by the inlined ASCII fast-path
in L0'sstring()(remaining work tracked under thestringframe
itself at 11.86 %, and absolute hit count dropped ~10×)
New dominant hotspots in L0 (next-step vectors)
- Reflection (
ReflectMessageImpl,assertOwn,values,
get sortedFields,unsafeIsSet) — 29.66 % combined → L1
schema-plan work - Dispatch (
writeMessageField,writeFields,writeField,
writeListField,writeScalar) — 19.10 % → L2 specialized
writers (codegen) stringUTF-8 path — 11.86 %- GC — 2.10 % (vs. 1.34 % baseline; the relative share rose only
because the rest of the pipeline got faster)
Full report and methodology: analysis/p1-t2b-profile-verification.md
(includes the jq commands needed to reproduce every number).
Profiles preserved at:
.tmp/p1-t2b-profile/baseline-main.cpuprofile(906 KB).tmp/p1-t2b-profile/l0.cpuprofile(362 KB)
Gate summary after P1-T2b
- Byte-parity — verified in PR (
verify-correctness.ts) - ≥ +30 % ops/s OTel — +334 % measured in PR (+327 % re-confirmed here)
- ≤ 1,500 B/op memory — deferred to reflection-layer work (P1-T2c)
- No SimpleMessage regression — +7 % in PR
-
finish + raw≤ 10 % self-time — 0.05 % measured (this report)
6 tasks
Replaces inline 0x20000000000000 (2^53) literals in the fast-path number guard of signedInt64LoHi / unsignedInt64LoHi with module-level POW_2_53 and NEG_POW_2_53 constants, computed via 2 ** 53 and annotated with /*@__PURE__*/ so bundlers can inline or tree-shake them. Resolves the TypeScript warning [80008] about numeric literals whose absolute values reach 2^53 being unable to be represented accurately as integers. Runtime behavior is unchanged: the guards accept the same set of values (finite integers in [-2^53, +2^53]), and the bigint path continues to use the existing INT64_MIN_BI / INT64_MAX_BI / UINT64_MAX_BI constants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop inferrable `: number` on BinaryWriter.initialCapacity param to satisfy biome's noInferrableTypes rule (the `= 1024` default already narrows it to `number`). - Apply biome format to the L0 writer test file (single-expression `assert.deepStrictEqual` that had been split across lines). - Regenerate bundle-size baseline. The L0 contiguous-buffer writer adds ~9.6 KiB raw / ~3.5 KiB minified / ~730 B gzipped to the 1-file bundle and scales similarly across the matrix. This is the intended cost of the feature — we are accepting it as the new baseline rather than carrying the diff forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 19, 2026
intech
added a commit
that referenced
this pull request
Apr 19, 2026
Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous writer), #10 (L1+L2 schema plans + specialized writers), #11 (correctness tests). Key results (Node 25.8, log-scale chart): - OTel 100 spans: 525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110) - OTel Metrics 50: 891 -> 4,773 ops/s (+435%) - OTel Logs 100: 880 -> 3,772 ops/s (+329%) - K8sPodList 20: 712 -> 3,510 ops/s (+393%) - Stress d=8 w=200: 2,568 -> 14,378 ops/s (+460%) - SimpleMessage: 1.39M -> 1.81M ops/s (+30%) Memory allocations per encode reduced proportionally via L0 contiguous buffer + L1 schema-plan opcode interpreter + L2 specialized field writers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
intech
added a commit
that referenced
this pull request
Apr 19, 2026
Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous writer), #10 (L1+L2 schema plans + specialized writers), #11 (correctness tests). Key results (Node 25.8, log-scale chart): - OTel 100 spans: 525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110) - OTel Metrics 50: 891 -> 4,773 ops/s (+435%) - OTel Logs 100: 880 -> 3,772 ops/s (+329%) - K8sPodList 20: 712 -> 3,510 ops/s (+393%) - Stress d=8 w=200: 2,568 -> 14,378 ops/s (+460%) - SimpleMessage: 1.39M -> 1.81M ops/s (+30%) Memory allocations per encode reduced proportionally via L0 contiguous buffer + L1 schema-plan opcode interpreter + L2 specialized field writers. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 20, 2026
intech
added a commit
that referenced
this pull request
Apr 21, 2026
…seline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output.
intech
added a commit
that referenced
this pull request
Apr 21, 2026
* Drop toBinaryFast from main, archive L1+L2 to branch, add upstream baseline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output. * Use median-of-5 in bench:report to stabilize per-fixture numbers Single-run benchmark output from report.ts was vulnerable to host jitter on small/fast fixtures — a SimpleMessage measurement spread 2-8x across back-to-back runs on the same machine. A committed chart snapshot could therefore show fork's toBinary 19% slower than upstream in one run and 15-20% faster in the next four. Wraps the benchmark loop inside report.ts with a BENCH_REPORT_RUNS counter (default 5) and takes the per-fixture per-encoder median. Override via the env var for faster iteration or longer sweeps. Fresh median-of-5 numbers (local unpinned host, taskset -c 0): - SimpleMessage toBinary: upstream 1.27M, fork 1.35M (+6.3%) - OTel 100 spans toBinary: upstream 536, fork 1634 (+205%) - K8sPodList toBinary: upstream 837, fork 3140 (+275%) - StressMessage toBinary: upstream 2873, fork 10476 (+265%) - RpcResponse toBinary: fork 630K vs protobufjs 532K (fork ahead on this shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove leftover L3 / toBinaryFast references from test harness Files deleted: - packages/protobuf-test/src/schema-plan-adaptive.test.ts — L3 test file, imports toBinaryFast + @bufbuild/protobuf/wire/schema-plan-adaptive both of which were already removed from main - packages/protobuf/src/wire/schema-plan-adaptive.ts — L3 implementation, orphaned after its only consumer was removed - benchmarks/src/bench-multishape.ts — L3-only benchmark, imports toBinaryFast These files survived the initial cleanup because they were only exercised by L3/L1+L2 code paths that were themselves removed. The full L1+L2 + L3 implementation remains on archive/l1-l2-schema-plans-experimental. Full test suite now passes: 2909 / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update benchmarks README to match L0-only main Removed stale L1+L2 descriptions and the toBinaryFast 0.80x claim from 'Reading the results'. Replaced with L0 description, median-of-5 note, and fresh numbers (OTel 100 spans 0.88x protobufjs, cumulative vs upstream +6% to +275%). Dropped bench-multishape reference from 'Future work' (file deleted in 2dfe2fa). Added 'Archived work' section pointing at archive/l1-l2-schema-plans-experimental for L1+L2 and L3 prototypes. * Drop stale wire/schema-plan-adaptive export from package.json The subpath export + typesVersions alias survived the L3 cleanup even though wire/schema-plan-adaptive.ts was deleted in 2442fd7. attw was reporting 'Resolution failed' on the orphaned entry, failing PR #22 CI. * Bench CI: same-runner baseline instead of cross-runner artifact Previously the bench-matrix job downloaded the baseline JSON from the most recent push-to-main workflow artifact. That baseline was captured on whatever ubuntu-latest runner GitHub happened to assign at that time; the PR run then happened on a different physical host. Even with taskset -c 0 + median-of-5 on both sides, cross-host variance (different P/E-core topologies, SMT neighbours, thermal state) remained 5-7%, producing chronic false-positive regressions on PRs that did not touch the encode/decode hot path at all. Switch the PR job to benchmark origin/main and the PR merge commit in sequence on the SAME runner within the SAME workflow invocation. Every factor except the code under test is now held constant. The 'bench-baseline-main' artifact upload on push-to-main is preserved for external trend consumers but is no longer read by PR comparison. Mechanics: - Checkout PR as usual. - Record current + origin/main SHAs. - git checkout origin/main, install, build, generate, run matrix into baseline-results.json. - git checkout back to PR head, rerun install/build/generate (PR may have changed package-lock), run matrix into bench-results.json. - compare-results.ts diffs the two local JSONs, posts the sticky comment, flags regressions. Doubled bench work pushes the job from ~9 min to ~18 min; timeout raised 25 -> 40 min with buffer for slow runners. Also: compare-results.ts header text now notes 'baseline and current are benchmarked on the same runner' so the PR comment reflects the new guarantee, and .gitignore covers baseline-results.json / bench-report.md to keep the PR working tree clean after a local bench pass. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the chunk-list + scratch-array state in
BinaryWriterwith asingle growable
Uint8Arrayplus an integer-offset stack forfork()/join()framing. Ports the OpenTelemetry PR #6390ProtobufWriterpattern into the general-purpose protobuf-es writer while preserving
all 20 public methods (signatures and wire-byte output are identical).
Implementation follows the pinned design decisions in
analysis/p1-t1-l0-design-spec.md(13 decisions, D1-D13).Pinned decisions applied
Uint8Array, 1,024-byte initial capacity, 2x growthfork/joinviacopyWithin; thestack holds integer offsets only (no per-fork object allocation)
string()with UTF-8 fallbackfinish()returnsbuf.subarray(0, pos)— no copy. A lazy-resetflag keeps the returned slice stable if the writer is reused.
protected buf: number[]fieldDataView, rebuilt on growtypeoftri-dispatch (number/bigint/string)with fallback to
protoInt64.enc/uEncfor out-of-range or invalidinputs (error-message parity is maintained)
Additive L0 API
Three new public instance methods exposed for upcoming L1/L2 consumers:
ensureCapacity(n: number): void— grow the backing buffer to holdnmore bytescurrentOffset(): number— return the current write positionpatchVarint32At(offset: number, value: number): void— back-patchan unsigned 32-bit varint at a previously reserved offset
Measurements
Node.js 25.8, 100-span
ExportTraceRequestfixture (37,547 bytes onthe wire),
.tmp/l0-bench/.Memory note. The per-encode heap delta varies heavily across runs
as minor-GC timing shifts (baseline ranges 3k-42k B; L0 ranges 7k-15k
B). L0 allocations from the writer itself dropped dramatically (no
more 3,917 per-encode
Uint8Arraywrappers), but steady-state encodememory is now dominated by reflection dispatch allocations
(
ReflectMessageiterators,KeyValueattribute maps). The≤1,500 B/op gate in the design spec targets post-Tier-B numbers; this
PR unblocks the throughput gate but defers the memory gate to the
reflection-layer work (P1-T2b follow-up).
Gates (per
analysis/p1-t1-l0-design-spec.md§7)output before and after (
.tmp/l0-bench/verify-correctness.tsreports
bytesIdentical: true)allocations dropped as designed, but the global heap delta needs
Tier-B work to clear the gate
L0 vs. 29.92% baseline (Node 25.8.1, 100-span OTel fixture,
30k iterations, 100 µs sampling). Full report:
analysis/p1-t2b-profile-verification.md. Seegate-#5 verification comment.
Breaking notes
protected buf: number[]field. No internalconsumer (
to-binary.ts,size-delimited.ts,extensions.ts)touches it; external subclassers must migrate to the public API
(
raw(),ensureCapacity()).finish()now returns aUint8Arraysubarray view that aliases thewriter's backing buffer. A lazy-reset flag swaps the buffer on the
next write, so reuse works the same as the legacy writer — but
callers must not mutate the returned slice while the writer is still
in use.
Test plan
@bufbuild/protobuf-testpass@bufbuild/protoplugin-testpass@bufbuild/protoplugin-examplepassbinary-encoding-l0.test.ts,covering:
ensureCapacitygrowth edges (1,024 → 4,096 → 100,000 B)128, 16,383, 16,384, 2,097,151, 2,097,152)
fork/joinround-tripjoin()without matchingfork()throwsstring()ASCII fast-path and UTF-8 fallback (empty, ASCII,long ASCII > initial capacity,
é, emoji, mixed)family method (
uint64,fixed64,int64,sfixed64)currentOffset,ensureCapacity,patchVarint32At)finish()returns a stable view even after reuseScope
Internal PR within the
Connectum-Framework/protobuf-esfork.Foundation for L1 (schema plans) and L2 (specialized writers).
Single-revert rollback if needed:
git revert <merge-sha>restoresthe chunk-based writer in one action (no dual implementation to
maintain, per spec §9).
Follow-ups tracked as P1-T2b: