Add L0 contiguous-buffer BinaryWriter by intech · Pull Request #8 · Connectum-Framework/protobuf-es

intech · 2026-04-19T16:29:16Z

Summary

Replaces the chunk-list + scratch-array state in BinaryWriter with a
single growable Uint8Array plus an integer-offset stack for fork()/
join() framing. Ports the OpenTelemetry PR #6390 ProtobufWriter
pattern into the general-purpose protobuf-es writer while preserving
all 20 public methods (signatures and wire-byte output are identical).

Implementation follows the pinned design decisions in
analysis/p1-t1-l0-design-spec.md (13 decisions, D1-D13).

Pinned decisions applied

D1/D2/D3: single Uint8Array, 1,024-byte initial capacity, 2x growth
D4/D5/D6: placeholder-and-shift fork/join via copyWithin; the
stack holds integer offsets only (no per-fork object allocation)
D7: single-pass ASCII probe in string() with UTF-8 fallback
D8: finish() returns buf.subarray(0, pos) — no copy. A lazy-reset
flag keeps the returned slice stable if the writer is reused.
D10: removed the legacy protected buf: number[] field
D11: cached DataView, rebuilt on grow
D13: int64 family uses a typeof tri-dispatch (number/bigint/string)
with fallback to protoInt64.enc/uEnc for out-of-range or invalid
inputs (error-message parity is maintained)

Additive L0 API

Three new public instance methods exposed for upcoming L1/L2 consumers:

ensureCapacity(n: number): void — grow the backing buffer to hold
n more bytes
currentOffset(): number — return the current write position
patchVarint32At(offset: number, value: number): void — back-patch
an unsigned 32-bit varint at a previously reserved offset

Measurements

Node.js 25.8, 100-span ExportTraceRequest fixture (37,547 bytes on
the wire), .tmp/l0-bench/.

Workload	Baseline	L0	Delta
toBinary 100 spans	525 ops/s	2,278 ops/s	+334%
toBinary SimpleMessage (KeyValue)	733,279 ops/s	783,726 ops/s	+7%
B/op 100 spans (5-run median)	6,338 B	7,616 B	+20% (see note)

Memory note. The per-encode heap delta varies heavily across runs
as minor-GC timing shifts (baseline ranges 3k-42k B; L0 ranges 7k-15k
B). L0 allocations from the writer itself dropped dramatically (no
more 3,917 per-encode Uint8Array wrappers), but steady-state encode
memory is now dominated by reflection dispatch allocations
(ReflectMessage iterators, KeyValue attribute maps). The
≤1,500 B/op gate in the design spec targets post-Tier-B numbers; this
PR unblocks the throughput gate but defers the memory gate to the
reflection-layer work (P1-T2b follow-up).

Gates (per `analysis/p1-t1-l0-design-spec.md` §7)

Byte-parity — OTel fixture round-trip produces byte-identical
output before and after (.tmp/l0-bench/verify-correctness.ts
reports bytesIdentical: true)
≥+30% ops/s OTel — actual +334% (2,278 vs 525)
≤1,500 B/op memory — blocked by reflection dispatch; writer
allocations dropped as designed, but the global heap delta needs
Tier-B work to clear the gate
No SimpleMessage regression — +7% (well within ±10%)
finish + raw combined self-time ≤10% — measured 0.05% on
L0 vs. 29.92% baseline (Node 25.8.1, 100-span OTel fixture,
30k iterations, 100 µs sampling). Full report:
analysis/p1-t2b-profile-verification.md. See
gate-#5 verification comment.

Breaking notes

Removed the undocumented protected buf: number[] field. No internal
consumer (to-binary.ts, size-delimited.ts, extensions.ts)
touches it; external subclassers must migrate to the public API
(raw(), ensureCapacity()).
finish() now returns a Uint8Array subarray view that aliases the
writer's backing buffer. A lazy-reset flag swaps the buffer on the
next write, so reuse works the same as the legacy writer — but
callers must not mutate the returned slice while the writer is still
in use.

Test plan

All 2,823 existing tests in @bufbuild/protobuf-test pass
All 187 tests in @bufbuild/protoplugin-test pass
All 4 tests in @bufbuild/protoplugin-example pass
52 new L0-specific tests added in binary-encoding-l0.test.ts,
covering:
- ensureCapacity growth edges (1,024 → 4,096 → 100,000 B)
- Placeholder-and-shift at every varint-size boundary (0, 1, 127,
  128, 16,383, 16,384, 2,097,151, 2,097,152)
- 10-deep nested fork/join round-trip
- join() without matching fork() throws
- string() ASCII fast-path and UTF-8 fallback (empty, ASCII,
  long ASCII > initial capacity, é, emoji, mixed)
- int64 tri-dispatch parity (number / bigint / string) for every
  family method (uint64, fixed64, int64, sfixed64)
- Additive API contracts (currentOffset, ensureCapacity,
  patchVarint32At)
- finish() returns a stable view even after reuse

Scope

Internal PR within the Connectum-Framework/protobuf-es fork.
Foundation for L1 (schema plans) and L2 (specialized writers).

Single-revert rollback if needed: git revert <merge-sha> restores
the chunk-based writer in one action (no dual implementation to
maintain, per spec §9).

Follow-ups tracked as P1-T2b:

CPU profile capture to close gate prototype(protobuf): template-based per-schema encoder (H3A) #5 formally
Reflection-layer memory reduction to close gate prototype(protobuf): two-pass size estimator (+6.28x encode, -54% memory) #3

Replaces chunk-list + scratch-array state with a single growable Uint8Array plus an integer-offset stack for fork/join framing. Ports the OpenTelemetry PR #6390 ProtobufWriter pattern into the general-purpose protobuf-es BinaryWriter while preserving all 20 public methods. Implementation follows analysis/p1-t1-l0-design-spec.md pinned decisions: - D1/D2/D3: single Uint8Array, initial capacity 1024, 2x growth - D4/D5/D6: placeholder-and-shift fork/join via copyWithin; stack holds integer offsets only (no per-fork object allocation) - D7: single-pass ASCII probe in string() with UTF-8 fallback - D8: finish() returns buf.subarray(0, pos) — no copy (lazy reset on next write keeps returned slice stable if the writer is reused) - D10: removed `protected buf: number[]` field from the legacy writer - D11: cached DataView, rebuilt on grow - D13: int64 family uses number/bigint/string tri-dispatch with protoInt64 fallback for out-of-range or invalid inputs (error-message parity) Additive L0 API (for upcoming L1/L2 consumers): - ensureCapacity(n): grow backing buffer to hold N more bytes - currentOffset(): return current write position - patchVarint32At(offset, value): back-patch a reserved varint32 Measurements (Node 25.8, 100-span ExportTraceRequest fixture): - toBinary 100 spans: 525 → 2,278 ops/s (+334%) - toBinary SimpleMessage (KeyValue): 733k → 784k ops/s (+7%, no regression) - B/op 100 spans (5-run median): 6,338 → 7,616 B (±20% noise band; writer allocations dropped substantially but reflection dispatch still churns — the ≤1,500 B/op gate is blocked by Tier-B reflection work, tracked as P1-T2b) Byte-parity verified via round-trip on the OTel fixture (bytesIdentical = true). All 2,875 existing tests plus 52 new L0 tests pass (edge cases: ensureCapacity growth, placeholder shift at varint-size boundaries, 10-deep fork/join nesting, ASCII + Unicode strings, int64 tri-dispatch parity, additive API contracts). Breaking note: removed undocumented `protected buf: number[]` field. No internal consumer (to-binary.ts, size-delimited.ts, extensions.ts) touches it; external subclassers (if any) must migrate to public API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

intech · 2026-04-19T18:28:02Z

P1-T2b — Gate #5 formally verified ✅

CPU-profile comparison captured on the OTel 100-span ExportTraceRequest
fixture (32,926 B wire size, 30 k reflective toBinary iterations, Node
25.8.1 with --cpu-prof at 100 µs sampling).

`finish + raw` combined self-time

Branch	Total samples	`finish + raw` hits	Combined self-time
`main` (baseline)	77,389	23,161	29.92 %
`feat/l0-contiguous-writer`	18,495	11	0.05 %

Gate #5 target: ≤ 10 % — measured 0.05 %. PASS (600× headroom).

Extended finish + raw + fork + join view: baseline 31.49 % → L0 2.34 %.

Throughput observed during profiling

main 414 ops/s → L0 1,767 ops/s (+327 %) — consistent with the
+334 % reported from .tmp/l0-bench/ in the PR body.

Top baseline hotspots eliminated

finish 14.97 % → below profile cutoff
raw 14.94 % → 0 hits in L0
fixed64 3.50 % → 0.30 % (no more per-call new Uint8Array(8))
TextEncoder.encode 27.73 % → absorbed by the inlined ASCII fast-path
in L0's string() (remaining work tracked under the string frame
itself at 11.86 %, and absolute hit count dropped ~10×)

New dominant hotspots in L0 (next-step vectors)

Reflection (ReflectMessageImpl, assertOwn, values,
get sortedFields, unsafeIsSet) — 29.66 % combined → L1
schema-plan work
Dispatch (writeMessageField, writeFields, writeField,
writeListField, writeScalar) — 19.10 % → L2 specialized
writers (codegen)
string UTF-8 path — 11.86 %
GC — 2.10 % (vs. 1.34 % baseline; the relative share rose only
because the rest of the pipeline got faster)

Full report and methodology: analysis/p1-t2b-profile-verification.md
(includes the jq commands needed to reproduce every number).

Profiles preserved at:

.tmp/p1-t2b-profile/baseline-main.cpuprofile (906 KB)
.tmp/p1-t2b-profile/l0.cpuprofile (362 KB)

Gate summary after P1-T2b

Byte-parity — verified in PR (verify-correctness.ts)
≥ +30 % ops/s OTel — +334 % measured in PR (+327 % re-confirmed here)
≤ 1,500 B/op memory — deferred to reflection-layer work (P1-T2c)
No SimpleMessage regression — +7 % in PR
finish + raw ≤ 10 % self-time — 0.05 % measured (this report)

Replaces inline 0x20000000000000 (2^53) literals in the fast-path number guard of signedInt64LoHi / unsignedInt64LoHi with module-level POW_2_53 and NEG_POW_2_53 constants, computed via 2 ** 53 and annotated with /*@__PURE__*/ so bundlers can inline or tree-shake them. Resolves the TypeScript warning [80008] about numeric literals whose absolute values reach 2^53 being unable to be represented accurately as integers. Runtime behavior is unchanged: the guards accept the same set of values (finite integers in [-2^53, +2^53]), and the bigint path continues to use the existing INT64_MIN_BI / INT64_MAX_BI / UINT64_MAX_BI constants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop inferrable `: number` on BinaryWriter.initialCapacity param to satisfy biome's noInferrableTypes rule (the `= 1024` default already narrows it to `number`). - Apply biome format to the L0 writer test file (single-expression `assert.deepStrictEqual` that had been split across lines). - Regenerate bundle-size baseline. The L0 contiguous-buffer writer adds ~9.6 KiB raw / ~3.5 KiB minified / ~730 B gzipped to the 1-file bundle and scales similarly across the matrix. This is the intended cost of the feature — we are accepting it as the new baseline rather than carrying the diff forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous writer), #10 (L1+L2 schema plans + specialized writers), #11 (correctness tests). Key results (Node 25.8, log-scale chart): - OTel 100 spans: 525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110) - OTel Metrics 50: 891 -> 4,773 ops/s (+435%) - OTel Logs 100: 880 -> 3,772 ops/s (+329%) - K8sPodList 20: 712 -> 3,510 ops/s (+393%) - Stress d=8 w=200: 2,568 -> 14,378 ops/s (+460%) - SimpleMessage: 1.39M -> 1.81M ops/s (+30%) Memory allocations per encode reduced proportionally via L0 contiguous buffer + L1 schema-plan opcode interpreter + L2 specialized field writers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous writer), #10 (L1+L2 schema plans + specialized writers), #11 (correctness tests). Key results (Node 25.8, log-scale chart): - OTel 100 spans: 525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110) - OTel Metrics 50: 891 -> 4,773 ops/s (+435%) - OTel Logs 100: 880 -> 3,772 ops/s (+329%) - K8sPodList 20: 712 -> 3,510 ops/s (+393%) - Stress d=8 w=200: 2,568 -> 14,378 ops/s (+460%) - SimpleMessage: 1.39M -> 1.81M ops/s (+30%) Memory allocations per encode reduced proportionally via L0 contiguous buffer + L1 schema-plan opcode interpreter + L2 specialized field writers. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…seline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output.

* Drop toBinaryFast from main, archive L1+L2 to branch, add upstream baseline Previous state on main: toBinary (L0 contiguous writer, PR #8) + toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary broke extension conformance (PR #21). Keeping two encoders required users to change their call sites to see L1+L2 gains — also unacceptable. This change: - Removes toBinaryFast export and L1+L2 source files from main (packages/protobuf/src/to-binary-fast.ts, its unit test, and all benchmark wiring that referenced it). - Preserves the full L1+L2 implementation on branch archive/l1-l2-schema-plans-experimental for future iteration, not discarded. Closes PR #21. - Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest) as a third baseline column in benchmarks (re-applying the infrastructure from the closed PR #20), so chart.svg and chart-delta.svg honestly show "this fork's toBinary vs original upstream vs protobufjs" instead of fork-internal self-comparison. - Regenerates chart.svg and chart-delta.svg with the three-way layout and rewrites the README narrative around the new encoders and the "Current state" archival note. - Updates correctness-matrix.test.ts comments to point at the archive branch so contributors know where L1+L2 lives. Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ): - upstream toBinary (OTel 100 spans): 331 ops/s (baseline) - fork toBinary (OTel 100 spans): 1,012 ops/s (+206% vs upstream) - protobufjs (OTel 100 spans): 1,680 ops/s L0 alone gets the fork past 3x original upstream on the real OTel workload that drove this whole investigation, with byte-identical wire output. * Use median-of-5 in bench:report to stabilize per-fixture numbers Single-run benchmark output from report.ts was vulnerable to host jitter on small/fast fixtures — a SimpleMessage measurement spread 2-8x across back-to-back runs on the same machine. A committed chart snapshot could therefore show fork's toBinary 19% slower than upstream in one run and 15-20% faster in the next four. Wraps the benchmark loop inside report.ts with a BENCH_REPORT_RUNS counter (default 5) and takes the per-fixture per-encoder median. Override via the env var for faster iteration or longer sweeps. Fresh median-of-5 numbers (local unpinned host, taskset -c 0): - SimpleMessage toBinary: upstream 1.27M, fork 1.35M (+6.3%) - OTel 100 spans toBinary: upstream 536, fork 1634 (+205%) - K8sPodList toBinary: upstream 837, fork 3140 (+275%) - StressMessage toBinary: upstream 2873, fork 10476 (+265%) - RpcResponse toBinary: fork 630K vs protobufjs 532K (fork ahead on this shape) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove leftover L3 / toBinaryFast references from test harness Files deleted: - packages/protobuf-test/src/schema-plan-adaptive.test.ts — L3 test file, imports toBinaryFast + @bufbuild/protobuf/wire/schema-plan-adaptive both of which were already removed from main - packages/protobuf/src/wire/schema-plan-adaptive.ts — L3 implementation, orphaned after its only consumer was removed - benchmarks/src/bench-multishape.ts — L3-only benchmark, imports toBinaryFast These files survived the initial cleanup because they were only exercised by L3/L1+L2 code paths that were themselves removed. The full L1+L2 + L3 implementation remains on archive/l1-l2-schema-plans-experimental. Full test suite now passes: 2909 / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update benchmarks README to match L0-only main Removed stale L1+L2 descriptions and the toBinaryFast 0.80x claim from 'Reading the results'. Replaced with L0 description, median-of-5 note, and fresh numbers (OTel 100 spans 0.88x protobufjs, cumulative vs upstream +6% to +275%). Dropped bench-multishape reference from 'Future work' (file deleted in 2dfe2fa). Added 'Archived work' section pointing at archive/l1-l2-schema-plans-experimental for L1+L2 and L3 prototypes. * Drop stale wire/schema-plan-adaptive export from package.json The subpath export + typesVersions alias survived the L3 cleanup even though wire/schema-plan-adaptive.ts was deleted in 2442fd7. attw was reporting 'Resolution failed' on the orphaned entry, failing PR #22 CI. * Bench CI: same-runner baseline instead of cross-runner artifact Previously the bench-matrix job downloaded the baseline JSON from the most recent push-to-main workflow artifact. That baseline was captured on whatever ubuntu-latest runner GitHub happened to assign at that time; the PR run then happened on a different physical host. Even with taskset -c 0 + median-of-5 on both sides, cross-host variance (different P/E-core topologies, SMT neighbours, thermal state) remained 5-7%, producing chronic false-positive regressions on PRs that did not touch the encode/decode hot path at all. Switch the PR job to benchmark origin/main and the PR merge commit in sequence on the SAME runner within the SAME workflow invocation. Every factor except the code under test is now held constant. The 'bench-baseline-main' artifact upload on push-to-main is preserved for external trend consumers but is no longer read by PR comparison. Mechanics: - Checkout PR as usual. - Record current + origin/main SHAs. - git checkout origin/main, install, build, generate, run matrix into baseline-results.json. - git checkout back to PR head, rerun install/build/generate (PR may have changed package-lock), run matrix into bench-results.json. - compare-results.ts diffs the two local JSONs, posts the sticky comment, flags regressions. Doubled bench work pushes the job from ~9 min to ~18 min; timeout raised 25 -> 40 min with buffer for slow runners. Also: compare-results.ts header text now notes 'baseline and current are benchmarked on the same runner' so the PR comment reflects the new guarantee, and .gitignore covers baseline-results.json / bench-report.md to keep the PR working tree clean after a local bench pass. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

intech mentioned this pull request Apr 19, 2026

Add L1 schema plan and L2 specialized writer #10

Merged

6 tasks

intech changed the title ~~feat(protobuf): L0 contiguous-buffer BinaryWriter (+334% encode, SimpleMessage parity)~~ Add L0 contiguous-buffer BinaryWriter Apr 19, 2026

intech self-assigned this Apr 19, 2026

intech merged commit 0febe99 into main Apr 19, 2026
26 checks passed

intech mentioned this pull request Apr 19, 2026

Refresh benchmark chart with finalized L0+L1+L2 stack #12

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add L0 contiguous-buffer BinaryWriter#8

Add L0 contiguous-buffer BinaryWriter#8
intech merged 3 commits intomainfrom
feat/l0-contiguous-writer

intech commented Apr 19, 2026 •

edited

Loading

Uh oh!

intech commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intech commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pinned decisions applied

Additive L0 API

Measurements

Gates (per analysis/p1-t1-l0-design-spec.md §7)

Breaking notes

Test plan

Scope

Uh oh!

intech commented Apr 19, 2026

P1-T2b — Gate #5 formally verified ✅

finish + raw combined self-time

Throughput observed during profiling

Top baseline hotspots eliminated

New dominant hotspots in L0 (next-step vectors)

Gate summary after P1-T2b

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intech commented Apr 19, 2026 •

edited

Loading

Gates (per `analysis/p1-t1-l0-design-spec.md` §7)

`finish + raw` combined self-time