Skip to content

Add L0 contiguous-buffer BinaryWriter#8

Merged
intech merged 3 commits intomainfrom
feat/l0-contiguous-writer
Apr 19, 2026
Merged

Add L0 contiguous-buffer BinaryWriter#8
intech merged 3 commits intomainfrom
feat/l0-contiguous-writer

Conversation

@intech
Copy link
Copy Markdown

@intech intech commented Apr 19, 2026

Summary

Replaces the chunk-list + scratch-array state in BinaryWriter with a
single growable Uint8Array plus an integer-offset stack for fork()/
join() framing. Ports the OpenTelemetry PR #6390 ProtobufWriter
pattern into the general-purpose protobuf-es writer while preserving
all 20 public methods (signatures and wire-byte output are identical).

Implementation follows the pinned design decisions in
analysis/p1-t1-l0-design-spec.md (13 decisions, D1-D13).

Pinned decisions applied

  • D1/D2/D3: single Uint8Array, 1,024-byte initial capacity, 2x growth
  • D4/D5/D6: placeholder-and-shift fork/join via copyWithin; the
    stack holds integer offsets only (no per-fork object allocation)
  • D7: single-pass ASCII probe in string() with UTF-8 fallback
  • D8: finish() returns buf.subarray(0, pos) — no copy. A lazy-reset
    flag keeps the returned slice stable if the writer is reused.
  • D10: removed the legacy protected buf: number[] field
  • D11: cached DataView, rebuilt on grow
  • D13: int64 family uses a typeof tri-dispatch (number/bigint/string)
    with fallback to protoInt64.enc/uEnc for out-of-range or invalid
    inputs (error-message parity is maintained)

Additive L0 API

Three new public instance methods exposed for upcoming L1/L2 consumers:

  • ensureCapacity(n: number): void — grow the backing buffer to hold
    n more bytes
  • currentOffset(): number — return the current write position
  • patchVarint32At(offset: number, value: number): void — back-patch
    an unsigned 32-bit varint at a previously reserved offset

Measurements

Node.js 25.8, 100-span ExportTraceRequest fixture (37,547 bytes on
the wire), .tmp/l0-bench/.

Workload Baseline L0 Delta
toBinary 100 spans 525 ops/s 2,278 ops/s +334%
toBinary SimpleMessage (KeyValue) 733,279 ops/s 783,726 ops/s +7%
B/op 100 spans (5-run median) 6,338 B 7,616 B +20% (see note)

Memory note. The per-encode heap delta varies heavily across runs
as minor-GC timing shifts (baseline ranges 3k-42k B; L0 ranges 7k-15k
B). L0 allocations from the writer itself dropped dramatically (no
more 3,917 per-encode Uint8Array wrappers), but steady-state encode
memory is now dominated by reflection dispatch allocations
(ReflectMessage iterators, KeyValue attribute maps). The
≤1,500 B/op gate in the design spec targets post-Tier-B numbers; this
PR unblocks the throughput gate but defers the memory gate to the
reflection-layer work (P1-T2b follow-up).

Gates (per analysis/p1-t1-l0-design-spec.md §7)

  • Byte-parity — OTel fixture round-trip produces byte-identical
    output before and after (.tmp/l0-bench/verify-correctness.ts
    reports bytesIdentical: true)
  • ≥+30% ops/s OTel — actual +334% (2,278 vs 525)
  • ≤1,500 B/op memory — blocked by reflection dispatch; writer
    allocations dropped as designed, but the global heap delta needs
    Tier-B work to clear the gate
  • No SimpleMessage regression — +7% (well within ±10%)
  • finish + raw combined self-time ≤10% — measured 0.05% on
    L0 vs. 29.92% baseline (Node 25.8.1, 100-span OTel fixture,
    30k iterations, 100 µs sampling). Full report:
    analysis/p1-t2b-profile-verification.md. See
    gate-#5 verification comment.

Breaking notes

  • Removed the undocumented protected buf: number[] field. No internal
    consumer (to-binary.ts, size-delimited.ts, extensions.ts)
    touches it; external subclassers must migrate to the public API
    (raw(), ensureCapacity()).
  • finish() now returns a Uint8Array subarray view that aliases the
    writer's backing buffer. A lazy-reset flag swaps the buffer on the
    next write, so reuse works the same as the legacy writer — but
    callers must not mutate the returned slice while the writer is still
    in use.

Test plan

  • All 2,823 existing tests in @bufbuild/protobuf-test pass
  • All 187 tests in @bufbuild/protoplugin-test pass
  • All 4 tests in @bufbuild/protoplugin-example pass
  • 52 new L0-specific tests added in binary-encoding-l0.test.ts,
    covering:
    • ensureCapacity growth edges (1,024 → 4,096 → 100,000 B)
    • Placeholder-and-shift at every varint-size boundary (0, 1, 127,
      128, 16,383, 16,384, 2,097,151, 2,097,152)
    • 10-deep nested fork/join round-trip
    • join() without matching fork() throws
    • string() ASCII fast-path and UTF-8 fallback (empty, ASCII,
      long ASCII > initial capacity, é, emoji, mixed)
    • int64 tri-dispatch parity (number / bigint / string) for every
      family method (uint64, fixed64, int64, sfixed64)
    • Additive API contracts (currentOffset, ensureCapacity,
      patchVarint32At)
    • finish() returns a stable view even after reuse

Scope

Internal PR within the Connectum-Framework/protobuf-es fork.
Foundation for L1 (schema plans) and L2 (specialized writers).

Single-revert rollback if needed: git revert <merge-sha> restores
the chunk-based writer in one action (no dual implementation to
maintain, per spec §9).

Follow-ups tracked as P1-T2b:

Replaces chunk-list + scratch-array state with a single growable Uint8Array
plus an integer-offset stack for fork/join framing. Ports the
OpenTelemetry PR #6390 ProtobufWriter pattern into the general-purpose
protobuf-es BinaryWriter while preserving all 20 public methods.

Implementation follows analysis/p1-t1-l0-design-spec.md pinned decisions:
- D1/D2/D3: single Uint8Array, initial capacity 1024, 2x growth
- D4/D5/D6: placeholder-and-shift fork/join via copyWithin; stack holds
  integer offsets only (no per-fork object allocation)
- D7: single-pass ASCII probe in string() with UTF-8 fallback
- D8: finish() returns buf.subarray(0, pos) — no copy (lazy reset on next
  write keeps returned slice stable if the writer is reused)
- D10: removed `protected buf: number[]` field from the legacy writer
- D11: cached DataView, rebuilt on grow
- D13: int64 family uses number/bigint/string tri-dispatch with protoInt64
  fallback for out-of-range or invalid inputs (error-message parity)

Additive L0 API (for upcoming L1/L2 consumers):
- ensureCapacity(n): grow backing buffer to hold N more bytes
- currentOffset(): return current write position
- patchVarint32At(offset, value): back-patch a reserved varint32

Measurements (Node 25.8, 100-span ExportTraceRequest fixture):
- toBinary 100 spans: 525 → 2,278 ops/s (+334%)
- toBinary SimpleMessage (KeyValue): 733k → 784k ops/s (+7%, no regression)
- B/op 100 spans (5-run median): 6,338 → 7,616 B (±20% noise band; writer
  allocations dropped substantially but reflection dispatch still churns —
  the ≤1,500 B/op gate is blocked by Tier-B reflection work, tracked as
  P1-T2b)

Byte-parity verified via round-trip on the OTel fixture (bytesIdentical
= true). All 2,875 existing tests plus 52 new L0 tests pass (edge cases:
ensureCapacity growth, placeholder shift at varint-size boundaries,
10-deep fork/join nesting, ASCII + Unicode strings, int64 tri-dispatch
parity, additive API contracts).

Breaking note: removed undocumented `protected buf: number[]` field.
No internal consumer (to-binary.ts, size-delimited.ts, extensions.ts)
touches it; external subclassers (if any) must migrate to public API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech
Copy link
Copy Markdown
Author

intech commented Apr 19, 2026

P1-T2b — Gate #5 formally verified ✅

CPU-profile comparison captured on the OTel 100-span ExportTraceRequest
fixture (32,926 B wire size, 30 k reflective toBinary iterations, Node
25.8.1 with --cpu-prof at 100 µs sampling).

finish + raw combined self-time

Branch Total samples finish + raw hits Combined self-time
main (baseline) 77,389 23,161 29.92 %
feat/l0-contiguous-writer 18,495 11 0.05 %

Gate #5 target: ≤ 10 % — measured 0.05 %. PASS (600× headroom).

Extended finish + raw + fork + join view: baseline 31.49 % → L0 2.34 %.

Throughput observed during profiling

main 414 ops/s → L0 1,767 ops/s (+327 %) — consistent with the
+334 % reported from .tmp/l0-bench/ in the PR body.

Top baseline hotspots eliminated

  • finish 14.97 % → below profile cutoff
  • raw 14.94 % → 0 hits in L0
  • fixed64 3.50 % → 0.30 % (no more per-call new Uint8Array(8))
  • TextEncoder.encode 27.73 % → absorbed by the inlined ASCII fast-path
    in L0's string() (remaining work tracked under the string frame
    itself at 11.86 %, and absolute hit count dropped ~10×)

New dominant hotspots in L0 (next-step vectors)

  • Reflection (ReflectMessageImpl, assertOwn, values,
    get sortedFields, unsafeIsSet) — 29.66 % combined → L1
    schema-plan work
  • Dispatch (writeMessageField, writeFields, writeField,
    writeListField, writeScalar) — 19.10 % → L2 specialized
    writers (codegen)
  • string UTF-8 path — 11.86 %
  • GC — 2.10 % (vs. 1.34 % baseline; the relative share rose only
    because the rest of the pipeline got faster)

Full report and methodology: analysis/p1-t2b-profile-verification.md
(includes the jq commands needed to reproduce every number).

Profiles preserved at:

  • .tmp/p1-t2b-profile/baseline-main.cpuprofile (906 KB)
  • .tmp/p1-t2b-profile/l0.cpuprofile (362 KB)

Gate summary after P1-T2b

  • Byte-parity — verified in PR (verify-correctness.ts)
  • ≥ +30 % ops/s OTel — +334 % measured in PR (+327 % re-confirmed here)
  • ≤ 1,500 B/op memory — deferred to reflection-layer work (P1-T2c)
  • No SimpleMessage regression — +7 % in PR
  • finish + raw ≤ 10 % self-time — 0.05 % measured (this report)

Replaces inline 0x20000000000000 (2^53) literals in the fast-path number
guard of signedInt64LoHi / unsignedInt64LoHi with module-level POW_2_53
and NEG_POW_2_53 constants, computed via 2 ** 53 and annotated with
/*@__PURE__*/ so bundlers can inline or tree-shake them. Resolves the
TypeScript warning [80008] about numeric literals whose absolute values
reach 2^53 being unable to be represented accurately as integers.

Runtime behavior is unchanged: the guards accept the same set of values
(finite integers in [-2^53, +2^53]), and the bigint path continues to
use the existing INT64_MIN_BI / INT64_MAX_BI / UINT64_MAX_BI constants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech changed the title feat(protobuf): L0 contiguous-buffer BinaryWriter (+334% encode, SimpleMessage parity) Add L0 contiguous-buffer BinaryWriter Apr 19, 2026
- Drop inferrable `: number` on BinaryWriter.initialCapacity param to
  satisfy biome's noInferrableTypes rule (the `= 1024` default already
  narrows it to `number`).
- Apply biome format to the L0 writer test file (single-expression
  `assert.deepStrictEqual` that had been split across lines).
- Regenerate bundle-size baseline. The L0 contiguous-buffer writer adds
  ~9.6 KiB raw / ~3.5 KiB minified / ~730 B gzipped to the 1-file bundle
  and scales similarly across the matrix. This is the intended cost of
  the feature — we are accepting it as the new baseline rather than
  carrying the diff forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech self-assigned this Apr 19, 2026
@intech intech merged commit 0febe99 into main Apr 19, 2026
26 checks passed
intech added a commit that referenced this pull request Apr 19, 2026
Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous
writer), #10 (L1+L2 schema plans + specialized writers), #11
(correctness tests).

Key results (Node 25.8, log-scale chart):
- OTel 100 spans:    525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110)
- OTel Metrics 50:   891 -> 4,773 ops/s (+435%)
- OTel Logs 100:     880 -> 3,772 ops/s (+329%)
- K8sPodList 20:     712 -> 3,510 ops/s (+393%)
- Stress d=8 w=200:  2,568 -> 14,378 ops/s (+460%)
- SimpleMessage:     1.39M -> 1.81M ops/s (+30%)

Memory allocations per encode reduced proportionally via L0 contiguous
buffer + L1 schema-plan opcode interpreter + L2 specialized field writers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
intech added a commit that referenced this pull request Apr 19, 2026
Regenerated after merge of #6 (benchmark matrix), #8 (L0 contiguous
writer), #10 (L1+L2 schema plans + specialized writers), #11
(correctness tests).

Key results (Node 25.8, log-scale chart):
- OTel 100 spans:    525 -> 2,501 ops/s (+376%), 0.80x pbjs (3,110)
- OTel Metrics 50:   891 -> 4,773 ops/s (+435%)
- OTel Logs 100:     880 -> 3,772 ops/s (+329%)
- K8sPodList 20:     712 -> 3,510 ops/s (+393%)
- Stress d=8 w=200:  2,568 -> 14,378 ops/s (+460%)
- SimpleMessage:     1.39M -> 1.81M ops/s (+30%)

Memory allocations per encode reduced proportionally via L0 contiguous
buffer + L1 schema-plan opcode interpreter + L2 specialized field writers.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
intech added a commit that referenced this pull request Apr 21, 2026
…seline

Previous state on main: toBinary (L0 contiguous writer, PR #8) +
toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary
broke extension conformance (PR #21). Keeping two encoders required
users to change their call sites to see L1+L2 gains — also unacceptable.

This change:

- Removes toBinaryFast export and L1+L2 source files from main
  (packages/protobuf/src/to-binary-fast.ts, its unit test, and all
  benchmark wiring that referenced it).
- Preserves the full L1+L2 implementation on branch
  archive/l1-l2-schema-plans-experimental for future iteration, not
  discarded. Closes PR #21.
- Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest)
  as a third baseline column in benchmarks (re-applying the
  infrastructure from the closed PR #20), so chart.svg and
  chart-delta.svg honestly show "this fork's toBinary vs original
  upstream vs protobufjs" instead of fork-internal self-comparison.
- Regenerates chart.svg and chart-delta.svg with the three-way layout
  and rewrites the README narrative around the new encoders and the
  "Current state" archival note.
- Updates correctness-matrix.test.ts comments to point at the archive
  branch so contributors know where L1+L2 lives.

Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ):
- upstream toBinary (OTel 100 spans):     331 ops/s  (baseline)
- fork    toBinary (OTel 100 spans):    1,012 ops/s  (+206% vs upstream)
- protobufjs       (OTel 100 spans):    1,680 ops/s

L0 alone gets the fork past 3x original upstream on the real OTel
workload that drove this whole investigation, with byte-identical
wire output.
intech added a commit that referenced this pull request Apr 21, 2026
* Drop toBinaryFast from main, archive L1+L2 to branch, add upstream baseline

Previous state on main: toBinary (L0 contiguous writer, PR #8) +
toBinaryFast (L1+L2 schema plans, opt-in). Merging L1+L2 into toBinary
broke extension conformance (PR #21). Keeping two encoders required
users to change their call sites to see L1+L2 gains — also unacceptable.

This change:

- Removes toBinaryFast export and L1+L2 source files from main
  (packages/protobuf/src/to-binary-fast.ts, its unit test, and all
  benchmark wiring that referenced it).
- Preserves the full L1+L2 implementation on branch
  archive/l1-l2-schema-plans-experimental for future iteration, not
  discarded. Closes PR #21.
- Adds upstream-protobuf-es (npm alias for @bufbuild/protobuf@latest)
  as a third baseline column in benchmarks (re-applying the
  infrastructure from the closed PR #20), so chart.svg and
  chart-delta.svg honestly show "this fork's toBinary vs original
  upstream vs protobufjs" instead of fork-internal self-comparison.
- Regenerates chart.svg and chart-delta.svg with the three-way layout
  and rewrites the README narrative around the new encoders and the
  "Current state" archival note.
- Updates correctness-matrix.test.ts comments to point at the archive
  branch so contributors know where L1+L2 lives.

Local run (Node v25.8.1, unpinned host — CI pinned numbers will differ):
- upstream toBinary (OTel 100 spans):     331 ops/s  (baseline)
- fork    toBinary (OTel 100 spans):    1,012 ops/s  (+206% vs upstream)
- protobufjs       (OTel 100 spans):    1,680 ops/s

L0 alone gets the fork past 3x original upstream on the real OTel
workload that drove this whole investigation, with byte-identical
wire output.

* Use median-of-5 in bench:report to stabilize per-fixture numbers

Single-run benchmark output from report.ts was vulnerable to host
jitter on small/fast fixtures — a SimpleMessage measurement spread
2-8x across back-to-back runs on the same machine. A committed chart
snapshot could therefore show fork's toBinary 19% slower than
upstream in one run and 15-20% faster in the next four.

Wraps the benchmark loop inside report.ts with a BENCH_REPORT_RUNS
counter (default 5) and takes the per-fixture per-encoder median.
Override via the env var for faster iteration or longer sweeps.

Fresh median-of-5 numbers (local unpinned host, taskset -c 0):
- SimpleMessage toBinary: upstream 1.27M, fork 1.35M (+6.3%)
- OTel 100 spans toBinary: upstream 536, fork 1634 (+205%)
- K8sPodList toBinary: upstream 837, fork 3140 (+275%)
- StressMessage toBinary: upstream 2873, fork 10476 (+265%)
- RpcResponse toBinary: fork 630K vs protobufjs 532K (fork ahead on this shape)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove leftover L3 / toBinaryFast references from test harness

Files deleted:
- packages/protobuf-test/src/schema-plan-adaptive.test.ts — L3 test
  file, imports toBinaryFast + @bufbuild/protobuf/wire/schema-plan-adaptive
  both of which were already removed from main
- packages/protobuf/src/wire/schema-plan-adaptive.ts — L3
  implementation, orphaned after its only consumer was removed
- benchmarks/src/bench-multishape.ts — L3-only benchmark, imports
  toBinaryFast

These files survived the initial cleanup because they were only
exercised by L3/L1+L2 code paths that were themselves removed. The
full L1+L2 + L3 implementation remains on archive/l1-l2-schema-plans-experimental.

Full test suite now passes: 2909 / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update benchmarks README to match L0-only main

Removed stale L1+L2 descriptions and the toBinaryFast 0.80x claim from
'Reading the results'. Replaced with L0 description, median-of-5 note,
and fresh numbers (OTel 100 spans 0.88x protobufjs, cumulative vs
upstream +6% to +275%).

Dropped bench-multishape reference from 'Future work' (file deleted in
2dfe2fa). Added 'Archived work' section pointing at
archive/l1-l2-schema-plans-experimental for L1+L2 and L3 prototypes.

* Drop stale wire/schema-plan-adaptive export from package.json

The subpath export + typesVersions alias survived the L3 cleanup even
though wire/schema-plan-adaptive.ts was deleted in 2442fd7. attw was
reporting 'Resolution failed' on the orphaned entry, failing PR #22 CI.

* Bench CI: same-runner baseline instead of cross-runner artifact

Previously the bench-matrix job downloaded the baseline JSON from the
most recent push-to-main workflow artifact. That baseline was captured
on whatever ubuntu-latest runner GitHub happened to assign at that time;
the PR run then happened on a different physical host. Even with
taskset -c 0 + median-of-5 on both sides, cross-host variance (different
P/E-core topologies, SMT neighbours, thermal state) remained 5-7%,
producing chronic false-positive regressions on PRs that did not touch
the encode/decode hot path at all.

Switch the PR job to benchmark origin/main and the PR merge commit in
sequence on the SAME runner within the SAME workflow invocation. Every
factor except the code under test is now held constant. The
'bench-baseline-main' artifact upload on push-to-main is preserved for
external trend consumers but is no longer read by PR comparison.

Mechanics:
  - Checkout PR as usual.
  - Record current + origin/main SHAs.
  - git checkout origin/main, install, build, generate, run matrix into
    baseline-results.json.
  - git checkout back to PR head, rerun install/build/generate (PR may
    have changed package-lock), run matrix into bench-results.json.
  - compare-results.ts diffs the two local JSONs, posts the sticky
    comment, flags regressions.

Doubled bench work pushes the job from ~9 min to ~18 min; timeout raised
25 -> 40 min with buffer for slow runners.

Also: compare-results.ts header text now notes 'baseline and current
are benchmarked on the same runner' so the PR comment reflects the new
guarantee, and .gitignore covers baseline-results.json / bench-report.md
to keep the PR working tree clean after a local bench pass.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant