Skip to content

prototype(protobuf): template-based per-schema encoder (H3A)#5

Closed
intech wants to merge 1 commit intofeat/prototype-estimator-map-oneoffrom
feat/prototype-per-schema-codegen
Closed

prototype(protobuf): template-based per-schema encoder (H3A)#5
intech wants to merge 1 commit intofeat/prototype-estimator-map-oneoffrom
feat/prototype-per-schema-codegen

Conversation

@intech
Copy link
Copy Markdown

@intech intech commented Apr 19, 2026

Summary

Replaces the switch(fieldKind) / switch(scalar) dispatch inside toBinaryFast with a pre-built array of closures per DescMessage. Each closure pre-captures tagBytes: Uint8Array, localName, and the scalar-specific writer, so the inner encode loop becomes for (const step of steps) step(c, msg, sizes) with no branch tables and no tag re-encoding on the hot path. CSP-safe — no eval, no new Function(), no dynamic source generation.

Step arrays are built on first touch of a schema and cached in WeakMap<DescMessage, Step[]>, amortized for the lifetime of the process. Stacked on #4.

Approach

  • buildSizeSteps(desc) walks desc.fields and desc.oneofs once, produces SizeStep[] — each element a closure specialized to one field (scalar variant, enum, message, list-of-T, map<K,V>, oneof dispatch table)
  • buildEncodeSteps(desc) does the same for writes, with pre-encoded tagBytes: Uint8Array
  • computeMessageSize(desc, msg, sizes) and writeMessageInto(c, desc, msg, sizes) become 3-line iterators over the cached step array
  • Oneof dispatch is table-driven via a per-case Map<string, Step> built at compile time; no linear scan at runtime
  • Map field steps pre-compute outer tag bytes, key tag bytes, and value tag bytes; per-entry body size is still recomputed inline (cached would require a second identity-keyed cache)

Results

Node 25.8, benchmarks/, run locally:

Workload ops/s Δ vs H2
SimpleMessage toBinary 1.16M
SimpleMessage toBinaryFast (H3) 1.93M +66%
OTel 100-span toBinary 463
OTel 100-span toBinaryFast (H3) 472 wash
OTel 100-span protobufjs (reference) 2,413 5.1x

The big win lands on flat-scalar schemas (SimpleMessage), where eliminating the dispatch hops dominates. On the OTel shape the bottleneck has moved off dispatch entirely — the remaining ~5x gap vs the pbjs-generated encoder is now dominated by:

  1. protoInt64.enc(...) for every startTimeUnixNano / endTimeUnixNano and every int64 attribute (100 spans × 2 timestamps = 200 bigint→(lo,hi) conversions per encode, plus 100 × ~1 int attribute)
  2. UTF-8 encoding for attribute strings (~1000 strings per 100-span payload)
  3. Uint8Array.set() for 16-byte trace IDs and 8-byte span IDs
  4. SizeMap bookkeeping (new Map() + many sizes.set() calls) during the deep nesting

Closing that gap would need specialization of those paths, not more dispatch removal.

Memory

Per 1000 iterations on the 100-span OTel fixture:

Variant Heap delta (MB)
toBinary (baseline) 14.2
toBinaryFast (H3) 54.2
protobufjs (reference) 47.5

The rise in toBinaryFast is transient object retention by SizeMap and ADT wrappers during one encode; per-schema step arrays themselves are stable.

Follow-ups (not in this PR)

  • Specialize protoInt64.enc callsite for the common bigint input (avoid the type-detect branch)
  • Replace SizeMap object-identity keying with an integer handle pre-assigned during size pass
  • Inline an ASCII-known branch for string fields whose values the schema marks as canonically ASCII
  • Package-level: emit a codegen variant that bakes the step array at build time (pbjs-style), skipping even the first-touch walk

Test plan

  • 2,839 existing tests pass (packages/protobuf-test suite)
  • Byte-identical output on both fixtures:
    • ExportTraceRequest 100 spans: 32,926 B identical
    • SimpleMessage: 19 B identical
  • benchmarks/src/verify-correctness.ts green
  • biome lint clean
  • tsc --noEmit clean

Scope

Internal PR within Connectum-Framework fork. Not proposed upstream.

🤖 Generated with Claude Code

Replaces switch-by-fieldKind dispatch in toBinaryFast with a pre-built
array of closures per DescMessage. Each closure pre-captures
(tagBytes as Uint8Array, localName, scalar-specific writer), eliminating
dispatch overhead and tag re-encoding on the hot path. CSP-safe — no
eval, no new Function(), no dynamic source generation.

Step arrays are built on first touch of a schema and cached in
WeakMap<DescMessage, Step[]>, so the walk of descriptor.fields /
descriptor.oneofs runs exactly once per schema for the lifetime of the
process.

Measurements (Node 25.8, benchmarks/, 1 iteration average):

| Workload / variant                  | ops/s    | Δ vs H2 |
|-------------------------------------|---------:|--------:|
| SimpleMessage toBinary              |    1.16M | —       |
| SimpleMessage toBinaryFast (H3)     |    1.93M | +66%    |
| OTel 100-span toBinary              |      463 | —       |
| OTel 100-span toBinaryFast (H3)     |      472 | wash    |
| OTel 100-span protobufjs (ref)      |    2,413 | 5.1x    |

The big win lands on flat-scalar schemas (SimpleMessage), where
eliminating the switch(fieldKind)/switch(scalar) hops dominates. On the
OTel shape the bottleneck has moved off dispatch entirely — profile now
points at bigint/varint64 work for ns timestamps, UTF-8 encoding for
attribute strings, Uint8Array.set() for trace/span IDs, and SizeMap
bookkeeping for the deep nesting. Closing the remaining gap vs the
pbjs-generated encoder would need specialization of *those* paths
(ASCII-known branches inlined per field, per-entry size cache avoiding
Map allocation, or a switch to a pre-scanned descriptor plan), not
further dispatch removal.

The H3 cache allocation itself shows up in the memory benchmark — per
1000 iterations the heap delta for toBinaryFast rose from ~19.5 MB
(H2) to ~54 MB. Most of that is transient ADT/value objects retained
by SizeMap during an encode; the per-schema step arrays are amortized
and stable. Follow-ups: (1) replace Object.keys() in map steps with an
iterator-free loop, (2) shrink SizeMap footprint by keying body sizes
on a per-call integer handle instead of the object identity.

Existing 2,839 tests pass. Byte-identical output maintained on both
the OTel 100-span fixture (32,926 bytes) and SimpleMessage (19 bytes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech self-assigned this Apr 19, 2026
@intech intech mentioned this pull request Apr 19, 2026
9 tasks
@intech
Copy link
Copy Markdown
Author

intech commented Apr 19, 2026

Superseded by L0 (#8) + L1+L2 (#10). H3 approach hit V8 megamorphic cliff.

@intech intech closed this Apr 19, 2026
@intech intech deleted the feat/prototype-per-schema-codegen branch April 21, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant