Skip to content

Add L3 runtime monomorphization over L1+L2 fast path#18

Closed
intech wants to merge 3 commits intomainfrom
feat/l3-runtime-monomorphization
Closed

Add L3 runtime monomorphization over L1+L2 fast path#18
intech wants to merge 3 commits intomainfrom
feat/l3-runtime-monomorphization

Conversation

@intech
Copy link
Copy Markdown

@intech intech commented Apr 20, 2026

Summary

Adds L3 runtime monomorphization as an opt-in overlay on top of the L1+L2 fast path in toBinaryFast. Implements the design committed in analysis/p1-t6-l3-design-spec.md (12 pinned decisions D1-D12), adapted to the actual L1+L2 surface on main (direct estimate/write helpers rather than the opcode interpreter described by the spec).

Default behaviour is unchanged. Opt in per call with toBinaryFast(schema, msg, { adaptive: true }) or globally via PROTOBUF_ES_L3=1.

What L3 does

  • Observes message shapes per DescMessage via a slot-presence bitmap (bigint).
  • After L3_WARMUP = 10 observations of the same shape, graduates a specialized plan variant that skips the generic isFieldSet presence gate for known-present fields and drops opcodes for known-absent slots.
  • 4-variant cap (D3). 5th unique shape seals the record; further novel shapes flow through the generic plan — already-graduated shapes keep being served.
  • Shape drift after seal remains byte-parity correct (falls back to generic).

Two execution modes

  • Mode A (CSP-safe, default) — variant = pre-computed VariantStep[]; the executor is a statically-imported interpreter that delegates to the L1+L2 estimate*/write* helpers. Safe under strict CSP.
  • Mode B (CSP-unsafe, opt-in) — per-variant new Function() executor with fully template-generated source (no user data in the source). Enabled by globalThis[Symbol.for('@bufbuild/protobuf.adaptive-codegen')] = true.

Gates (5-run median, pinned CPU)

Fixture Delta vs L1+L2 Target
SimpleMessage multi-shape +55.5 % >= +10 %
SimpleMessage single-shape +40.8 % regression <= 3 %
Span multi-shape +19.0 % >= +10 %
Span single-shape +12.2 % regression <= 3 %

Byte-parity preserved across:

  • 11 new schema-plan-adaptive tests (shape hashing, graduation, cap, drift, oneof, Mode B)
  • All 16 pre-existing toBinaryFast feature-coverage tests
  • correctness-matrix.test.ts + byte-identity.test.ts

Spec adaptations for current main

The L1+L2 reference implementation on main is the direct estimator/writer function set in to-binary-fast.ts, not the opcode-based schema-plan.ts assumed by the spec (that lives on feat/l1-l2-schema-plans and was not merged to main). The adaptation:

  • VariantPlan carries a compact VariantStep[] rather than a trimmed Int32Array opcode stream. Same monomorphization effect (known-present slot list + unrolled dispatch), fewer moving parts.
  • buildVariantExecutor is replaced by two closures inside compileVariantPlan (Mode A static / Mode B codegen).

All 12 pinned decisions (D1-D12) are honoured. D7 side-table sharing is achieved implicitly because variants delegate back into the generic helpers (no table duplication).

Files changed

  • packages/protobuf/src/wire/schema-plan-adaptive.ts — new (+503 LOC)
  • packages/protobuf/src/to-binary-fast.ts — adaptive routing (+86 LOC)
  • packages/protobuf/package.json — internal subpath export for tests
  • packages/protobuf-test/src/schema-plan-adaptive.test.ts — new (+369 LOC, 11 tests)
  • benchmarks/src/bench-multishape.ts — new (+258 LOC)

Test plan

  • node_modules/.bin/tsc --noEmit --project packages/protobuf/tsconfig.json
  • node_modules/.bin/tsc --noEmit --project packages/protobuf-test/tsconfig.json
  • tsx --test src/schema-plan-adaptive.test.ts — 11/11 pass
  • tsx --test src/to-binary-fast.test.ts — 16/16 pass (no regression)
  • tsx --test src/correctness-matrix.test.ts src/byte-identity.test.ts — all pass
  • taskset -c 0 npx tsx src/bench-multishape.ts x 5 runs — gates pass on median

Draft status

Keeping as draft for per-PR user review before merge. Internal fork only.

intech and others added 2 commits April 20, 2026 23:16
Per design spec (analysis/p1-t6-l3-design-spec.md). Observes message
shapes across first 10 encodes per schema; graduates frequent shapes
to specialized plan variants that skip the generic `isFieldSet`
presence gate for known-present fields and known-absent slots. 4-
variant cap with seal-on-breach prevents cache explosion.

Two execution modes:
- Mode A (CSP-safe, default): variant = pre-computed `VariantStep[]`
  list of known-present slots; executor is a statically-imported
  interpreter that delegates to the L1+L2 `estimate*/write*`
  helpers. Safe under strict CSP.
- Mode B (CSP-unsafe, opt-in): per-variant `new Function()` executor
  with unrolled call sites for per-variant IC isolation. Enabled
  via `globalThis[Symbol.for('@bufbuild/protobuf.adaptive-codegen')] = true`.

Spec adaptations for current main:
- The L1+L2 reference implementation on main is the direct estimate/
  write function set in `to-binary-fast.ts` rather than the opcode-
  based `schema-plan.ts` assumed by the spec. The variant plan shape
  therefore drops the opcode trim/filter step and instead carries a
  compact `VariantStep[]` — same monomorphization effect, fewer
  moving parts for this code base.
- `buildVariantExecutor` is replaced by the two closures (Mode A
  static / Mode B codegen) in `compileVariantPlan` with identical
  semantics.

Gates (5-run median on pinned CPU):
- Byte-parity: preserved across 11 new L3 tests + 16 pre-existing
  toBinaryFast tests + correctness-matrix.
- SimpleMessage multi-shape:   +55.5%  (spec target: >= +10%)
- SimpleMessage single-shape:  +40.8%  (spec target: regression <= 3%)
- Span multi-shape:            +19.0%  (spec target: >= +10%)
- Span single-shape:           +12.2%  (spec target: regression <= 3%)
- Memory overhead: bounded by D3 + D7 (shared side tables).

Opt-in: `toBinaryFast(schema, msg, { adaptive: true })` or
`PROTOBUF_ES_L3=1`. Default behaviour unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chains

- schema-plan-adaptive.ts: remove suppressions for a rule biome isn't
  enforcing in this project config
- schema-plan-adaptive.ts + to-binary-fast.ts: collapse defensive
  globalThis process env lookups into optional chain form
- biome.json: ignore gen/gen-protobufjs/.tmp from root scope so turbo
  lint doesn't catch pbjs-generated files and scratch dirs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

Benchmark: no regressions

Thresholds: throughput regression >5%, memory regression >10%. Runner pinned to CPU 0 via taskset. Current run on linux/x64, Node v22.22.2, captured 2026-04-20T19:55:39.913Z.
Baseline captured 2026-04-20T17:34:45.007Z on linux/x64, Node v22.22.2.

Summary: 0 regressed, 3 improved, 0 new, 17 unchanged.

Fixture Baseline ops/s PR ops/s Δ ops Baseline B/op PR B/op Δ mem Status
SimpleMessage :: toBinary (pre-built, 19 B) 849,817 891,706 +4.9% ok
ExportTraceRequest (100 spans) :: toBinary (pre-built, 32926 B) 1,231 1,282 +4.1% ok
ExportMetricsRequest (50 series) :: toBinary (pre-built, 17696 B) 2,168 2,242 +3.4% ok
ExportLogsRequest (100 records) :: toBinary (pre-built, 21319 B) 2,171 2,235 +3.0% ok
K8sPodList (20 pods) :: toBinary (pre-built, 28900 B) 2,342 2,498 +6.6% improved
GraphQLRequest :: toBinary (pre-built, 624 B) 176,305 184,415 +4.6% ok
GraphQLResponse :: toBinary (pre-built, 1366 B) 236,876 245,317 +3.6% ok
RpcRequest :: toBinary (pre-built, 501 B) 296,046 314,074 +6.1% improved
RpcResponse :: toBinary (pre-built, 602 B) 434,888 448,977 +3.2% ok
StressMessage (depth=8, width=200) :: toBinary (pre-built, 12868 B) 7,860 8,299 +5.6% improved
SimpleMessage :: fromBinary (19 B) 1,020,891 1,046,590 +2.5% ok
ExportTraceRequest (100 spans) :: fromBinary (32926 B) 599.8 621.6 +3.6% ok
ExportMetricsRequest (50 series) :: fromBinary (17696 B) 1,149 1,187 +3.4% ok
ExportLogsRequest (100 records) :: fromBinary (21319 B) 1,073 1,106 +3.0% ok
K8sPodList (20 pods) :: fromBinary (28900 B) 1,398 1,410 +0.8% ok
GraphQLRequest :: fromBinary (624 B) 300,513 303,615 +1.0% ok
GraphQLResponse :: fromBinary (1366 B) 265,540 271,515 +2.2% ok
RpcRequest :: fromBinary (501 B) 269,405 273,347 +1.5% ok
RpcResponse :: fromBinary (602 B) 378,014 385,593 +2.0% ok
StressMessage (depth=8, width=200) :: fromBinary (12868 B) 4,046 4,033 -0.3% ok

Produced by benchmarks/scripts/compare-results.ts. Artifacts: bench-results-<pr> (current), bench-baseline-main (baseline).

@intech
Copy link
Copy Markdown
Author

intech commented Apr 20, 2026

Closing as not-ready after CI evaluation.

CI bench-matrix (single-shape repeated encode, pinned median-of-5) showed L3 net negative on realistic workloads:

Fixture Δ ops/s
ExportTraceRequest toBinary -5.9%
ExportLogsRequest toBinary -5.3%
K8sPodList fromBinary -5.3%
StressMessage fromBinary -6.6%
SimpleMessage toBinary +8.0%
RpcResponse toBinary +5.1%

Root cause: L3 adds `shapeHash()` + `variants.get()` overhead per encode. On single-shape workloads (which dominate real traffic for a given schema) this overhead is not amortized by the variant specialization.

Custom multi-shape bench on my host showed +19-55%, but that's a niche scenario and the matrix — which mirrors typical deployments — regressed on nested-message fixtures.

Deferred, not abandoned

  • Cheaper shape hashing (Int32Array of field-presence bits, no BigInt, no Map lookup) could eliminate the per-encode overhead
  • Conditional activation only after observing ≥3 distinct shapes (not on first encode)
  • bench-multishape.ts should land in CI matrix separately before another L3 attempt

Next steps

Keeping L0+L1+L2 stack on main as the shipped improvement. Current status vs pbjs: 0.80x on OTel traces without codegen (from baseline 0.18x before PR #8). L3 micro-optimizations not worth the additional scope given this floor.

@intech intech closed this Apr 20, 2026
@intech intech deleted the feat/l3-runtime-monomorphization branch April 21, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant