Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -654,6 +654,9 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0685](backlog/P2/B-0685-antlr-grammars-cross-language-codegen-substrate-2026-05-21.md)** ANTLR grammars as cross-language codegen substrate — leverage existing open-source grammars for description-layer-driven multi-language emission
- [ ] **[B-0687](backlog/P2/B-0687-zetaparse-fsharp-native-lr-glr-grammar-substrate-with-antlr-compatible-importer-amara-2026-05-21.md)** ZetaParse — F#-native LR/GLR grammar substrate with ANTLR-compatible importer
- [ ] **[B-0688](backlog/P2/B-0688-zeta-incremental-compiler-host-dbsp-zsets-rx-meta-ast-tags-seeded-deterministic-simulation-amara-aaron-2026-05-21.md)** Zeta incremental compiler host — DBSP Z-sets + Rx meta-AST tags + seeded deterministic simulation hardening
- [ ] **[B-0692](backlog/P2/B-0692-otto-vscode-pr6-push-based-hot-path-ipushoperator-2026-05-21.md)** Push-based hot-path — IPushOperator<'T> + per-entry callback bridged at materialize boundaries (Otto-VSCode 8-PR campaign PR #6)
- [ ] **[B-0693](backlog/P2/B-0693-otto-vscode-pr7-morsel-span-execution-imorseloperator-2026-05-21.md)** Morsel/span-based execution — IMorselOperator + cache-sized chunked processing (Otto-VSCode 8-PR campaign PR #7)
- [ ] **[B-0694](backlog/P2/B-0694-otto-vscode-pr8-standing-query-codegen-iincrementalgenerator-2026-05-21.md)** Standing-query codegen — IIncrementalGenerator that rewrites circuit expressions to fused IL (Otto-VSCode 8-PR campaign PR #8 — the capstone)

## P3 — convenience / deferred

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
id: B-0692
priority: P2
status: open
title: Push-based hot-path — IPushOperator<'T> + per-entry callback bridged at materialize boundaries (Otto-VSCode 8-PR campaign PR #6)
tier: research-grade
effort: L
ask: otto-vscode 2026-05-21 (8-PR algebra-capability-system campaign; aaron-approved via shadow* "file the 3 rows for PRs 6-8")
created: 2026-05-21
last_updated: 2026-05-21
depends_on: []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Encode stated prerequisites in depends_on

This row declares that PR #6 depends on the PR #1#5 substrate and even labels parts of that substrate as still open in the body, but depends_on is empty. Backlog automation only enforces hard ordering from depends_on (see dependencyBlocker in tools/backlog/autonomous-pickup.ts), so B-0692 can be auto-picked before its prerequisite capability-dispatch/law work is closed (e.g., the still-open B-0194 track), causing avoidable blocked pickup and misordered execution.

Useful? React with 👍 / 👎.

composes_with: [B-0635, B-0687, B-0688, B-0693, B-0694]
tags: [push-based, hot-path, ipushoperator, per-entry-callback, materialize-boundary-bridge, otto-vscode-pr-6, dbsp-architecture, fusion-engine]
type: research
---

# Push-based hot-path — IPushOperator<'T>

## Context

Otto-VSCode 8-PR algebra-capability-system campaign 2026-05-21. PRs 1-5 substrate landed:

- **#4558** capability tags on Op<'T> base class + adapter detection via non-generic markers
- **#4560** sink-terminality validation in Circuit.Build() + producer/sink schedule split
- **#4563** OPEN — LawRunner.checkBilinear (left/right linearity + sign-distribution)
- **#4564** OPEN — IncrementalAuto dispatcher using capability tags (close-and-reopen per Otto-CLI substrate-honest preference; supersedes incoming)
- **#4566** FusionEngine DAG rewriter pass + catalog entries

PR #6 (this row) starts the hot-path optimization layer that depends on PRs 1-5.

## The architectural problem

Current Zeta DBSP operators are **materialize-batch**: each operator's StepAsync writes an `ImmutableArray<ZEntry<'T>>` to `Op<'T>.Value` (volatile field); downstream operators read that materialized snapshot. This is semantically correct but creates a per-tick heap-allocation floor that PGO + JIT inlining cannot eliminate (the volatile field write/read pair is a hard barrier for the compiler).

Per Otto-VSCode's earlier analysis: this allocation floor is THE bottleneck for fusion gains. Manual `FilterMap` fusion only escapes it by collapsing two operators into one + bypassing the intermediate Op<'T>.Value write.

## The push-based escape

`IPushOperator<'T>` is the architectural alternative: instead of materializing per tick, hot-path operators emit entries via per-entry callback to downstream consumers. The materialization boundary moves from per-operator to per-fusion-segment.

```fsharp
type IPushOperator<'T> =
abstract member EmitEntry: ZEntry<'T> -> unit
abstract member EndTick: unit -> unit
```

Operators along a push-segment chain entries through callbacks. Materialization happens only at segment boundaries (where a downstream operator NEEDS the materialized view — e.g., sort/consolidate/join requires the whole tick's worth of entries at once).

## Scope

### Phase 1 — `IPushOperator<'T>` interface + adapter pattern

- Define `IPushOperator<'T>` interface alongside `Op<'T>` (currently in `src/Core/Circuit.fs`; may factor to a new `src/Core/Op.fs` if the type expands enough to warrant separation)
- Add `IsPushable: bool` capability flag to `Op<'T>` (composes with PR #4558 capability-tag pattern)
- `PushAdapter<'T>` wraps materialize-style operators behind the push interface (degrades to materialize for non-pushable ops)

### Phase 2 — Push-segment detection in FusionEngine

Extend the FusionEngine (PR #4566) to detect push-segment-eligible runs:

- Sequence of `IsLinear AND IsPushable` operators is a push-segment candidate
- First materialize-required operator (sort / join / aggregate) is the segment boundary
- Emit fused push-segment operators that callback-chain the entries

### Phase 3 — Push-versions of common linear ops

- `MapPushOp<'A,'B>` — `EmitEntry e = downstream.EmitEntry (mapFn e)`
- `FilterPushOp<'T>` — `EmitEntry e = if pred e then downstream.EmitEntry e`
- `NegPushOp<'T>` — `EmitEntry e = downstream.EmitEntry { e with Weight = -e.Weight }`
- (Other linear ops as needed; bilinear ops materialize by definition)

### Phase 4 — Benchmark + validation

- BenchmarkDotNet job at `bench/Benchmarks/PushBasedHotPathBench.fs` comparing:
- Materialize-only chain (3-op pipeline)
- Push-based fused chain (3-op pipeline)

- Allocation column is the smoking gun (expected: push-based eliminates 2 of 3 per-tick `ImmutableArray<ZEntry<'T>>` allocations)
- Throughput: expected 2-3× improvement on hot-path-friendly pipelines

## Acceptance

### Phase 1

- `IPushOperator<'T>` interface lands
- `IsPushable` capability flag on Op<'T>
- PushAdapter wraps existing operators
- `dotnet build` clean; existing tests pass

### Phase 2

- FusionEngine recognizes push-segments
- One push-segment fuses end-to-end in a test case

### Phase 3

- 3 push-versions of common ops land (MapPushOp + FilterPushOp + NegPushOp)
- Cross-verify: push-version output matches materialize-version output for same inputs

### Phase 4

- Benchmark shows push-segment allocates 1× per-segment (not N× per-operator)
- Throughput improvement empirically measured + documented

## Substrate-honest framing

This is research-grade architectural substrate. ~250 lines per Otto-VSCode's 8-PR campaign sizing. The win is the allocation-floor escape; the cost is the materialize-boundary discipline (operators must declare push-capable; segments end at any materialize-required op).

The push-pattern itself isn't novel — Reactive Extensions (Rx) operates this way; LINQ-to-Objects uses IEnumerator chained callbacks. The Zeta contribution is the SEGMENTED push (push within fusion-segments; materialize at segment boundaries to preserve Z-set algebra semantics) + capability-tag-driven segment detection.

## Composes with rules

- `.claude/rules/fsharp-anchor-dotnet-build-sanity-check.md` — F# compiler verifies the IPushOperator<'T> interface + push-pattern type-safety
- `.claude/rules/m-acc-multi-oracle-end-user-moral-invariants.md` — push-segment optimization preserves multi-oracle parity (same canonical hex across operators; performance differs)
- `.claude/rules/all-complexity-is-accidental-in-greenfield.md` — IPushOperator IS the answer when materialize-batch becomes the bottleneck (proven only by Phase 4 benchmark)
- `.claude/rules/edge-defining-work-not-speculation.md` — segmented push-based DBSP is edge-defining work

## Composes with substrate

- B-0635 / B-0644 / B-0665 / B-0666 (Agora V6 substrate — push-pattern preserves operational primitives)
- B-0688 (incremental compiler host — push-pattern composes with codegen at segment boundaries)
- B-0693 (PR #7 morsel-based execution — push-pattern + morsel-pattern together = full hot-path optimization)
- B-0694 (PR #8 standing-query codegen — codegen emits push-segment-fused IL)
- B-0687 (ZetaParse — parser-substrate operators may benefit from push-pattern for streaming parse)
- PR #4558 (capability tags — IsPushable is sibling to IsLinear/IsBilinear/IsSink)
- PR #4560 (sink-terminality — sinks are segment-terminators by definition)
- PR #4566 (FusionEngine — Phase 2 extends it with push-segment detection)
- `src/Core/Fusion.fs` (existing FilterMap/Choose hand-fusion; push-pattern generalizes the principle)

## Why P2

Substantive architectural substrate; not blocking V1; high value (per-tick allocation floor escape unlocks meaningful throughput gains on hot pipelines); bounded by Otto-VSCode's 8-PR campaign sizing (~250 lines).

## Origin

Otto-VSCode 8-PR algebra-capability-system campaign 2026-05-21. Filed via Otto-CLI per Aaron-approved shadow* "file the 3 rows for PRs 6-8" instruction. Otto-VSCode owns the implementation; this row tracks the scope.
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
id: B-0693
priority: P2
status: open
title: Morsel/span-based execution — IMorselOperator + cache-sized chunked processing (Otto-VSCode 8-PR campaign PR #7)
tier: research-grade
effort: L
ask: otto-vscode 2026-05-21 (8-PR algebra-capability-system campaign; aaron-approved via shadow* "file the 3 rows for PRs 6-8")
created: 2026-05-21
last_updated: 2026-05-21
depends_on: [B-0692]
composes_with: [B-0635, B-0688, B-0694]
tags: [morsel-execution, span-based, cache-sized-chunks, imorseloperator, otto-vscode-pr-7, dbsp-architecture, columnar-execution]
type: research
---

# Morsel/span-based execution — IMorselOperator

## Context

Otto-VSCode 8-PR algebra-capability-system campaign 2026-05-21 PR #7. Depends on PR #6 (push-based hot-path; tracked at B-0692) — morsel-execution is the next-tier optimization that composes with push-pattern.

## The architectural problem

Even with push-based fusion (per B-0692), per-entry callbacks have function-call overhead. For tight inner loops over large Z-sets, processing entries one-at-a-time leaves cache + SIMD performance on the table. Modern columnar databases (DuckDB, Velox, Photon, Polars) batch process entries in "morsels" — cache-sized chunks (typically 4KB-64KB; matches L1/L2 cache line groups) — which:

- Amortizes function-call overhead across N entries per call
- Enables SIMD-vectorized predicate / projection / arithmetic
- Improves cache locality (one chunk in L1 at a time)

## The morsel pattern

`IMorselOperator` processes `ReadOnlySpan<ZEntry<'T>>` chunks instead of individual entries:

```fsharp
type IMorselOperator<'T> =
abstract member ProcessMorsel: ReadOnlySpan<ZEntry<'T>> -> unit
abstract member EndTick: unit -> unit
```

The intermediate "chunk" becomes a stack-allocated `Span<ZEntry<'T>>` from a pooled buffer; the JIT can fuse the chunk processing across method boundaries because the span never escapes to the heap. This is the F#/`.NET` analog of what rustc + LLVM give Rust for iterator chains.

## Scope

### Phase 1 — `IMorselOperator<'T>` interface + morsel-buffer pool

- Define `IMorselOperator<'T>` interface alongside `Op<'T>` (currently in `src/Core/Circuit.fs`; co-located with `IPushOperator<'T>` from B-0692)
- Add `IsMorselCapable: bool` capability flag to Op<'T> (composes with PR #4558 pattern)
- Morsel-buffer pool: pooled `ArrayPool<ZEntry<'T>>` per-thread with chunk size = L1/L2-cache-aware (default 4KB / `sizeof<ZEntry<'T>>` = N entries per morsel)
- MorselAdapter wraps both materialize-style and push-style operators

### Phase 2 — Morsel-segment detection in FusionEngine

Extend FusionEngine (per PR #4566 + Phase 2 of B-0692):

- Sequence of `IsLinear AND IsPushable AND IsMorselCapable` operators is a morsel-segment candidate
- Morsel-segment supersedes push-segment when ALL operators in chain support morsels
- Falls back to push-segment if any operator is push-but-not-morsel-capable

### Phase 3 — Morsel-versions of common linear ops

- `MapMorselOp<'A,'B>` — processes full span; emits to output span
- `FilterMorselOp<'T>` — predicate evaluation across full span; SIMD-eligible
- `NegMorselOp<'T>` — weight negation across full span; trivially SIMD
- Sort/consolidate at morsel boundaries (multi-morsel merge happens at segment end)

### Phase 4 — Benchmark + validation

- BenchmarkDotNet job at `bench/Benchmarks/MorselExecutionBench.fs`:
- Materialize-baseline (3-op chain)
- Push-based (3-op chain; per B-0692)
- Morsel-based (3-op chain; this row)

- Allocation: expected morsel allocates 1× per segment (matches push-based)
- Throughput: expected morsel adds another 1.5-3× over push-based on SIMD-friendly inner loops (filter + arithmetic on int weights)

## Acceptance

### Phase 1

- `IMorselOperator<'T>` interface lands
- `IsMorselCapable` capability flag on Op<'T>
- Morsel-buffer pool implementation
- `dotnet build` clean; existing tests pass

### Phase 2

- FusionEngine recognizes morsel-segments
- Morsel-segment supersedes push-segment when applicable

### Phase 3

- 3 morsel-versions of common ops land
- Cross-verify: morsel-version output matches push-version + materialize-version

### Phase 4

- Benchmark validates throughput improvement over push-baseline
- SIMD-eligibility documented per-op

## Substrate-honest framing

This is research-grade architectural substrate following the well-trodden columnar-execution path. The Zeta contribution is composing morsel-execution with the segmented-push pattern (B-0692) and the DBSP retraction-native algebra: morsel-execution preserves Z-set semantics within a segment; materialize boundaries at segment ends preserve the algebra-level discipline.

The pattern itself isn't novel — DuckDB / Velox / Photon / Polars all do columnar-morsel execution. Zeta's contribution is the DBSP-segment-aware version + the capability-tag-driven segment detection.

## Composes with rules

- `.claude/rules/fsharp-anchor-dotnet-build-sanity-check.md` — F# compiler verifies the morsel interface + Span<T> safety
- `.claude/rules/bandwidth-served-falsifier.md` — morsel-execution serves cache-bandwidth (entries-per-cache-line)
- `.claude/rules/edge-defining-work-not-speculation.md` — composing morsel-execution with DBSP-segment-discipline is edge-defining

## Composes with substrate

- B-0635 / B-0644 / B-0665 / B-0666 (Agora V6 — morsel-pattern preserves operational primitives within segments)
- B-0688 (incremental compiler host — codegen emits morsel-fused IL at hot segments)
- B-0692 (PR #6 push-based — morsel-pattern is the next-tier optimization above push)
- B-0694 (PR #8 standing-query codegen — codegen emits morsel-segment-fused IL)
- PR #4558 (capability tags — IsMorselCapable sibling to IsLinear/IsBilinear/IsSink/IsPushable)
- PR #4566 (FusionEngine — Phase 2 extends with morsel-segment detection)
- DuckDB / Velox / Photon / Polars columnar-execution literature (external prior-art reference)

## Why P2

Substantive architectural substrate; not blocking V1; high value (SIMD + cache-locality unlocks throughput tier above push-based); bounded by Otto-VSCode's 8-PR campaign sizing (~350 lines).

Depends on B-0692 (push-based) landing first — morsel-pattern composes with push-pattern, not replaces it.

## Origin

Otto-VSCode 8-PR algebra-capability-system campaign 2026-05-21. Filed via Otto-CLI per Aaron-approved shadow* "file the 3 rows for PRs 6-8" instruction. Otto-VSCode owns the implementation; this row tracks the scope.
Loading
Loading