diff --git a/docs/backlog/P3/B-0203-deepseek-v4-csa-hca-zset-algebra-composability-aaron-2026-05-05.md b/docs/backlog/P3/B-0203-deepseek-v4-csa-hca-zset-algebra-composability-aaron-2026-05-05.md new file mode 100644 index 000000000..1bd5bb6fe --- /dev/null +++ b/docs/backlog/P3/B-0203-deepseek-v4-csa-hca-zset-algebra-composability-aaron-2026-05-05.md @@ -0,0 +1,440 @@ +--- +id: B-0203 +priority: P3 +status: open +title: DeepSeek V4 CSA+HCA architecture composability analysis with Zeta's Z-set algebra -- attention-as-Z-set-operators isomorphism (Aaron 2026-05-05) +tier: research+architecture-composition +effort: L +ask: Aaron 2026-05-05 forwarding Claude.ai conversation that named DeepSeek V4 as architecturally-orthogonal to Google TurboQuant + Aaron's *"by deep seek in similar orthognal areas"* + *"and the deep seek stuff is just as substantial"* +created: 2026-05-05 +last_updated: 2026-05-05 +depends_on: [] +composes_with: [B-0152, B-0196, B-0202, B-0026] +tags: [deepseek, deepseek-v4, csa, hca, compressed-sparse-attention, heavily-compressed-attention, mla, mixture-of-experts, four-property-hodl, dbsp, zset-algebra, fp8, manifold-constrained-hyper-connections, mhc, kv-cache, attention-architecture, mit-license, open-weights] +--- + +# B-0203 -- DeepSeek V4 CSA+HCA composability with Zeta's Z-set algebra + +## Source + +Aaron 2026-05-05 forwarded a Claude.ai conversation that surfaced +**DeepSeek V4** (released April 22-24, 2026) as a major architectural +movement parallel-and-orthogonal to Google's TurboQuant. Aaron's +verbatim framing landed two complementary points: + +1. *"by deep seek in similar orthognal areas"* -- DeepSeek V4 sits in + a similar-orthogonal relation to TurboQuant. Both move the long- + context efficiency frontier; they do so at different layers of the + stack and therefore compose multiplicatively rather than competing. +2. *"and the deep seek stuff is just as substantial"* -- the V4 + announcement is not a sidebar to the TurboQuant cluster (B-0202 + companion). It is a substrate-level architectural redesign that + merits its own analysis lane, not a footnote. + +Full verbatim research-doc preservation: +[`docs/research/2026-05-05-claudeai-tinygrad-uop-turboquant-deepseek-v4-symbolica-categorical-aaron-forwarded-preservation.md`](../../research/2026-05-05-claudeai-tinygrad-uop-turboquant-deepseek-v4-symbolica-categorical-aaron-forwarded-preservation.md) +(lands via PR #1610). + +## Why P3 not P2 + +This row is **research+architecture-composition** with bounded scope: + +- Not blocking Zeta delivery -- the four-property hodl algebra (DST- + safe / retraction-aware / scale-free / DBSP-native) ships independent + of whether DeepSeek V4's CSA+HCA attention layer is dissected, + composed-with, or eventually internalized into the algebra. +- Aaron's 2026-05-05 framing positioned this as research-grade-not- + operational alongside the TurboQuant + tinygrad lanes. None of the + three is the answer; they compose. +- Bounded scope per the substance-test discipline: one PoC of one + attention-layer-as-Z-set-operator isomorphism is enough to move + the architectural-composability claim from speculation to evidence. + No replication of V4 from scratch; no port of the full inference + stack; no engagement with DeepSeek before substance-tests complete. +- The composability claim *"CSA+HCA could land in the algebra itself, + not just the runtime layer"* is **load-bearing** -- treat as + substance-test required, not asserted-as-fact, per the engagement- + gate substantive-claim-level discipline. + +If the substance-test (a) below succeeds and the isomorphism turns out +clean enough that V4-Flash's attention layer compiles to UOp graphs +(acceptance criterion (d)), this row is a candidate for P2-promotion +via the renegotiation protocol -- not by default. + +## DeepSeek V4 -- the artifact + +- **V4-Pro**: 1.6T total parameters, 49B active per token + ([deepseek-ai/DeepSeek-V4-Pro on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)) +- **V4-Flash**: 284B total parameters, 13B active per token + ([deepseek-ai/DeepSeek-V4-Flash on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)) +- **Context length**: 1M tokens native on both checkpoints +- **License**: MIT, open weights (full dissection permitted) +- **Release**: April 22-24, 2026 (announced April 22; full open + release April 24 per + [DeepSeek API Docs preview release note](https://api-docs.deepseek.com/news/news260424)) +- **Folds the previously-separate R reasoning line into a single model + with switchable Thinking / Non-Thinking modes** -- earlier DeepSeek + generations shipped a base + R reasoning checkpoint pair; V4 + collapses the two into one weight set with a mode toggle. +- **V4-Pro is currently the largest open-weights model in existence.** + +URLs verified via WebSearch per Otto-364 search-first authority (2026- +05-05). The HuggingFace blog writeup at +[huggingface.co/blog/deepseekv4](https://huggingface.co/blog/deepseekv4) +and Simon Willison's hands-on writeup at +[simonwillison.net/2026/Apr/24/deepseek-v4/](https://simonwillison.net/2026/Apr/24/deepseek-v4/) +are the most useful third-party entry points; the official GitHub +organization is at +[github.com/deepseek-ai](https://github.com/deepseek-ai) with the +inference code shipped inside the HuggingFace repo at +[the V4-Pro inference folder](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main/inference). + +## Attention architecture (CSA+HCA, NOT "DSA") + +The Claude.ai conversation Aaron forwarded initially used the label +*"DSA"* for the V4 attention architecture. The correct labels per the +DeepSeek paper and HuggingFace announcement are **CSA** and **HCA**; +the architecture is a **hybrid of two distinct attention shapes**, not +one. Future-Otto: when reading older Claude.ai forwards, expect the +DSA shorthand and translate to CSA+HCA at absorb time. + +### Compressed Sparse Attention (CSA) + +Compact KV with top-k sparse selector. CSA compresses KV entries by +4x along the sequence dimension using softmax-gated pooling with a +learned positional bias. A **lightning indexer** (FP4, ReLU-scored +multi-head dot product) picks the top-k compressed blocks per query. +The structural shape: *compress along sequence, sparsely select per +query*. + +### Heavily Compressed Attention (HCA) + +Folds many tokens into single entries. HCA compresses KV entries by +**128x** and **drops the sparse selection** -- every query attends +densely to every compressed block. Because the compressed sequence is +short enough at 128x compression, dense attention is cheap. The +structural shape: *aggressive compression, dense attention over the +short result*. + +### Interleaved across layers + +The V4-Pro 61-layer stack alternates CSA and HCA. Layers 0-1 are HCA; +layers 2-60 alternate CSA / HCA. Different layers carry different +attention patterns; alternating gives the model both fine-grained +sparse selection (CSA) and broad cheap-dense aggregation (HCA) without +forcing all layers into one shape. + +### Performance vs V3 + +- **90% KV cache reduction** at typical context lengths. +- **73% per-token FLOPs reduction**. +- **At 1M context, V4-Pro: 27% V3.2 FLOPs, 10% V3.2 KV.** +- **At 1M context, V4-Flash: 10% V3.2 FLOPs, 7% V3.2 KV.** + +V4-Flash is the more aggressive efficiency point in absolute terms; +V4-Pro is the more capable model with still-substantial efficiency +gains. + +## Architectural-redesign vs compress-on-top -- the substantive divergence + +This is the load-bearing distinction between DeepSeek V4 and Google's +TurboQuant cluster: + +- **TurboQuant compresses an existing KV cache POST-HOC.** It is a + runtime-layer optimization that applies to attention as conventionally + computed; the cache shape is the same as before, the bytes-per-entry + are smaller. Wins compose at the runtime layer. +- **DeepSeek REDESIGNS attention so the cache is structurally smaller + from the start.** CSA's 4x sequence-compression and HCA's 128x + block-compression are architectural -- the cache shape itself + shrinks, before any quantization runs. Wins compose at the + architecture layer. + +**They compose multiplicatively.** Run V4 + TurboQuant on top of +CSA+HCA and you stack the wins: V4 makes the cache structurally smaller +(architecture-layer); TurboQuant compresses what's left (runtime-layer). +The two interventions live at different layers of the stack and the +multiplications don't double-count. + +**This composes with B-0202 (tinygrad UOp IR) at the kernel layer.** +UOp runs the kernels regardless of which approach is used -- whether +V4's architectural CSA+HCA, TurboQuant's runtime compression, or a +classic dense attention. The three lanes are layered (architecture / +runtime / kernel) and Zeta's substrate sits orthogonal to all three +via the operator algebra. + +## Composability claim with Zeta's Z-set algebra (the headline) + +The structural fit between CSA+HCA and Zeta's Z-set algebra is closer +than between TurboQuant's post-hoc compression and the algebra: + +- **Sparse selectors = filter operators.** CSA's top-k lightning + indexer is a *signed Z-set restriction*: select the top-k blocks + with multiplicity +1, leave the rest with multiplicity 0. The +k / + -k awareness of Z-set algebra means that *retracting* a selection + (e.g. unselecting a stale top-k pick under DBSP retract semantics) + is the same primitive as selecting it in the first place, with the + sign flipped. The selector becomes a Z-set-typed function rather + than a destructive in-place update. +- **Compressed entries = aggregation operators.** HCA's 128x + compression is a sum/fold over the compression window. Z-set + algebra's abelian-group structure means the aggregation preserves + retraction: if a token was folded into a compressed entry and later + needs to be retracted, the inverse operation exists (subtract the + contribution) without having to re-run the full forward pass. +- **Interleaved layers = sequence of incremental rewrites.** The + alternating CSA / HCA stack is a sequence of operator applications; + in DBSP terms, each layer is a circuit node and the composition is + a circuit. Layer-by-layer recomputation under input changes is + natural under DBSP retract semantics; the alternation pattern is + just the static circuit structure. +- **Switchable Thinking / Non-Thinking = mode-conditioned dataflow + branching.** V4's mode toggle maps onto Zeta's `View@clock` + paraconsistent-superposition shape: two modes coexist as parallel + views over the same underlying weights, with the active mode + selected at clock-tick rather than via destructive weight-swap. + +**This is a stronger compositional fit than TurboQuant's post-hoc +compression. CSA+HCA could land in the algebra itself, not just the +runtime layer.** + +The claim is **load-bearing** and **substance-test required**. It is +not asserted-as-fact in this row; acceptance criterion (a) below is +the gate. If the isomorphism breaks the four-property hodl, the claim +is weakened and the row updates accordingly per future-self-not-bound +discipline. + +## Four-property hodl preservation + +Zeta's substrate algebra has four invariants the architectural +composition above must preserve. Status per invariant: + +- **DST-safe**: V4's attention computation is deterministic given + fixed weights + sparse-selector seed. The lightning-indexer top-k + is a deterministic function of the query and the compressed block + embeddings; ties broken by stable ordering preserve bit-exactness + across replay. Substance-test required to verify under retraction + -- specifically, that retracting and re-applying a top-k selection + produces the same hidden state to bit-exact equality. +- **Lock-free**: attention is data-parallel by construction. No + mutex / locks; the only synchronization point is the across-layer + reduction, which is a pure aggregation step. +- **Scale-free**: works at any context length (1M native on V4-Pro). + Composes with retract semantics -- the algebra's scale-freedom is + inherited; the cost of a retraction at 1M context is the cost of + recomputing the affected sub-circuit, not the full attention matrix. +- **DBSP-native**: open research question -- this is the substance- + test. The composability claim above sketches the isomorphism; + acceptance criterion (a) is the gate that turns it from sketch to + evidence. + +## Acceptance criteria + +(a) **Dissect V4-Flash attention + write the explicit Z-set +isomorphism.** V4-Flash (284B total / 13B active) is more tractable +than V4-Pro (1.6T) for substance-test purposes. Download the weights +from +[deepseek-ai/DeepSeek-V4-Flash on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash); +read the inference code at the V4-Flash inference folder; write out +the explicit Z-set isomorphism for at least the CSA selector + the +HCA aggregation. +**Verifier**: a written F# (or markdown-pseudocode) signature for +each of CSA-selector-as-filter and HCA-aggregation-as-fold, with the +abelian-group / retraction-preservation properties stated and a small +worked example (one query, one block) showing forward + inverse + +identity. **Pass**: signatures + worked example landed in +`docs/research/` with cross-link from this row. **Fail-falsifier**: +if the isomorphism breaks the four-property hodl somewhere along the +trajectory, document where and why; the structural claim is then +weakened and the row updates. + +(b) **Cross-reference with MLA (Multi-head Latent Attention) lineage +from V2 / V3.** MLA achieved approximately 93% KV reduction via +latent compression in earlier DeepSeek generations. CSA+HCA achieves +a comparable 90% reduction via architectural redesign. Document +whether the two approaches are *independent-additive* (MLA's latent +compression composes on top of CSA+HCA's architectural compression +for further wins) or *substitutable* (one supplants the other and +running both is double-counting). **Verifier**: a written analysis +in the same `docs/research/` doc as (a), with reference to the +DeepSeek V2 / V3 papers and the V4 paper's own discussion of the +lineage. **Pass**: analysis lands with explicit independent-additive- +or-substitutable verdict + reasoning. **Fail-falsifier**: if the +analysis cannot resolve the question without running both stacks +empirically, document that explicitly and gate the verdict on the +empirical run rather than guess. + +(c) **Engagement gate per the engagement-gate substantive-claim-level +discipline** +([`memory/feedback_engagement_gate_substantive_claim_level_discipline_aaron_otto_2026_05_05.md`](../../../memory/feedback_engagement_gate_substantive_claim_level_discipline_aaron_otto_2026_05_05.md)). +The MIT license means dissection is fully permitted; engagement is the +constrained surface. Do NOT engage DeepSeek (Discord, GitHub issues, +direct researcher contact) with substantive proposals before (a) + +(b) substance-tests complete. **Verifier**: zero outbound engagement +artifacts (issue comments, Discord posts, direct messages, paper +citations soliciting reply) attributable to Zeta substrate before (a) +and (b) land. **Pass**: substance-test artifacts exist, then +engagement is a separate decision per the gate framework. **Fail- +falsifier**: if engagement happens before substance-tests, that IS +the failure mode -- log a meta-win for catching it, retract the +engagement, finish the substance-tests. + +(d) **Composability check with B-0202 (tinygrad UOp IR).** Can V4's +attention layer compile to UOp graphs? The CSA selector and HCA +aggregator should both express cleanly as UOp graphs (UOp's universal- +IR claim is the load-bearing piece -- B-0202 covers it). If yes, the +bridge between architecture-redesign (DeepSeek) + universal-IR +(tinygrad) lands cleanly through Zeta's substrate, and the three lanes +(architecture / runtime / kernel) all compose at the operator-algebra +boundary. **Verifier**: a small UOp graph that expresses the CSA top-k +selection over a toy compressed-block array (3 blocks, 2 queries), +verified to produce numerically-equivalent output to a direct +reference implementation. **Pass**: UOp graph + reference impl + bit- +exact (or numerical-tolerance) match for the toy case. **Fail- +falsifier**: if UOp's IR cannot express the CSA selector cleanly, the +*"three-layer composition through Zeta"* claim weakens and the row +updates -- in particular, B-0202's universal-IR claim takes a hit and +needs cross-row revision. + +## DeepSeek's broader lineage (parallel substrate worth tracking) + +V4 is not isolated; it sits in a connected series of architectural +moves over the last two years. Tracking the lineage matters because +the substance-tests above may need to compose with prior pieces: + +### MLA (Multi-head Latent Attention, V2 / V3) + +Approximately 93% KV reduction via latent compression. Predates +CSA+HCA; the lineage is direct -- MLA's latent compression is the +architectural-pressure that motivates V4's CSA+HCA evolution. +Acceptance criterion (b) is specifically the substance-test for the +MLA / CSA+HCA composability question. + +### DeepSeekMoE + +Fine-grained expert segmentation + shared expert isolation. Composes +naturally with the four-property hodl analysis: MoE expert dispatch +is itself a Z-set operation. Experts are sparse rows in a +Z-set-typed expert matrix; gating is a filter (top-k selection over +the expert axis); retraction = unselect an expert + reselect under +DST replay. **The same compositional shape that makes CSA+HCA fit +the algebra makes DeepSeekMoE fit the algebra -- both are top-k +sparse selectors expressed as Z-set filters.** This is one of the +strongest cross-rows for B-0196 (BigInt + four-property hodl) -- the +binding-acceptance-test for hodl preservation under top-k sparse +ops surfaces in both DeepSeek branches simultaneously. + +### DeepGEMM + FP8 mixed-precision training framework + +V3 was the first open-source frontier-scale model trained natively +in FP8. The DeepGEMM kernel library is one of the strongest signals +that Zeta's BigInt / BigNumber substrate (B-0196) needs to compose +with arbitrary-precision *down* as well as *up* -- FP8 is below the +default precision floor for most arithmetic substrates, and four- +property hodl preservation under FP8 is a non-trivial test of the +algebra's scale-free invariant. + +### mHC (Manifold-Constrained Hyper-Connections, January 2026) + +Signal propagation stability at scale -- the residual-stream piece. +mHC keeps the residual stream geometry well-conditioned across deep +stacks; this is adjacent-to-but-distinct-from CSA+HCA, which is the +attention-layer piece. Tracking it matters because the four-property +hodl analysis at deep-stack composition (61 layers in V4-Pro) needs +to know whether residual-stream stability composes cleanly with the +attention-layer compression -- if mHC is required for CSA+HCA to +hold at scale, the substance-test (a) needs to model residual-stream +behavior too. + +## Human anchors / engagement plan + +DeepSeek's open-research posture (MIT license on weights, technical +reports published alongside releases, papers archived on HuggingFace +and arXiv) means **dissection IS the engagement-readiness path**. +Substantive engagement happens after substrate exists to point at, +not before, per the engagement-gate substantive-claim-level discipline. + +Specific anchors: + +- The V4 paper at + [DeepSeek_V4.pdf on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) + is the canonical technical reference; cite it directly rather than + via summaries. +- The + [HuggingFace blog writeup](https://huggingface.co/blog/deepseekv4) + and Simon Willison's + [hands-on writeup](https://simonwillison.net/2026/Apr/24/deepseek-v4/) + are useful third-party reads but not load-bearing per Otto-364. +- Don't engage on Discord or open issues until something concrete to + contribute. *"Have you considered reading our writeup of how this + composes with our operator algebra"* is a credible engagement; *"we + think your work is interesting"* is not. + +## Out of scope + +This row is bounded; the following are **out of scope** for B-0203 +specifically: + +- **Replicating V4 from scratch.** That is training-not-substrate- + engineering; not a Zeta capability and not the point of this row. + Zeta is not a foundation-model training shop. +- **Supporting multiple DeepSeek versions simultaneously.** V4-Flash + is the substance-test target; V2 / V3 / R / V4-Pro are reference + context, not substance-test targets. If the V4-Flash substance-test + surfaces something requiring V3 cross-check, that's a follow-up row, + not this one. +- **Engaging DeepSeek before substance-tests complete.** Per the + engagement-gate framework, engagement happens after substrate + exists. (c) above is the explicit acceptance criterion for this. +- **Porting DeepSeek's inference code to F#.** The substance-test is + about whether the *algebraic shape* of CSA+HCA fits Zeta's algebra. + A full F# port is downstream of the substance-test concluding that + the fit is clean enough to be worth the porting cost. +- **Settling whether DeepSeek V4 or Google TurboQuant is "the + answer".** Per Aaron's no-kill-paths discipline, both are alive; + they compose multiplicatively at different layers. Picking one + kills the other's contribution and is the failure mode this row + exists to prevent. + +## Composes with + +- **B-0152** + ([P2 row](../P2/B-0152-topological-quantum-emulation-via-bayesian-inference-zeta-seed-executor-aaron-2026-05-01.md)) + -- *Topological-quantum emulation via Bayesian inference, Zeta + seed executor*. The substrate V4's attention layer could run on + with four-property hodl preserved. The emulation-inside-the-algebra + discipline is the load-bearing infrastructure for substance-test + (a). +- **B-0196** + ([P2 row](../P2/B-0196-bigint-and-bignumber-integration-aaron-2026-05-05.md)) + -- *BigInt + BigNumber integration*. The numeric substrate the + four-property algebra depends on; the binding-acceptance-test + gating composability surfaces in both this row and B-0196 (top-k + sparse selector preserves abelian-group structure). +- **B-0202** -- *tinygrad UOp IR kernel-layer companion*. Companion + architecture; UOp could be the kernel layer V4's CSA+HCA compiles + to. Acceptance criterion (d) is the substance-test for the + composition. (B-0202 lands as a sibling backlog row this same week + -- pre-merge file path: + `docs/backlog/P3/B-0202-tinygrad-uop-ir-kernel-layer-aaron-2026-05-05.md`; + cross-link will resolve once both merge.) +- **B-0026** + ([P2 row](../P2/B-0026-embodiment-grounding-analysis-isaac-sim-and-other-robotics-sim-platforms-otto-340-counter.md)) + -- *Embodiment-grounding analysis*. V4's switchable Thinking / + Non-Thinking modes parallel embodiment's action-space split (slow + deliberation vs. fast reaction). Adjacent axis; not on the critical + path for B-0203's substance-tests but worth tracking for follow-on + composition. +- The research-doc preservation at + [`docs/research/2026-05-05-claudeai-tinygrad-uop-turboquant-deepseek-v4-symbolica-categorical-aaron-forwarded-preservation.md`](../../research/2026-05-05-claudeai-tinygrad-uop-turboquant-deepseek-v4-symbolica-categorical-aaron-forwarded-preservation.md) + -- the upstream artifact this row absorbs from. Lands via PR #1610. + +## The carved sentence + +*DeepSeek V4 redesigns attention so the KV cache is structurally +smaller from the start; TurboQuant compresses what remains; tinygrad +UOp runs the kernels. Three layers, multiplicative composition, no +kill. The Zeta-relevant test is whether CSA+HCA's sparse-selector + +compressed-aggregation pair is a Z-set operator pair preserving +four-property hodl -- if it is, the architecture lands in the algebra +itself, not just the runtime.*