diff --git a/docs/research/semantic-canonicalization-and-provenance-aware-retrieval-2026-04-23.md b/docs/research/semantic-canonicalization-and-provenance-aware-retrieval-2026-04-23.md new file mode 100644 index 00000000..0290da77 --- /dev/null +++ b/docs/research/semantic-canonicalization-and-provenance-aware-retrieval-2026-04-23.md @@ -0,0 +1,501 @@ +# Semantic canonicalization and provenance-aware retrieval — spine research doc + +**Scope:** research and cross-review artifact. Defines the +**technical spine** for what the 8th-ferry-corrected +"rainbow table" framing actually is: canonicalization + +semantic hashing + approximate nearest-neighbour retrieval + +provenance-aware scoring, all wrapped in Zeta's +retraction-native algebra. Does NOT define the full +provenance-aware bullshit detector (that's 8th-ferry +candidate #3); defines the substrate it builds on. + +**Attribution:** 8th-ferry absorb +(`docs/aurora/2026-04-23-amara-physics-analogies-semantic-indexing-cutting-edge-gaps-8th-ferry.md`, +PR #274) §"The corrected rainbow-table model" — Amara +distilled the mathematical spine + primary-source +citations (Hinton & Salakhutdinov semantic hashing, +Charikar LSH, HNSW, product quantization). Otto-98 +extracts into standalone research doc + composes with +existing Zeta substrate. Aminata and future adversarial +reviewers will surface gaps on subsequent passes. + +**Operational status:** research-grade. Does not commit +Zeta to any specific embedding model / ANN library / +canonicalization function / provenance-scoring parameter +choice. Those are downstream design questions gated on +this spine landing + Aminata review + a separate design +tick. + +**Non-fusion disclaimer:** Amara's ferry + Otto's +extraction + future Aminata/Codex review-passes producing +consistent framing does NOT imply merged substrate. The +spine is technically-specific enough that independent +review would surface the same standard literature (Hinton +semantic hashing; Charikar LSH; HNSW Malkov-Yashunin +2018). Concordance on published technical primitives is +baseline per SD-9. + +--- + +## Why this spine belongs in Zeta + +Amara's 8th-ferry observation: *"the repo already contains +almost all the pieces for a provenance-aware semantic +bullshit detector."* The pieces: + +- **SD-9** (`docs/ALIGNMENT.md`, PR #252) — agreement is + signal, not proof; carrier-aware independence + downgrade. Norm, not control. +- **DRIFT-TAXONOMY pattern 5** (`docs/DRIFT-TAXONOMY.md`, + PR #238) — truth-confirmation-from-agreement; real-time + diagnostic. +- **citations-as-first-class** (existing + `docs/research/citations-as-first-class.md`) — typed + citation graph + lineage tracer + drift checker; + research-grade. +- **alignment-observability** (existing + `docs/research/alignment-observability.md`) — anti- + gaming measurement surfaces; compliance-theatre- + resistant metric discipline. + +What's missing to turn those into a machine-aidable tool is +the **technical substrate** for semantic indexing + +retrieval + provenance-aware scoring. This doc defines that +substrate at the spine layer. The full bullshit detector +composes on top (candidate #3); the operational promotion +teaches contributors how to use it (candidate #4). All +three land research-grade per AGENTS.md §"Agent operational +practices" (the absorb-discipline cadence is documented as +part of that section's "When an agent ingests an external +conversation" rule). + +--- + +## The spine in one equation-bundle + +Pipeline end-to-end, expressed as functions: + +```text +x — raw input (conversation turn / claim / doc) +c = N(x) — canonical form after normalisation +e = φ(c) — representation (dense embedding or binary hash) +C(q) = kNN(φ(N(q))) — candidate set via ANN retrieval +score(y | q) = α·sim(e_q, e_y) + + β·evidence(y) + - γ·carrierOverlap(q, y) + - δ·contradiction(y) +``` + +Four layers with explicit separation: + +1. **Canonicalization (N)** — strip irrelevant surface + variation. +2. **Representation (φ)** — semantic hashing and/or dense + embedding. +3. **Retrieval (kNN)** — approximate nearest-neighbour + with bounded latency and sublinear-in-corpus-size + lookup. +4. **Scoring** — combine similarity with evidence, + penalise carrier overlap and contradiction. The scoring + layer is where SD-9 operationalises. + +This doc scopes the first three layers in detail + sketches +the scoring layer. Full scoring (bullshit detector) is +candidate #3. + +--- + +## Layer 1 — Canonicalization N(x) + +### Purpose + +Two claims that mean the same thing should reach the same +canonical form. Surface variation (whitespace, word order +in unordered fields, capitalisation in case-insensitive +contexts, punctuation that doesn't carry meaning) should be +stripped; meaning-carrying content preserved. + +### Properties N must have + +| Property | Statement | +|---|---| +| **Idempotent** | N(N(x)) = N(x). Running canonicalization twice is the same as once. | +| **Deterministic** | Same x → same N(x), always. Required for replay determinism. | +| **Meaning-preserving** | Two inputs that a human would call "the same claim" canonicalise to the same output (ideally) or at least to outputs that a downstream embedding places close together. | +| **Version-pinned** | N has a version; canonical forms carry that version in provenance. Changing N requires a governance / ADR-gated update — not a routine code edit — because it invalidates every existing canonical form computed under the prior version. | + +### What N does NOT do + +- N does NOT claim semantic equality. Two claims with + identical N(x) might still differ in subtle ways; the + embedding layer + scoring layer handle that. +- N does NOT guarantee every equivalent claim produces + identical canonical form. "Zeta is a DBSP implementation" + vs "Zeta implements DBSP" may canonicalise to different + forms; kNN retrieval + embedding space closes the gap. +- N does NOT validate content. Canonicalization is + structural; validity lives at the scoring layer. + +### Version pinning + +Every receipt (per BLAKE3 receipt hashing v0, PR #268) +binds the v0 10-field input set — `hash_version`, +`issuance_epoch`, `parameter_file_sha`, +`approval_set_commitment`, plus the 6 inherited from +Amara's 7th-ferry. The `parameter_file_sha` slot in +particular pins N's +version. A canonical form produced under version `N-v2` +doesn't silently match against forms produced under `N-v1`; +retrieval respects version boundaries or runs explicit +cross-version reconciliation. + +### Candidate implementation archetypes + +- **For natural-language claims:** lowercase + whitespace- + normalise + lemmatise + remove stopwords for retrieval + (not for display); preserve surface form in provenance. +- **For structured claims:** sort keys alphabetically; + normalise whitespace in values; version-tag the schema. +- **For code / diffs:** AST-level canonicalization (format + + rename-collapse + comment-strip for similarity; keep + literal for provenance). + +Exact choices are downstream design questions. The +property constraints above are what the spine commits to. + +--- + +## Layer 2 — Representation φ(c) + +### Two candidate families + +Per Amara 8th ferry: + +- **Dense embeddings** — real-valued vectors from a + trained model; continuous similarity via cosine or L2. +- **Binary semantic hashes** — fixed-length bitstrings + from Hinton & Salakhutdinov semantic hashing; Hamming- + distance similarity; fast table lookups. + +Both families are compatible with the spine. Implementations +may use one, the other, or both as complementary retrieval +layers (binary hash for coarse bucket; dense embedding for +fine rerank within bucket). + +### Properties φ must have + +| Property | Statement | +|---|---| +| **Similar canonical forms → close representations** | `sim(e_c1, e_c2)` is monotonically related to claim-similarity as a human would judge it. The whole point of semantic indexing. | +| **Version-pinned** | φ has a version; mixing embeddings across versions is either forbidden or explicitly reconciled. | +| **Bounded compute** | φ(c) completes in bounded time per input; unbounded compute breaks replay-determinism. | +| **Reproducible** | Same c → same e, under same φ version. Deterministic model inference. | + +### Locality-sensitive hashing as complement + +Charikar LSH: similarity drives hash collision. Practical +shape: `h(c) = sign(w · φ(c))` for random hyperplane w; +repeat with k hyperplanes to form a k-bit binary hash. +Property: `Pr[h(c1) = h(c2)]` grows with `cos(e_c1, e_c2)`. + +LSH gives cheap Hamming-distance candidate retrieval over +huge corpora; dense-embedding rerank gives precision on +the retrieved candidates. + +### Product quantization as compression + +If the corpus is large (millions of claims; plausible for +a long-lived factory), dense embeddings get expensive. +Product quantization (Jégou et al.) compresses by +splitting the embedding into subvectors + quantising each. +Memory shrinks 8-32×; accuracy degrades gracefully. +Useful once corpus scale warrants; not required at initial +landing. + +--- + +## Layer 3 — Approximate nearest-neighbour retrieval + +### HNSW as default + +Malkov-Yashunin 2018 Hierarchical Navigable Small World — +graph-based ANN with logarithmic scaling + strong empirical +performance. Insert / query both O(log n) average. Memory +overhead bounded. Well-documented; multiple production +libraries. + +Proposal: HNSW is the retrieval-layer default. +Alternatives (FAISS IVF-PQ, ScaNN, DiskANN) are +interchangeable behind the same interface; substitution is a +downstream tuning question. + +### Interface spine + +```text +RetrievalIndex: + insert(c, e, metadata) -- add canonical form + representation + provenance + query(e, k) -> [(c, e, metadata, sim)] -- top-k by similarity + remove(c, metadata) -- retraction-native; keys on full insertion + identity (c, metadata) so multiple distinct + insertions of the same canonical form under + different provenance can be retracted + independently. Removing a single insertion + decrements the weighted sum for (c) by 1 + rather than nuking all (c) entries. + replay(events) -> Index -- deterministic rebuild from event stream +``` + +`remove(c, metadata)` is not a tombstone; it's a negative- +weight event in the same algebra as insert, keyed on the +full insertion identity `(c, metadata)` so retractions +target a specific insertion rather than nuking all +insertions of the same canonical form. This is the +retraction-native integration: the retrieval index IS a +Zeta-module materialised view over the event stream +`{insert, remove}`, not a separate mutable structure. Same +property as budgets + approvals + receipts in the KSK-as- +Zeta-module proposal (PR #259 7th ferry). + +### Replay determinism requirement + +Same event stream + same HNSW build-parameters must yield +functionally-equivalent indexes across replays. "Functionally +equivalent" not "bit-identical" because HNSW insertion order +can affect graph structure; two replays may produce +different internal graphs that return the same top-k for +any query. Replay-determinism at the query-behaviour layer, +not the memory-layout layer. + +Property test candidate: for a given event stream + query +set, `kNN(query)` results match across independent replays. + +--- + +## Layer 4 — Scoring (sketch only; full formalisation in candidate #3) + +The scoring layer is where SD-9 operationalises. Amara's +formulation: + +```text +score(y | q) = α · sim(e_q, e_y) + + β · evidence(y) + - γ · carrierOverlap(q, y) + - δ · contradiction(y) + +bullshitRisk(q) = 1 - max_{y ∈ C(q)} score(y | q) +``` + +Each term is a substrate commitment: + +| Term | What it measures | Where it lives | +|---|---|---| +| `α · sim` | Semantic closeness between query and candidate | Representation layer + kNN | +| `β · evidence` | Independent evidence associated with candidate (falsifiers run; reproducible measurements; citations outside the query's provenance cone) | Citations-as-first-class graph | +| `γ · carrierOverlap` | Fraction of candidate's provenance cone that overlaps the query's provenance cone (shared ferries, shared memory files, shared drafting lineage) | Provenance-graph traversal | +| `δ · contradiction` | Load of known contradictions involving this candidate | Retraction ledger | + +The full scoring formulation (parameter choice, +α/β/γ/δ tuning, output bands, composition with oracle- +scoring v0) is 8th-ferry candidate #3. This doc scopes the +spine; candidate #3 scopes the scoring. + +### Aminata-concern preview + +The Aminata Otto-90 pass on oracle-scoring v0 (PR #263) +surfaced three classes of concern that will apply here too: + +- **Gameable-by-self-attestation.** `evidence(y)` and + `contradiction(y)` measurements must come from + independent oracles, not self-report. +- **Parameter-fitting adversary.** α/β/γ/δ changes land + behind ADR per the parameter-change-ADR-gate pattern + (oracle-scoring v0 PR #266). +- **False-precision risk.** Output likely wants band + classifier (RED/YELLOW/GREEN) not decimal score — + applying the Otto-91 oracle-scoring-v0-redesign pattern. + +These are flagged here; candidate #3 addresses them head-on. + +--- + +## Retraction-native ledger of canonical patterns + +Per Amara's 8th-ferry framing: + +> *"The 'table' should not be a mutable truth database +> that overwrites prior judgments. It should be a +> Zeta-style retractable ledger of canonical patterns: +> known-good patterns, known-bad patterns, superseded +> patterns, unresolved patterns, and provenance edges +> between them."* + +Ledger schema (Zeta-native event algebra): + +```text +PatternLedger_t ∈ Z[CanonicalForm × Provenance × Status] + where Status ∈ {known-good, known-bad, superseded, unresolved} + +events: + PatternInserted(c, provenance, status) + PatternRetracted(c, provenance, status) + PatternSuperseded(c_old, c_new, reason) + ProvenanceEdgeAdded(c_from, c_to, edge_type) + ProvenanceEdgeRemoved(c_from, c_to, edge_type) +``` + +Note: `PatternRetracted` and `ProvenanceEdgeRemoved` carry +the same identity tuple as their corresponding insert +events — `(c, provenance, status)` for patterns, +`(c_from, c_to, edge_type)` for edges. This preserves the +negative-weight-retraction symmetry of the Z-set algebra: +a retract event is the additive inverse of the insert +event with the same identity, so the algebra stays +balanced. Without `status` on PatternRetracted, a retract +of a `known-good` insertion would also nuke a sibling +`known-bad` insertion of the same `(c, provenance)` — +wrong granularity. Same logic for `edge_type` symmetry on +provenance-edge events. + +Materialised views: + +```text +CurrentKnownGood — all (c, provenance) with Status=known-good, positive weight +CurrentKnownBad — all (c, provenance) with Status=known-bad, positive weight +ContradictingPairs — (c1, c2) pairs joined by an edge with edge_type=contradicting +ProvenanceCone(c) — transitive closure of provenance edges from c +``` + +Note: `Status` is a per-node property over the +`{known-good, known-bad, superseded, unresolved}` enum +(applied to canonical forms). `contradicting` is an +`edge_type` over the provenance graph (applied to edges +between canonical forms via `ProvenanceEdgeAdded`), NOT a +node-level Status. The two are distinct concepts; the +`ContradictingPairs` view is a graph-edge query, not a +status filter. + +Every scoring query walks these views. No mutable state +outside the event stream. + +--- + +## Composition with existing Zeta substrate + +| Existing substrate | Spine composition | +|---|---| +| SD-9 (PR #252) | Scoring's γ·carrierOverlap operationalises SD-9's "downgrade independence when carriers exist." SD-9 is norm; spine is mechanism. | +| DRIFT-TAXONOMY pattern 5 (PR #238) | Scoring's low-max-score output = pattern-5 live detection. Pattern is diagnostic; spine is the detection engine. | +| citations-as-first-class | Provenance edges + lineage tracer are the substrate spine's scoring uses. Citations-first-class defines the graph; spine consumes it. | +| alignment-observability | Anti-gaming + compliance-theatre-resistance discipline applies to parameter choices. No self-attested α/β/γ/δ. | +| Oracle-scoring v0 (PR #266) | Band-valued output pattern applies: RED/YELLOW/GREEN for `bullshitRisk` not decimal score. Parameter-change-ADR-gate applies. | +| BLAKE3 receipt hashing v0 (PR #268) | parameter_file_sha binding pattern extends to N-version and φ-version pinning. Every retrieval event bound to versions in force. | +| Quantum-sensing analogies (PR #278) | Analogies #2 correlation-beats-isolation (kNN retrieval) + #4 decoherence (γ·carrierOverlap) slot into this spine directly. | +| KSK-as-Zeta-module (PR #259 7th ferry) | Same module pattern: event streams + materialised views. PatternLedger fits as peer to BudgetStore / ReceiptLedger / DisputeState. | + +No new substrate added; existing pieces compose. + +--- + +## What this doc does NOT do + +- **Does NOT commit to specific embedding models.** + Properties constrain φ; the actual model is a downstream + design choice with its own tradeoffs (local-only vs + API-vendor; cost; latency; quality). +- **Does NOT commit to HNSW specifically.** HNSW is the + default proposal; alternatives interchange behind the + interface. Choice is downstream. +- **Does NOT commit to canonicalization specifics.** N's + properties are the commitment; implementation per input- + type is downstream. +- **Does NOT formalise the scoring layer.** That's + candidate #3. The sketch here is placeholder so the spine + composes conceptually; the substrate for full scoring + landing is what this doc provides. +- **Does NOT propose implementation.** F#/.NET + implementation, choice of ANN library binding, serialisation + format for the event stream — all downstream. +- **Does NOT replace citations-as-first-class.** That + research doc defines the graph structure this spine + consumes. This spine makes the graph queryable via + retrieval; the graph itself is citations-first-class's + job. + +--- + +## Dependencies to adoption + +In priority order: + +1. **Aminata adversarial pass** on this spine (cheap; + bounded; anticipates gaps before candidate #3 lands). +2. **Candidate #3 provenance-aware bullshit detector** — + formalises the scoring layer on top of this spine. +3. **Candidate #4 docs/EVIDENCE-AND-AGREEMENT.md** — + operational contributor guidance using the detector. +4. **Parameter choices** for canonicalization per input- + type (natural-language / structured / code). +5. **Embedding-model choice** with explicit latency + + quality + cost tradeoff ADR. +6. **ANN-library choice** (HNSW default; alternative + evaluation research doc). +7. **PatternLedger event-stream implementation** as a + Zeta module following the KSK-as-Zeta-module template. +8. **Property tests** for replay-determinism (same + event stream → same query behaviour). +9. **Parameter-change-ADR-gate** for N-version / + φ-version / scoring-parameter evolution. + +Each downstream step has its own research doc / ADR / +implementation PR. + +--- + +## Specific-asks (per Otto-82/90/93 calibration) + +**None for Otto to proceed.** Spine is research-grade +substrate synthesis; no specific Aaron input needed. +Downstream choices (embedding model; ANN library) will +warrant asks when/if they surface operational trade-offs +Otto can't decide unilaterally. + +One passive expectation: Aminata runs an adversarial pass +on this spine when budget fits (same pattern as 5th-ferry +governance edits / 7th-ferry oracle rules / multi-Claude +experiment design). Otto doesn't block on it; pass +informs candidate #3. + +--- + +## Sibling context + +- **8th-ferry absorb** (PR #274) — source of the math + spine + primary-source citations preserved throughout. +- **Quantum-sensing research doc** (PR #278, Otto-97) — + analogies #2 + #4 directly compose with this spine; + the composition-table here mirrors the composition- + table there. +- **Oracle-scoring v0** (PR #266) — band-classifier + pattern applies to this spine's scoring layer when + candidate #3 formalises it; parameter-change-ADR-gate + applies to spine's α/β/γ/δ. +- **BLAKE3 receipt hashing v0** (PR #268) — + parameter_file_sha binding pattern extends to N- + version / φ-version pinning in every retrieval event. +- **KSK-as-Zeta-module 7th ferry** (PR #259) — same + "event streams + materialised views" module shape; + PatternLedger + RetrievalIndex fit as peers. + +Archive-header format self-applied — 15th aurora/research +doc in a row. + +Otto-98 tick primary deliverable. Closes 8th-ferry +candidate #2 of 3 remaining (after Otto-96 TECH-RADAR +batch + Otto-97 quantum-sensing research doc). + +Remaining 8th-ferry candidates: + +- #3 Provenance-aware bullshit-detector research doc + (M; composes on top of this spine) +- #4 `docs/EVIDENCE-AND-AGREEMENT.md` future operational + promotion (gated on #3)