Skip to content

feat(B-0914.6): proximity-agent substrate-engineering substrate de-duplication (canonical-form + Jaccard clustering); 19 tests pass#5772

Merged
AceHack merged 3 commits into
mainfrom
otto-cli/b-0914-6-proximity-agent-substrate-engineering-substrate-deduplication-canonical-form-normalization-2026-05-28
May 28, 2026
Merged

feat(B-0914.6): proximity-agent substrate-engineering substrate de-duplication (canonical-form + Jaccard clustering); 19 tests pass#5772
AceHack merged 3 commits into
mainfrom
otto-cli/b-0914-6-proximity-agent-substrate-engineering-substrate-deduplication-canonical-form-normalization-2026-05-28

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 28, 2026

Summary

Google co-scientist proximity agent pattern generalized to TS-side substrate. Two de-dup mechanisms: canonical-form normalization (deterministic) + Jaccard-similarity clustering (lightweight; no embedding model).

19 tests pass / 0 fail.

Composes with

🤖 Generated with Claude Code

…plication (canonical-form + Jaccard-similarity clustering); 19 tests pass

Per Google co-scientist proximity agent (Nature 2026): maps ideas into
high-dimensional space + groups similar variants to prevent wasting
compute on substantively-identical proposals. Generalized to TS-side
substrate with two de-dup mechanisms.

What this adds:
- ProximityFeedback discriminated union + ProximityResult<T> Result-shape
- Cluster<T> with representative + members + canonicalForm
- clusterByCanonical<T>(corpus, canonicalFn) — deterministic dedup
- jaccardSimilarity(tokensA, tokensB) — Jaccard coefficient
- defaultTokenize(text) — lowercase + stop-word filter
- clusterBySimilarity<T>(context) — greedy clustering by Jaccard threshold
- uniqueRepresentatives<T>(result) — drop duplicates convenience

Tests (19; all pass):
- clusterByCanonical groups same-canonical items
- first-seen is representative (pre-sort by score for top-ranked rep)
- empty corpus → EmptyCorpus
- all unique → N clusters of size 1
- jaccardSimilarity edge cases (identical / disjoint / partial / empty)
- defaultTokenize lowercase + stop-word filter
- clusterBySimilarity threshold catches near-duplicates
- High threshold keeps all distinct; low threshold clusters aggressively
- Invalid threshold → InvalidThreshold
- uniqueRepresentatives extracts rep-only list
- Compose with evolution substrate: pre-sort by score → rep is best
- ProximityFeedback exhaustive switch

Composes with substrate:
- B-0914.6 backlog row
- B-0914.5 PR #5767 evolution (de-dup Survivor list before mash)
- B-0914.2 PR #5769 closed-loop (de-dup pre-CI-dispatch saves cycles)
- verify-existing-substrate-before-authoring rule (proximity IS
  substrate-inventory at runtime scope)
- grep-substrate-anchors-before-razor-as-metaphysical rule (substrate-
  anchor check at runtime scope)
- additive-not-zero-sum + monad-propagation + asymmetric-authorship

Real semantic embeddings (TF-IDF / sentence-BERT) deferred; current PoC
handles structural dedup case (substrate-engineering work often produces
variants that differ only in serialization order, key casing, attribute
ordering — canonical-form normalization catches these without embeddings).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 11:26
@AceHack AceHack enabled auto-merge (squash) May 28, 2026 11:26
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a TypeScript proximity de-duplication substrate for workflow-engine experiments, supporting deterministic canonical-form clustering and lightweight Jaccard/token similarity clustering for near-duplicate hypotheses before ranking/evolution/CI dispatch.

Changes:

  • Adds proximity.ts with Result-shaped clustering APIs, tokenization, Jaccard similarity, and representative extraction.
  • Adds proximity.test.ts with 19 Bun tests covering canonical clustering, similarity clustering, tokenizer behavior, errors, and evolution-substrate composition.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tools/workflow-engine/proximity.ts Implements proximity de-duplication primitives and public API types.
tools/workflow-engine/proximity.test.ts Adds invariant and behavior coverage for the new proximity substrate.

Comment thread tools/workflow-engine/proximity.ts Outdated
Comment thread tools/workflow-engine/proximity.ts Outdated
AceHack and others added 2 commits May 28, 2026 07:56
…engineering-substrate-deduplication-canonical-form-normalization-2026-05-28
…nonicalForm semantic divergence (Copilot threads)

Two threads from Copilot on tools/workflow-engine/proximity.ts:

1. Docblock cross-reference "B-0914.6 backlog row" was misleading — the
   seven .N subtasks (.1-.7) are sections within the parent B-0914 row
   file, NOT separate B-0914.N row files. Reworded to "B-0914 subtask .6"
   with explicit parent-row pointer + cross-reference clarification for
   subtasks .5 and .2 as well.

2. Cluster.canonicalForm field semantically divergent between
   clusterByCanonical (real canonical-form string from CanonicalFn<T>)
   and clusterBySimilarity (synthesized "[similarity:<threshold>]:<tokens>"
   label). Added interface docblock that documents the divergence
   explicitly + names the discriminator (`[similarity:` prefix) callers
   can use + notes future-substrate rename path.

Non-breaking: same field name + same type + same behavior; only docblock
expanded. Composes with asymmetric-authorship + monad-propagation rules
unchanged.

Autonomous-loop tick 2026-05-28T12:08Z resolution of PR #5772 BLOCKED
gate (unresolved Copilot threads only blocker; required checks all green).

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 12:13
@AceHack AceHack merged commit dc66cef into main May 28, 2026
31 of 33 checks passed
@AceHack AceHack deleted the otto-cli/b-0914-6-proximity-agent-substrate-engineering-substrate-deduplication-canonical-form-normalization-2026-05-28 branch May 28, 2026 12:16
AceHack added a commit that referenced this pull request May 28, 2026
… pass — completes 7-of-7 B-0914 candidate substrate-engineering gap substrate (#5773)

* feat(B-0914.7): Falcon-style auto-research-doc template substrate (8-section scaffold + Markdown renderer); 19 tests pass — completes 7-of-7 B-0914 candidate gap substrate

Per Sakana Robin Falcon agent (Nature 2026): takes drug proposal + does
deep-dive literature review + writes comprehensive research report. TS-
side scaffold provides 8-section template structure that downstream LLM
substrate-engineering work populates (header / framing / background /
mechanism / evidence / risks / composes-with / test-plan).

What this adds:
- ResearchDocSection discriminated union (9 section kinds)
- ResearchDoc structure (id + proposalId + sections + composesWith)
- ResearchDocFeedback + ResearchDocResult<T> Result-shape
- renderSection(section): string — pure-function Markdown serializer
- renderResearchDoc(doc): ResearchDocResult<string> — full doc rendering
- buildSkeleton(context): ResearchDocResult<ResearchDoc> — 8-section scaffold
- buildAndRender(context): ResearchDocResult<string> — end-to-end convenience

Falcon-stage pending markers preserved (substrate-honest about what's
not yet auto-generated by LLM substrate-engineering):
- '[PENDING LITERATURE REVIEW — Falcon-stage auto-generated]'
- '[PENDING MECHANISM ANALYSIS — Falcon-stage auto-generated]'
- etc. (per section)

Tests (19; all pass):
- EmptyProposalId validation
- 8-section Falcon scaffold structure
- proposalId sanitized to filename-safe id
- composesWith pass-through to skeleton + composes-with section
- All 9 section-kind renderings tested (header/framing/background/
  mechanism/evidence/risks/composes-with/test-plan/raw)
- renderResearchDoc empty → NoSectionsRendered
- buildAndRender end-to-end
- Pending markers preserved (substrate-honest)
- ResearchDocSection exhaustive switch

Composes with substrate:
- B-0914.7 backlog row (Falcon extension target)
- tools/save-ai-memory/ skill (existing substrate; future integration for
  auto-write to docs/research/ + composes-with citation discipline)
- Amara consolidation ferry pattern (PR #5757)
- B-0914.2 PR #5769 closed-loop orchestrator (research-doc generation
  at any cycle stage; template provides structure)
- substrate-or-it-didn't-happen + honor-those-that-came-before rules
- asymmetric-authorship + monad-propagation rules

**B-0914 7-of-7 candidate substrate-engineering gap substrate complete:**
- B-0914.1 PR #5764 TrueSkill ranking (S/M/L: ranking)
- B-0914.2 PR #5769 closed-loop orchestrator (S/M/L: L)
- B-0914.3 PR #5770 n-parallel + consensus (8-parallel-Finch)
- B-0914.4 PR #5768 generation-reflection pairing (S/M/L: M)
- B-0914.5 PR #5767 evolution mash-refine (S/M/L: S)
- B-0914.6 PR #5772 proximity-dedup (canonical + Jaccard clustering)
- B-0914.7 THIS PR Falcon-style auto-research-doc template

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PR #5773): full rule paths + remove unreachable InvalidOperationalStatus variant (Copilot threads)

Two threads on tools/workflow-engine/research-doc.ts:

1. Composes-with docblock referenced rule files by short form
   (`asymmetric-authorship`, `monad-propagation-pattern`) — actual
   filenames are longer + .md-suffixed:
     `.claude/rules/asymmetric-authorship-substrate-entity-defines-consent-channel-recipient-acknowledges.md`
     `.claude/rules/monad-propagation-pattern-cross-language-substrate-shape.md`
   Updated to full paths so cross-refs stay greppable + don't drift.

2. ResearchDocFeedback.InvalidOperationalStatus variant was
   structurally unreachable: `operationalStatus` is a string-literal
   union (`"research-grade" | "operational"`) at the type level, the
   only constructor (line 179) fixes it to `"research-grade"`, and
   no untrusted-string parse path exists. Variant was dead substrate.
   Removed + added docblock naming the conditions under which a
   future caller should add it back (JSON import of external
   research-doc with operationalStatus parsed from untrusted input —
   add validator AT THE PARSE BOUNDARY first, then add this variant).
   Composes with asymmetric-authorship discipline: every TFeedback
   variant should correspond to a real code path that can produce it.

Non-breaking: no callers reference the removed variant (grep clean).
Type-system continues to rule out invalid operationalStatus at
construction time.

Autonomous-loop tick 2026-05-28T12:16Z resolution of PR #5773 BLOCKED
gate (unresolved Copilot threads only blocker; required checks all green).

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@AceHack AceHack review requested due to automatic review settings May 28, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants