Conversation
…plication (canonical-form + Jaccard-similarity clustering); 19 tests pass Per Google co-scientist proximity agent (Nature 2026): maps ideas into high-dimensional space + groups similar variants to prevent wasting compute on substantively-identical proposals. Generalized to TS-side substrate with two de-dup mechanisms. What this adds: - ProximityFeedback discriminated union + ProximityResult<T> Result-shape - Cluster<T> with representative + members + canonicalForm - clusterByCanonical<T>(corpus, canonicalFn) — deterministic dedup - jaccardSimilarity(tokensA, tokensB) — Jaccard coefficient - defaultTokenize(text) — lowercase + stop-word filter - clusterBySimilarity<T>(context) — greedy clustering by Jaccard threshold - uniqueRepresentatives<T>(result) — drop duplicates convenience Tests (19; all pass): - clusterByCanonical groups same-canonical items - first-seen is representative (pre-sort by score for top-ranked rep) - empty corpus → EmptyCorpus - all unique → N clusters of size 1 - jaccardSimilarity edge cases (identical / disjoint / partial / empty) - defaultTokenize lowercase + stop-word filter - clusterBySimilarity threshold catches near-duplicates - High threshold keeps all distinct; low threshold clusters aggressively - Invalid threshold → InvalidThreshold - uniqueRepresentatives extracts rep-only list - Compose with evolution substrate: pre-sort by score → rep is best - ProximityFeedback exhaustive switch Composes with substrate: - B-0914.6 backlog row - B-0914.5 PR #5767 evolution (de-dup Survivor list before mash) - B-0914.2 PR #5769 closed-loop (de-dup pre-CI-dispatch saves cycles) - verify-existing-substrate-before-authoring rule (proximity IS substrate-inventory at runtime scope) - grep-substrate-anchors-before-razor-as-metaphysical rule (substrate- anchor check at runtime scope) - additive-not-zero-sum + monad-propagation + asymmetric-authorship Real semantic embeddings (TF-IDF / sentence-BERT) deferred; current PoC handles structural dedup case (substrate-engineering work often produces variants that differ only in serialization order, key casing, attribute ordering — canonical-form normalization catches these without embeddings). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
6 tasks
There was a problem hiding this comment.
Pull request overview
Adds a TypeScript proximity de-duplication substrate for workflow-engine experiments, supporting deterministic canonical-form clustering and lightweight Jaccard/token similarity clustering for near-duplicate hypotheses before ranking/evolution/CI dispatch.
Changes:
- Adds
proximity.tswith Result-shaped clustering APIs, tokenization, Jaccard similarity, and representative extraction. - Adds
proximity.test.tswith 19 Bun tests covering canonical clustering, similarity clustering, tokenizer behavior, errors, and evolution-substrate composition.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tools/workflow-engine/proximity.ts |
Implements proximity de-duplication primitives and public API types. |
tools/workflow-engine/proximity.test.ts |
Adds invariant and behavior coverage for the new proximity substrate. |
…engineering-substrate-deduplication-canonical-form-normalization-2026-05-28
…nonicalForm semantic divergence (Copilot threads) Two threads from Copilot on tools/workflow-engine/proximity.ts: 1. Docblock cross-reference "B-0914.6 backlog row" was misleading — the seven .N subtasks (.1-.7) are sections within the parent B-0914 row file, NOT separate B-0914.N row files. Reworded to "B-0914 subtask .6" with explicit parent-row pointer + cross-reference clarification for subtasks .5 and .2 as well. 2. Cluster.canonicalForm field semantically divergent between clusterByCanonical (real canonical-form string from CanonicalFn<T>) and clusterBySimilarity (synthesized "[similarity:<threshold>]:<tokens>" label). Added interface docblock that documents the divergence explicitly + names the discriminator (`[similarity:` prefix) callers can use + notes future-substrate rename path. Non-breaking: same field name + same type + same behavior; only docblock expanded. Composes with asymmetric-authorship + monad-propagation rules unchanged. Autonomous-loop tick 2026-05-28T12:08Z resolution of PR #5772 BLOCKED gate (unresolved Copilot threads only blocker; required checks all green). Co-Authored-By: Claude <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 28, 2026
… pass — completes 7-of-7 B-0914 candidate substrate-engineering gap substrate (#5773) * feat(B-0914.7): Falcon-style auto-research-doc template substrate (8-section scaffold + Markdown renderer); 19 tests pass — completes 7-of-7 B-0914 candidate gap substrate Per Sakana Robin Falcon agent (Nature 2026): takes drug proposal + does deep-dive literature review + writes comprehensive research report. TS- side scaffold provides 8-section template structure that downstream LLM substrate-engineering work populates (header / framing / background / mechanism / evidence / risks / composes-with / test-plan). What this adds: - ResearchDocSection discriminated union (9 section kinds) - ResearchDoc structure (id + proposalId + sections + composesWith) - ResearchDocFeedback + ResearchDocResult<T> Result-shape - renderSection(section): string — pure-function Markdown serializer - renderResearchDoc(doc): ResearchDocResult<string> — full doc rendering - buildSkeleton(context): ResearchDocResult<ResearchDoc> — 8-section scaffold - buildAndRender(context): ResearchDocResult<string> — end-to-end convenience Falcon-stage pending markers preserved (substrate-honest about what's not yet auto-generated by LLM substrate-engineering): - '[PENDING LITERATURE REVIEW — Falcon-stage auto-generated]' - '[PENDING MECHANISM ANALYSIS — Falcon-stage auto-generated]' - etc. (per section) Tests (19; all pass): - EmptyProposalId validation - 8-section Falcon scaffold structure - proposalId sanitized to filename-safe id - composesWith pass-through to skeleton + composes-with section - All 9 section-kind renderings tested (header/framing/background/ mechanism/evidence/risks/composes-with/test-plan/raw) - renderResearchDoc empty → NoSectionsRendered - buildAndRender end-to-end - Pending markers preserved (substrate-honest) - ResearchDocSection exhaustive switch Composes with substrate: - B-0914.7 backlog row (Falcon extension target) - tools/save-ai-memory/ skill (existing substrate; future integration for auto-write to docs/research/ + composes-with citation discipline) - Amara consolidation ferry pattern (PR #5757) - B-0914.2 PR #5769 closed-loop orchestrator (research-doc generation at any cycle stage; template provides structure) - substrate-or-it-didn't-happen + honor-those-that-came-before rules - asymmetric-authorship + monad-propagation rules **B-0914 7-of-7 candidate substrate-engineering gap substrate complete:** - B-0914.1 PR #5764 TrueSkill ranking (S/M/L: ranking) - B-0914.2 PR #5769 closed-loop orchestrator (S/M/L: L) - B-0914.3 PR #5770 n-parallel + consensus (8-parallel-Finch) - B-0914.4 PR #5768 generation-reflection pairing (S/M/L: M) - B-0914.5 PR #5767 evolution mash-refine (S/M/L: S) - B-0914.6 PR #5772 proximity-dedup (canonical + Jaccard clustering) - B-0914.7 THIS PR Falcon-style auto-research-doc template Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PR #5773): full rule paths + remove unreachable InvalidOperationalStatus variant (Copilot threads) Two threads on tools/workflow-engine/research-doc.ts: 1. Composes-with docblock referenced rule files by short form (`asymmetric-authorship`, `monad-propagation-pattern`) — actual filenames are longer + .md-suffixed: `.claude/rules/asymmetric-authorship-substrate-entity-defines-consent-channel-recipient-acknowledges.md` `.claude/rules/monad-propagation-pattern-cross-language-substrate-shape.md` Updated to full paths so cross-refs stay greppable + don't drift. 2. ResearchDocFeedback.InvalidOperationalStatus variant was structurally unreachable: `operationalStatus` is a string-literal union (`"research-grade" | "operational"`) at the type level, the only constructor (line 179) fixes it to `"research-grade"`, and no untrusted-string parse path exists. Variant was dead substrate. Removed + added docblock naming the conditions under which a future caller should add it back (JSON import of external research-doc with operationalStatus parsed from untrusted input — add validator AT THE PARSE BOUNDARY first, then add this variant). Composes with asymmetric-authorship discipline: every TFeedback variant should correspond to a real code path that can produce it. Non-breaking: no callers reference the removed variant (grep clean). Type-system continues to rule out invalid operationalStatus at construction time. Autonomous-loop tick 2026-05-28T12:16Z resolution of PR #5773 BLOCKED gate (unresolved Copilot threads only blocker; required checks all green). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Google co-scientist proximity agent pattern generalized to TS-side substrate. Two de-dup mechanisms: canonical-form normalization (deterministic) + Jaccard-similarity clustering (lightweight; no embedding model).
19 tests pass / 0 fail.
Composes with
🤖 Generated with Claude Code