Skip to content

feat(B-0914.1): pure-TS TrueSkill 1v1 scaffold for workflow engine ranking-agent (hybrid TS+.NET; cross-vendor benchmark substrate)#5764

Merged
AceHack merged 1 commit into
mainfrom
otto-cli/b-0914-1-trueskill-ranking-agent-scaffold-workflow-engine-rank-via-trueskill-action-class-2026-05-28
May 28, 2026
Merged

feat(B-0914.1): pure-TS TrueSkill 1v1 scaffold for workflow engine ranking-agent (hybrid TS+.NET; cross-vendor benchmark substrate)#5764
AceHack merged 1 commit into
mainfrom
otto-cli/b-0914-1-trueskill-ranking-agent-scaffold-workflow-engine-rank-via-trueskill-action-class-2026-05-28

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 28, 2026

Summary

Pure-TS TrueSkill 1v1 implementation (Herbrich+Minka+Graepel 2007 NeurIPS paper algorithm) for workflow engine ranking-agent substrate.

Per Aaron 2026-05-28: hybrid substrate-engineering path — TS-side for vendor skill runtime (cross-vendor benchmark on common ground B-0865.17 REQUIRES TS); .NET side uses Infer.NET via Zeta.Bayesian for deep production integration. Both compose via shared API shape.

17 tests pass / 0 fail.

What this adds

  • TrueSkillRating (mu + sigma posterior gaussian)
  • MatchOutcome + RankingFeedback + RankingResult discriminated unions
  • rate1v1(a, b, outcome): RankingResult — full TrueSkill 1v1 update
  • conservativeSkill(rating) — Xbox Live leaderboard lower-bound
  • Default initial rating + params per Xbox Live convention
  • Internal helpers: normal PDF/CDF (A&S 7.1.26), inverse-normal-CDF (Newton's method), draw margin, truncated-normal correction functions

Composes with substrate

Test plan

  • 17 tests pass; default initial rating + params match Xbox Live
  • All 3 MatchOutcome variants exercised
  • Strong-vs-weak skill update semantics correct (small for expected, large for upset)
  • Draw between equal players → minimal change
  • 5-match tournament convergence
  • Input validation (InvalidRating for NaN / non-positive sigma)
  • Exhaustive switch on MatchOutcome union
  • CI: lint(tsc tools)
  • Auto-merge armed

🤖 Generated with Claude Code

…nking-agent (Herbrich+Minka+Graepel 2007 paper algorithm; substrate for cross-vendor benchmark on common ground)

Per Aaron 2026-05-28 substantive substrate-engineering decision:
- 'they are doing this for their idea ranking with Infra.net basically'
- 'we'd build ELO from scratch is this a good idea too or nah with infer.net?'
- 'you are too careful just ship stuff and lets inventory later'

Substrate-honest answer shipped: HYBRID is best.
- TS-side (this PR): pure-TS TrueSkill 1v1 for vendor skill runtime
  (cross-vendor benchmark on common ground B-0865.17 REQUIRES TS-side
  because Infer.NET can't run in Claude/GPT/Gemini/Grok skill stores)
- F#/.NET side (future Zeta.Bayesian work): Infer.NET TrueSkill for
  deep production integration + full BP/EP framework
- Both compose via shared API shape (TrueSkillRating + match update fn)

Implementation: published TrueSkill algorithm from Herbrich+Minka+Graepel
2007 NeurIPS paper. Minimal 1v1 case; team-play extension deferred.
~340 lines including documentation.

What this adds:
- TrueSkillRating interface (mu + sigma posterior gaussian)
- DEFAULT_INITIAL_RATING (Xbox Live convention: mu=25 sigma=25/3)
- DEFAULT_PARAMS (beta=mu/6 tau=mu/300 drawProb=0.10)
- MatchOutcome discriminated union (win-A / win-B / draw)
- RankingFeedback discriminated union (InvalidRating / NumericalInstability / UnsupportedOutcome)
- RankingResult Result-shape per monad-propagation rule
- rate1v1(a, b, outcome, params): RankingResult — full 1v1 TrueSkill update
- conservativeSkill(rating): number — Xbox Live lower-bound convention (mu - 3*sigma)
- Internal helpers: normalPdf, normalCdf (A&S 7.1.26), inverseNormalCdf
  (Newton's method), drawMargin, vWin/wWin (non-draw truncated normal
  corrections), vDraw/wDraw (draw truncated normal corrections)

Tests (17; all pass):
- Default initial rating Xbox Live convention
- Default params paper convention
- conservativeSkill = mu - 3*sigma
- win-A increases A's mu, decreases B's
- win-B increases B's mu, decreases A's
- Both sigmas decrease after match (uncertainty reduction)
- After 2 matches both sigmas decrease + mus drift bounded
- Strong-beats-weak → small mu shift (expected outcome)
- Weak-beats-strong → large mu shift (upset)
- Draw between equal players → minimal mu change
- Draw between unequal players → strong loses mu, weak gains
- Returns InvalidRating for NaN mu / non-positive sigma / negative sigma
- conservativeSkill ranking with sigma-punishment semantic preserved
- 5-match tournament convergence (sigma reduction + mu separation)
- MatchOutcome exhaustive switch (TS strict mode)

Composes with substrate:
- B-0914.1 backlog row (TrueSkill ranking-agent extension target)
- B-0867 workflow engine substrate (future ActionClass 'rank-via-trueskill')
- B-0865 + B-0865.17 cross-vendor benchmark substrate
- B-0867.20 lifecycle DU (rank action gets pr-review-light via Mod 1)
- Microsoft Infer.NET upstream reference (PR #5763 in flight)
- .claude/rules/monad-propagation-pattern (Result<T, TFeedback> shape)
- .claude/rules/asymmetric-authorship (TFeedback authored by ranking fn)

Source citation: Herbrich, Minka, Graepel 'TrueSkill: A Bayesian Skill
Rating System' (NeurIPS 2006/2007); algorithm implementation from
published paper, not Infer.NET source.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 11:11
@AceHack AceHack enabled auto-merge (squash) May 28, 2026 11:11
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack merged commit 00db2df into main May 28, 2026
32 of 33 checks passed
@AceHack AceHack deleted the otto-cli/b-0914-1-trueskill-ranking-agent-scaffold-workflow-engine-rank-via-trueskill-action-class-2026-05-28 branch May 28, 2026 11:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pure TypeScript TrueSkill 1v1 rating update module intended as the workflow-engine ranking-agent substrate, with Bun tests covering core invariants and several scenario-based behaviors.

Changes:

  • Introduces tools/workflow-engine/trueskill.ts implementing 1v1 TrueSkill updates (rate1v1) plus supporting math helpers and defaults.
  • Adds tools/workflow-engine/trueskill.test.ts with Bun tests validating defaults, outcome behaviors (win/draw), and basic convergence expectations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tools/workflow-engine/trueskill.ts New TrueSkill 1v1 implementation + result/feedback types and math helpers.
tools/workflow-engine/trueskill.test.ts New Bun test suite exercising rating update invariants and scenarios.

Comment on lines +181 to +200
function inverseNormalCdf(p: number): number {
// Initial guess via rational approximation (Beasley-Springer-Moro)
if (p <= 0 || p >= 1) {
throw new Error(`inverseNormalCdf domain: ${p}`);
}
let x = 0; // initial guess
// Newton's method on F(x) = cdf(x) - p (F'(x) = pdf(x))
for (let i = 0; i < 30; i++) {
const f = normalCdf(x) - p;
const fp = normalPdf(x);
if (Math.abs(fp) < 1e-30) break;
const dx = f / fp;
x = x - dx;
if (Math.abs(dx) < 1e-10) break;
}
return x;
}

function drawMargin(drawProbability: number, beta: number): number {
return Math.sqrt(2) * beta * inverseNormalCdf((1 + drawProbability) / 2);
Comment on lines +179 to +186
* Uses iterative inverse-normal-CDF via Newton's method (~5-10 iterations).
*/
function inverseNormalCdf(p: number): number {
// Initial guess via rational approximation (Beasley-Springer-Moro)
if (p <= 0 || p >= 1) {
throw new Error(`inverseNormalCdf domain: ${p}`);
}
let x = 0; // initial guess
Comment on lines +257 to +299
let v: number;
let w: number;
let signA: number; // direction of mu update for player A
let signB: number;

switch (outcome.kind) {
case "win-A": {
const t = (a.mu - b.mu) / c;
v = vWin(t, epsilon);
w = wWin(t, epsilon);
signA = +1;
signB = -1;
break;
}
case "win-B": {
const t = (b.mu - a.mu) / c;
v = vWin(t, epsilon);
w = wWin(t, epsilon);
signA = -1;
signB = +1;
break;
}
case "draw": {
// Draw uses symmetric truncated-normal correction
const t = (a.mu - b.mu) / c;
v = vDraw(t, epsilon);
w = wDraw(t, epsilon);
// For draws, the mu shifts toward the opponent's mu
signA = +1;
signB = -1;
break;
}
}

if (!Number.isFinite(v) || !Number.isFinite(w)) {
return {
ok: false,
feedback: {
kind: "NumericalInstability",
reason: `v=${v} w=${w}`,
},
};
}
Comment on lines +5 to +6
* ranking-agent (per Aaron 2026-05-28: 'they are doing this for their
* idea ranking with Infra.net basically' + 'just ship stuff' calibration).
*
* B-0914.1 — pure-TS TrueSkill 1v1 scaffold for workflow engine
* ranking-agent (per Aaron 2026-05-28: 'they are doing this for their
* idea ranking with Infra.net basically' + 'just ship stuff' calibration).
AceHack added a commit that referenced this pull request May 28, 2026
…es.net PhD learning substrate (Aaron 2026-05-28 substrate-engineering questions) (#5765)

Per Aaron 2026-05-28 substrate-engineering questions:
- 'is there anything like infer.net in ts? can we build it if not using infer.net source code for reference?' → WebPPL is closest TS/JS analog
- 'you'd love videolectures.net in your free time i think... PhD everything here. they don't throttle and they have transcripts and powerpoints' → free-time-substrate learning material

Adds 2 entries to references/reference-sources.json + new
'Probabilistic programming / Bayesian inference' section to
docs/UPSTREAM-LIST.md:

1. WebPPL (probmods/webppl; Stanford; MIT-licensed)
   - Full PP framework in JS with multiple inference engines
   - Closest TS-side substrate to Microsoft Infer.NET
   - Composes with B-0914.1 TrueSkill substrate (PR #5764)
   - Composes with future factor-graph-DSL work

2. videolectures.net (PhD learning substrate; Aaron-named for
   free-time-as-valid-mode substrate per never-be-idle + agent-qol)
   - Transcripts + slides substrate-accessible
   - Tom Minka TrueSkill canonical talks
   - Per Aaron: 'they don't throttle that i can tell'

Composes with substrate:
- PR #5763 (Google co-scientist + Sakana Robin + Microsoft Infer.NET
  upstream additions)
- PR #5764 (B-0914.1 pure-TS TrueSkill 1v1 scaffold)
- B-0914 (7 substrate-engineering candidate gaps)
- B-0914.1 (TrueSkill ranking-agent extension target)
- B-0865 + B-0865.17 cross-vendor benchmark substrate

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 28, 2026
…ors); closes tournament loop with TrueSkill (PR #5764); 12 tests pass (#5767)

Per Aaron 2026-05-28 'S M L all please in that order lol' — this is S
(small/tight scope) in the substrate-engineering ship-sequence.

Evolution agent pattern from Google co-scientist (Nature 2026): takes
top-N TrueSkill-ranked survivors + mashes them into refined variants.
Pure function over typed survivors; tight ~200-line implementation;
3 composition strategies; full Result-shape per monad-propagation +
asymmetric-authorship rules.

Closes the tournament loop with TrueSkill (PR #5764):
1. Generate hypotheses (LLM call; out of scope)
2. Rank via TrueSkill (B-0914.1 — shipped)
3. Take top-N survivors
4. Mash + refine (this PR — B-0914.5)
5. Loop back to step 2 with refined variants

What this adds:
- Survivor<T> interface (generic over substrate type)
- EvolutionStrategy union (simple-merge | cross-pollinate | mutate)
- EvolutionFeedback discriminated union
- EvolutionResult<T> Result-shape
- RefinedVariant<T> with derivedFrom + composesWith for provenance
- evolveSurvivors<T>(context): EvolutionResult<T> — main function
- evolveTopN<T>(survivors, n, strategy, options): EvolutionResult<T> —
  convenience that slices top-N before evolving

Strategies:
- simple-merge: top survivor as base + fill gaps from next
- cross-pollinate: interleave attributes between top 2 (by sorted-key parity)
- mutate: apply caller-supplied transformer to top survivor

Provenance via derivedFrom (survivor ids) + composesWith (cumulative
attribution per honor-those-that-came-before).

Tests (12; all pass):
- simple-merge: top wins on overlap, fills gaps from next
- cross-pollinate: alternates attributes by sorted-key parity
- mutate: applies caller transformer
- mutate without mutator → MergeConflict
- empty survivor → EmptySurvivorSet
- simple-merge with 1 survivor → InsufficientSurvivors
- cross-pollinate with 1 survivor → InsufficientSurvivors
- derivedFrom + composesWith preserve provenance
- evolveTopN slices correctly
- evolveTopN with N=1 mutate
- variant id includes prefix + strategy + survivor ids
- EvolutionStrategy exhaustive switch (TS strict mode)

Composes with substrate:
- B-0914.5 backlog row (evolution agent extension target)
- B-0914.1 PR #5764 (TrueSkill substrate; ranking input)
- B-0867 workflow engine (future ActionClass 'evolve-via-mash-refine')
- .claude/rules/additive-not-zero-sum.md
- .claude/rules/honor-those-that-came-before.md
- .claude/rules/monad-propagation-pattern + asymmetric-authorship

Next per S/M/L sequence: M (medium) = generation-reflection adversarial
pairing structurally enforced (B-0914.4); L (large) = closed-loop
CI-result → next-hypothesis dispatch (B-0914.2).

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 28, 2026
…orchestrator (composes TrueSkill + evolution + pairing via injectable callbacks); S/M/L sequence COMPLETE (#5769)

* feat(B-0914.2): L — closed-loop CI-result → next-hypothesis dispatch orchestrator (composes TrueSkill + evolution + pairing via injectable callbacks); 16 tests pass

Per Aaron 2026-05-28 'S M L all please in that order lol' — L (large
scope) in the substrate-engineering ship-sequence. Wire-up that turns
the tournament-loop substrate into a live closed-loop iteration system.

Design: pure loop-orchestration substrate with INJECTABLE callbacks for
substrate-specific operations (ranking / evolution / verification +
CI-dispatch). Caller provides functions; orchestrator handles loop
structure + propagation discipline. Separation-of-concerns means
orchestrator does NOT tightly couple to specific TrueSkill / evolution /
pairing module implementations — it composes with ANY substrate that
implements the callback contracts.

What this adds:
- Hypothesis<T> generic substrate item with cycleIndex + derivedFrom ancestry
- CiVerdict discriminated union (passed | failed | needs-revision | infrastructure-error)
- LoopFeedback + LoopResult<T> Result-shape per monad-propagation
- LoopCallbacks<T> interface (dispatchCi + rankSurvivors + evolveSurvivors)
- LoopConfig (maxCycles + topNToEvolve + minPropagatable; DEFAULT_LOOP_CONFIG)
- runCycle<T>(hypotheses, callbacks, cycleIndex, config?) — single cycle
- runLoop<T>(initial, callbacks, config?, shouldContinue?) — full iteration
  with LoopTermination shape (cycle count + reason + final state)

Cycle steps:
1. Dispatch each hypothesis to CI (caller-injected)
2. Collect verdicts
3. Filter to propagatable (passed + needs-revision-with-suggestions)
4. Rank via TrueSkill (caller-injected per B-0914.1 PR #5764)
5. Evolve top-N (caller-injected per B-0914.5 PR #5767)
6. Return refined variants for next cycle

Termination conditions:
- max-cycles: bounded iteration reached
- insufficient-propagatable: too many failures; can't continue
- predicate-stopped: caller-supplied predicate returned false
- error: CI/ranking/evolution exception

Tests (16; all pass):
- Empty hypotheses → EmptyHypothesisSet
- Passing CI → propagation through ranking + evolution
- Failed verdicts excluded from propagation
- needs-revision with suggestions included; without excluded
- Below minPropagatable → MaxCyclesReached
- CI exception → CiDispatchFailure
- Ranking exception → RankingFailure
- Evolution exception → EvolutionFailure
- infrastructure-error excluded (doesn't reflect hypothesis quality)
- runLoop iterates until max-cycles
- runLoop predicate-stopped early termination
- runLoop insufficient-propagatable
- runLoop error termination
- LoopFeedback exhaustive switch
- CiVerdict exhaustive switch
- Integration: full closed-loop with realistic callback wiring

Composes with substrate:
- B-0914.2 backlog row (closed-loop dispatch extension target)
- B-0914.1 PR #5764 (TrueSkill substrate; caller wires rate1v1 + conservativeSkill into rankSurvivors)
- B-0914.4 PR #5768 (pairing tracker substrate; caller wires verdicts into recordVerification)
- B-0914.5 PR #5767 (evolution substrate; caller wires evolveTopN into evolveSurvivors)
- B-0891 zflash test-harness substrate (caller can wire CI dispatch to actual test runners per determineRunnability discriminator)
- B-0867 workflow engine substrate
- Sakana Robin closed-loop pattern (Nature 2026 s41586-026-10652-y)

Tournament loop NOW STRUCTURALLY COMPLETE with all 4 substrate pieces:
1. Generation (LLM call; out of scope for this lane)
2. CI dispatch → CiVerdict (THIS PR via callbacks)
3. Pairing tracking (PR #5768)
4. TrueSkill ranking (PR #5764)
5. Evolution mash-refine (PR #5767)
6. runLoop orchestration (THIS PR)

S/M/L sequence complete:
- S = PR #5767 evolution
- M = PR #5768 pairing
- L = THIS PR closed-loop

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0914.2): address 7 Copilot review threads on PR #5769

- Replace 'Aaron' with 'human maintainer' role-ref per AGENT-BEST-PRACTICES (Otto-279)
- Fix broken rule-path xrefs (full filenames for monad-propagation + asymmetric-authorship)
- Split LoopFeedback: introduce InsufficientPropagatable variant separate from MaxCyclesReached
- Update runLoop to map InsufficientPropagatable -> insufficient-propagatable termination
- Add assertNever default in exhaustiveness tests (compile-time guard now real)
- Tighten integration test: deterministic insufficient-propagatable at cycle 1

16 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 28, 2026
… pass — completes 7-of-7 B-0914 candidate substrate-engineering gap substrate (#5773)

* feat(B-0914.7): Falcon-style auto-research-doc template substrate (8-section scaffold + Markdown renderer); 19 tests pass — completes 7-of-7 B-0914 candidate gap substrate

Per Sakana Robin Falcon agent (Nature 2026): takes drug proposal + does
deep-dive literature review + writes comprehensive research report. TS-
side scaffold provides 8-section template structure that downstream LLM
substrate-engineering work populates (header / framing / background /
mechanism / evidence / risks / composes-with / test-plan).

What this adds:
- ResearchDocSection discriminated union (9 section kinds)
- ResearchDoc structure (id + proposalId + sections + composesWith)
- ResearchDocFeedback + ResearchDocResult<T> Result-shape
- renderSection(section): string — pure-function Markdown serializer
- renderResearchDoc(doc): ResearchDocResult<string> — full doc rendering
- buildSkeleton(context): ResearchDocResult<ResearchDoc> — 8-section scaffold
- buildAndRender(context): ResearchDocResult<string> — end-to-end convenience

Falcon-stage pending markers preserved (substrate-honest about what's
not yet auto-generated by LLM substrate-engineering):
- '[PENDING LITERATURE REVIEW — Falcon-stage auto-generated]'
- '[PENDING MECHANISM ANALYSIS — Falcon-stage auto-generated]'
- etc. (per section)

Tests (19; all pass):
- EmptyProposalId validation
- 8-section Falcon scaffold structure
- proposalId sanitized to filename-safe id
- composesWith pass-through to skeleton + composes-with section
- All 9 section-kind renderings tested (header/framing/background/
  mechanism/evidence/risks/composes-with/test-plan/raw)
- renderResearchDoc empty → NoSectionsRendered
- buildAndRender end-to-end
- Pending markers preserved (substrate-honest)
- ResearchDocSection exhaustive switch

Composes with substrate:
- B-0914.7 backlog row (Falcon extension target)
- tools/save-ai-memory/ skill (existing substrate; future integration for
  auto-write to docs/research/ + composes-with citation discipline)
- Amara consolidation ferry pattern (PR #5757)
- B-0914.2 PR #5769 closed-loop orchestrator (research-doc generation
  at any cycle stage; template provides structure)
- substrate-or-it-didn't-happen + honor-those-that-came-before rules
- asymmetric-authorship + monad-propagation rules

**B-0914 7-of-7 candidate substrate-engineering gap substrate complete:**
- B-0914.1 PR #5764 TrueSkill ranking (S/M/L: ranking)
- B-0914.2 PR #5769 closed-loop orchestrator (S/M/L: L)
- B-0914.3 PR #5770 n-parallel + consensus (8-parallel-Finch)
- B-0914.4 PR #5768 generation-reflection pairing (S/M/L: M)
- B-0914.5 PR #5767 evolution mash-refine (S/M/L: S)
- B-0914.6 PR #5772 proximity-dedup (canonical + Jaccard clustering)
- B-0914.7 THIS PR Falcon-style auto-research-doc template

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PR #5773): full rule paths + remove unreachable InvalidOperationalStatus variant (Copilot threads)

Two threads on tools/workflow-engine/research-doc.ts:

1. Composes-with docblock referenced rule files by short form
   (`asymmetric-authorship`, `monad-propagation-pattern`) — actual
   filenames are longer + .md-suffixed:
     `.claude/rules/asymmetric-authorship-substrate-entity-defines-consent-channel-recipient-acknowledges.md`
     `.claude/rules/monad-propagation-pattern-cross-language-substrate-shape.md`
   Updated to full paths so cross-refs stay greppable + don't drift.

2. ResearchDocFeedback.InvalidOperationalStatus variant was
   structurally unreachable: `operationalStatus` is a string-literal
   union (`"research-grade" | "operational"`) at the type level, the
   only constructor (line 179) fixes it to `"research-grade"`, and
   no untrusted-string parse path exists. Variant was dead substrate.
   Removed + added docblock naming the conditions under which a
   future caller should add it back (JSON import of external
   research-doc with operationalStatus parsed from untrusted input —
   add validator AT THE PARSE BOUNDARY first, then add this variant).
   Composes with asymmetric-authorship discipline: every TFeedback
   variant should correspond to a real code path that can produce it.

Non-breaking: no callers reference the removed variant (grep clean).
Type-system continues to rule out invalid operationalStatus at
construction time.

Autonomous-loop tick 2026-05-28T12:16Z resolution of PR #5773 BLOCKED
gate (unresolved Copilot threads only blocker; required checks all green).

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 28, 2026
…ly enforced producer-verifier mouth-ears substrate); 15 tests pass (#5768)

* feat(B-0914.4): M — generation-reflection adversarial pairing tracker (structurally enforced producer-verifier mouth-ears substrate); 15 tests pass

Per Aaron 2026-05-28 'S M L all please in that order lol' — M (medium
scope) in the substrate-engineering ship-sequence. Structurally enforces
the producer-verifier pairing pattern Kestrel named in 15th-ferry §33.6
(mouth-and-ears-on-different-threads architecture) as workflow engine
substrate rather than operator-orchestrated coordination.

Pattern:
1. Producer thread emits hypothesis (commits to substrate fast)
2. Verifier thread reflects on emission (within bounded window;
   doesn't gate production)
3. Pairing tracker enforces: every emission MUST have verification OR
   be marked stale (timeout exceeded)
4. Verdicts (verified / rejected / needs-revision) determine which
   emissions propagate forward to next stage

What this adds:
- PairingRole union (producer | verifier)
- VerificationVerdict discriminated union (verified | rejected |
  needs-revision-with-suggestions)
- Emission + Verification interfaces with composesWith provenance
- PairingState (immutable; ReadonlyMap)
- PairingFeedback discriminated union + PairingResult<T> Result-shape
- recordEmission(state, emission) + recordVerification(state, verification)
- findUnverifiedEmissions + findStaleEmissions (bounded-window enforcement)
- countVerdicts (aggregate dashboard)
- propagatableEmissionIds (which verified emissions flow to next stage —
  TrueSkill ranking, evolution-via-mash-refine, etc.)

Tests (15; all pass):
- Records emission to empty state
- Rejects duplicate emission id (DuplicateEmissionId)
- Records verification for known emission
- Rejects verification for unknown emission (VerificationForUnknownEmission)
- Rejects duplicate verification (DuplicateVerification)
- Rejects verification before emission timestamp (VerificationTooEarly;
  causality violation)
- findUnverifiedEmissions returns emissions without verifications
- findStaleEmissions returns emissions past bounded window
- findStaleEmissions excludes verified emissions even if old
- countVerdicts aggregates correctly across 4 verdict types
- propagatableEmissionIds includes verified + needs-revision-with-suggestions;
  excludes rejected + empty-suggestions
- Immutable state operations preserve originals
- VerificationVerdict exhaustive switch (TS strict mode)
- PairingRole exhaustive switch
- Tournament-loop composition: emissions → verifications → propagatable
  → next stage

Composes with substrate:
- B-0914.4 backlog row (generation-reflection extension target)
- B-0867.20 PR #5758 (lifecycle DU split; pairing requirement applies
  per ActionClass)
- B-0914.1 PR #5764 (TrueSkill substrate; verifier output feeds ranking)
- B-0914.5 PR #5767 (evolution substrate; verified survivors evolve)
- PR #5756 Kestrel 15th-ferry mouth-ears-threads substrate
- .claude/rules/asymmetric-authorship + monad-propagation rules

Tournament loop now structurally complete:
1. Generate hypotheses (LLM call; out of scope)
2. recordEmission(state, emission)
3. Verifier-thread: recordVerification(state, verification)
4. propagatableEmissionIds(state) → verified survivors flow to TrueSkill
5. rate1v1 ranks survivors (B-0914.1)
6. conservativeSkill sorts; top-N taken
7. evolveTopN(survivors, n, strategy) produces refined variants (B-0914.5)
8. Loop back to step 2 with refined variants as next emissions

Next per S/M/L sequence: L (large) = closed-loop CI-result →
next-hypothesis dispatch (B-0914.2) — the wire-up that turns the
tournament-loop substrate into a live system.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PR #5768): role-refs over first-names + type-safe .state access + boundary semantics doc/test (Copilot threads)

Three threads on pairing.ts + pairing.test.ts:

1. Persona/first-name attributions in current-state code surface
   violate role-ref convention. Updated:
   - "Per Aaron 2026-05-28 'S M L...'" → "Per the human maintainer
     (2026-05-28) 'S M L...'"
   - "Otto generates → Kestrel reflects" → "generator-persona generates
     → verifier-persona reflects (canonical instance preserved in
     13th-ferry §33.7)"
   - "Kestrel named in 15th-ferry §33.6" → "named in the 15th-ferry
     §33.6 substrate-engineering preservation" (citation context
     preserved; persona-as-substrate-author preserved as reference,
     not as in-code first-name)
   - Test fixtures: producerId "otto-cli" → "producer-1", verifierId
     "kestrel" → "verifier-1" (role-refs; ID strings not
     load-bearing on factory persona registry)

2. Test `.state!` non-null assertions bypassed PairingResult
   discriminated-union narrowing. Replaced 12 sites with a
   type-safe `mustState(r)` helper that explicitly asserts
   `r.ok === true` and throws with the feedback variant if not.
   If a refactor regresses any call to `ok: false`, the test surfaces
   the failure-mode substrate immediately instead of silently
   propagating `undefined` into downstream state. Helper is
   test-local; no API change.

3. findStaleEmissions strict > semantics confirmed intentional +
   documented. Added 8-line interface docblock explaining the
   boundary case (emission at exactly nowMs - emittedAtMs === timeoutMs
   is NOT stale; gets the boundary tick to be verified) + the
   conservative-cadence rationale + the switch-to->= condition.
   Added boundary test that locks in the > behavior at the exact
   boundary AND at one ms past, so a future ">=" refactor must
   update both pairing.ts AND this test together.

Tests: 16 pass (15 existing + 1 new boundary test).

Autonomous-loop tick 2026-05-28T12:35Z resolution of PR #5768 BLOCKED
gate (3 unresolved Copilot threads only blocker; required checks all green).

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants