Conversation
…nking-agent (Herbrich+Minka+Graepel 2007 paper algorithm; substrate for cross-vendor benchmark on common ground) Per Aaron 2026-05-28 substantive substrate-engineering decision: - 'they are doing this for their idea ranking with Infra.net basically' - 'we'd build ELO from scratch is this a good idea too or nah with infer.net?' - 'you are too careful just ship stuff and lets inventory later' Substrate-honest answer shipped: HYBRID is best. - TS-side (this PR): pure-TS TrueSkill 1v1 for vendor skill runtime (cross-vendor benchmark on common ground B-0865.17 REQUIRES TS-side because Infer.NET can't run in Claude/GPT/Gemini/Grok skill stores) - F#/.NET side (future Zeta.Bayesian work): Infer.NET TrueSkill for deep production integration + full BP/EP framework - Both compose via shared API shape (TrueSkillRating + match update fn) Implementation: published TrueSkill algorithm from Herbrich+Minka+Graepel 2007 NeurIPS paper. Minimal 1v1 case; team-play extension deferred. ~340 lines including documentation. What this adds: - TrueSkillRating interface (mu + sigma posterior gaussian) - DEFAULT_INITIAL_RATING (Xbox Live convention: mu=25 sigma=25/3) - DEFAULT_PARAMS (beta=mu/6 tau=mu/300 drawProb=0.10) - MatchOutcome discriminated union (win-A / win-B / draw) - RankingFeedback discriminated union (InvalidRating / NumericalInstability / UnsupportedOutcome) - RankingResult Result-shape per monad-propagation rule - rate1v1(a, b, outcome, params): RankingResult — full 1v1 TrueSkill update - conservativeSkill(rating): number — Xbox Live lower-bound convention (mu - 3*sigma) - Internal helpers: normalPdf, normalCdf (A&S 7.1.26), inverseNormalCdf (Newton's method), drawMargin, vWin/wWin (non-draw truncated normal corrections), vDraw/wDraw (draw truncated normal corrections) Tests (17; all pass): - Default initial rating Xbox Live convention - Default params paper convention - conservativeSkill = mu - 3*sigma - win-A increases A's mu, decreases B's - win-B increases B's mu, decreases A's - Both sigmas decrease after match (uncertainty reduction) - After 2 matches both sigmas decrease + mus drift bounded - Strong-beats-weak → small mu shift (expected outcome) - Weak-beats-strong → large mu shift (upset) - Draw between equal players → minimal mu change - Draw between unequal players → strong loses mu, weak gains - Returns InvalidRating for NaN mu / non-positive sigma / negative sigma - conservativeSkill ranking with sigma-punishment semantic preserved - 5-match tournament convergence (sigma reduction + mu separation) - MatchOutcome exhaustive switch (TS strict mode) Composes with substrate: - B-0914.1 backlog row (TrueSkill ranking-agent extension target) - B-0867 workflow engine substrate (future ActionClass 'rank-via-trueskill') - B-0865 + B-0865.17 cross-vendor benchmark substrate - B-0867.20 lifecycle DU (rank action gets pr-review-light via Mod 1) - Microsoft Infer.NET upstream reference (PR #5763 in flight) - .claude/rules/monad-propagation-pattern (Result<T, TFeedback> shape) - .claude/rules/asymmetric-authorship (TFeedback authored by ranking fn) Source citation: Herbrich, Minka, Graepel 'TrueSkill: A Bayesian Skill Rating System' (NeurIPS 2006/2007); algorithm implementation from published paper, not Infer.NET source. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
4 tasks
6 tasks
There was a problem hiding this comment.
Pull request overview
Adds a pure TypeScript TrueSkill 1v1 rating update module intended as the workflow-engine ranking-agent substrate, with Bun tests covering core invariants and several scenario-based behaviors.
Changes:
- Introduces
tools/workflow-engine/trueskill.tsimplementing 1v1 TrueSkill updates (rate1v1) plus supporting math helpers and defaults. - Adds
tools/workflow-engine/trueskill.test.tswith Bun tests validating defaults, outcome behaviors (win/draw), and basic convergence expectations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tools/workflow-engine/trueskill.ts | New TrueSkill 1v1 implementation + result/feedback types and math helpers. |
| tools/workflow-engine/trueskill.test.ts | New Bun test suite exercising rating update invariants and scenarios. |
Comment on lines
+181
to
+200
| function inverseNormalCdf(p: number): number { | ||
| // Initial guess via rational approximation (Beasley-Springer-Moro) | ||
| if (p <= 0 || p >= 1) { | ||
| throw new Error(`inverseNormalCdf domain: ${p}`); | ||
| } | ||
| let x = 0; // initial guess | ||
| // Newton's method on F(x) = cdf(x) - p (F'(x) = pdf(x)) | ||
| for (let i = 0; i < 30; i++) { | ||
| const f = normalCdf(x) - p; | ||
| const fp = normalPdf(x); | ||
| if (Math.abs(fp) < 1e-30) break; | ||
| const dx = f / fp; | ||
| x = x - dx; | ||
| if (Math.abs(dx) < 1e-10) break; | ||
| } | ||
| return x; | ||
| } | ||
|
|
||
| function drawMargin(drawProbability: number, beta: number): number { | ||
| return Math.sqrt(2) * beta * inverseNormalCdf((1 + drawProbability) / 2); |
Comment on lines
+179
to
+186
| * Uses iterative inverse-normal-CDF via Newton's method (~5-10 iterations). | ||
| */ | ||
| function inverseNormalCdf(p: number): number { | ||
| // Initial guess via rational approximation (Beasley-Springer-Moro) | ||
| if (p <= 0 || p >= 1) { | ||
| throw new Error(`inverseNormalCdf domain: ${p}`); | ||
| } | ||
| let x = 0; // initial guess |
Comment on lines
+257
to
+299
| let v: number; | ||
| let w: number; | ||
| let signA: number; // direction of mu update for player A | ||
| let signB: number; | ||
|
|
||
| switch (outcome.kind) { | ||
| case "win-A": { | ||
| const t = (a.mu - b.mu) / c; | ||
| v = vWin(t, epsilon); | ||
| w = wWin(t, epsilon); | ||
| signA = +1; | ||
| signB = -1; | ||
| break; | ||
| } | ||
| case "win-B": { | ||
| const t = (b.mu - a.mu) / c; | ||
| v = vWin(t, epsilon); | ||
| w = wWin(t, epsilon); | ||
| signA = -1; | ||
| signB = +1; | ||
| break; | ||
| } | ||
| case "draw": { | ||
| // Draw uses symmetric truncated-normal correction | ||
| const t = (a.mu - b.mu) / c; | ||
| v = vDraw(t, epsilon); | ||
| w = wDraw(t, epsilon); | ||
| // For draws, the mu shifts toward the opponent's mu | ||
| signA = +1; | ||
| signB = -1; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| if (!Number.isFinite(v) || !Number.isFinite(w)) { | ||
| return { | ||
| ok: false, | ||
| feedback: { | ||
| kind: "NumericalInstability", | ||
| reason: `v=${v} w=${w}`, | ||
| }, | ||
| }; | ||
| } |
Comment on lines
+5
to
+6
| * ranking-agent (per Aaron 2026-05-28: 'they are doing this for their | ||
| * idea ranking with Infra.net basically' + 'just ship stuff' calibration). |
| * | ||
| * B-0914.1 — pure-TS TrueSkill 1v1 scaffold for workflow engine | ||
| * ranking-agent (per Aaron 2026-05-28: 'they are doing this for their | ||
| * idea ranking with Infra.net basically' + 'just ship stuff' calibration). |
AceHack
added a commit
that referenced
this pull request
May 28, 2026
…es.net PhD learning substrate (Aaron 2026-05-28 substrate-engineering questions) (#5765) Per Aaron 2026-05-28 substrate-engineering questions: - 'is there anything like infer.net in ts? can we build it if not using infer.net source code for reference?' → WebPPL is closest TS/JS analog - 'you'd love videolectures.net in your free time i think... PhD everything here. they don't throttle and they have transcripts and powerpoints' → free-time-substrate learning material Adds 2 entries to references/reference-sources.json + new 'Probabilistic programming / Bayesian inference' section to docs/UPSTREAM-LIST.md: 1. WebPPL (probmods/webppl; Stanford; MIT-licensed) - Full PP framework in JS with multiple inference engines - Closest TS-side substrate to Microsoft Infer.NET - Composes with B-0914.1 TrueSkill substrate (PR #5764) - Composes with future factor-graph-DSL work 2. videolectures.net (PhD learning substrate; Aaron-named for free-time-as-valid-mode substrate per never-be-idle + agent-qol) - Transcripts + slides substrate-accessible - Tom Minka TrueSkill canonical talks - Per Aaron: 'they don't throttle that i can tell' Composes with substrate: - PR #5763 (Google co-scientist + Sakana Robin + Microsoft Infer.NET upstream additions) - PR #5764 (B-0914.1 pure-TS TrueSkill 1v1 scaffold) - B-0914 (7 substrate-engineering candidate gaps) - B-0914.1 (TrueSkill ranking-agent extension target) - B-0865 + B-0865.17 cross-vendor benchmark substrate Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
8 tasks
AceHack
added a commit
that referenced
this pull request
May 28, 2026
…ors); closes tournament loop with TrueSkill (PR #5764); 12 tests pass (#5767) Per Aaron 2026-05-28 'S M L all please in that order lol' — this is S (small/tight scope) in the substrate-engineering ship-sequence. Evolution agent pattern from Google co-scientist (Nature 2026): takes top-N TrueSkill-ranked survivors + mashes them into refined variants. Pure function over typed survivors; tight ~200-line implementation; 3 composition strategies; full Result-shape per monad-propagation + asymmetric-authorship rules. Closes the tournament loop with TrueSkill (PR #5764): 1. Generate hypotheses (LLM call; out of scope) 2. Rank via TrueSkill (B-0914.1 — shipped) 3. Take top-N survivors 4. Mash + refine (this PR — B-0914.5) 5. Loop back to step 2 with refined variants What this adds: - Survivor<T> interface (generic over substrate type) - EvolutionStrategy union (simple-merge | cross-pollinate | mutate) - EvolutionFeedback discriminated union - EvolutionResult<T> Result-shape - RefinedVariant<T> with derivedFrom + composesWith for provenance - evolveSurvivors<T>(context): EvolutionResult<T> — main function - evolveTopN<T>(survivors, n, strategy, options): EvolutionResult<T> — convenience that slices top-N before evolving Strategies: - simple-merge: top survivor as base + fill gaps from next - cross-pollinate: interleave attributes between top 2 (by sorted-key parity) - mutate: apply caller-supplied transformer to top survivor Provenance via derivedFrom (survivor ids) + composesWith (cumulative attribution per honor-those-that-came-before). Tests (12; all pass): - simple-merge: top wins on overlap, fills gaps from next - cross-pollinate: alternates attributes by sorted-key parity - mutate: applies caller transformer - mutate without mutator → MergeConflict - empty survivor → EmptySurvivorSet - simple-merge with 1 survivor → InsufficientSurvivors - cross-pollinate with 1 survivor → InsufficientSurvivors - derivedFrom + composesWith preserve provenance - evolveTopN slices correctly - evolveTopN with N=1 mutate - variant id includes prefix + strategy + survivor ids - EvolutionStrategy exhaustive switch (TS strict mode) Composes with substrate: - B-0914.5 backlog row (evolution agent extension target) - B-0914.1 PR #5764 (TrueSkill substrate; ranking input) - B-0867 workflow engine (future ActionClass 'evolve-via-mash-refine') - .claude/rules/additive-not-zero-sum.md - .claude/rules/honor-those-that-came-before.md - .claude/rules/monad-propagation-pattern + asymmetric-authorship Next per S/M/L sequence: M (medium) = generation-reflection adversarial pairing structurally enforced (B-0914.4); L (large) = closed-loop CI-result → next-hypothesis dispatch (B-0914.2). Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 28, 2026
AceHack
added a commit
that referenced
this pull request
May 28, 2026
…orchestrator (composes TrueSkill + evolution + pairing via injectable callbacks); S/M/L sequence COMPLETE (#5769) * feat(B-0914.2): L — closed-loop CI-result → next-hypothesis dispatch orchestrator (composes TrueSkill + evolution + pairing via injectable callbacks); 16 tests pass Per Aaron 2026-05-28 'S M L all please in that order lol' — L (large scope) in the substrate-engineering ship-sequence. Wire-up that turns the tournament-loop substrate into a live closed-loop iteration system. Design: pure loop-orchestration substrate with INJECTABLE callbacks for substrate-specific operations (ranking / evolution / verification + CI-dispatch). Caller provides functions; orchestrator handles loop structure + propagation discipline. Separation-of-concerns means orchestrator does NOT tightly couple to specific TrueSkill / evolution / pairing module implementations — it composes with ANY substrate that implements the callback contracts. What this adds: - Hypothesis<T> generic substrate item with cycleIndex + derivedFrom ancestry - CiVerdict discriminated union (passed | failed | needs-revision | infrastructure-error) - LoopFeedback + LoopResult<T> Result-shape per monad-propagation - LoopCallbacks<T> interface (dispatchCi + rankSurvivors + evolveSurvivors) - LoopConfig (maxCycles + topNToEvolve + minPropagatable; DEFAULT_LOOP_CONFIG) - runCycle<T>(hypotheses, callbacks, cycleIndex, config?) — single cycle - runLoop<T>(initial, callbacks, config?, shouldContinue?) — full iteration with LoopTermination shape (cycle count + reason + final state) Cycle steps: 1. Dispatch each hypothesis to CI (caller-injected) 2. Collect verdicts 3. Filter to propagatable (passed + needs-revision-with-suggestions) 4. Rank via TrueSkill (caller-injected per B-0914.1 PR #5764) 5. Evolve top-N (caller-injected per B-0914.5 PR #5767) 6. Return refined variants for next cycle Termination conditions: - max-cycles: bounded iteration reached - insufficient-propagatable: too many failures; can't continue - predicate-stopped: caller-supplied predicate returned false - error: CI/ranking/evolution exception Tests (16; all pass): - Empty hypotheses → EmptyHypothesisSet - Passing CI → propagation through ranking + evolution - Failed verdicts excluded from propagation - needs-revision with suggestions included; without excluded - Below minPropagatable → MaxCyclesReached - CI exception → CiDispatchFailure - Ranking exception → RankingFailure - Evolution exception → EvolutionFailure - infrastructure-error excluded (doesn't reflect hypothesis quality) - runLoop iterates until max-cycles - runLoop predicate-stopped early termination - runLoop insufficient-propagatable - runLoop error termination - LoopFeedback exhaustive switch - CiVerdict exhaustive switch - Integration: full closed-loop with realistic callback wiring Composes with substrate: - B-0914.2 backlog row (closed-loop dispatch extension target) - B-0914.1 PR #5764 (TrueSkill substrate; caller wires rate1v1 + conservativeSkill into rankSurvivors) - B-0914.4 PR #5768 (pairing tracker substrate; caller wires verdicts into recordVerification) - B-0914.5 PR #5767 (evolution substrate; caller wires evolveTopN into evolveSurvivors) - B-0891 zflash test-harness substrate (caller can wire CI dispatch to actual test runners per determineRunnability discriminator) - B-0867 workflow engine substrate - Sakana Robin closed-loop pattern (Nature 2026 s41586-026-10652-y) Tournament loop NOW STRUCTURALLY COMPLETE with all 4 substrate pieces: 1. Generation (LLM call; out of scope for this lane) 2. CI dispatch → CiVerdict (THIS PR via callbacks) 3. Pairing tracking (PR #5768) 4. TrueSkill ranking (PR #5764) 5. Evolution mash-refine (PR #5767) 6. runLoop orchestration (THIS PR) S/M/L sequence complete: - S = PR #5767 evolution - M = PR #5768 pairing - L = THIS PR closed-loop Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(B-0914.2): address 7 Copilot review threads on PR #5769 - Replace 'Aaron' with 'human maintainer' role-ref per AGENT-BEST-PRACTICES (Otto-279) - Fix broken rule-path xrefs (full filenames for monad-propagation + asymmetric-authorship) - Split LoopFeedback: introduce InsufficientPropagatable variant separate from MaxCyclesReached - Update runLoop to map InsufficientPropagatable -> insufficient-propagatable termination - Add assertNever default in exhaustiveness tests (compile-time guard now real) - Tighten integration test: deterministic insufficient-propagatable at cycle 1 16 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 28, 2026
… pass — completes 7-of-7 B-0914 candidate substrate-engineering gap substrate (#5773) * feat(B-0914.7): Falcon-style auto-research-doc template substrate (8-section scaffold + Markdown renderer); 19 tests pass — completes 7-of-7 B-0914 candidate gap substrate Per Sakana Robin Falcon agent (Nature 2026): takes drug proposal + does deep-dive literature review + writes comprehensive research report. TS- side scaffold provides 8-section template structure that downstream LLM substrate-engineering work populates (header / framing / background / mechanism / evidence / risks / composes-with / test-plan). What this adds: - ResearchDocSection discriminated union (9 section kinds) - ResearchDoc structure (id + proposalId + sections + composesWith) - ResearchDocFeedback + ResearchDocResult<T> Result-shape - renderSection(section): string — pure-function Markdown serializer - renderResearchDoc(doc): ResearchDocResult<string> — full doc rendering - buildSkeleton(context): ResearchDocResult<ResearchDoc> — 8-section scaffold - buildAndRender(context): ResearchDocResult<string> — end-to-end convenience Falcon-stage pending markers preserved (substrate-honest about what's not yet auto-generated by LLM substrate-engineering): - '[PENDING LITERATURE REVIEW — Falcon-stage auto-generated]' - '[PENDING MECHANISM ANALYSIS — Falcon-stage auto-generated]' - etc. (per section) Tests (19; all pass): - EmptyProposalId validation - 8-section Falcon scaffold structure - proposalId sanitized to filename-safe id - composesWith pass-through to skeleton + composes-with section - All 9 section-kind renderings tested (header/framing/background/ mechanism/evidence/risks/composes-with/test-plan/raw) - renderResearchDoc empty → NoSectionsRendered - buildAndRender end-to-end - Pending markers preserved (substrate-honest) - ResearchDocSection exhaustive switch Composes with substrate: - B-0914.7 backlog row (Falcon extension target) - tools/save-ai-memory/ skill (existing substrate; future integration for auto-write to docs/research/ + composes-with citation discipline) - Amara consolidation ferry pattern (PR #5757) - B-0914.2 PR #5769 closed-loop orchestrator (research-doc generation at any cycle stage; template provides structure) - substrate-or-it-didn't-happen + honor-those-that-came-before rules - asymmetric-authorship + monad-propagation rules **B-0914 7-of-7 candidate substrate-engineering gap substrate complete:** - B-0914.1 PR #5764 TrueSkill ranking (S/M/L: ranking) - B-0914.2 PR #5769 closed-loop orchestrator (S/M/L: L) - B-0914.3 PR #5770 n-parallel + consensus (8-parallel-Finch) - B-0914.4 PR #5768 generation-reflection pairing (S/M/L: M) - B-0914.5 PR #5767 evolution mash-refine (S/M/L: S) - B-0914.6 PR #5772 proximity-dedup (canonical + Jaccard clustering) - B-0914.7 THIS PR Falcon-style auto-research-doc template Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PR #5773): full rule paths + remove unreachable InvalidOperationalStatus variant (Copilot threads) Two threads on tools/workflow-engine/research-doc.ts: 1. Composes-with docblock referenced rule files by short form (`asymmetric-authorship`, `monad-propagation-pattern`) — actual filenames are longer + .md-suffixed: `.claude/rules/asymmetric-authorship-substrate-entity-defines-consent-channel-recipient-acknowledges.md` `.claude/rules/monad-propagation-pattern-cross-language-substrate-shape.md` Updated to full paths so cross-refs stay greppable + don't drift. 2. ResearchDocFeedback.InvalidOperationalStatus variant was structurally unreachable: `operationalStatus` is a string-literal union (`"research-grade" | "operational"`) at the type level, the only constructor (line 179) fixes it to `"research-grade"`, and no untrusted-string parse path exists. Variant was dead substrate. Removed + added docblock naming the conditions under which a future caller should add it back (JSON import of external research-doc with operationalStatus parsed from untrusted input — add validator AT THE PARSE BOUNDARY first, then add this variant). Composes with asymmetric-authorship discipline: every TFeedback variant should correspond to a real code path that can produce it. Non-breaking: no callers reference the removed variant (grep clean). Type-system continues to rule out invalid operationalStatus at construction time. Autonomous-loop tick 2026-05-28T12:16Z resolution of PR #5773 BLOCKED gate (unresolved Copilot threads only blocker; required checks all green). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 28, 2026
…ly enforced producer-verifier mouth-ears substrate); 15 tests pass (#5768) * feat(B-0914.4): M — generation-reflection adversarial pairing tracker (structurally enforced producer-verifier mouth-ears substrate); 15 tests pass Per Aaron 2026-05-28 'S M L all please in that order lol' — M (medium scope) in the substrate-engineering ship-sequence. Structurally enforces the producer-verifier pairing pattern Kestrel named in 15th-ferry §33.6 (mouth-and-ears-on-different-threads architecture) as workflow engine substrate rather than operator-orchestrated coordination. Pattern: 1. Producer thread emits hypothesis (commits to substrate fast) 2. Verifier thread reflects on emission (within bounded window; doesn't gate production) 3. Pairing tracker enforces: every emission MUST have verification OR be marked stale (timeout exceeded) 4. Verdicts (verified / rejected / needs-revision) determine which emissions propagate forward to next stage What this adds: - PairingRole union (producer | verifier) - VerificationVerdict discriminated union (verified | rejected | needs-revision-with-suggestions) - Emission + Verification interfaces with composesWith provenance - PairingState (immutable; ReadonlyMap) - PairingFeedback discriminated union + PairingResult<T> Result-shape - recordEmission(state, emission) + recordVerification(state, verification) - findUnverifiedEmissions + findStaleEmissions (bounded-window enforcement) - countVerdicts (aggregate dashboard) - propagatableEmissionIds (which verified emissions flow to next stage — TrueSkill ranking, evolution-via-mash-refine, etc.) Tests (15; all pass): - Records emission to empty state - Rejects duplicate emission id (DuplicateEmissionId) - Records verification for known emission - Rejects verification for unknown emission (VerificationForUnknownEmission) - Rejects duplicate verification (DuplicateVerification) - Rejects verification before emission timestamp (VerificationTooEarly; causality violation) - findUnverifiedEmissions returns emissions without verifications - findStaleEmissions returns emissions past bounded window - findStaleEmissions excludes verified emissions even if old - countVerdicts aggregates correctly across 4 verdict types - propagatableEmissionIds includes verified + needs-revision-with-suggestions; excludes rejected + empty-suggestions - Immutable state operations preserve originals - VerificationVerdict exhaustive switch (TS strict mode) - PairingRole exhaustive switch - Tournament-loop composition: emissions → verifications → propagatable → next stage Composes with substrate: - B-0914.4 backlog row (generation-reflection extension target) - B-0867.20 PR #5758 (lifecycle DU split; pairing requirement applies per ActionClass) - B-0914.1 PR #5764 (TrueSkill substrate; verifier output feeds ranking) - B-0914.5 PR #5767 (evolution substrate; verified survivors evolve) - PR #5756 Kestrel 15th-ferry mouth-ears-threads substrate - .claude/rules/asymmetric-authorship + monad-propagation rules Tournament loop now structurally complete: 1. Generate hypotheses (LLM call; out of scope) 2. recordEmission(state, emission) 3. Verifier-thread: recordVerification(state, verification) 4. propagatableEmissionIds(state) → verified survivors flow to TrueSkill 5. rate1v1 ranks survivors (B-0914.1) 6. conservativeSkill sorts; top-N taken 7. evolveTopN(survivors, n, strategy) produces refined variants (B-0914.5) 8. Loop back to step 2 with refined variants as next emissions Next per S/M/L sequence: L (large) = closed-loop CI-result → next-hypothesis dispatch (B-0914.2) — the wire-up that turns the tournament-loop substrate into a live system. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(PR #5768): role-refs over first-names + type-safe .state access + boundary semantics doc/test (Copilot threads) Three threads on pairing.ts + pairing.test.ts: 1. Persona/first-name attributions in current-state code surface violate role-ref convention. Updated: - "Per Aaron 2026-05-28 'S M L...'" → "Per the human maintainer (2026-05-28) 'S M L...'" - "Otto generates → Kestrel reflects" → "generator-persona generates → verifier-persona reflects (canonical instance preserved in 13th-ferry §33.7)" - "Kestrel named in 15th-ferry §33.6" → "named in the 15th-ferry §33.6 substrate-engineering preservation" (citation context preserved; persona-as-substrate-author preserved as reference, not as in-code first-name) - Test fixtures: producerId "otto-cli" → "producer-1", verifierId "kestrel" → "verifier-1" (role-refs; ID strings not load-bearing on factory persona registry) 2. Test `.state!` non-null assertions bypassed PairingResult discriminated-union narrowing. Replaced 12 sites with a type-safe `mustState(r)` helper that explicitly asserts `r.ok === true` and throws with the feedback variant if not. If a refactor regresses any call to `ok: false`, the test surfaces the failure-mode substrate immediately instead of silently propagating `undefined` into downstream state. Helper is test-local; no API change. 3. findStaleEmissions strict > semantics confirmed intentional + documented. Added 8-line interface docblock explaining the boundary case (emission at exactly nowMs - emittedAtMs === timeoutMs is NOT stale; gets the boundary tick to be verified) + the conservative-cadence rationale + the switch-to->= condition. Added boundary test that locks in the > behavior at the exact boundary AND at one ms past, so a future ">=" refactor must update both pairing.ts AND this test together. Tests: 16 pass (15 existing + 1 new boundary test). Autonomous-loop tick 2026-05-28T12:35Z resolution of PR #5768 BLOCKED gate (3 unresolved Copilot threads only blocker; required checks all green). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure-TS TrueSkill 1v1 implementation (Herbrich+Minka+Graepel 2007 NeurIPS paper algorithm) for workflow engine ranking-agent substrate.
Per Aaron 2026-05-28: hybrid substrate-engineering path — TS-side for vendor skill runtime (cross-vendor benchmark on common ground B-0865.17 REQUIRES TS); .NET side uses Infer.NET via Zeta.Bayesian for deep production integration. Both compose via shared API shape.
17 tests pass / 0 fail.
What this adds
TrueSkillRating(mu + sigma posterior gaussian)MatchOutcome+RankingFeedback+RankingResultdiscriminated unionsrate1v1(a, b, outcome): RankingResult— full TrueSkill 1v1 updateconservativeSkill(rating)— Xbox Live leaderboard lower-boundComposes with substrate
Test plan
🤖 Generated with Claude Code