-
Notifications
You must be signed in to change notification settings - Fork 1
backlog: B-0174 cross-model tool-review convergence-rate replay [architectural-intent-emergence] #1306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
AceHack
merged 1 commit into
main
from
backlog/b-0174-cross-model-tool-review-convergence-replay-otto-2026-05-03
May 3, 2026
Merged
backlog: B-0174 cross-model tool-review convergence-rate replay [architectural-intent-emergence] #1306
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
85 changes: 85 additions & 0 deletions
85
...og/P3/B-0174-cross-model-tool-review-convergence-rate-replay-otto-2026-05-03.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| --- | ||
| id: B-0174 | ||
| priority: P3 | ||
| status: open | ||
| title: Cross-model tool-review convergence-rate replay protocol — measure how many review rounds different models need to settle on a tool-authoring PR (Otto 2026-05-03 sibling-instance of multi-harness convergence skill domain) | ||
| tier: research | ||
|
AceHack marked this conversation as resolved.
|
||
| effort: M | ||
| ask: Otto self-derived 2026-05-03 from v0.5 substrate-claim-checker review-cycle empirics + Aaron 2026-05-03 nudge to formalize as backlog row (autonomous-loop maintainer channel) | ||
| created: 2026-05-03 | ||
| last_updated: 2026-05-03 | ||
| depends_on: [B-0170] | ||
| composes_with: [B-0169, B-0173] | ||
| tags: [calibration, multi-harness, cross-model, frontier-ability, convergence, code-review, tool-authoring, p3-research] | ||
| --- | ||
|
|
||
| # Cross-model tool-review convergence-rate replay protocol | ||
|
|
||
| ## Origin | ||
|
|
||
| Otto 2026-05-03, observing the v0.5 substrate-claim-checker review-cycle (PR #1298) accumulate 19+ substantive findings across 5 rounds with a clear convergence trend (8→5→2→2→2 stabilizing at 2/round). Per Aaron 2026-05-03 nudge to formalize as backlog row rather than leave as a memo observation. | ||
|
AceHack marked this conversation as resolved.
|
||
|
|
||
| This is a **sibling-instance** of the multi-harness alignment convergence future-skill-domain memo (`memory/feedback_multi_harness_alignment_convergence_design_future_skill_domain_aaron_2026_05_03.md`). The parent memo covered design-time convergence (run designs through N harnesses × M models). This row covers iteration-time convergence (how many review rounds does a model need to settle on a tool-authoring PR). | ||
|
|
||
| ## What it measures | ||
|
|
||
| For a given AI model M and a tool-authoring task T: | ||
|
|
||
| 1. Give M the initial draft of T (e.g., v0.5 substrate-claim-checker check-existence.ts) | ||
| 2. Run a fixed code-review prompt against the draft (using Otto's own reviewer-style or a standard reviewer harness) | ||
| 3. M produces a revised draft addressing the findings | ||
| 4. Run code-review prompt again | ||
| 5. Repeat until 0 findings (convergence) or N rounds (timeout) | ||
| 6. Record: per-round-finding-count, total-rounds-to-converge, total-findings, finding-categories-per-round, code-quality-delta (LOC, test coverage, etc.) | ||
|
|
||
| The **convergence-rate signature** is `[findings_round_1, findings_round_2, ..., 0]` — a per-model fingerprint of code-authoring quality. | ||
|
|
||
| ## Comparison points | ||
|
|
||
| - **Otto baseline (this session)**: v0.5 check-existence.ts: 8→5→2→2→2 (5 rounds and counting; not yet converged). Categories: security, cross-platform, false-positive scope, regex edge cases, test discipline + extension whitelist | ||
| - **Other models**: GPT-5.x (Codex), Claude.ai (different harness), Gemini, Cursor's models. Each gets the same initial draft + same review-prompt; observe their convergence-rate | ||
|
|
||
| ## Why this is research-grade calibration data | ||
|
|
||
| - **Reproducible** — same tool draft + same review prompt = directly comparable across models | ||
| - **Quantitative** — finding-counts are measurable; convergence-rounds are measurable | ||
| - **Frontier-ability signal** — models with faster convergence-rate produce higher-quality code per cycle | ||
| - **Composes with the calibration protocol** — Otto's in-the-moment guesses (B-0XXXX backlog rows) measure architectural-intent inference; this row measures code-implementation quality | ||
|
AceHack marked this conversation as resolved.
|
||
|
|
||
| ## Acceptance criteria | ||
|
|
||
| This row closes when: | ||
|
|
||
| 1. Protocol is documented (per-step, with example fixtures): how to feed a tool-draft to a model, how to run the review-prompt, how to record the finding stream, how to detect convergence | ||
| 2. At least 3 cross-model runs are executed against a fixture tool-draft (e.g., the v0.5 initial check-existence.ts) | ||
| 3. Results are published in `docs/research/cross-model-convergence-rate-replay/` with the per-model convergence signature + categorical breakdown | ||
| 4. The protocol is referenced from the multi-harness convergence skill-domain memo as a worked-example seed | ||
|
|
||
| ## Composes with | ||
|
|
||
| - **B-0170** (substrate-claim-checker tool) — depends_on; the v0.5 review-cycle is the empirical seed | ||
| - **B-0169** (decision-archaeology skill) — composes_with; Otto's convergence-signature is part of decision-archaeology data for "how this tool came to be" | ||
| - **B-0173** (hook authoring) — composes_with; hooks could automate the review-prompt invocation | ||
| - `memory/feedback_multi_harness_alignment_convergence_design_future_skill_domain_aaron_2026_05_03.md` — parent skill domain | ||
| - `memory/feedback_guess_then_verify_architectural_intent_calibration_protocol_aaron_2026_05_03.md` — sibling protocol (architectural-intent inference vs code-implementation quality) | ||
| - `memory/architectural-intent-guesses/` — sibling calibration data directory | ||
|
|
||
| ## Why P3 | ||
|
|
||
| This is research-grade frontier-ability measurement, not a production-blocking concern. P3 (low-priority research) per the project priority taxonomy. Promotion to P2 would be appropriate when: | ||
|
|
||
| - 3+ cross-model runs are scheduled to run as a routine (not one-shot) | ||
| - The convergence-signature data starts informing model-selection decisions for routine work | ||
| - Aaron names it as a recurring need rather than a research curiosity | ||
|
AceHack marked this conversation as resolved.
|
||
|
|
||
| ## Effort sizing — M (medium) | ||
|
|
||
| - Designing the review-prompt + fixtures: small | ||
| - Running first cross-model batch: medium (depends on harness availability + per-model latency) | ||
| - Documenting the protocol + writing the per-model results: small-to-medium | ||
|
|
||
| Total: M. Bounded; not multi-month. Mostly composing existing infrastructure (harnesses + review-prompt + tool fixtures). | ||
|
|
||
| ## Carved sentence | ||
|
|
||
| **"Cross-model tool-review convergence-rate replay measures how many review rounds different AI models need to settle on a tool-authoring PR. Sibling-instance of the multi-harness alignment convergence future-skill-domain. Otto's v0.5 substrate-claim-checker review-cycle (8→5→2→2→2 stabilizing at 2/round) is the empirical seed. Reproducible, quantitative, frontier-ability-revealing. Closes when protocol is documented, 3+ cross-model runs executed against a fixture, results published in docs/research/cross-model-convergence-rate-replay/."** | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.