Skip to content

feat(B-0170.4): seed eval-set fixture for count-drift regression coverage#3611

Merged
AceHack merged 3 commits into
mainfrom
otto-cli/b0170-4-eval-set-fixture-2026-05-15
May 15, 2026
Merged

feat(B-0170.4): seed eval-set fixture for count-drift regression coverage#3611
AceHack merged 3 commits into
mainfrom
otto-cli/b0170-4-eval-set-fixture-2026-05-15

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 15, 2026

Summary

Smallest safe slice of B-0170.4 (fixture-tests + eval-set coverage).

  • New tools/substrate-claim-checker/fixtures/ directory with one frozen historical drift instance — count-drift-9-vs-15.md reproducing the count-drift pattern from PR review(pr-1257-postmerge): verify-then-claim count drift (9→18+) frontmatter + body + MEMORY.md #1259 (claim "9 drift instances" vs 15-row body table)
  • New fixtures.test.ts regression test asserting check-counts.ts still detects the empirical drift the fixture preserves
  • fixtures/README.md documents the index + the procedure for adding the next fixture (one sub-class per slice)
  • Top-level README points at the new eval-set surface

Empirical axis complement to the synthetic-case unit tests in each check-*.test.ts: fixtures regress against the actual drift patterns that prompted the discipline, not just toy inputs.

Test plan

  • bun test tools/substrate-claim-checker/fixtures.test.ts — 1 pass, 6 expect() calls, exit 0
  • bun test tools/substrate-claim-checker/ (full suite) — 113 pass, 0 fail, 250 expect() calls
  • bun tools/substrate-claim-checker/check-counts.ts tools/substrate-claim-checker/fixtures/count-drift-9-vs-15.md — 2 count-drift findings, exit 1 (drift surfaces as designed)
  • Branch verified before commit, tree size sanity-checked pre + post (52 root entries, no broken-commit canary)
  • Bus claim acquired: 72031688-2a2b-466d-a045-a5b76802d6df (otto-cli, B-0170.4)

Peer-work isolation

Avoided collision with in-flight branches:

  • otto-b0170-decompose-into-atomic-children-2026-05-15 (otto-desktop, parent B-0170 → B-0538-B-0541 children)
  • otto-cli/b0170-1-semantic-equiv-checker-2026-05-15 (B-0170.1)
  • otto-cli/b0170-3-self-recursive-checker-2026-05-15 (B-0170.3)

This slice touches only tools/substrate-claim-checker/fixtures* + README.md — purely additive.

🤖 Generated with Claude Code

…rage

Smallest safe slice of B-0170.4 (fixture-tests + eval-set coverage):

- New `tools/substrate-claim-checker/fixtures/` directory with one
  frozen historical drift instance — `count-drift-9-vs-15.md` reproducing
  the count-drift pattern from PR #1259 (claim "9 drift instances" vs
  15-row body table)
- New `fixtures.test.ts` regression test asserting `check-counts.ts`
  still detects the empirical drift the fixture preserves
- `fixtures/README.md` documents the index + the procedure for adding
  the next fixture (one sub-class per slice)
- Top-level README points at the new eval-set surface

Empirical axis complement to the synthetic-case unit tests in each
`check-*.test.ts`: fixtures regress against the actual drift patterns
that prompted the discipline, not just toy inputs.

Focused checks:
- `bun test tools/substrate-claim-checker/fixtures.test.ts` — 1 pass,
  6 expect() calls, exit 0
- `bun test tools/substrate-claim-checker/` (full suite) — 113 pass,
  0 fail, 250 expect() calls (negative-path stderr lines are intentional
  error-handling cases)
- `bun tools/substrate-claim-checker/check-counts.ts <fixture>` —
  2 count-drift findings, exit 1 (drift surfaces as designed)

Claim: 72031688-2a2b-466d-a045-a5b76802d6df (otto-cli, B-0170.4).

Peer work in flight (avoided collision):
- otto-desktop: parent B-0170 decompose branch (B-0538-B-0541 children)
- otto-cli: B-0170.1 (semantic-equivalence checker), B-0170.3
  (self-recursive checker)

operative-authorization: aaron 2026-05-14: "- **Devil-pole** (edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 22:55
@AceHack AceHack enabled auto-merge (squash) May 15, 2026 22:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 212ce34a27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tools/substrate-claim-checker/fixtures.test.ts Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the first on-disk eval-set fixture for tools/substrate-claim-checker, seeding historical count-drift regression coverage for B-0170.4.

Changes:

  • Documents the new fixture surface from the checker README.
  • Adds a fixture README with indexing and contribution procedure.
  • Adds one count-drift markdown fixture plus a Bun regression test.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tools/substrate-claim-checker/README.md Points users to the new fixture directory and fixture procedure.
tools/substrate-claim-checker/fixtures/README.md Defines the eval-set fixture purpose, index, and add-fixture process.
tools/substrate-claim-checker/fixtures/count-drift-9-vs-15.md Adds a frozen historical count-drift markdown fixture.
tools/substrate-claim-checker/fixtures.test.ts Adds a Bun test that runs check-counts.ts against the fixture.

Comment thread tools/substrate-claim-checker/fixtures.test.ts Outdated
AceHack and others added 2 commits May 15, 2026 19:03
… (PR #3611 thread)

Per chatgpt-codex-connector + copilot-pull-request-reviewer threads on
PR #3611: the original HTML provenance comment restated "9 drift
instances", producing TWO matching findings (one from the comment,
one from the body). The fixtures.test assertion (length >= 1,
findings[0]) could be satisfied by the comment alone, masking
regressions in body-claim detection.

Reword the comment to describe the scenario abstractly + add a NOTE
section explaining why the exact <number> <noun> pair is omitted.
Body claim "9 drift instances" + 15-row table preserved unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
…3611 thread)

Per chatgpt-codex-connector + copilot-pull-request-reviewer threads on
PR #3611: replace `>= 1` with exact `=== 1` and pin the finding line
to the body claim (line 24 after the rephrased HTML comment). A
regression that stops detecting the body claim cannot now be masked
by an HTML-comment match — the assertion forces exactly the intended
finding.

Composes with the sibling fixture-rephrase commit that removes the
spurious comment match in the first place.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 23:03
@AceHack AceHack merged commit f92bbd2 into main May 15, 2026
27 of 30 checks passed
@AceHack AceHack deleted the otto-cli/b0170-4-eval-set-fixture-2026-05-15 branch May 15, 2026 23:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +18 to +19
past `fixtures.test.ts` (per PR #3611 review threads from
chatgpt-codex-connector + copilot-pull-request-reviewer).
Comment on lines +29 to +35
expect(result.findings.length).toBe(1);
const finding = result.findings[0]!;
expect(finding.line).toBe(24);
expect(finding.claimedCount).toBe(9);
expect(finding.actualCount).toBe(15);
expect(finding.claim).toContain("drift instances");
expect(finding.claimIsMinimum).toBe(false);
AceHack added a commit that referenced this pull request May 16, 2026
Smallest safe slice of B-0170.4 (fixture-tests + eval-set coverage).
Extends PR #3611's count-drift seed to the existence-drift sub-class
— the second of the 5 shipped check-types now has empirical-axis
regression coverage.

- New `tools/substrate-claim-checker/fixtures/existence-drift-missing-doc.md`
  fixture modeling the verify-then-claim memo's body table instance #8
  (PR #1252 — future-domain memo referenced a docs/ markdown file that
  didn't actually exist). Uses a clearly synthetic path so the fixture
  stays stable across substrate evolution.
- New `fixtures.test.ts` describe block asserting `check-existence.ts`
  emits exactly one drift finding at line 24 with severity "drift".
- `fixtures/README.md` index gains the new fixture row.

Discipline carried forward from PR #3611 review threads
(chatgpt-codex-connector + copilot-pull-request-reviewer): the HTML
provenance comment intentionally does NOT backtick-quote the exact
fixture path. Restating the claim inside the comment would let
regressions in body-claim detection slip past the test via an
HTML-comment match. The test asserts exact finding count + pins the
body line as the catch.

Focused checks:
- `bun tools/substrate-claim-checker/check-existence.ts <fixture>` —
  1 drift finding at line 24, severity "drift", exit 1
- `bun test tools/substrate-claim-checker/fixtures.test.ts` — 2 pass,
  12 expect() calls, exit 0
- `bun test tools/substrate-claim-checker/` (full suite) — 114 pass,
  0 fail, 256 expect() calls (negative-path stderr lines are
  intentional error-handling cases per PR #3611 convention)

Composes with:
- B-0170.4 done-criteria ("fixture-tests + eval-set coverage for all
  shipped + new check-types") — incremental progress, one sub-class
  per slice per the fixtures/README.md procedure
- B-0170 (parent row, decomposed)
- PR #3611 (count-drift seed; same scaffolding extended here)

Claim: 6c253d24-3ed0-4e89-8f3a-563b13f933cc (otto-cli, B-0170).

operative-authorization: aaron 2026-05-14: "- **Devil-pole** (edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 16, 2026
…rged PR #3614 (#3628)

* docs(rules): extend ID-allocation discipline with subdecimal-vs-top-level scheme distinction

The ID-allocation-discipline section covered WHEN to check (on-disk + in-flight)
but not WHICH scheme to use. Adds a "Subdecimal vs top-level scheme" subsection
distinguishing:

- B-NNNN.M (subdecimal) → child / slice of EXISTING parent row
- B-NNNN (new top-level) → new umbrella / standalone row

Empirically grounded by the 2026-05-15 collision: Otto on Desktop decomposed
B-0170 into new top-levels B-0538/B-0539/B-0540/B-0541, missing that PR #3611
had already landed B-0170.4 via subdecimal scheme + Otto-CLI's PR #3595
had claimed B-0539 for the Otto-BFT umbrella. Both Ottos converged on the
same decomposition; the scheme mismatch (top-level vs subdecimal) was the
symptom of not checking existing-parent's siblings first.

The new check command is tight: `find docs/backlog -name "B-NNNN.*.md"` +
`gh pr list --state all --search '"B-NNNN."'`. If siblings exist, use next
free subdecimal — not a new top-level.

Composes with the existing ID-allocation section + refresh-before-decide
invariant + audit-first-then-decide discipline (PR #3583).

Co-Authored-By: Claude <noreply@anthropic.com>

* shard(tick): 2026-05-16T00:08Z — fix-PR #3626 for monad-terminology drift from merged PR #3614

First tick of 2026-05-16 UTC; fresh-session cold-boot from autonomous-loop.

Landed: PR #3626 (5 P1 review-thread fixes — monad-associativity terminology
+ dead xrefs in B-0543/B-0544 research substrate).

Operational notes: Lior process active during commit window
(lock-cleanup-race precondition); used borrow-on-existing pattern with
ls-tree canary on both PRs (this shard + #3626).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard-0008z): markdownlint MD037 — wrap full cron expression in backticks

`<<autonomous-loop>>` followed by `* * * * *` parsed as emphasis markers
with spaces (MD037/no-space-in-emphasis at line 72). Wrap the entire cron
expression in backticks so the asterisks are inside the code span.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 16, 2026
* feat(B-0170.4): seed path-form-drift fixture + regression test

Adds the third eval-set fixture for B-0170.4, extending regression
coverage from {count, existence} to {count, existence, path-form}.
Same proven shape as PR #3611 (count) and PR #3624 (existence): a
small on-disk markdown file under fixtures/ plus a pinned-expectation
test in fixtures.test.ts.

The fixture references tools/substrate-claim-checker/check-counts.ts
as both a bare basename (`check-counts.ts`) and a fully-qualified
path. Both resolve to the same absolute file via check-path-forms.ts's
3-root strategy (fileDir / parentDir / repoRoot), so the drift is
deterministically detected without depending on synthetic files.

Per PR #3611 review-thread discipline (chatgpt-codex-connector +
copilot): pin exact finding count (1) AND exact body-claim line (28)
so a regression in body-claim detection cannot be silently masked by
an HTML-comment-side match. The provenance comment intentionally
avoids restating either path form.

Re-decomposition note: original B-0170 lists B-0170.1-.4 as children.
B-0170.1 (semantic-equivalence) has an in-flight branch already;
B-0170.2 / .3 introduce brand-new sub-class checkers (bigger slices).
Adding one more fixture under B-0170.4 is genuinely the smallest safe
slice — it extends the proven pattern, has no merge risk, and closes
one more line of the parent row's done-criteria (eval-set coverage).

Focused check: bun test tools/substrate-claim-checker/fixtures.test.ts
→ 3 pass, 0 fail, 17 expect() calls. Full suite: 115 pass, 0 fail.

operative-authorization: aaron 2026-05-14: "- **Devil-pole**
(edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-Authored-By: Claude <noreply@anthropic.com>

* shard(tick): 2026-05-16T02:56Z — B-0170.4 path-form fixture slice (PR #3696)

Per-tick shard documenting the path-form-drift fixture slice landed
in PR #3696. Captures the re-decomp reasoning (B-0170.1 has in-flight
branch; B-0170.2/.3 are bigger slices; B-0170.4 fixture continuation
is smallest safe), the subdecimal-vs-top-level scheme discipline
observed (per ac9d9a4 rule), the focused-check outcome, and the
catch-43 cron sentinel re-arm at session start.

operative-authorization: aaron 2026-05-14: "- **Devil-pole**
(edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(B-0170.4): correct anchor citation to memo instance #15 / PR #1256

Per Copilot review threads on PR #3696: the path-form fixture's anchor
was cited as "taxonomy row 4" but path-form is actually instance #15
of the verify-then-claim memo's body table (PR #1256), and sub-class
#6 of the 7-class list. Corrects the README index + adds the historical
anchor comment in the test.

The current fixture remains a synthetic exemplar covering the sub-class;
instance #15's literal substance (adjacent ADR citations from PR #1256)
is queued as follow-on fixture B-0170.4.1 per the per-thread plan.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 16, 2026
…3749)

Adds the fourth eval-set fixture for substrate-claim-checker. The
fixture reproduces verify-then-claim memo instance #19 — YAML
frontmatter `description:` (and MEMORY.md index in the historical
case) claimed "9 drift instances" while the body table already held
15 rows. check-cross-surface's "any-table" semantics fire when zero
body tables match the claim.

Pinned per PR #3611 discipline:
- exact finding count (1)
- field == "description"
- claimedCount == 9, claimIsMinimum == false
- actualCounts == [15]
- HTML comment intentionally avoids restating the `<number> <noun>`
  pair (mirrors existing fixtures for uniformity, even though the
  cross-surface checker only scans the frontmatter description)

Focused-check outcome:
- `bun test tools/substrate-claim-checker/fixtures.test.ts` → 4/4 pass
- `bun test tools/substrate-claim-checker/` → 116/116 pass
- CLI: `bun tools/substrate-claim-checker/check-cross-surface.ts <fixture>`
  exits 1 with "cross-surface count drift — frontmatter.description
  claims '9 drift instances' (expected == 9); body tables have [15] rows"

operative-authorization: aaron 2026-05-14: "- **Devil-pole** (edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 17, 2026
Adds the 5th eval-set fixture for the substrate-claim-checker, covering
the convention sub-class of the 7-class verify-then-claim taxonomy. The
fixture pair (current ADR + sibling predecessor ADR support file) makes
the broken half of the bidirectional ADR supersession convention
reproducible without depending on any real ADR pair in the repo.

Anchor: PR #2512 (the PR that shipped check-convention.ts) — synthetic
exemplar, same shape as the path-form-drift fixture's synthetic case.

Focused check outcomes:
- bun test tools/substrate-claim-checker/fixtures.test.ts → 5 pass / 0 fail
- bun test tools/substrate-claim-checker/ → 117 pass / 0 fail
- Direct CLI run reports 1 convention-drift finding on line 36 with the
  expected reciprocal-marker reason string

Composes with B-0170 (parent), B-0170.4 eval-set thread (PRs #3611,
#3624, #3696, #3749).

operative-authorization: aaron 2026-05-14: "- **Devil-pole** (edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants