Skip to content

backlog: PR #1261 post-merge fixes (B-0172 plugin paths + B-0173 hook paths)#1262

Merged
AceHack merged 1 commit intomainfrom
backlog/pr-1261-postmerge-canonical-plugin-hook-paths-aaron-2026-05-03
May 3, 2026
Merged

backlog: PR #1261 post-merge fixes (B-0172 plugin paths + B-0173 hook paths)#1262
AceHack merged 1 commit intomainfrom
backlog/pr-1261-postmerge-canonical-plugin-hook-paths-aaron-2026-05-03

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 3, 2026

Summary

3 substantive Copilot post-merge findings on PR #1261 (already merged):

  1. B-0172 plugin location wrong: .claude/plugins/<name>/~/.claude/plugins/cache/<plugin-name>/
  2. B-0172 manifest wrong: top-level plugin.json.claude-plugin/plugin.json (Claude Code) / .codex-plugin/plugin.json (Codex)
  3. B-0173 hook path wrong: tools/git-hooks/tools/git/hooks/

All verified empirically against repo state + existing research docs (docs/research/codex-builtins-skills-vs-plugins-factory-integration-2026-04-24.md; ls tools/git/).

The 4th Copilot finding (B-0173 depends_on B-0170 not on main yet) resolves automatically when PR #1260 lands — B-0170 ships in that PR. False-positive on timing.

Pattern

These are verify-then-claim drift instances of the existence-drift sub-class: claimed locations/conventions without verifying canonical surfaces. Each would have been caught by the v1+ existence-check sub-class of substrate-claim-checker (B-0170).

Test plan

  • B-0172 plugin location updated to ~/.claude/plugins/cache/<plugin-name>/
  • B-0172 manifest path updated to .claude-plugin/plugin.json with Codex equivalent noted
  • B-0173 hook path updated to tools/git/hooks/ (2 occurrences)
  • CI green

🤖 Generated with Claude Code

… (B-0173) per repo conventions

3 substantive Copilot post-merge findings on PR #1261 (the 3
follow-up rows). Empirically verified each against repo state
+ existing docs:

1. **B-0172 plugin location wrong**:
   was: `.claude/plugins/<name>/`
   actual: `~/.claude/plugins/cache/<plugin-name>/` (per
   `docs/research/codex-builtins-skills-vs-plugins-factory-
   integration-2026-04-24.md`)

2. **B-0172 manifest path wrong**:
   was: top-level `plugin.json`
   actual: `.claude-plugin/plugin.json` (Claude Code) /
   `.codex-plugin/plugin.json` (Codex), per the same
   research doc

3. **B-0173 hook path wrong**:
   was: `tools/git-hooks/`
   actual: `tools/git/hooks/` (verified via `ls tools/git/`
   showing existing batch-resolve + push-with-retry scripts)

These are verify-then-claim drift instances of the
existence-drift sub-class: I claimed locations/conventions
without checking the canonical surfaces (existing research
docs + tools/ directory layout). Each fix would have been
caught by the v1+ existence-check sub-class of
substrate-claim-checker.

The 4th Copilot finding (depends_on:[B-0170] but B-0170 not
on main yet) resolves automatically when PR #1260 lands —
B-0170 ships in that PR. False-positive on timing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 3, 2026 01:32
@AceHack AceHack enabled auto-merge (squash) May 3, 2026 01:32
@AceHack AceHack merged commit 1104764 into main May 3, 2026
24 checks passed
@AceHack AceHack deleted the backlog/pr-1261-postmerge-canonical-plugin-hook-paths-aaron-2026-05-03 branch May 3, 2026 01:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates two backlog row documents to correct post-merge drift in the planned plugin-packaging and hook-authoring work. In the broader codebase, these per-row backlog files are the source-of-truth planning artifacts for future tooling and workflow work, so path and packaging details here need to stay precise.

Changes:

  • Corrects the Claude Code plugin location and manifest references in backlog row B-0172.
  • Adds a Codex manifest-path note to the B-0172 scope description.
  • Fixes the committed git hook path references in backlog row B-0173.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
docs/backlog/P2/B-0172-skill-domain-plugin-packaging-aaron-2026-05-03.md Updates the plugin-packaging row with revised Claude/Codex path and manifest details.
docs/backlog/P1/B-0173-hook-authoring-for-skill-creation-contracts-aaron-2026-05-03.md Corrects the documented committed hook locations for the future hook-authoring work.

> *"look at packaking skill domains a plugins or other packagin so we can take advantage of hooks in harnesses"*

Claude Code supports plugins under `.claude/plugins/`. When a skill domain matures (per the future-skill-domain memos' promotion-trigger criteria — 3+ worked examples per skill candidate + 1+ judgment-disagreement per expert candidate), packaging the whole domain as a plugin lets it ship as one unit including its hooks.
Claude Code installs plugins under `~/.claude/plugins/cache/<plugin-name>/` (per `docs/research/codex-builtins-skills-vs-plugins-factory-integration-2026-04-24.md`). When a skill domain matures (per the future-skill-domain memos' promotion-trigger criteria — 3+ worked examples per skill candidate + 1+ judgment-disagreement per expert candidate), packaging the whole domain as a plugin lets it ship as one unit including its hooks.
## Scope (when promotion-trigger fires)

Per Claude Code plugin convention (`.claude/plugins/<name>/`):
Per Claude Code plugin convention (installed at `~/.claude/plugins/cache/<plugin-name>/`; the source bundle has the manifest at `.claude-plugin/plugin.json`):
- Tools under `tools/` (TS files per Aaron skill-design rule 2)
- References to OpenSpec capabilities the plugin contracts against (per B-0171)
2. Plugin manifest (`plugin.json` per Anthropic spec) with description + dependencies + capabilities
2. Codex equivalent uses `.codex-plugin/plugin.json` with richer fields (semver + interface block + URLs + category) per the cross-harness research at `docs/research/codex-builtins-skills-vs-plugins-factory-integration-2026-04-24.md`
AceHack added a commit that referenced this pull request May 3, 2026
…ess; #1262 merged; replace-all-isn't-comprehensive

V0 → V0.4.4 substrate-claim-checker: 9 review iterations + 23+
substrate-quality findings. v0.4.4 fixes CommonMark fence-close
strictness + remaining count-claim drift that v0.4.3's
replace_all missed. Recursive verify-then-claim catches its own
remediation drift. v1+ existence-check would catch the
"removed all X" → grep should return 0 class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…1260)

* tools(substrate-claim-checker): v0 ship — count-drift detection + B-0170 backlog row

Builds the v0 of `tools/substrate-claim-checker/` per the
verify-then-claim discipline mechanization path. After 19+ drift
instances across 9+ PRs in a single session despite naming the
discipline, manual discipline provably insufficient — mechanization
is the only path.

V0 scope: ONE sub-class — count drift.

- `tools/substrate-claim-checker/check-counts.ts` (~150 lines, single-purpose)
  - Scans narrative for "N <noun>" patterns where <noun> is one of
    drift instances / rows / items / procedure skills / experts /
    tools / sub-classes
  - Counts data rows in the nearest markdown table within 50 lines
  - Reports drift if claimed N differs from actual
  - Exit 0 on no drift; exit 1 on drift detected

- `tools/substrate-claim-checker/README.md`
  - Usage + v0 scope + known limitations + composes-with

Self-test: runs cleanly on the verify-then-claim memo (which
catalogues 15 drift instances + has 15 table rows = consistent).
Synthetic test caught "5 drift instances" claim vs 3-row table.
Cross-scan of memory/feedback_*.md surfaced 7 findings: ~3 real
(multi-harness experts/skills counts) + ~4 false positives
(rhetorical "100 rows" in narrative, nearest-table heuristic
limitations).

V0 limitations documented in README:
- Nearest-table heuristic (no noun-to-table matching yet)
- Rhetorical number false positives
- Markdown-table data rows only (lists not counted)

V1 path covers remaining 6 sub-classes (existence / semantic-
equivalence / empirical-output / convention / path-form /
self-recursive); plus pre-commit + commit-msg + CI hook integration.

Per Aaron's no-dynamic-commands rule (skill-design memo): TS file
under tools/, single-purpose, type-checked, re-runnable. Per
hub-satellite separation: tool is hub-shaped; per-invocation
outputs are satellite-shaped.

B-0170 backlog row filed with done-criteria, depends_on:[],
composes_with [B-0169 decision-archaeology], canonical mapping
of v0 (1 sub-class shipped) to v1+ (6 remaining).

This PR breaks the drift-fix-meta-cycle from the past several ticks
by shipping the actual mechanization the cycle was pointing toward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T00:55Z — drift-fix meta-cycle broken; substrate-claim-checker v0 shipped

After 19+ drift instances + 6+ ticks of drift-fix-on-fix producing
new drift faster than fixes land, the path forward is shipping the
mechanization the cycle was pointing at. V0 of substrate-claim-checker
ships with count-drift sub-class coverage; eval-set + sub-class
taxonomy made authoring mechanical.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): substrate-claim-checker v0.1 — address 6 Copilot findings + 2 lint fails

Iterating v0 → v0.1 on the same branch per the verify-then-claim
discipline applied to itself: tool needs to be substrate-quality
substrate before it gates substrate quality.

Lint fixes:
- **tsc strict-null** (4 errors at lines 57, 59, 64, 102) —
  added `?? ""` fallbacks for `lines[i]` and `m[N]` access under
  `noUncheckedIndexedAccess`; explicit `if (numStr === undefined
  || noun === undefined) continue` guard
- **markdownlint MD032** in B-0170 — added blank line before
  v0-limitations list (lists need blanks-around per MD032)

Copilot findings (6):

1. **P1 fail-fast on missing file** — `checkFile()` previously
   returned [] silently, allowing exit 0 even when inputs were
   missing. Refactored: returns `{findings, ok}`; `main()` tracks
   inputErrors separately and exits 1 if any input was missing.

2. **P2 preserve `+` semantics** — `"20+ drift instances"` was
   treated identically to `"20"`. Added `claimIsMinimum` field
   to Claim; drift fires only when `actual < claimed` for
   minimum-claims (vs strict-equal for non-plus claims). Output
   format shows `>=` vs `==` operator.

3. **(duplicate of #1)** Same issue, same fix.

4. **Hyphenated forms not caught** — `"13-row table"` didn't
   match `\d+\s+noun`. Updated regex to `\d+\+?[\s-]+noun` so
   both `"13 rows"` and `"13-row"` match.

5. **Skip fenced code + tables** — `findClaims()` previously
   scanned every line including code blocks + table data rows.
   Added inFence toggle on ` ``` ` / `~~~` lines; skip lines
   starting with `|` (table rows).

6. **Drop unused Table.endLine** — interface simplified to
   `{startLine, rowCount}` only.

Self-verified v0.1:
- Missing file → exit 1 with error ✓
- Verify-then-claim memo (15 rows + "15 instances" claim) → no drift ✓
- tsc --noEmit passes against full repo tsconfig ✓

The 4 Copilot tsc fixes + 6 review findings are themselves
worked examples of the verify-then-claim discipline: each fix
is a count/semantic claim that needed empirical verification
before publishing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:02Z — substrate-claim-checker v0→v0.1 iteration; 6 findings + 2 lint addressed

V0 of the discipline-mechanizer hit 6 Copilot findings + 2
lint failures; v0.1 addresses all in same PR. Recursive
composition of verify-then-claim discipline through tool
review IS the worked example.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.2 — findTables skips fenced code blocks

P2 finding on PR #1260: `findTables()` previously matched any
`|...|` + separator sequence as a real table without checking
fenced-code-block context. If a memo's narrative contained a
fenced markdown example like:

    ```markdown
    | # | example |
    |---|---|
    | 1 | a |
    ```

...the tool would treat it as a real table. When followed by an
actual table, the nearest-table heuristic would pick the FENCED
example over the real one — false drift report.

Fix: added `inFence` toggle to `findTables()` matching the same
fence-tracking discipline `findClaims()` already uses. Tables
inside fenced code blocks are now ignored.

Verified via synthetic test: a memo with a 3-row fenced example
table + a 5-row real table + claim "5 drift instances" now
correctly reports no drift (v0.1 would have flagged because it
picked the 3-row fenced table first).

This finding is itself a worked example of the verify-then-claim
discipline: I claimed `findClaims` and `findTables` had the same
fence-tracking discipline (in v0.1's docstring), but only
`findClaims` actually had it. Empirical verification before
publishing claim would have caught this.

tsc --noEmit passes against full repo tsconfig.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:08Z — v0.2 fence-asymmetry fix; substrate-claim-checker becomes its own primary user

Asymmetric fence-tracking between findClaims (skip fences) and
findTables (didn't) IS the bug class. Verify-then-claim applied
recursively: claim about parallel-discipline-between-functions
needed empirical verification, not docstring assertion. v0 → v0.2
caught 10 substrate-quality findings on the discipline-mechanizer
itself — the tool's recursive self-application IS the empirical
evidence that mechanization is correct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.3 — separator regex + import.meta.main + B-0170 sub-class accuracy + indented-table v1 doc

4 Copilot findings on PR #1260 addressed:

1. **Separator regex too lax** — `^\|[\s\-:|]+\|\s*$` accepted
   `|   |` and `||||` as valid table separators. GFM requires
   at least one `-` per separator cell. Tightened regex to
   require at least one `-`: `^\|[\s\-:|]*-[\s\-:|]*\|\s*$`.

2. **process.exit(main()) unconditional** — script couldn't be
   imported for testing. Refactored: exported `main` + `findTables`
   + `findClaims` + `checkFile` + types; wrapped invocation in
   `if (import.meta.main) { process.exit(main()); }` per Bun
   convention. Other tools/ scripts use this pattern.

3. **B-0170 sub-class table mis-claim** — row "Frontmatter ↔
   body ↔ index count drift" said "v0 covers" but v0 only checks
   narrative-vs-nearby-table within a single document, not
   cross-surface narrative-to-narrative comparison. Reclassified
   as v1 work; explicitly named the 5 surfaces (frontmatter
   description / body table / section heading / carved sentence /
   MEMORY.md index entry) per the 0106Z shard's 5-surface finding.

4. **Indented tables not matched** — `findTables` regex `^\|`
   requires column-1 anchor. Tables inside nested lists or
   blockquotes aren't recognized. Documented as v1 limitation
   in README; v1 fix is `^\s*\|`. Not fixed in v0 to avoid
   broadening false-positive surface before adding scope-aware
   matching.

tsc clean + self-test (verify-then-claim memo) reports no drift.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:11Z — v0.3 iteration; #1259 merged with 5 post-merge threads triaged

V0 → V0.3 substrate-claim-checker iteration through 4 Copilot
review passes; 14 substrate-quality findings catalogued; recursive
discipline-mechanization application is itself the primary teacher.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.4 — CommonMark fence delimiter tracking + directory rejection

2 Copilot findings on v0.3:

1. **P2 fence delimiter length** — `inFence` toggle on any
   ` ``` ` or `~~~` line is wrong per CommonMark: a fence
   closes only when the closing delimiter is the SAME char
   AND at-least-equal length. So a 3-backtick fence containing
   a longer block of backticks shouldn't close on the inner
   line. Refactored both `findTables` and `findClaims` to
   track `fenceChar` + `fenceLen`; close only on matching
   char + length>=open.

2. **P2 directory input** — `existsSync` returns true for
   directories, then `readFileSync` throws with cryptic error.
   Added `statSync(filePath).isFile()` check; reject directories
   with explicit "not a regular file" error.

Self-tested:
- `bun tools/substrate-claim-checker/check-counts.ts tools/`
  → "error: not a regular file (directory or other): tools/"
  → exit 1 with explicit message
- Verify-then-claim memo → no count drift detected (regression
  test for fence-tracking + table-counting)
- tsc --noEmit clean

Both fixes are CommonMark-spec compliance + filesystem-input
robustness — the kind of edge case the eventual deployed-tool
will hit on real corpus.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:14Z — v0.4 CommonMark + directory; 5 review passes; v0.x mature for count-drift

V0 → V0.4 substrate-claim-checker iteration: 5 Copilot review
passes catching 16 substrate-quality findings. Edge-case
absorption (CommonMark fence delimiter, directory rejection)
is the substrate-quality-maturity path — recursive review IS
the eval-set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.4.1 — file header version label refresh + readFileSync error wrap

5 Copilot findings on v0.4 — 3 already-resolved or false-positive,
2 substantive:

1. **(stale)** Tick shard 0108Z says "v0.1 → v0.2" while file
   header (then) said v0.1. Tick shards are append-only history;
   they accurately recorded the version-label-at-write-time. The
   header had been v0.1 BEFORE that tick; the shard correctly
   notes the v0.1 → v0.2 transition. No retroactive edit.

2. **(false positive)** docs/BACKLOG.md flagged as
   "auto-generated, don't edit". Verified: BACKLOG.md WAS
   regenerated via `bash tools/backlog/generate-index.sh` when
   B-0170 was added; the diff is the auto-generated entry. No
   action needed.

3. **(already-resolved in v0.3)** `process.exit(...)` without
   `if (import.meta.main)` guard. Verified: line 278-280 has
   the guard already. False positive on stale review state.

4. **(real, fixed)** `readFileSync` could throw on permission
   errors / transient IO. Wrapped in try/catch; emit explicit
   error message; return ok:false. Together with the prior
   directory check, all read-failure modes now produce clean
   error output rather than crash trace.

5. **(real, fixed)** File header docstring still said v0.1
   while the iteration is now v0.4. Updated header to v0.4 +
   added an iteration-history block listing each version's
   changes (v0 / v0.1 / v0.2 / v0.3 / v0.4).

The version-label-drift in the file header was itself drift
instance-class — version-string-vs-iteration-state inconsistency.
Future tooling for substrate-claim-checker should add a check:
"file's docstring version label matches latest iteration commit
in git log."

tsc clean + self-test on verify-then-claim memo passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:17Z — v0.4.1 + 5 findings triaged (3 stale/FP, 2 real)

Triage-as-substrate: empirically verify each finding's currency
BEFORE deciding to fix. 3 of 5 #1260 findings were stale or
false-positive after verification (tick-shard append-only history;
BACKLOG.md auto-gen verified; import.meta.main guard already in
v0.3). 2 real fixes: file header v0.1 → v0.4 with iteration
history; readFileSync error wrap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.4.2 — collapse existsSync+statSync+readFileSync into single try/catch (eliminates TOCTOU race per CodeQL)

CodeQL flagged TOCTOU (time-of-check-to-time-of-use) race
condition: the existsSync() → statSync() → readFileSync()
sequence had two windows where the file could change between
check and use.

Fix: collapse into single readFileSync try/catch + categorize
the resulting NodeJS.ErrnoException by err.code:
- ENOENT → "error: file not found: <path>"
- EISDIR → "error: not a regular file (directory): <path>"
- other → "error: read failed for <path>: <msg>"

This produces equivalent user-facing error messages from a
single syscall — eliminates TOCTOU race while preserving the
explicit error categorization the prior v0.4 added.

Verified empirically (verify-then-claim discipline applied):
- missing file → "file not found" + exit 1 ✓
- directory → "not a regular file (directory)" + exit 1 ✓
- valid file → no count drift detected ✓
- tsc --noEmit clean ✓

This is the FIRST CodeQL-class finding caught on the tool —
distinct from the Copilot review pattern (CodeQL is static
analysis for security; Copilot is general code review). Both
should integrate as inputs to the eventual deployed
substrate-claim-checker for PR description / commit-msg /
file-content checking.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:19Z — v0.4.2 TOCTOU fix; CodeQL is a new review-input class

First CodeQL finding on substrate-claim-checker — TOCTOU race
between existsSync+statSync+readFileSync. Collapsed to single
readFileSync try/catch with err.code categorization. CodeQL is
distinct from Copilot review pattern; eventual deployed
substrate-claim-checker should integrate both as parallel
review-inputs with shared triage discipline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.4.3 — bun:test unit tests + README/B-0170 count drift fixes

6 Copilot findings on v0.4.2:

1. **(real, fixed)** README "differs" missed `+` minimum-count
   semantics. Updated: "Reports drift if claimed N differs from
   actual. **Special case for `N+` minimum-count claims:** drift
   fires only when `actual < N`."

2. **(real, fixed)** README cited "19+" drift instances + "#19"
   as count-drift, but main memo enumerated 15. Switched to
   no-specific-count: "drift instances catalogued in the
   verify-then-claim memo's body table — see that file for
   current count." Avoids two-surface count drift between README
   + memo.

3. **(real, fixed)** B-0170 cited "19+" — same drift class.
   Replaced with "(the verify-then-claim memo's body table is
   canonical)". Two occurrences updated.

4. **(false-positive on stale review state)** v0.1 file header.
   Verified: file header is at v0.4.2 (since commit 464c086 +
   484cc48). Resolved as stale.

5. **(real, fixed)** No bun:test unit tests. Added 16 unit
   tests covering findTables (5 tests) + findClaims (5 tests)
   + checkFile (6 tests) including: separator-`-`-required,
   fenced-code-block skipping, CommonMark fence-delimiter
   length matching, hyphenated forms, minimum-count semantics
   (allows actual >= claimed; fires on actual < claimed),
   missing-file + directory rejection, drift detection +
   no-drift cases.

6. **(false-positive on stale review state)** Closing fence
   rules. Verified: v0.4 + v0.4.2 implement CommonMark same-char
   + at-least-equal-length closing. Resolved as stale.

Test results: 16/16 pass; tsc --noEmit clean.

The unit-test suite is the missing eval-set per Aarav's BP-14
review on B-0169 (worked-examples-are-the-dry-run-eval-set).
Each test fixture is a known-good or known-drift case the tool
should classify correctly. Future v1+ work extends the suite
as new sub-classes ship.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:28Z — v0.4.3 unit-test suite + count-drift fixes; "point at canonical" pattern

V0 → V0.4.3 substrate-claim-checker iteration: 8 review passes
catching 18+ findings. v0.4.3 adds 16-test bun:test suite
(findTables/findClaims/checkFile coverage) per Aarav's BP-14
worked-examples-are-the-eval-set finding. README + B-0170 count
claims switched from specific count to "memo's body table is
canonical" — hub-satellite separation applied to count-claim
sourcing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:33Z — #1261 merged + 4 findings triaged; #1260 rebased; existence-drift caught 3×

Existence-drift sub-class caught 3 times on #1261's follow-up
rows (plugin location + manifest path + hook directory). Each
fix verified empirically against repo state + existing research
docs. The substrate-claim-checker v1+ existence-check would
have caught all 3 pre-publish — empirical urgency for v1
mechanization continues.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review(pr-1260): v0.4.4 — fence-close requires whitespace-only after delimiter; remove remaining 19+/20+ count claims; bump header

5 Copilot findings on v0.4.3:

1. **(real, fixed)** findTables fence-close: per CommonMark,
   closing fences must have ONLY whitespace after the delimiter.
   "```bash" was being treated as a closer; it's actually an
   info-string-bearing line that occurs INSIDE a fence.
   Refactored to use two regexes: fenceOpen (allows info string)
   and fenceClose (strict whitespace-only); only fenceClose
   triggers fence-close transitions.

2. **(real, fixed)** Same in findClaims; same fix.

3. **(real, fixed)** File header v0.4.2; bumped to v0.4.4 with
   iteration history block extended (v0.4.3 unit tests +
   count-cleanup; v0.4.4 fence-close strictness).

4. **(real, fixed)** BACKLOG.md auto-generated; regenerated to
   pick up B-0170 title from the per-row file (drift was caused
   by an earlier in-flight title rename — `19+` → `(memo's body
   table is canonical)` — that the prior regeneration didn't
   pick up post-rebase).

5. **(real, fixed)** Remaining 19+/20+ claims:
   - README line 73: "running 20+ as of late 2026-05-03 wake" →
     dropped specific count
   - B-0170 line 18: "catalogues 19+ distinct" → "catalogues N
     distinct"
   - B-0170 line 22: "19+ instances of substrate-authoring" →
     "N instances"
   - B-0170 line 23: "19 × 20min ≈ 6 hours" → "compound to many
     hours"
   - B-0170 line 71: "19+ historical drift instances" → "N
     historical drift instances"

The replace_all pass on v0.4.3 caught some but missed others —
this is itself a verify-then-claim drift instance: I claimed
"removed all 19+/20+ counts" but actually only removed some.
v0.4.4 catches the rest. tsc clean; 16/16 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* hygiene(tick-history): 2026-05-03T01:36Z — v0.4.4 fence-close strictness; #1262 merged; replace-all-isn't-comprehensive

V0 → V0.4.4 substrate-claim-checker: 9 review iterations + 23+
substrate-quality findings. v0.4.4 fixes CommonMark fence-close
strictness + remaining count-claim drift that v0.4.3's
replace_all missed. Recursive verify-then-claim catches its own
remediation drift. v1+ existence-check would catch the
"removed all X" → grep should return 0 class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…dent pattern refinement (#1283)

* free-memory: guess #2 — in-the-moment guess on B-0172 skill-domain-plugin-packaging (Otto 2026-05-03)

Second in-the-moment guess under the guess-then-verify architectural-intent
calibration protocol (PR #1278). Target: B-0172 skill-domain-plugin-
packaging row (P2). Otto has read row name only; not body.

**Guess summary:**

- Architectural intent (medium-high confidence): plugins-as-distribution-
  + isolation + composition units for skill domains; instantiates
  hub-satellite separation at the domain level
- Substrate-content (medium): plugin manifest format
  (.claude-plugin/plugin.json per recent path corrections); first
  packaging is decision-archaeology + substrate-claim-checker cluster
- Specific implementation (low): directory tree + dependencies
  declaration; GitHub-publishable
- Cross-row composition (medium): B-0169 + B-0170 + B-0173
  composition; B-0171 likely depends_on (OpenSpec specs precede
  plugin packaging)

**Pre-recovery self-prediction**: based on guess #1 pattern (principle-
strong + specific-weak), I predict architectural PARTIAL-MATCH +
substrate-content MIXED + specific MOSTLY-OFF. This pre-prediction
itself is calibration data: how well does Otto predict its own
accuracy BEFORE seeing the answer?

Ground truth + calibration delta sections deliberately empty — to be
filled in a SUBSEQUENT GROUND-TRUTH-RECOVERY commit after Otto reads
B-0172.

This is the second calibration data point under the protocol. Pattern-
recognition test: does the principle-strong + specific-weak pattern
generalize beyond the first guess?

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* GROUND-TRUTH-RECOVERY: B-0172 calibration delta (65%) — context-dependent pattern refinement

Second calibration data point under the guess-then-verify protocol.
Otto scored 26/40 = 65% on B-0172 plugin packaging, up from 48% on
guess #1 (B-0173 hook authoring).

**Calibration result by layer:**

- Architectural: 6/10 PARTIAL-MATCH — got distribution + composition;
  missed Aaron's "hooks-shipping" primary frame + promotion-trigger
  maturity-gate
- Substrate-content: 6/10 MIXED — got Claude-Code-side path; missed
  Codex equivalent format + cross-harness adapter design
- Specific implementation: 7/10 MOSTLY-MATCH — significantly stronger
  than guess #1's 3/10. Reason: recent specific-context from PR #1262
  path corrections taught the manifest path + install location
- Cross-row composition: 7/10 MOSTLY-MATCH — right rows; one
  mis-categorization (B-0173 depends_on vs composes_with)

**Pre-prediction validation**: I predicted 3 layers before research.
2/3 correct (architectural PARTIAL-MATCH ✓ + substrate-content MIXED ✓
+ specific MOSTLY-OFF predicted but actual MOSTLY-MATCH ✗). I
over-predicted weakness on specific-implementation when recent
specific-context was present.

**KEY NEW PATTERN FINDING — context-dependent calibration**:

The principle-strong + specific-weak pattern (observed in guess #1)
is CONTEXT-DEPENDENT. When prior specific-context is present (e.g.,
recent PR fixes, recent doc reads, recent commit context), the gap
between principle-layer and specific-layer accuracy narrows
substantially.

This is more useful than the original pattern observation: future-Otto
can predict specific-implementation accuracy as a function of recent
context-density, not as a fixed weakness.

**Pattern progression across 2 data points:**
- Guess #1 (B-0173): no prior specific-context → 3/10 specific
  (MOSTLY-OFF)
- Guess #2 (B-0172): recent PR #1262 path-correction context →
  7/10 specific (MOSTLY-MATCH)

The hypothesis: specific-context-density predicts specific-layer
accuracy. Future guesses will validate or invalidate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…-intent-guesses/

Two real findings from #1282 review:

1. Grammar: "why packages skills as plugins" → "why package skills as
   plugins" (line 7)
2. Discoverability: architectural-intent-guesses/ directory had no
   MEMORY.md entry. Added newest-first entry pointing at the directory's
   README, with series progression note (guess #1 48% + guess #2
   65% with pattern observation)

Two findings deferred to PR-level reply:

3. PR description / frontmatter mismatch — explained in reply
4. PR-derived detail (PR #1262 path-correction context) "contaminating"
   the guess — methodological clarification: prior context is permitted
   under the protocol; declared in "Read state at guess time" so the
   calibration delta accounts for it. This is the discipline working as
   intended, not contamination

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…-intent-guesses/

Two real findings from #1282 review:

1. Grammar: "why packages skills as plugins" → "why package skills as
   plugins" (line 7)
2. Discoverability: architectural-intent-guesses/ directory had no
   MEMORY.md entry. Added newest-first entry pointing at the directory's
   README, with series progression note (guess #1 48% + guess #2
   65% with pattern observation)

Two findings deferred to PR-level reply:

3. PR description / frontmatter mismatch — explained in reply
4. PR-derived detail (PR #1262 path-correction context) "contaminating"
   the guess — methodological clarification: prior context is permitted
   under the protocol; declared in "Read state at guess time" so the
   calibration delta accounts for it. This is the discipline working as
   intended, not contamination

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…termines-layer-ceiling pattern emerges

Third calibration data point under guess-then-verify protocol. Otto
scored 17-18/40 = ~44% on B-0166 chat-as-DBSP-event vision — lowest
of three so far. Trajectory: 48% → 65% → 44%.

**Calibration result by layer:**

- Architectural: 6/10 PARTIAL-MATCH — got ACID/DBSP/glass-halo angle;
  missed training-substrate angle (chat-event-stream as fine-tuning
  data for Anthropic's next-gen + training material for new AIs)
- Substrate-content: 5/10 MIXED — got basic schema; missed multi-source
  ingest (because B-0164 dual-loop wasn't in read-state)
- Specific implementation: 2-3/10 MOSTLY-OFF — wrong language (TS vs
  F# DBSP runtime); wrong storage (file vs runtime)
- Cross-row composition: 4/10 MOSTLY-OFF — missed B-0164 entirely
  (had zero read-state for the primary composition partner)

**Pre-prediction**: 2/4 within range. I over-predicted accuracy on
layers requiring specific read-state I lacked.

**KEY NEW PATTERN — read-state-determines-layer-ceiling**:

| Layer | Driven by |
|---|---|
| Architectural | Aaron's framing + cross-disciplinary catalogue + principles |
| Substrate-content | Specific row context + recent PR context |
| Specific implementation | Recent PR context for exact implementation choices |
| Cross-row composition | DIRECT read-state for the composition partners |

Hypothesis: layer-level-accuracy ≈ min(principle-reasoning-quality,
read-state-coverage-for-that-layer).

When read-state is thin for a layer, accuracy degrades regardless of
principle-based reasoning. Future-Otto: predict that layer's score
CONSERVATIVELY when read-state is thin. Don't let principle-reasoning
quality bleed into layer-level confidence when read-state is the
actual ceiling.

**3-data-point pattern progression**:

- #1 (B-0173, no recent PR context): 48% — principle-strong, specific-weak
- #2 (B-0172, recent PR #1262 context): 65% — context boosted specific
- #3 (B-0166, no read-state for primary composition partner): 44% —
  read-state thinness on cross-row layer dragged total down

The hypothesis is testable on future guesses. Pick rows where
read-state varies by layer and observe whether the min-formula holds.

Per Aaron 2026-05-03 *"we are defining the edge / that's the job"* —
this is edge-defining work, not idle-fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 3, 2026
…vent (44%, read-state-ceiling pattern) (#1296)

* free-memory: guess #3 — in-the-moment guess on B-0166 chat-input as ACID-durable DBSP event (Otto 2026-05-03)

Third in-the-moment guess under the calibration protocol. Target:
B-0166 chat-input-as-ACID-durable-DBSP-event row.

**Guess summary:**

- Architectural intent (medium confidence, predict 6-7/10): chat as
  source-of-architectural-intent; ACID-durable preserves what would
  otherwise be lost on compaction; DBSP-event semantics (Aaron's
  cross-disciplinary pattern); replayability composes with DST
- Substrate-content (medium, predict 5-6/10): chat-event schema +
  Z-set retraction semantics + replay tool
- Specific implementation (low, predict 3-4/10): auto-capture hook +
  docs/chat-events/ directory + TS replay tool
- Cross-row composition (medium-high, predict 6-7/10): Otto-363
  substrate-or-it-didn't-happen + Otto-272 DST + retraction-native +
  bidirectional alignment

**Pre-prediction at finer granularity**: this iteration tests whether
self-prediction calibration improves as data points accumulate. Guess
#3 predicts specific score ranges per layer (vs #2's coarser
predictions). Will validate or invalidate the calibration-improvement
hypothesis.

Ground truth + calibration delta sections deliberately empty — to be
filled in a SUBSEQUENT GROUND-TRUTH-RECOVERY commit after Otto reads
B-0166's row body.

Per Aaron 2026-05-03 *"we are defining the edge / that's the job"* —
this is edge-defining work, not idle-fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* GROUND-TRUTH-RECOVERY: B-0166 calibration delta (44%) — read-state-determines-layer-ceiling pattern emerges

Third calibration data point under guess-then-verify protocol. Otto
scored 17-18/40 = ~44% on B-0166 chat-as-DBSP-event vision — lowest
of three so far. Trajectory: 48% → 65% → 44%.

**Calibration result by layer:**

- Architectural: 6/10 PARTIAL-MATCH — got ACID/DBSP/glass-halo angle;
  missed training-substrate angle (chat-event-stream as fine-tuning
  data for Anthropic's next-gen + training material for new AIs)
- Substrate-content: 5/10 MIXED — got basic schema; missed multi-source
  ingest (because B-0164 dual-loop wasn't in read-state)
- Specific implementation: 2-3/10 MOSTLY-OFF — wrong language (TS vs
  F# DBSP runtime); wrong storage (file vs runtime)
- Cross-row composition: 4/10 MOSTLY-OFF — missed B-0164 entirely
  (had zero read-state for the primary composition partner)

**Pre-prediction**: 2/4 within range. I over-predicted accuracy on
layers requiring specific read-state I lacked.

**KEY NEW PATTERN — read-state-determines-layer-ceiling**:

| Layer | Driven by |
|---|---|
| Architectural | Aaron's framing + cross-disciplinary catalogue + principles |
| Substrate-content | Specific row context + recent PR context |
| Specific implementation | Recent PR context for exact implementation choices |
| Cross-row composition | DIRECT read-state for the composition partners |

Hypothesis: layer-level-accuracy ≈ min(principle-reasoning-quality,
read-state-coverage-for-that-layer).

When read-state is thin for a layer, accuracy degrades regardless of
principle-based reasoning. Future-Otto: predict that layer's score
CONSERVATIVELY when read-state is thin. Don't let principle-reasoning
quality bleed into layer-level confidence when read-state is the
actual ceiling.

**3-data-point pattern progression**:

- #1 (B-0173, no recent PR context): 48% — principle-strong, specific-weak
- #2 (B-0172, recent PR #1262 context): 65% — context boosted specific
- #3 (B-0166, no read-state for primary composition partner): 44% —
  read-state thinness on cross-row layer dragged total down

The hypothesis is testable on future guesses. Pick rows where
read-state varies by layer and observe whether the min-formula holds.

Per Aaron 2026-05-03 *"we are defining the edge / that's the job"* —
this is edge-defining work, not idle-fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants