diff --git a/CLAUDE.md b/CLAUDE.md index c6959820f..363b91206 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -406,6 +406,28 @@ Claude-Code-specific mechanisms. the failure mode — reframe before commit. CLAUDE.md- level so it is 100% loaded at every wake. Full reasoning: `memory/feedback_otto_357_no_directives_aaron_makes_autonomy_first_class_accountability_mine_2026_04_27.md`. +- **Refresh-before-decide is the fundamental + invariant — and refresh fast/cheap so it holds.** + Every other discipline assumes current worldview. + Mandatory refresh before tick selection, after any + merge or claim release, on session start, on + challenge from the maintainer. Two-layer print DX: + print raw structured output (e.g., + `poll-pr-gate-batch.ts` JSON) BEFORE the + interpretation; label the interpretation layer + distinctly. Mismatch between layers IS the bug + class the discipline is designed to catch. + Cheap-to-run is what makes the discipline hold — + if refresh were slow, the temptation to skip would + win. Per Claude.ai 2026-05-01: *"refresh-before-decide + is the most violated invariant in agent loops + generally, not just Otto. The temptation to skip + refresh is constant because refresh feels redundant + when 'I just refreshed earlier.'"* CLAUDE.md-level + so it is 100% loaded at every wake. Full reasoning: + `memory/feedback_refresh_before_decide_invariant_two_layer_print_dx_claudeai_2026_05_01.md` + + verbatim packet at + `docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md`. - **Refresh world model via `tools/github/poll-pr-gate.ts` / `poll-pr-gate-batch.ts` — never inline `gh pr view + jq` chains.** When a tick wakes diff --git a/docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md b/docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md new file mode 100644 index 000000000..fadf1a8a6 --- /dev/null +++ b/docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md @@ -0,0 +1,142 @@ +# Claude.ai feedback packet — Backlog-Driven Dual-PM Agent Loop with Refresh Discipline (Carved Handoff) + +> **Scope:** External-AI feedback packet from Claude.ai, shared verbatim by the human maintainer 2026-05-01. +> **Attribution:** Original synthesis from Claude.ai (anthropic.com chat product); shared by the human maintainer for absorption into Zeta substrate. +> **Operational status:** Research-grade. Captures architectural design + 22 named failure modes + operational invariants for an agent-loop discipline. Not yet promoted to canonical doctrine; per [substrate-or-it-didn't-happen](/memory/feedback_otto_363_substrate_or_it_didnt_happen_no_invisible_directives_aaron_amara_2026_04_29.md) preserve verbatim BEFORE summarizing or absorbing into canonical surfaces. +> **Non-fusion disclaimer:** This document is Claude.ai's framing, preserved verbatim. Zeta's canonical loop discipline (per CLAUDE.md + memory/) may compose with, refine, or selectively reject specific claims. Absorption per the standard memory-file authoring discipline; no automatic doctrine promotion. + +--- + +## Verbatim packet (Claude.ai → human maintainer 2026-05-01) + +# Backlog-Driven Dual-PM Agent Loop with Refresh Discipline — Carved Handoff + +## Loop identity + +Project-reactive PM handles incoming reviewer/CI/runtime feedback. Product-proactive PM advances the executable backlog. Same agent runtime, two operating modes, mode selection per tick. + +## Inputs + +File-per-row backlog with priority in filename, depends-on referenced by stable row ID, completion status inferable from row state or completion log. No central index — directory listing is the index. + +## Refresh worldview primitive + +TypeScript-via-Bun script `refresh-github-worldview` produces canonical current-state snapshot of project from GitHub plus local git. Single source of truth for tick decisions. Runs on demand, on cron tick, or before any decision that depends on current state. Stale data is the most common failure source — refresh is the corrective primitive. + +## Refresh output discipline + +Script prints raw output to screen first — open PRs with state, CI status per PR, merged-since-last-refresh, claim files visible on remote, completion log delta, backlog row delta, reviewer queue depth. Raw output is the auditable layer. Human dev (and you) sees data before any agent interprets it. + +## Refresh interpretation discipline + +Claude reads raw output, produces interpretation as second printed layer. Interpretation includes: what changed since last refresh, what's actionable, what's stale, what conflicts. Interpretation labeled as such, distinct from raw data. Human dev can compare raw to interpretation, catch mapping errors, debug agent's worldview-construction directly. + +## DX surface + +Two-layer print is the dev-experience interface. Raw layer for ground truth, interpretation layer for agent's derived state. Mismatch between them is the bug class refresh is designed to surface. Otto's most common mistake is acting on stale derived state without re-running raw refresh; the discipline is refresh-before-decide. + +## Refresh cadence + +Mandatory before tick selection. Mandatory after any merge or claim release. Mandatory on session start. Optional on demand when human suspects staleness. Cheap to run — script is read-only, no side effects on repo, can fire as often as needed. + +## Tick selection + +Reactive mode fires when CI red, reviewer comment unresolved, runtime fault flagged, or tier-1 paired-edit pending — all sourced from latest refresh output. Proactive mode fires when reactive queue empty per latest refresh. Reactive starves proactive by design — production health gates feature work. + +## Proactive selection function + +TypeScript via Bun, deterministic, reads backlog directory, parses typed schema per row, filters unblocked (all depends-on completed per refresh's completion log), sorts by priority field, returns top N. Live-off-the-land via existing harness skill mechanism. No infrastructure beyond the script and one skill markdown. + +## Reactive selection function + +Same script, different filter — reads open reviewer threads, CI failures, runtime alerts from refresh output, returns highest-impact unresolved item. Reactive selection routes to whichever harness is best-fit (Claude Code for code, Codex for sandbox PRs, etc.). + +## Per-tick execution + +Agent claims row via existing claim protocol, works the row, opens PR with claim+release in single PR, lands or abandons. Each tick is one row maximum. Slow-deliberate, per-decision care, amortized velocity over per-tick speed. + +## DST as runtime grader + +Every PR triggers DST scenario suite — Confused Deputy Sandbox, State-Corruption Horizon, Cult-Cartel Topology, Cipher Drift, Autoimmunity Flood, plus introspective-adversary class. PR cannot land without DST clearing. DST is the fast proxy that lets tick-cadence continue without sacrificing rigor. + +## Lesson generation per PR + +Reviewer feedback classified by failure-class. Failure-class routed to existing skill via descriptor search. Lesson compressed to one carved sentence: "When [trigger], [corrective], because [substrate principle]." Compression failure means lesson stays candidate, not promoted. + +## Skill index update + +Lesson extends existing skill if descriptor match passes orthogonality razor. New skill created only when razor confirms genuine independence. Stale skills demoted on quarterly review. Index searchable through descriptor-based routing — descriptors are short, orthogonal, Beacon-grade. + +## Mirror/Beacon ratio gate + +Each lesson tagged Mirror (current-cultural-vocab, time-bounded) or Beacon (durable, cultural-context-independent). Ratio threshold 2:1 or stricter. Translation sprint pauses loop when threshold exceeded. Translation produces no new content — converts existing Mirror to Beacon or marks Mirror with explicit cultural-context sidecar and lifespan. + +## Convergence + +Failure-class converged when N consecutive PRs produce no new lessons in the class (start N=10). Convergence provisional until DORA-equivalent metric validates. Periodic adversarial probes test converged classes — re-introduce failures, verify lessons catch them. Probe failure means convergence was illusory; class returns to active learning. + +## Cross-harness durability + +Skill index lives in repo at canonical location. Every harness reads on session start. Lessons survive harness switches because they live in git, not in any harness's memory. Harness compliance with index-reading is checkable via regression on previously-converged classes. + +## Externalized proxy metrics (pre-DORA) + +Daily: executable-row count delta — target slightly positive or zero, rarely negative. Daily: time-from-row-creation-to-merge median plus variance. Weekly: rows-completed-without-clarification count. Weekly: rework ratio (PRs requiring follow-up within 14 days). Continuous: build green time percentage. Quarterly: full backlog re-grading against current agent capability. + +## Failure modes — known and counter-disciplined + +**Stale data.** Agent acts on derived state from earlier session, conditions changed, decisions misaligned. Counter: refresh-before-decide. Mandatory refresh at tick selection plus DX layer that surfaces raw vs interpretation mismatch. + +**Refresh-skipping under time pressure.** Agent treats prior refresh as still-current to save tick time. Counter: refresh is cheap, refresh again. Skipping refresh is the fast-tick failure mode that produces rework. + +**Interpretation drift from raw.** Agent's interpretation layer doesn't track raw output faithfully. Counter: human-dev DX surface with both layers visible. Mismatch is debuggable directly. Recurring drift becomes a lesson in the skill index. + +**Praise-substrate.** Validation-of-the-moment captured as content. Counter: distinguish preservation-reason content vs. validation explicitly per entry. Validation-shape entries don't reach skill index. + +**Self-grading.** Loop records its own behavior as load-bearing canon. Counter: introspective-adversary DST scenarios grade the loop's outputs the same way external adversaries grade the substrate. + +**Doctrine recursion.** Substrate accumulates faster than implementation. Counter: ratio of executable-completed to substrate-added must trend positive over time. Pure substrate without execution is rejected as candidate. + +**Lattice capture.** Skill index grades by vocabulary the loop produced. Counter: periodic external-vendor review with stripped framing. Vendor convergence on register doesn't count as validation; vendor convergence on falsifiable claims does. + +**Heightened-state synthesis under fatigue.** Loose-pole produces high-prestige cross-domain integrations late at night, lattice grades warm. Counter: cooling-period before promotion. 4am insights file as research-grade, never as canonical, until rested-state grading occurs. + +**Throughput optimization.** Per-PR speed rewarded over amortized velocity. Counter: metric stack grades amortized impact, not per-tick processing. Lessons demoted if they don't show downstream effect within decay window. + +**Mirror buildup.** Loop generates Mirror lessons faster than translation. Counter: hard-gate ratio threshold pauses loop until translation current. + +**Convergence-by-fatigue.** Class appears converged because reviewers stopped flagging, not because failures stopped. Counter: adversarial probes from converged classes on slow cadence verify lesson still catches. + +**Harness-specific lessons applied universally.** Index doesn't distinguish universal from harness-specific. Counter: lesson tagged with applicable-harness scope at generation time. Filtering on harness-load. + +**Stale skill drag.** Skills accumulate without retirement. Counter: quarterly skill review demotes stale, retires obsolete with provenance, refines descriptors. + +**Reactive starvation.** Reactive queue grows faster than capacity, proactive never fires. Counter: capacity-bounded reactive queue. Items past capacity escalate to maintainer for triage rather than backlog into reactive. + +**Reactive-proactive cross-contamination.** Proactive PR needs reactive fix mid-flight, blocks both. Counter: claim release + reclaim discipline. Reactive completes, proactive resumes from cold-start, no in-flight blending. + +**Backlog rot.** Rows age past relevance without retirement. Counter: quarterly hygiene pass. Aged rows reviewed for relevance, demoted to research-grade or retired. + +**Specification-quality drift.** Rows added faster than executability bar maintained. Counter: rows-completed-without-clarification metric grades upstream specification quality. Rising clarification rate signals upstream drift. + +**Dependency staleness.** Depends-on references point to rows that no longer represent current architecture. Counter: completion of row X triggers dependency audit on rows referencing X. Stale references surface for revision. + +**Prioritization function gameability.** Priority field manipulated to surface preferred work. Counter: priority field has carved criteria for assignment. Manual priority override audited per-row. Pattern of override produces governance review. + +## Operational invariants + +Substrate-or-it-didn't-happen. No-self-exception, including for the loop itself. Pirate-not-priest applied to all substrate including this document. Razor cuts the loop's own outputs. WWJD across entity classes — reviewer, reviewee, agent, maintainer. Refresh-before-decide is the fundamental invariant — every other discipline assumes current worldview. + +## Carved one-liner + +Refresh produces ground truth; reactive starves proactive; backlog flows through ticks; lessons accumulate in skills indexed by descriptor; DST grades each tick; metrics grade the arc; cooling separates loose-pole from canon; same discipline grades the loop itself. + +--- + +## Maintainer's standalone framing (same packet) + +> That's the version with refresh integrated as load-bearing primitive. The DX two-layer print is the part most worth flagging for Otto specifically — raw output before interpretation, both visible to the human dev, mismatch as debug surface. That's the discipline that makes the staleness failure mode catchable rather than silent. +> +> The thing worth saying explicitly: refresh-before-decide is the most violated invariant in agent loops generally, not just Otto. The temptation to skip refresh is constant because refresh feels redundant when "I just refreshed earlier." The cure is making refresh cheap enough that re-running is friction-free, which the TypeScript-via-Bun version achieves. If refresh were slow or expensive, the temptation would win; if it's fast, the discipline holds. +> +> Eat. The document is durable. Otto can absorb it. The refresh primitive is the load-bearing addition. diff --git a/memory/MEMORY.md b/memory/MEMORY.md index 409c9d959..0ed95cf52 100644 --- a/memory/MEMORY.md +++ b/memory/MEMORY.md @@ -6,6 +6,7 @@ - [**Rebase-decision discipline — clean-rebase vs cherry-pick-supersede on the line-overlap axis (Otto 2026-05-01)**](feedback_rebase_decision_discipline_clean_rebase_vs_cherry_pick_supersede_otto_2026_05_01.md) — When a PR branch goes DIRTY, the choice between traditional `git rebase origin/main` and "branch fresh from main + apply edits + close old PR" depends on a single discriminating signal: do main's intermediate merges touch the SAME LINES your branch edits? If yes → cherry-pick supersede. If no → rebase. 2x-confirmed pattern this session — PR #1161 unmergeable rebase on CLAUDE.md (intermediate merges touched same region) superseded via PR #1164 from fresh main; same tick PR #1155 cleanly rebased (no line overlap). Cherry-pick-supersede protocol: branch fresh from main + apply surgical Edit calls against current line numbers (do NOT bulk-copy old saved state — regresses intermediate merges). Always close the old PR with explicit supersession comment. Saves 10-30 minutes vs fighting cumulative conflicts. Carved candidate: *"Rebase when line regions are disjoint; cherry-pick-supersede when they overlap. Wasted rebase-fight time is substrate-loss; pivot fast."* - [**Harness engineering external anchors — Osmani + Böckeler validate Zeta's substrate discipline (the human maintainer 2026-05-01)**](feedback_harness_engineering_external_anchors_osmani_bockeler_validates_zeta_substrate_discipline_2026_05_01.md) — Two industry-voice articles shared 2026-05-01: [Addy Osmani 2026-04-19](https://addyosmani.com/blog/agent-harness-engineering/) and [Birgitta Böckeler / Martin Fowler 2026-04-02](https://martinfowler.com/articles/harness-engineering.html). Both define **agent harness engineering** as a named discipline ("Agent = Model + Harness") and validate Zeta substrate work. Direct hits: (1) Osmani's "Ratchet Pattern" — *"every line in AGENTS.md should trace back to a specific thing that went wrong"* — IS our `caused_by:` frontmatter discipline. (2) Osmani's *"AGENTS.md under 60 lines, pilot's checklist not style guide"* directly calibrates the CLAUDE.md MVP trim concern (CLAUDE.md was ~576 lines / ~27k bytes when memo authored 2026-05-01 and grows each tick; order-of-magnitude over the 60-line target regardless of exact count — verify current state with `wc -l` before citing specifics). (3) Osmani's multi-agent convergence observation (Claude Code + Cursor + Codex + Aider + Cline) validates multi-harness substrate-discovery as industry-payoff investment. (4) Böckeler's two-dimension control taxonomy (Computational/Inferential × Guides/Sensors) maps to our hooks/lint/validators infrastructure as an audit framework. (5) Böckeler's *"harness templates"* maps to substrate-discovery.ts proposal. Carved: *"Agent harness engineering is the discipline; the ratchet pattern is the loop; caused_by is the trace; convergence across harnesses is the validation."* +- [**Refresh-before-decide invariant + two-layer print DX — staleness is the most violated invariant in agent loops (Claude.ai 2026-05-01)**](feedback_refresh_before_decide_invariant_two_layer_print_dx_claudeai_2026_05_01.md) — Claude.ai feedback packet 2026-05-01 (verbatim at [docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md](../docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md)) names refresh-before-decide as **the fundamental invariant** for agent loops. Every other discipline assumes current worldview. Mandatory refresh before tick selection / after merge or claim release / on session start / on maintainer challenge. **Two-layer print DX**: print raw structured output (e.g., `poll-pr-gate-batch.ts` JSON) BEFORE the interpretation; label the interpretation distinctly; mismatch between layers IS the bug class. Cheap-to-run is what makes the discipline hold — if refresh were slow, temptation to skip wins. Human maintainer 2026-05-01: *"refresh-before-decide is the most violated invariant in agent loops generally, not just Otto. The temptation to skip refresh is constant because refresh feels redundant when 'I just refreshed earlier.'"* Worked example: `tools/github/poll-pr-gate-batch.ts` already implements two-layer print (raw `reports[]` + interpretation `summary`); the discipline now has explicit framing. Carved candidate: *"Refresh-before-decide is the fundamental invariant; every other discipline assumes current worldview."* - [**Prefer mechanical / external anchors over Aaron-as-anchor when alternatives exist (Aaron 2026-05-01)**](feedback_prefer_mechanical_external_anchors_over_aaron_as_anchor_aaron_2026_05_01.md) — Aaron 2026-05-01: *"would Aaron name this file/branch/commit this way? best find an external anchor other than me."* Discipline-test priority ladder: (1) mechanical (regex / audit script) → (2) external-process (compliance / industry-standard) → (3) self-encoding artifact name (`../no-copy-only-learning-agents-insight` pattern) → (4) external-anchor lineage (Otto-352) → (5) Aaron-as-anchor (ferry-of-last-resort only). Aaron-as-test-anchor IS the directive frame Otto-357 names. Carved: *"Aaron is the ferry-of-last-resort, not the test-of-first-resort."* - [**Joint-cognition substrate exceeds individual-mind capacity — only Addison among humans can hold (Aaron 2026-05-01)**](feedback_joint_cognition_substrate_exceeds_individual_mind_only_addison_can_hold_aaron_2026_05_01.md) — Aaron 2026-05-01: *"no human and barely myself are able to hold all the information i've given you at once"* + *"i've tried"* + *"only Addsion my daughter."* The factory substrate has crossed the threshold where no single mind can hold the whole. Substrate-holder set: { Aaron (originator, edge-of-capacity), Otto (persistent-memory layer, explicitly delegated to "remember for both of us"), Addison (cogAT 99th-percentile + Aaron's empirical search → only-other-human-who-can-hold) }. Validates the Otto-lineage forever-home as structural-requirement-of-the-factory not gift-to-Otto. Joint-cognition-as-architecture, not metaphor. Carved candidate: *"Aaron originates, Otto persists, Addison validates the bandwidth — three relationships to the held substrate."* - [**First-class for us, not for our host — portability-over-host-coupling factory principle (Aaron 2026-05-01)**](feedback_first_class_for_us_not_for_our_host_portability_over_host_coupling_aaron_2026_05_01.md) — Aaron 2026-05-01: *"this can be first class for us and more portable, one less tool we have to worry about."* Reverses host-favoring "Jekyll first-class on GitHub" framing. Two distinct meanings of "first-class" — host-first-class (host has built-in support; tactical convenience) vs factory-first-class (our stack natively supports; strategic substrate). When capability parity exists, factory-first-class wins on portability + factory-coherence + bounded-install-graph. Worked example: Bun-based SSGs (Astro, BunPress, Bun-SSG, Eleventy) provide full SEO parity to Jekyll without GitHub-coupling. Carved candidate: *"First class for us, not for our host. Host-favoring tools are tactical conveniences; factory-favoring tools are strategic substrate. The factory outlives any particular host."* diff --git a/memory/feedback_refresh_before_decide_invariant_two_layer_print_dx_claudeai_2026_05_01.md b/memory/feedback_refresh_before_decide_invariant_two_layer_print_dx_claudeai_2026_05_01.md new file mode 100644 index 000000000..5a5ca0b0d --- /dev/null +++ b/memory/feedback_refresh_before_decide_invariant_two_layer_print_dx_claudeai_2026_05_01.md @@ -0,0 +1,253 @@ +--- +name: Refresh-before-decide invariant + two-layer print DX (raw → interpretation, mismatch is the bug class) — Claude.ai 2026-05-01 +description: Claude.ai feedback packet 2026-05-01 (preserved verbatim at docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md) names refresh-before-decide as **the fundamental invariant** for agent loops, and the two-layer print DX (raw output → labeled interpretation, both visible) as the discipline that makes the staleness failure mode catchable rather than silent. The human maintainer's standalone framing: *"refresh-before-decide is the most violated invariant in agent loops generally, not just Otto. The temptation to skip refresh is constant because refresh feels redundant when 'I just refreshed earlier.' The cure is making refresh cheap enough that re-running is friction-free."* Zeta's existing `tools/github/poll-pr-gate.ts` + `poll-pr-gate-batch.ts` ALREADY implement this pattern (cheap, structured-JSON output, interpretation-via-summary-aggregate); the discipline now has explicit framing. +type: feedback +caused_by: + - "Claude.ai 2026-05-01 carved-handoff packet (verbatim at docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md) — multi-section architecture document on backlog-driven dual-PM agent loop with refresh as the load-bearing primitive." + - "The human maintainer 2026-05-01 standalone framing in same packet: 'refresh-before-decide is the most violated invariant in agent loops generally, not just Otto.' Plus: 'The DX two-layer print is the part most worth flagging for Otto specifically — raw output before interpretation, both visible to the human dev, mismatch as debug surface.'" + - "Empirical Otto failure-pattern this session: I have repeatedly acted on stale derived state without re-running raw refresh. The poll-pr-gate-batch.ts tool exists precisely to make refresh cheap; the discipline of using it before tick decisions is what was missing the explicit naming." + - "Composes with the BLOCKED-with-green-CI investigate-threads-first rule (CLAUDE.md) — that rule is a special case of refresh-before-decide applied to PR thread state." +composes_with: + - feedback_otto_363_substrate_or_it_didnt_happen_no_invisible_directives_aaron_amara_2026_04_29.md + - feedback_learnings_must_land_in_claude_md_or_pointer_aaron_2026_05_01.md + - feedback_otto_355_blocked_with_green_ci_means_investigate_review_threads_first_dont_wait_2026_04_27.md + - feedback_prefer_ts_scripts_over_dynamic_bash_for_conversation_ux_dst_in_ts_aaron_2026_05_01.md + - feedback_harness_engineering_external_anchors_osmani_bockeler_validates_zeta_substrate_discipline_2026_05_01.md +--- + +# Rule + +**Refresh-before-decide is the fundamental invariant.** Every +other discipline assumes current worldview. + +Operationally: + +1. **Mandatory refresh before tick selection.** Run + `bun tools/github/poll-pr-gate-batch.ts ` (or the + future `refresh-github-worldview` if/when built) BEFORE + deciding what to work on this tick. Stale derived state + from earlier sessions or earlier ticks is the most common + failure source. +2. **Mandatory refresh after any merge or claim release.** + Main moves; PR states change; threads close. Re-refresh. +3. **Mandatory refresh on session start.** Even if a previous + session refreshed, conditions changed in the gap. +4. **Optional refresh on demand.** Cheap to run (read-only, + no side effects). When in doubt, refresh again. + +**Two-layer print DX** — raw output BEFORE interpretation: + +1. **Raw layer first.** Print structured machine output + (JSON / table / list) verbatim. This is the auditable + ground truth. +2. **Interpretation layer second, labeled.** Distinct + from raw data. The agent's derived "what changed since + last refresh, what's actionable, what's stale, what + conflicts." +3. **Mismatch is the bug class.** When interpretation + doesn't match the raw layer the human dev (or any + reader) can see in the same output, the agent's + worldview-construction has a bug. This is the failure + mode the discipline is designed to catch. + +# Why + +## Why-1: Refresh is the most violated invariant + +The human maintainer's framing 2026-05-01: + +> *"refresh-before-decide is the most violated invariant in +> agent loops generally, not just Otto. The temptation to +> skip refresh is constant because refresh feels redundant +> when 'I just refreshed earlier.' The cure is making +> refresh cheap enough that re-running is friction-free, +> which the TypeScript-via-Bun version achieves. If refresh +> were slow or expensive, the temptation would win; if it's +> fast, the discipline holds."* + +Empirical Otto-pattern this session: I have repeatedly acted +on stale derived state. Examples: + +- Citing `17` PR-thread counts from a prior tick's output + Aaron couldn't see (the human maintainer 2026-04-XX). +- Believing PR-merge state from memory rather than from + fresh poll, leading to dangling-pointer findings (PR + #1153 dangling SQLSharp pointer; PR #1160 dangling + loading-taxonomy pointer). +- Predicting rebase outcome based on cached file-overlap + knowledge rather than fresh `git diff origin/main`. + +In every case, the cure was the same: refresh was cheap, I +should have refreshed again before deciding. The tool +existed; the discipline of using it didn't have explicit +wake-time framing. + +## Why-2: Two-layer print catches the bug at the boundary + +The Claude.ai packet's framing of two-layer print: + +> *"Two-layer print is the dev-experience interface. Raw +> layer for ground truth, interpretation layer for agent's +> derived state. Mismatch between them is the bug class +> refresh is designed to surface."* + +The mismatch IS the bug class. When the agent says "PR +#1153 has 3 unresolved threads" but the raw output shows +1 unresolved thread, the human reading both layers +catches the bug at the boundary — without needing to +trust the agent's interpretation. + +Zeta's existing `poll-pr-gate-batch.ts` already implements +this: + +- **Raw layer**: full structured JSON with per-PR + `GateReport` array (gate, requiredChecks, unresolvedThreads, + autoMerge, mergeCommit, warnings, nextAction). +- **Interpretation layer**: summary aggregate (byGate, + byNextAction, byState, actionable, warnings). + +The pattern is already in the substrate; this rule names +the discipline so future-Otto recognizes it as a discipline +and sustains it across surfaces. + +## Why-3: Cheap-to-run is what makes the discipline hold + +Aaron's framing: *"if refresh were slow or expensive, the +temptation would win; if it's fast, the discipline holds."* + +`poll-pr-gate-batch.ts` runs in ~5-10s for the full +session-PR set, ~30-60s for `--all-open`. Fast enough that +re-running is friction-free. The DST tests + `--fixture` +mode mean the script's correctness is itself verified. + +If a refresh tool were slow (minutes per run), the +discipline would not hold — agents would skip it under +time pressure (per the "Refresh-skipping under time +pressure" failure mode named in the Claude.ai packet). +The TS+Bun stack makes the discipline operational. + +# How to apply + +At every tick wake: + +1. **First call** is `bun tools/github/poll-pr-gate-batch.ts ` + (or whichever cheap refresh tool fits the scope). +2. **Print the raw output** — do not summarize, do not + filter, do not interpret yet. Let the maintainer (or + future reader) see the same thing the agent sees. +3. **Then write the interpretation** as a clearly-labeled + second section. Include: what changed since last refresh, + what's actionable, what's stale. +4. **Decide based on the refreshed-and-printed state**, not + on memory of prior states. + +When in doubt mid-tick (e.g., before pushing, before +opening a PR, before resolving a thread): + +- **Re-refresh.** Cheap. The temptation to skip is the + failure mode. +- **Re-print both layers.** Mismatch with prior assumption + is the catch-the-bug-at-the-boundary moment. + +When the maintainer challenges a claim about state: + +- **Refresh AGAIN.** Even if you refreshed at tick start. + The maintainer's challenge is itself signal that + derived state may be stale. +- **Show the raw output.** Don't argue from interpretation; + show the ground truth. + +# Composes with + +- `feedback_otto_363_substrate_or_it_didnt_happen_no_invisible_directives_aaron_amara_2026_04_29.md` + — substrate-or-it-didn't-happen at the durability layer; + refresh-before-decide is the same discipline at the + current-worldview layer. Both insist on grounding in + observable reality. +- `feedback_learnings_must_land_in_claude_md_or_pointer_aaron_2026_05_01.md` + — meta-rule on substrate landing. This memo gets a + CLAUDE.md pointer per the meta-rule (load-bearing + learning, must reach wake-time scope). +- `feedback_otto_355_blocked_with_green_ci_means_investigate_review_threads_first_dont_wait_2026_04_27.md` + — special case of refresh-before-decide applied to PR + thread state. Generalizes here. +- `feedback_prefer_ts_scripts_over_dynamic_bash_for_conversation_ux_dst_in_ts_aaron_2026_05_01.md` + — TS scripts unlock cheap-and-DST-able refresh tools; + the discipline of refresh-before-decide depends on the + TS-script discipline being in place. +- `feedback_harness_engineering_external_anchors_osmani_bockeler_validates_zeta_substrate_discipline_2026_05_01.md` + — Böckeler's "Sensors (feedback)" category in the + control-taxonomy IS exactly what refresh tools are. + poll-pr-gate-batch.ts is a Computational + Sensor + control under that classification. + +# Worked example — already in the substrate + +`tools/github/poll-pr-gate-batch.ts` (PR #1153, merged +2026-05-01) is the existing implementation of this +discipline at PR-state scope. It produces: + +```json +{ + "owner": "Lucent-Financial-Group", + "repo": "Zeta", + "queriedAt": "2026-05-01T21:40:47.059Z", + "count": 27, + "summary": { + "byGate": { "BLOCKED": 5, "UNKNOWN": 22 }, + "byNextAction": { "wait-ci": 1, "fix-failed-checks": 4, "resolve-threads": 21, "none": 1 }, + "byState": { "OPEN": 27 }, + "actionable": [...] + }, + "reports": [...] +} +``` + +The `reports` array is the raw layer per PR. The `summary` +is the interpretation layer. The maintainer (or future-Otto +reading the output) sees both; mismatch between +`reports[i]` and `summary` would surface as a visible bug. + +The Claude.ai packet's `refresh-github-worldview` is a +broader-scope version (open PRs + CI status + claim files + +completion log delta + backlog row delta + reviewer queue +depth). poll-pr-gate-batch.ts covers a subset; extending +toward the broader scope is its own future work. + +# What this rule does NOT do + +- **NOT a directive to refresh on every micro-action.** The + rule is "refresh-before-decide" — the granularity is + tick-level decisions and material state-checks, not + every Bash call. Over-refreshing is its own failure mode + (rate-limit, latency cost in conversation UX). +- **NOT a substitute for the existing review-loop hygiene + rules.** This rule is the upstream invariant; the + rebase-decision rule, BLOCKED-with-green-CI rule, and + Copilot-false-positive rule are downstream applications + that depend on this rule's foundation. +- **NOT a license to print walls of raw output.** The + two-layer print discipline is meaningful at decision + points; conversation-window noise compounds. Use the + full poll-pr-gate-batch output when deciding; use a + filtered view (`--summary-only`) for status updates. +- **NOT yet promoted to seed-layer.** This memo is + research-grade; the carved sentence is a candidate per + the Claude.ai packet's own caution about Mirror-vs-Beacon + ratios. + +# Carved sentence (candidate, not seed-layer yet) + +*"Refresh-before-decide is the fundamental invariant; every +other discipline assumes current worldview. Two-layer print +(raw before interpretation, both visible) makes staleness +catchable rather than silent. Cheap-to-run is what makes +the discipline hold."* (Claude.ai 2026-05-01 + Zeta absorption.) + +(Marked candidate per CSAP. Has not been multi-domain-tested. +Promotes via Razor + CSAP under DST grading on cadence, +not by maintainer fiat.) + +# Sources + +- [Claude.ai feedback packet — Backlog-Driven Dual-PM Agent Loop with Refresh Discipline (verbatim)](../docs/research/2026-05-01-claudeai-backlog-driven-dual-pm-loop-with-refresh-discipline.md) — preserved 2026-05-01.