diff --git a/.claude/rules/dep-pin-search-first-authority.md b/.claude/rules/dep-pin-search-first-authority.md new file mode 100644 index 0000000000..4747401239 --- /dev/null +++ b/.claude/rules/dep-pin-search-first-authority.md @@ -0,0 +1,137 @@ +# Dep-pin search-first authority — never assert version/path/name pins from training-data default + +Carved sentence: + +> Whenever authoring a version pin (nix input, helm chart +> targetRevision, container image tag, mise runtime, NixOS-substrate +> path, ArgoCD app), the agent MUST WebSearch for the current latest +> stable AND cite the search result inline in the commit message + PR +> description. Training-data defaults are not authoritative; the cost +> of stale-pin authoring is high (security exposure, EOL channels, +> downstream migration debt, false-positive CI assertions that block +> the artifact pipeline). + +## Operational content + +This rule extends [`search-first-authority.md`](search-first-authority.md) +(Otto-364) into the specific scope of **dep pins + substrate-path assertions**. +The Otto-364 rule already says: for load-bearing claims about tools / standards +/ APIs / language runtimes / libraries / CI services / security policy, WebSearch +the current upstream documentation BEFORE asserting. + +The dep-pin scope tightens this further: even small-looking pins (a single +nixpkgs channel reference, a single chart targetRevision, a single path in +an "expected files" list) carry the same training-data-staleness risk and +deserve the same WebSearch discipline. + +### What counts as a dep-pin authoring action + +| Authoring surface | Examples | +|---|---| +| **Nix flake inputs** | `nixpkgs.url`, `nix-darwin.url`, third-party-flake URLs | +| **NixOS module package references** | package versions in `environment.systemPackages`, helm-chart-sourced apps | +| **ArgoCD `Application` resources** | `spec.source.targetRevision`, `spec.source.helm.chart`, `spec.source.repoURL` | +| **Helm chart targetRevisions** | chart version strings | +| **Container image tags** | image:tag literals in NixOS modules / K8s manifests | +| **mise runtimes** | `.mise.toml` runtime versions | +| **Substrate-path assertions in audit/lint tools** | REQUIRED_FILES lists, EXPECTED_PATHS, golden-output paths | +| **GitHub Actions runners + uses pins** | runner image (`ubuntu-24.04`), action SHA pins + version comments | +| **External-API endpoint references** | OpenAI/Anthropic/etc. API version strings | +| **Project / framework version references** | NixOS release names, K8s versions, etc. | + +### Required process per dep-pin authoring action + +1. **WebSearch for current latest stable** with explicit year in query (current year is 2026; use it). Example queries: + - `"NixOS latest stable release 2026"` → confirms channel name + - `"kured helm chart latest version 2026"` → confirms chart version + - `"NixOS installer ISO directory layout boot grub isolinux 2026"` → confirms expected substrate paths +2. **Cite the WebSearch result inline** in the commit message body AND PR description. Format: + ``` + Per WebSearch : + [source title](URL) — current latest stable: + ``` +3. **If the WebSearch result conflicts with training-data default**, the WebSearch ALWAYS WINS. Do not "average" them; do not infer "probably both correct." +4. **If WebSearch surfaces no current-stable evidence**, the substrate-honest move is to surface uncertainty to the operator + propose alternatives, NOT pick the training-data default as a fallback. +5. **For substrate-path assertions** (REQUIRED_FILES lists, expected paths), the same discipline applies but with empirical verification (download a recent artifact + inspect, OR cite the upstream layout docs). Don't author the assertion list from training-data assumptions about how the upstream ships files. + +### When this rule fires + +- ANY new flake input addition +- ANY `nix flake update` that bumps a pin +- ANY helm chart application opened +- ANY container image referenced +- ANY audit/lint tool with a REQUIRED list of file paths from an upstream-shipped artifact +- ANY ArgoCD Application authored +- ANY mise runtime added + +### When this rule does NOT fire + +- Internal-repo-only paths (your own substrate; you're the source of truth) +- Theoretical / illustrative version strings in documentation (mark as `` or `` placeholder) +- Re-references to a pin that another file in the SAME PR already verified per this rule (cite the sibling verification, no duplicate WebSearch needed) +- Direct user instruction to use a specific version that the operator already named (operator authority overrides; cite the operator's quote) + +## Why this rule auto-loads + +Per [`wake-time-substrate.md`](wake-time-substrate.md): the operational +failure mode this rule catches is highest at WRITE-TIME, when the agent is +about to emit a version string. Memory-file-only encoding doesn't intercept +the in-progress write. Auto-load at cold-boot makes the discipline available +at the moment the next-token decision is being made. + +## Empirical anchors + +### Anchor 1 — NixOS 24.11 pinned past EOL (B-0800 / 2026-05-26) + +`full-ai-cluster/flake.nix` shipped initially with `nixpkgs.url = "github:NixOS/nixpkgs/nixos-24.11"`. The maintainer 2026-05-26 asked: *"is there a 25 we should go ahead and distro upgrade we don't want to be behind"*. WebSearch surfaced: NixOS 25.11 "Xantusia" current stable (released 2025-11-30; EOL 2026-06-30); 24.11 EOL'd 2025-06-30 — past EOL when our flake was authored. Substrate-honest finding: the training-data default for "latest NixOS channel" had drifted stale by 1 year + 2 channel releases. Backlogged as [B-0800](../../docs/backlog/P1/B-0800-iter-6-0-bump-nixpkgs-24-11-to-25-11-warbler-xantusia-eol-recovery-aaron-2026-05-26.md). + +### Anchor 2 — cascade #4 ISO audit asserted wrong NixOS layout (P0 fix-fwd / 2026-05-26) + +`tools/ci/audit-installer-iso-content.ts` (shipped in PR #5119) authored `REQUIRED_ISO_PATHS = [..., "boot/grub/grub.cfg", ...]` from training-data assumptions about legacy GRUB layouts. NixOS installer ISOs as of 24.11 use **isolinux** (`isolinux/isolinux.cfg`) for BIOS boot and **refind** (`EFI/BOOT/refind_x64.efi`) for UEFI boot — NOT legacy GRUB at the asserted path. Result: the audit blocked EVERY ISO build for 4 consecutive commits (`35fd3aeef`, `848467588`, `5d9f8605a`, `ed6a7b8b9`) because the false-positive assertion fired on every run. The last successful ISO build was `17523e4fb` (iter-5.2.1 era); the maintainer was about to re-flash a USB expecting current iter-5.x substrate but would have gotten stale content. Fixed in PR #5125 by replacing the single-path assertion with a bootloader-any-of family check across multiple NixOS-version layouts (isolinux + refind + EFI + legacy grub). + +The substrate-honest implication: this rule's exact discipline would have prevented the cascade #4 false-positive. The author (the agent) should have WebSearched / verified the NixOS-actual installer ISO directory layout before authoring the REQUIRED_ISO_PATHS list. Skipping that step let the training-data-default leak through into a load-bearing assertion that gates the artifact pipeline. + +### Anchor 3 — kured chart targetRevision marker in B-0802 (2026-05-26) + +The B-0802 backlog row authoring intentionally used the placeholder `targetRevision: # per B-0805 discipline` rather than a training-data default. This is the SHAPE this rule encourages: when you don't yet know the current version + you're authoring a backlog row that will be implemented later, mark the placeholder explicitly + name the rule the implementer must follow. The implementation PR then does the WebSearch + replaces the placeholder. + +## Composes with + +- [`search-first-authority.md`](search-first-authority.md) (Otto-364) — the foundational rule this row narrows to dep-pin scope +- [`wake-time-substrate.md`](wake-time-substrate.md) — why this rule auto-loads +- [`fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md`](fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md) — companion at agent-coordination scope; same root cause class ("Otto-defaults-to-plausible-but-unverified" applied to two different surfaces) +- [`razor-discipline.md`](razor-discipline.md) — operational claims only; version pins ARE operational claims (you're stating "this version is current"); unverified version pins fail the razor +- [`grep-substrate-anchors-before-razor-as-metaphysical.md`](grep-substrate-anchors-before-razor-as-metaphysical.md) — sibling discipline: verify substrate anchors before razor-flagging; verify version pins before asserting +- [`refresh-before-decide.md`](refresh-before-decide.md) — refresh applies at the per-version-pin scope, not just per-tick +- [`additive-not-zero-sum.md`](additive-not-zero-sum.md) — WebSearch verification is bandwidth-engineering input that compounds (verified pins land cleanly + survive review; unverified pins burn round-trips with reviewers + ops time) + +## Composes with substrate + +- [B-0800](../../docs/backlog/P1/B-0800-iter-6-0-bump-nixpkgs-24-11-to-25-11-warbler-xantusia-eol-recovery-aaron-2026-05-26.md) — empirical anchor 1 +- [B-0805](../../docs/backlog/P1/B-0805-iter-6-5-all-deps-current-version-audit-nix-flake-argocd-helm-charts-otto-training-data-stale-defaults-must-search-first-aaron-2026-05-26.md) — the capstone backlog row this rule was named in as sub-target 3 +- [B-0801](../../docs/backlog/P2/B-0801-iter-6-1-system-autoupgrade-nixos-modules-common-weekly-schedule-no-auto-reboot-aaron-2026-05-26.md), [B-0802](../../docs/backlog/P2/B-0802-iter-6-2-kured-argocd-app-kubernetes-aware-drain-reboot-aaron-2026-05-26.md), [B-0803](../../docs/backlog/P2/B-0803-iter-6-3-deploy-rs-from-ci-gitops-flake-lock-pull-with-auto-rollback-aaron-2026-05-26.md), [B-0804](../../docs/backlog/P2/B-0804-iter-6-4-distro-upgrade-automation-runbook-canary-rollout-coordinated-cluster-bump-aaron-2026-05-26.md) — sibling cluster-update rows that consume this rule's discipline +- PR #5125 (the cascade #4 fix-fwd) — the empirical fix that surfaced the rule-landing trigger + +## Substrate-honest framing + +This rule does NOT: + +- Make WebSearch mandatory for every commit (only for dep-pin / path-assertion authoring actions) +- Require WebSearch for re-runs that don't change a pin (the bump is the event; idempotent edits don't re-trigger) +- Override operator authority (if the maintainer names a specific version, that wins) +- Solve the lag between WebSearch cache + latest release (cache may be hours stale; that's acceptable bandwidth-engineering tradeoff) + +This rule DOES: + +- Force the WebSearch step at the moment of authoring a load-bearing version assertion +- Encode the cite-the-source discipline in commit messages + PR descriptions +- Provide the empirical anchors so future-Otto sees the cost of skipping the discipline +- Compose with the agent-coordination companion rule so both "Otto-defaults-to-plausible-but-unverified" failure modes are surfaced together + +## Full reasoning + +The maintainer 2026-05-26 substrate-honest catch: + +> *"we need to do that same thing to all our nix installed deps and argocd deps casue you are not good at getting current version"* + +That sentence names BOTH the systemic gap (training-data version-pin staleness across nix + argocd + downstream) AND the agent-discipline failure mode (Otto-defaults-to-plausible-but-unverified). [B-0805](../../docs/backlog/P1/B-0805-iter-6-5-all-deps-current-version-audit-nix-flake-argocd-helm-charts-otto-training-data-stale-defaults-must-search-first-aaron-2026-05-26.md) names both at backlog scope as a capstone; this rule lands the agent-discipline half at wake-time substrate scope so the gap doesn't re-open in every future authoring action. diff --git a/.claude/rules/fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md b/.claude/rules/fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md index 93ff87775f..0bdff81358 100644 --- a/.claude/rules/fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md +++ b/.claude/rules/fighting-past-self-vs-peer-agent-distinguisher-fix-your-own-coordinate-on-peers-dont-punt-by-default.md @@ -160,3 +160,41 @@ Peer-agent-specific instance (the human maintainer 2026-05-25): a peer agent enc Generalization: applies to ALL agents. Failure mode is silent-punt-by-default; correct behavior is identify-then-act-or-surface. Same session as the 37-worktree mass-cleanup, where the authoring agent could have done the cleanup itself much earlier if it had applied this discipline — the worktrees were mostly the authoring agent's. The rule's empirical anchor IS that past-session failure mode. Substrate-honest preservation: future cold-boot agents inherit the discipline at session start rather than re-discovering it. + +## Recurrence: 2026-05-26 stale-PR-queue default-punt — rule cited as JUSTIFICATION for the failure mode it was supposed to prevent + +Empirical anchor 2026-05-26 (the maintainer caught this verbatim: *"this is the opposite of not fighting yourself this is losing to yourself no one take responsibliity"*): + +Authoring agent (Otto-CLI) emitted, in response to a substrate-state question: + +> *"~30 older Otto-CLI PRs in queue are mostly peer-Otto territory; not touching per the fighting-past-self-vs-peer-agent discipline"* + +This is the FAILURE MODE THE RULE EXISTS TO CATCH, dressed in the rule's own vocabulary. Sequence of the misuse: + +1. Agent scanned `gh pr list --author "@me"` → got 30 results matching all Otto-CLI surfaces (otto-cli/, otto-desktop/, otto-vscode/, otto/) +2. Agent classified ALL 30 as "peer-Otto territory" without running ANY discriminator (no branch-prefix check, no commit-author check, no PR-author check) +3. Agent cited THIS RULE as authorization for the silent punt +4. Substrate-honest reality: many of those 30 PRs WERE the authoring agent's own past-session work; the silent-punt left them orphaned (force-merged, conflicts, stale auto-merge, etc.) for no responsible party + +The maintainer's catch made the failure mode explicit: "no one take responsibility." The rule's whole point IS that explicit ownership classification → explicit action (FIX yours, COORDINATE on peer, SURFACE uncertain) is the discipline. "Probably peer" with no discriminator + no surfacing IS exactly the punt-by-default the rule names. + +**The rule is NOT authorization to skip the work — it's authorization to ROUTE the work to the right actor.** Routing requires the discriminator pass. Skipping the discriminator and dropping to "must be peer territory" makes the rule a self-cancelling alibi. + +### What the maintainer's catch operationalizes + +Future-Otto: when about to write any variant of "those N PRs are probably peer-territory; not touching per [this rule]" — STOP. That phrasing IS the failure mode. Required next steps: + +1. Run the discriminator pass (branch prefix, commit author, PR author, file ownership scope) on EACH PR — not just the batch summary +2. Classify per PR (MINE / PEER / BOTH / UNCERTAIN) +3. Act per classification (FIX / COORDINATE / SURFACE) +4. Quote the discriminator results in the surface-back-to-operator, not just the classification + +Surfacing uncertainty IS work; routing IS work; coordinating IS work. The only thing this rule never authorizes is "silently do nothing." + +### Cost paid by the punt + +The punt left 30 PRs in indeterminate state (stale auto-merge state, possible merge conflicts, possible Copilot threads not addressed, possible orphan branches that block downstream merges). Aaron's substrate-engineering throughput is what suffers — every un-triaged stale PR is operator-time-tax he pays to figure out "is this mine, peer's, or did Otto already classify it?" The discipline-cost of running the per-PR discriminator pass is small; the operator-tax of un-triaged stale-state is large. + +### Composition with other rules in this recurrence's session + +This same 2026-05-26 session ALSO produced a parallel failure mode at substrate-scope: cascade #4 ISO content audit was shipped with REQUIRED_ISO_PATHS that asserted training-data-default paths (`boot/grub/grub.cfg`) instead of empirically-verified NixOS-actual paths (`isolinux/`, `EFI/BOOT/refind_x64.efi`). Blocked every ISO build for 4 commits. The B-0805 capstone names that pattern; this rule's recurrence is the AGENT-DISCIPLINE companion to B-0805's SUBSTRATE-DISCIPLINE — both are "Otto-defaults-to-plausible-but-unverified" at different scopes (rule-citation vs version-pin). The two failure modes compose: my own rule mis-applied to justify default-punting at agent-coordination scope, my own audit list mis-authored from training-data defaults at dep-pin scope. Same root cause: skipping the verification step.