backlog(B-0760): USB as repair tool for any node — identity preservation + no-disruption-at-3+-nodes invariant#5038
Conversation
…ion + no-disruption-at-3+-nodes invariant Aaron 2026-05-25 mid-iteration-1 prep: 'think of this usb as the easiest repair tool for any node in the cluster if it fails we should harden it like that since we are desired state / declarative / git native / ai native a full rebuild of a node should not stop normal operations once we get to 3 nodes.' Captures the repair-tool design: plug USB into failed node, mDNS- detects existing cluster, MAC-address-detects prior node identity, inherits hostname + role from cluster's known-nodes registry, reinstalls + rejoins as SAME identity. At 3+ nodes (B-0756 HA quorum), single-node rebuild is zero-disruption to cluster ops. Composes with B-0754 (zero-typing first-boot — substrate this extends), B-0756 (HA quorum prerequisite), B-0757 (mDNS auto- discovery — the 'is there a cluster?' detection reused for 'is this a known node?'), B-0755 (role inheritance must support all role variants), B-0758 (orthogonal — repair-tool semantics apply whether OS lives on disk OR USB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Adds a new P2 backlog row (B-0760) describing the “USB as repair tool” desired-state semantics for cluster node rebuilds, including identity preservation and a “zero disruption at 3+ nodes” invariant, and registers it in the generated backlog index.
Changes:
- Adds new backlog row file
B-0760with problem statement, target flow, acceptance criteria, and related cross-references. - Updates
docs/BACKLOG.mdto include the new P2 entry.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| docs/backlog/P2/B-0760-usb-as-repair-tool-for-any-node-identity-preservation-across-rebuilds-no-disruption-at-3-plus-nodes-aaron-2026-05-25.md | New backlog row defining repair-USB semantics, acceptance criteria, and operational/security constraints. |
| docs/BACKLOG.md | Adds the B-0760 entry to the P2 index. |
Two findings: 1. Line 29: continuation "+ ZETA_AUTO_CONFIRM ..." parsed as markdown list item, breaking markdownlint. Rewrap as comma-separated list inside the parenthetical so no line starts with "+". 2. Line 142-143: "even counts split-brain on partition" inaccurate for quorum-based Raft/etcd. Even clusters lose quorum on equal- split partitions (becoming unavailable) rather than forming divergent leaders. Rewrite to reflect the actual fault-tolerance tradeoff: 3 and 4 both survive 1 failure, so even counts add a node without improving tolerance. Resolves lint (markdownlint) failure + both Copilot threads. Co-Authored-By: Claude <noreply@anthropic.com>
* docs(archive): preserve discussions for merged PRs * docs(shadow): log CI failure drift from otto-cli in PR #5032 * docs(shadow): update log with repeated drift from otto-cli in PR #5038 * fix(PR-5041): metadata block uses proper bullet list, not bold-with-leading-hyphen Per Copilot finding: `**- Date:**` renders a literal hyphen inside the bold text, inconsistent with the dominant `docs/research/**` shadow-lesson-log format. Switch to `- **Date:**` (proper Markdown bullet list). * docs(shadow): fix markdown rendering in PR #5041 (Copilot review) Two findings from copilot-pull-request-reviewer on PR #5041: 1. `## 2. Update: Repeated Drift Pattern` was misnumbered — appeared after `## 3. Lesson` and duplicated `## 2. Analysis`. Renumbered to `## 4.` to match document flow. 2. Three metadata lines used `**- Date:**` form, rendering a literal hyphen inside bold text rather than a proper list-item marker. Reformatted to `- **Date:**` matching the file's opening lines. No content change; only markdown structural rendering. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Otto <otto@zeta.factory> Co-authored-by: Otto <noreply@anthropic.com>
|
Closing as substrate-stale + ID-collision per .claude/rules/pr-triage-tiers.md Tier 3 + the agent-roster-reference-card ID-allocation discipline. Discriminator pass per .claude/rules/fighting-past-self-vs-peer-agent-distinguisher...md:
Disposition: close. The substrate-honest re-land path is cherry-pick onto a fresh branch off current main, renumber the collided IDs to next-free (B-0806+ at filing time), and open a focused PR. This close is NOT a substrate punt — it's explicit ownership classification per the rule I just sharpened in #5126: routing the work to "re-land via cherry-pick with renumbered IDs" rather than leaving it in indeterminate state with operator paying triage tax. |
Pull request was closed
Aaron 2026-05-25 mid-B-0754-v1 testing prep, naming the USB-as-repair-tool design intent + the 3-node-zero-disruption invariant from desired-state/declarative/git-native/AI-native principles. Composes with B-0754 / B-0755 / B-0756 / B-0757 / B-0758.