Skip to content

docs(backlog): B-0831 — CI cascade #6 full-install + cluster-auto-join (eliminate routine human physical USB test)#5343

Merged
AceHack merged 3 commits into
mainfrom
otto/b-0831-ci-cascade-6-full-install-cluster-auto-join-no-human-test-2026-05-26
May 26, 2026
Merged

docs(backlog): B-0831 — CI cascade #6 full-install + cluster-auto-join (eliminate routine human physical USB test)#5343
AceHack merged 3 commits into
mainfrom
otto/b-0831-ci-cascade-6-full-install-cluster-auto-join-no-human-test-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

Files B-0831 as P1 substrate-engineering target capturing operator direction 2026-05-26: "zflash is the thing plus cluster auto joining after boot from iso use we want that in ci not needing human to test everytime."

3-slice decomposition

Slice Scope Latency cost
1 Full-install-in-QEMU: boot installer ISO → first-boot service fires → greedy N-disk install → reboot → verify login banner +5-10 min PR-build
2 Cluster-auto-join verification via mock cluster control-plane (capture + verify B-0812 self-registration payload) +<1 min
3 ArgoCD reconciliation verification (most coupled to live cluster state; deferrable to push-to-main only) TBD; possibly push-only

Each slice ships independently. Overall acceptance: human physical-USB-test is no longer the routine gate for substrate landings.

What remains valuable for physical test

  • Real-hardware quirks (BIOS/UEFI variants; motherboard NICs; SAS controllers) that QEMU doesn't emulate
  • Periodic sanity-checks the maintainer chooses to do
  • First-time-on-new-hardware validation

Test plan

  • markdownlint clean (B-0831 row + BACKLOG.md regenerated)
  • No code changes (backlog row only)
  • Composes_with cross-refs to all relevant rows + skills + workflow files
  • Substrate-honest scope assessment (L effort; phased; latency trade-off named)

🤖 Generated with Claude Code

…n (eliminate routine human physical USB test)

Per Aaron 2026-05-26: "zflash is the thing plus cluster auto joining
after boot from iso use we want that in ci not needing human to test
everytime."

Files B-0831 as P1 substrate-engineering target. 3-slice decomposition:

- Slice 1: full-install-in-QEMU (boot installer ISO + auto-fire
  first-boot service + greedy N-disk install + reboot + verify login
  banner with auto-generated hostname). Extends cascade #5
  (qemu-boot-test.ts) to validate the full install flow not just the
  live-ISO boot.
- Slice 2: cluster-auto-join verification via mock cluster control-plane
  (capture + verify B-0812 iter-5.4.1 self-registration payload shape).
- Slice 3: ArgoCD reconciliation verification (most coupled to live
  cluster state; deferrable to push-to-main only for latency reasons).

Acceptance criteria scoped per slice; each ships independently.
Overall acceptance: human physical-USB-test is no longer the routine
gate for substrate landings; physical test reserved for real-hardware
quirks + periodic sanity-checks.

Composes_with cross-refs:
- tools/ci/qemu-boot-test.ts (cascade #5)
- build-ai-cluster-iso.yml workflow
- zeta-install.sh + zeta-first-boot.sh + configuration.nix
- B-0812 (node self-registration substrate)
- B-0813 (ArgoCD reconciliation substrate)
- B-0814 (deregister sibling)
- B-0816 (architectural principle: maximize ArgoCD + minimize NixOS-
  native lock-in)
- B-0754 (zero-typing first-boot scope; substrate exercised in Slice 1)
- B-0818 (isoName nixpkgs 25.11 regression; orthogonal)
- flash-cluster-iso SKILL.md (operator-side 0-human-typing analog of
  this CI-side cascade)
- 2026-05-26 USB test gate empirical anchor (PR #5324 + 179a8d2 build)

Operational implication: Slice 1 adds ~5-10 min to PR-build latency;
acceptable trade-off for eliminating human physical-test as routine gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 22:53
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 22:53
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

…rdware-support test (not eliminated; assigned distinct first-class scope)

Per the human maintainer 2026-05-26: "yes physcal test become actually
hardware support test".

Adds reframing section to B-0831:

- Routine substrate validation → CI cascade #6 (QEMU-emulatable)
- Hardware-support test → physical USB on real hardware (validates
  BIOS/UEFI variants, motherboard NIC drivers, SAS controller support,
  GPU detection on actual silicon, real-hardware quirks QEMU cannot
  emulate)

Physical test becomes first-class hardware-compatibility-matrix gate
fired for: onboarding new hardware, iterating on hardware-specific
code paths, periodic compatibility sanity-checks across fleet's
hardware diversity.

This composes with broader cluster-bringup substrate work: hardware-
support-test results inform hardware-compatibility-matrix that drives
provisioning decisions.

The reframing turns physical-test from "annoying routine gate" INTO
"valuable hardware-compatibility-matrix substrate."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new P1 backlog row (B-0831) capturing the planned CI “cascade #6” work to validate a full installer run in QEMU plus post-boot cluster auto-join, with the goal of eliminating routine physical USB testing as the substrate gate.

Changes:

  • Adds new backlog row B-0831 describing a 3-slice CI verification plan (full install, mock join verification, optional ArgoCD reconciliation verification).
  • Updates docs/BACKLOG.md index to include B-0831.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md New backlog row defining the problem statement, slices, acceptance criteria, and cross-references for CI cascade #6.
docs/BACKLOG.md Adds B-0831 to the P1 backlog index list.

…g-ambiguity (<10 / <1) wording

1+2. Lines 70+126: cluster-nodes path was `maintainers/cluster-nodes/`
     which doesn't match B-0794 + B-0812 per-maintainer convention.
     Corrected to `maintainers/<operator>/cluster-nodes/<hostname>/...`
     for the registration step + `maintainers/*/cluster-nodes/**`
     glob for the ArgoCD watch path.

3. Line 102: replaced `<10 min` with `under 10 min` to avoid HTML-tag-
   ambiguity in Markdown renderers; also restructured the parenthetical
   list to avoid `+` continuation in bullet (which markdownlint parses
   as nested list-item).

4. Line 173 (now 192): replaced `(<1 min ...)` with `(under 1 min ...)`
   for the same HTML-tag-ambiguity reason.

No substantive scope changes; consistency-fixes + cross-renderer-safety.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AceHack AceHack merged commit 1072f56 into main May 26, 2026
29 checks passed
@AceHack AceHack deleted the otto/b-0831-ci-cascade-6-full-install-cluster-auto-join-no-human-test-2026-05-26 branch May 26, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants