diff --git a/docs/BACKLOG.md b/docs/BACKLOG.md index f35ab6ed2e..bbf90fed46 100644 --- a/docs/BACKLOG.md +++ b/docs/BACKLOG.md @@ -392,6 +392,7 @@ are closed (status: closed in frontmatter)._ - [ ] **[B-0822](backlog/P1/B-0822-diamond-resolution-namespace-cardinality-multi-tenant-awareness-as-third-dimension-of-shared-chart-dependency-resolution-aaron-2026-05-26.md)** diamond-resolution namespace + cardinality + multi-tenant-awareness — three orthogonal properties on shared charts that determine whether the Maven-for-Helm graph deploys ONE shared instance or N per-consumer instances; substrate-engineering target for Ace package manager's chart-graph resolver (Aaron 2026-05-26) - [ ] **[B-0824](backlog/P1/B-0824-package-manager-of-package-managers-n-dimensional-dependency-space-holographic-projection-ai-rate-continuous-upstream-negotiation-aaron-2026-05-26.md)** Ace as "package manager of package managers" — N-dimensional dependency space (Maven is 2D; we're at least 3D / N-D) + holographic projection (merge 2D streams from each PM into higher-D views) + AI-rate continuous upstream negotiation (push-forward + absorb-forward at AI cadence — no existing PM does this); strategic-architectural substrate for the Ace meta-PM substrate (Aaron 2026-05-26) - [ ] **[B-0825](backlog/P1/B-0825-time-modeled-dependencies-for-helm-clusters-as-long-running-stateful-systems-require-temporal-axis-in-dependency-graph-aaron-2026-05-26.md)** time-modeled dependencies for Helm — clusters are long-running stateful systems; chart-graph needs temporal axis for revision history + migration phases + rolling-upgrade windows + concurrent-version overlap; Helm uniquely requires this among package managers; substrate-engineering target for Ace meta-PM (Aaron 2026-05-26) +- [ ] **[B-0831](backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md)** CI cascade #6 — full-install-and-cluster-auto-join (post-boot install completes; node self-registers; eliminates routine human physical USB test) (Aaron 2026-05-26) ## P2 — research-grade diff --git a/docs/backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md b/docs/backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md new file mode 100644 index 0000000000..38ea4ac78f --- /dev/null +++ b/docs/backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md @@ -0,0 +1,193 @@ +--- +id: B-0831 +priority: P1 +status: open +title: CI cascade #6 — full-install-and-cluster-auto-join (post-boot install completes; node self-registers; eliminates routine human physical USB test) (Aaron 2026-05-26) +effort: L +ask: aaron 2026-05-26 +created: 2026-05-26 +last_updated: 2026-05-26 +depends_on: + - B-0812 + - B-0813 +composes_with: + - B-0814 + - B-0816 +tags: [ci, qemu, cluster-bringup, auto-install, cluster-join, eliminates-human-physical-test, cascade-6] +--- + +## Problem + +Current CI has cascade #5 (QEMU boot smoke-test per +`tools/ci/qemu-boot-test.ts`): asserts the installer ISO boots to the +`zeta-installer login:` prompt. This validates kernel, initrd, getty, +console-output. But it does NOT validate: + +- First-boot service fires (`zeta-first-boot.service`) +- `zeta-install` greedy N-disk install completes +- Installed system reboots cleanly +- Auto-generated hostname appears in login banner +- Node self-registers with the cluster (per B-0812 iter-5.4.1) +- ArgoCD sees the new node + reconciles (per B-0813 iter-5.4.2) + +The human maintainer currently physically tests by booting from USB on +real hardware. Operator framing 2026-05-26: *"zflash is the thing plus +cluster auto joining after boot from iso use we want that in ci not +needing human to test everytime."* + +Eliminating routine human physical-USB-test as the gate for substrate +landings would unblock multiple downstream cascades and reduce +per-iteration latency dramatically. + +## Proposed cascade #6 (full-install-and-join in QEMU) + +Three phases, each a separately-shippable slice: + +### Slice 1 — full-install-in-QEMU (no cluster join) + +Extend `tools/ci/qemu-boot-test.ts` (or add `tools/ci/qemu-full-install-test.ts`): + +- Boot QEMU with installer ISO as CD-ROM +- Attach virtual disk as install target (qcow2; e.g., 20 GB; emulates a + single-NVMe greedy-install target per zeta-install logic) +- Wait for `zeta-first-boot` to fire +- Wait for `zeta-install` to complete (success marker in serial output + OR /mnt/etc/zeta presence visible via SSH) +- Reboot QEMU; verify installed system comes up to login prompt +- Assert auto-generated hostname appears in login banner (iter-5.2.2 + module) +- Assert cluster substrate is present on the installed disk + +### Slice 2 — cluster-auto-join verification (mock cluster control-plane) + +After Slice 1 completes, the installed node should attempt cluster +self-registration per B-0812 iter-5.4.1. Slice 2 adds: + +- Mock cluster control-plane in CI (or skip if real cluster API is + reachable from runner) +- Capture the join attempt's payload shape +- Verify the registration request matches schema +- Verify the new node would be added to + `maintainers//cluster-nodes//...` tree shape per + B-0794 + B-0812 per-maintainer convention + +This requires either: + +- A standalone TS mock server in `tools/ci/mock-cluster-control-plane.ts` +- OR a real cluster endpoint reachable from runner (less safe; + cluster-side state changes) + +Prefer mock; less coupling; more reproducible. + +### Slice 3 — ArgoCD reconciliation verification + +Once the node registers (Slice 2), ArgoCD watches `maintainers/ +cluster-nodes/` tree per B-0813 iter-5.4.2 and reconciles. Slice 3 +verifies: + +- Mock ArgoCD (or real ArgoCD instance) sees the new node +- Reconciliation produces expected k8s manifest delta +- No failures in reconcile loop + +This phase is the most coupled to live cluster state; can be deferred +until Slices 1 + 2 are landed and Slice 3 has a clean isolated test +surface. + +## Acceptance + +Phased acceptance (each slice ships independently): + +- **Slice 1 acceptance**: CI cascade #6 phase 1 step passes on PR + touching `full-ai-cluster/**`. Step runs in under 10 min total + (boot, install, reboot, login-verify). Captures full serial console + as workflow-artifact for debug. +- **Slice 2 acceptance**: CI cascade #6 phase 2 captures + verifies + cluster-join attempt payload. Mock-cluster substrate is reusable for + other CI tests (composes with cluster-bringup substrate). +- **Slice 3 acceptance**: CI cascade #6 phase 3 verifies ArgoCD + reconciliation shape. May be deferred OR run only on push-to-main + (skipped on PR builds for latency reasons). +- **Overall acceptance**: human physical-USB-test is no longer the + routine gate for substrate landings; physical test is reserved for + hardware-specific issues (real-hardware quirks that QEMU doesn't + emulate) AND for periodic sanity-checks the maintainer chooses to do. + +## Composes with + +- `tools/ci/qemu-boot-test.ts` (cascade #5; extend OR sibling-script) +- `.github/workflows/build-ai-cluster-iso.yml` (cascade orchestration) +- `full-ai-cluster/usb-nixos-installer/zeta-install.sh` (the install + flow the test exercises end-to-end) +- `full-ai-cluster/usb-nixos-installer/zeta-first-boot.sh` (the + auto-fire service) +- `full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix` + (login banner + auto-hostname per iter-5.2.2) +- B-0812 iter-5.4.1 (node self-registration commit+push to maintainers/ + cluster-nodes; the substrate this CI cascade verifies end-to-end) +- B-0813 iter-5.4.2 (ArgoCD app watches `maintainers/*/cluster-nodes/**` tree per per-maintainer glob) +- B-0814 (tools/cluster-deregister-node.ts sibling) +- B-0816 (architectural principle: maximize ArgoCD scope + minimize + NixOS-native lock-in) +- B-0754 (zero-typing first-boot auto-install scope; the substrate + exercised in CI cascade #6 phase 1) +- B-0818 (isoName mkForce nixpkgs 25.11 regression; orthogonal but + composes at ISO-naming scope) +- `.claude/skills/flash-cluster-iso/SKILL.md` (Path B 0-human-typing + flow IS the operator-side analog; this CI cascade IS the CI-side + analog — eliminates need to flash physical USB just to test) +- The 2026-05-26 USB test gate empirical anchor (PR #5324 + + 179a8d2 ISO build; 2 validated cycles in record; this cascade + reduces the per-iteration latency) + +## Substrate-honest framing + +This is a P1 substrate-engineering target with substantial scope. The +3-slice decomposition allows incremental progress; Slice 1 alone delivers +material value (per-PR full-install validation in QEMU; eliminates 80% +of cases where human physical test was needed). + +Slice 3 (ArgoCD reconciliation in CI) is the most coupled to live cluster +state and may require additional substrate (mock-ArgoCD OR isolated test +cluster) that itself is substrate-engineering scope. + +The substrate-engineering work composes with the broader +"eliminate-human-physical-USB-test-as-routine-gate" direction the +operator named 2026-05-26. + +### Physical test BECOMES the hardware-support test (operator 2026-05-26 reframing) + +Operator's sharpening: *"yes physcal test become actually hardware +support test"*. Physical USB-test is REFRAMED — not eliminated, not +demoted, but assigned a different scope: + +| Test scope | Surface | Validates | +|---|---|---| +| Routine substrate validation | CI cascade #6 (this row) | Install flow + cluster-join shape (QEMU-emulatable) | +| **Hardware-support test** | Physical USB on real hardware | BIOS/UEFI variant compatibility, motherboard NIC drivers, SAS controller support, GPU detection on actual silicon, real-hardware quirks QEMU cannot emulate | + +The physical test is no longer the gate-of-last-resort for every +substrate landing; it becomes the **first-class hardware-compatibility- +matrix gate** — fired when (a) onboarding new hardware, (b) iterating on +hardware-specific code paths, (c) periodic compatibility sanity-checks +across the fleet's hardware diversity. + +This composes with the broader cluster-bringup substrate-engineering +work: hardware-support-test results inform the hardware-compatibility- +matrix that drives provisioning decisions (which boards/NICs/GPUs +are supported; which need driver substrate; which need to be excluded). + +The reframing turns physical-test from "annoying routine gate that +blocks substrate landings" INTO "valuable hardware-compatibility-matrix +substrate that informs cluster-provisioning decisions." + +## Operational implication for CI rate-limit + run-time + +Slice 1 adds ~5-10 minutes to the build-ai-cluster-iso.yml workflow +(boot + install + reboot + verify). Per the rate-limit operational tiers +discipline, this is a meaningful but bounded addition; PR-build latency +goes from ~current to ~current+10min. Acceptable trade-off for the +elimination of human physical-test as routine gate. + +Slice 3 may be deferred to push-to-main (not PR-build) for latency +reasons. Slice 2 should be PR-build-eligible since cluster-join shape +verification is fast (under 1 min after Slice 1 completes).