Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0822](backlog/P1/B-0822-diamond-resolution-namespace-cardinality-multi-tenant-awareness-as-third-dimension-of-shared-chart-dependency-resolution-aaron-2026-05-26.md)** diamond-resolution namespace + cardinality + multi-tenant-awareness — three orthogonal properties on shared charts that determine whether the Maven-for-Helm graph deploys ONE shared instance or N per-consumer instances; substrate-engineering target for Ace package manager's chart-graph resolver (Aaron 2026-05-26)
- [ ] **[B-0824](backlog/P1/B-0824-package-manager-of-package-managers-n-dimensional-dependency-space-holographic-projection-ai-rate-continuous-upstream-negotiation-aaron-2026-05-26.md)** Ace as "package manager of package managers" — N-dimensional dependency space (Maven is 2D; we're at least 3D / N-D) + holographic projection (merge 2D streams from each PM into higher-D views) + AI-rate continuous upstream negotiation (push-forward + absorb-forward at AI cadence — no existing PM does this); strategic-architectural substrate for the Ace meta-PM substrate (Aaron 2026-05-26)
- [ ] **[B-0825](backlog/P1/B-0825-time-modeled-dependencies-for-helm-clusters-as-long-running-stateful-systems-require-temporal-axis-in-dependency-graph-aaron-2026-05-26.md)** time-modeled dependencies for Helm — clusters are long-running stateful systems; chart-graph needs temporal axis for revision history + migration phases + rolling-upgrade windows + concurrent-version overlap; Helm uniquely requires this among package managers; substrate-engineering target for Ace meta-PM (Aaron 2026-05-26)
- [ ] **[B-0831](backlog/P1/B-0831-ci-cascade-6-full-install-plus-cluster-auto-join-eliminate-routine-human-physical-usb-test-aaron-2026-05-26.md)** CI cascade #6 — full-install-and-cluster-auto-join (post-boot install completes; node self-registers; eliminates routine human physical USB test) (Aaron 2026-05-26)

## P2 — research-grade

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
id: B-0831
priority: P1
status: open
title: CI cascade #6 — full-install-and-cluster-auto-join (post-boot install completes; node self-registers; eliminates routine human physical USB test) (Aaron 2026-05-26)
effort: L
ask: aaron 2026-05-26
created: 2026-05-26
last_updated: 2026-05-26
depends_on:
- B-0812
- B-0813
composes_with:
- B-0814
- B-0816
tags: [ci, qemu, cluster-bringup, auto-install, cluster-join, eliminates-human-physical-test, cascade-6]
---

## Problem

Current CI has cascade #5 (QEMU boot smoke-test per
`tools/ci/qemu-boot-test.ts`): asserts the installer ISO boots to the
`zeta-installer login:` prompt. This validates kernel, initrd, getty,
console-output. But it does NOT validate:

- First-boot service fires (`zeta-first-boot.service`)
- `zeta-install` greedy N-disk install completes
- Installed system reboots cleanly
- Auto-generated hostname appears in login banner
- Node self-registers with the cluster (per B-0812 iter-5.4.1)
- ArgoCD sees the new node + reconciles (per B-0813 iter-5.4.2)

The human maintainer currently physically tests by booting from USB on
real hardware. Operator framing 2026-05-26: *"zflash is the thing plus
cluster auto joining after boot from iso use we want that in ci not
needing human to test everytime."*

Eliminating routine human physical-USB-test as the gate for substrate
landings would unblock multiple downstream cascades and reduce
per-iteration latency dramatically.

## Proposed cascade #6 (full-install-and-join in QEMU)

Three phases, each a separately-shippable slice:

### Slice 1 — full-install-in-QEMU (no cluster join)

Extend `tools/ci/qemu-boot-test.ts` (or add `tools/ci/qemu-full-install-test.ts`):

- Boot QEMU with installer ISO as CD-ROM
- Attach virtual disk as install target (qcow2; e.g., 20 GB; emulates a
single-NVMe greedy-install target per zeta-install logic)
- Wait for `zeta-first-boot` to fire
- Wait for `zeta-install` to complete (success marker in serial output
OR /mnt/etc/zeta presence visible via SSH)
- Reboot QEMU; verify installed system comes up to login prompt
- Assert auto-generated hostname appears in login banner (iter-5.2.2
module)
- Assert cluster substrate is present on the installed disk

### Slice 2 — cluster-auto-join verification (mock cluster control-plane)

After Slice 1 completes, the installed node should attempt cluster
self-registration per B-0812 iter-5.4.1. Slice 2 adds:

- Mock cluster control-plane in CI (or skip if real cluster API is
reachable from runner)
- Capture the join attempt's payload shape
- Verify the registration request matches schema
- Verify the new node would be added to
`maintainers/<operator>/cluster-nodes/<hostname>/...` tree shape per
B-0794 + B-0812 per-maintainer convention

This requires either:

- A standalone TS mock server in `tools/ci/mock-cluster-control-plane.ts`
- OR a real cluster endpoint reachable from runner (less safe;
cluster-side state changes)

Prefer mock; less coupling; more reproducible.

### Slice 3 — ArgoCD reconciliation verification

Once the node registers (Slice 2), ArgoCD watches `maintainers/
cluster-nodes/` tree per B-0813 iter-5.4.2 and reconciles. Slice 3
verifies:
Comment thread
AceHack marked this conversation as resolved.

- Mock ArgoCD (or real ArgoCD instance) sees the new node
- Reconciliation produces expected k8s manifest delta
- No failures in reconcile loop

This phase is the most coupled to live cluster state; can be deferred
until Slices 1 + 2 are landed and Slice 3 has a clean isolated test
surface.

## Acceptance

Phased acceptance (each slice ships independently):

- **Slice 1 acceptance**: CI cascade #6 phase 1 step passes on PR
touching `full-ai-cluster/**`. Step runs in under 10 min total
(boot, install, reboot, login-verify). Captures full serial console
as workflow-artifact for debug.
- **Slice 2 acceptance**: CI cascade #6 phase 2 captures + verifies
cluster-join attempt payload. Mock-cluster substrate is reusable for
other CI tests (composes with cluster-bringup substrate).
- **Slice 3 acceptance**: CI cascade #6 phase 3 verifies ArgoCD
reconciliation shape. May be deferred OR run only on push-to-main
(skipped on PR builds for latency reasons).
- **Overall acceptance**: human physical-USB-test is no longer the
routine gate for substrate landings; physical test is reserved for
hardware-specific issues (real-hardware quirks that QEMU doesn't
emulate) AND for periodic sanity-checks the maintainer chooses to do.

## Composes with

- `tools/ci/qemu-boot-test.ts` (cascade #5; extend OR sibling-script)
- `.github/workflows/build-ai-cluster-iso.yml` (cascade orchestration)
- `full-ai-cluster/usb-nixos-installer/zeta-install.sh` (the install
flow the test exercises end-to-end)
- `full-ai-cluster/usb-nixos-installer/zeta-first-boot.sh` (the
auto-fire service)
- `full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix`
(login banner + auto-hostname per iter-5.2.2)
- B-0812 iter-5.4.1 (node self-registration commit+push to maintainers/
cluster-nodes; the substrate this CI cascade verifies end-to-end)
- B-0813 iter-5.4.2 (ArgoCD app watches `maintainers/*/cluster-nodes/**` tree per per-maintainer glob)
- B-0814 (tools/cluster-deregister-node.ts sibling)
- B-0816 (architectural principle: maximize ArgoCD scope + minimize
NixOS-native lock-in)
- B-0754 (zero-typing first-boot auto-install scope; the substrate
exercised in CI cascade #6 phase 1)
- B-0818 (isoName mkForce nixpkgs 25.11 regression; orthogonal but
composes at ISO-naming scope)
- `.claude/skills/flash-cluster-iso/SKILL.md` (Path B 0-human-typing
flow IS the operator-side analog; this CI cascade IS the CI-side
analog — eliminates need to flash physical USB just to test)
- The 2026-05-26 USB test gate empirical anchor (PR #5324 +
179a8d2 ISO build; 2 validated cycles in record; this cascade
reduces the per-iteration latency)

## Substrate-honest framing

This is a P1 substrate-engineering target with substantial scope. The
3-slice decomposition allows incremental progress; Slice 1 alone delivers
material value (per-PR full-install validation in QEMU; eliminates 80%
of cases where human physical test was needed).

Slice 3 (ArgoCD reconciliation in CI) is the most coupled to live cluster
state and may require additional substrate (mock-ArgoCD OR isolated test
cluster) that itself is substrate-engineering scope.

The substrate-engineering work composes with the broader
"eliminate-human-physical-USB-test-as-routine-gate" direction the
operator named 2026-05-26.

### Physical test BECOMES the hardware-support test (operator 2026-05-26 reframing)

Operator's sharpening: *"yes physcal test become actually hardware
support test"*. Physical USB-test is REFRAMED — not eliminated, not
demoted, but assigned a different scope:

| Test scope | Surface | Validates |
|---|---|---|
| Routine substrate validation | CI cascade #6 (this row) | Install flow + cluster-join shape (QEMU-emulatable) |
| **Hardware-support test** | Physical USB on real hardware | BIOS/UEFI variant compatibility, motherboard NIC drivers, SAS controller support, GPU detection on actual silicon, real-hardware quirks QEMU cannot emulate |

The physical test is no longer the gate-of-last-resort for every
substrate landing; it becomes the **first-class hardware-compatibility-
matrix gate** — fired when (a) onboarding new hardware, (b) iterating on
hardware-specific code paths, (c) periodic compatibility sanity-checks
across the fleet's hardware diversity.

This composes with the broader cluster-bringup substrate-engineering
Comment thread
AceHack marked this conversation as resolved.
work: hardware-support-test results inform the hardware-compatibility-
matrix that drives provisioning decisions (which boards/NICs/GPUs
are supported; which need driver substrate; which need to be excluded).

The reframing turns physical-test from "annoying routine gate that
blocks substrate landings" INTO "valuable hardware-compatibility-matrix
substrate that informs cluster-provisioning decisions."

## Operational implication for CI rate-limit + run-time

Slice 1 adds ~5-10 minutes to the build-ai-cluster-iso.yml workflow
(boot + install + reboot + verify). Per the rate-limit operational tiers
discipline, this is a meaningful but bounded addition; PR-build latency
goes from ~current to ~current+10min. Acceptable trade-off for the
elimination of human physical-test as routine gate.

Slice 3 may be deferred to push-to-main (not PR-build) for latency
reasons. Slice 2 should be PR-build-eligible since cluster-join shape
verification is fast (under 1 min after Slice 1 completes).
Loading