From 0449ea6e2190f5d15144985cc9bdb4c89175f95b Mon Sep 17 00:00:00 2001 From: Lior Date: Tue, 26 May 2026 22:51:18 -0400 Subject: [PATCH] =?UTF-8?q?feat(B-0849):=20docker-based=20NixOS=20install.?= =?UTF-8?q?sh=20test=20harness=20=E2=80=94=20fast=20iteration=20(~30=20sec?= =?UTF-8?q?)=20complementing=20B-0831=20QEMU=20full-install=20(~15=20min);?= =?UTF-8?q?=20"easy=20dockerfile"=20per=20operator=20framing=20(Aaron=2020?= =?UTF-8?q?26-05-27)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Operator framing 2026-05-27 (verbatim): > "we should add docker based nixos install.sh testing so we can > iterate quick that's an easy dockerfile" Direct response after PR #5389 (iter-5.5.1 alignment fix-fwd for PR #5388) where the operator named the iteration-cost problem: every install.sh / linux.sh / mise.sh change today requires a full ISO build + USB flash + physical install (~30 min cycle); Docker testing of just the script on NixOS userspace gives seconds-per- iteration. 3-phase plan: - Phase 1: tools/ci/dockerfiles/nixos-install-sh-test/Dockerfile (10-line FROM nixos/nix + touch /etc/NIXOS + COPY . . + RUN tools/setup/install.sh) + TS wrapper bun tools/ci/docker-nixos-install-sh-test.ts per Rule 0 - Phase 2: GitHub Actions integration with path-filter scoped to tools/setup/** + full-ai-cluster/nixos/modules/common.nix - Phase 3: documented Docker-vs-QEMU coverage matrix; both run on install-substrate PRs (Docker fast, QEMU end-to-end) Empirical case for the substrate: iter-5.4 cascade has produced 8 distinct bugs (Bug 1-8) all caught only after operator USB flash; Docker harness would have caught Bug 5 (gh not in systemPackages), Bug 7 (NetBIOS conflict with smbd), Bug 8 (credential persistence gap) AT WRITE TIME. Composes with: B-0831 cascade #6 (complementary; full-VM scope at minutes-per-cycle); B-0835 install bug cluster (this row catches future Bug-N at write-time); B-0848 (node-local Claude install validation); B-0824 (Ace package-manager โ€” Docker NixOS test is one slice of three-way-parity); GOVERNANCE ยง24 (three-way parity extended to NixOS-via-Docker validation surface). ๐Ÿค– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- docs/BACKLOG.md | 1 + ...qemu-full-install-test-aaron-2026-05-27.md | 144 ++++++++++++++++++ 2 files changed, 145 insertions(+) create mode 100644 docs/backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md diff --git a/docs/BACKLOG.md b/docs/BACKLOG.md index 4362525e96..0c9f616a2f 100644 --- a/docs/BACKLOG.md +++ b/docs/BACKLOG.md @@ -783,6 +783,7 @@ are closed (status: closed in frontmatter)._ - [ ] **[B-0846](backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md)** installer WiFi-reproducibility โ€” cache.nixos.org timeouts hang nixos-install on same N derivations; closure-baking into ISO + extra-substituters + Cachix mirror for reproducible-over-WiFi target (Aaron 2026-05-26) - [ ] **[B-0847](backlog/P2/B-0847-each-ai-gets-own-github-identity-with-email-once-cluster-operational-substrate-honest-attribution-end-to-end-closes-enabledby-token-owner-not-actor-algo-wink-aaron-2026-05-26.md)** each Zeta AI gets own GitHub identity + email once cluster operational โ€” substrate-honest attribution end-to-end (closes the `gh enabledBy = token-owner โ‰  actor` algo-wink-attribution-gap; Ilyana review for public-surface name + email before any creation) (Aaron 2026-05-26) - [ ] **[B-0848](backlog/P2/B-0848-node-local-claude-agent-stewards-own-registration-pr-then-reports-k8s-cluster-status-operator-interactive-login-pattern-aaron-2026-05-26.md)** node-local Claude agent stewards own registration PR + reports K8s cluster status โ€” operator interactive-login pattern (mirrors gh auth flow); first concrete instance of B-0847 AI-on-cluster substrate (Aaron 2026-05-26) +- [ ] **[B-0849](backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md)** docker-based NixOS install.sh test harness โ€” fast iteration on tools/setup/install.sh + linux.sh changes; complements B-0831 cascade #6 QEMU full-install-test (slow) with seconds-per-iteration loop; "easy dockerfile" per operator framing (Aaron 2026-05-27) ## P3 โ€” convenience / deferred diff --git a/docs/backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md b/docs/backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md new file mode 100644 index 0000000000..54257a338c --- /dev/null +++ b/docs/backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md @@ -0,0 +1,144 @@ +--- +id: B-0849 +priority: P2 +status: open +title: docker-based NixOS install.sh test harness โ€” fast iteration on tools/setup/install.sh + linux.sh changes; complements B-0831 cascade #6 QEMU full-install-test (slow) with seconds-per-iteration loop; "easy dockerfile" per operator framing (Aaron 2026-05-27) +effort: S +ask: aaron 2026-05-27 +created: 2026-05-27 +last_updated: 2026-05-27 +depends_on: + - B-0831 +composes_with: + - B-0824 + - B-0835 + - B-0848 +tags: [ci-test-harness, docker, nixos, install-sh, fast-iteration, mise-bootstrap, runtime-version-validation, three-way-parity-extended-to-nixos, operator-iteration-cost, complements-qemu-full-install] +--- + +## Operator framing (Aaron 2026-05-27) + +> *"we should add docker based nixos install.sh testing so we can iterate quick that's an easy dockerfile"* + +Direct response after PR #5389 (the fix-fwd for PR #5388 which added NixOS detection to linux.sh) โ€” operator named the iteration-cost problem: every change to install.sh / linux.sh / common/mise.sh on NixOS currently requires a full ISO build + USB flash + physical install cycle (B-0831 QEMU cascade #6 is the substrate-level fix, but it's a full-VM boot which takes minutes). Docker-based testing of JUST the install.sh script (not the full ISO/boot cycle) gives seconds-per-iteration. + +## Why this is bounded + valuable + +| Test surface | Validates | Cycle time | Today | +|---|---|---|---| +| Operator physical USB install | End-to-end (BIOS + disko + nixos-install + iter-5.x + reboot) | ~30+ min | only-after-full-PR-cascade | +| B-0831 cascade #6 QEMU full-install | End-to-end virtualized | ~15 min | iter-5.2 STARTER landed; full slices pending | +| **B-0849 Docker install.sh harness** (this row) | Just `tools/setup/install.sh` on NixOS userspace | **~30-60 sec** | not yet exists | + +Docker testing is the FASTEST iteration loop for the linux.sh + mise.sh + common/* substrate. Aaron's catch ("we've drifted for nixos for some reason for bun") is exactly the class of bug Docker-fast-iteration would catch BEFORE it ships to a real install. + +## Proposed mitigation + +### Phase 1 โ€” Minimal Dockerfile (operator's "easy dockerfile") + +```dockerfile +# tools/ci/dockerfiles/nixos-install-sh-test/Dockerfile +FROM nixos/nix:latest + +# /etc/NIXOS marker file โ€” what linux.sh's NixOS detection branch +# (per PR #5389) uses to skip the apt step. +RUN touch /etc/NIXOS + +WORKDIR /workspace +COPY . . + +# Run the canonical install entry โ€” same script dev laptops + CI +# runners + devcontainers + NixOS cluster nodes all use. +RUN tools/setup/install.sh + +# Verify mise installed bun = 1.3 (from .mise.toml) +RUN bash -lc 'eval "$(mise activate bash)"; bun --version' + +# Verify claude-code installable via mise-managed bun +RUN bash -lc 'eval "$(mise activate bash)"; bun install --global @anthropic-ai/claude-code 2>&1 | tail -3; test -x ~/.bun/bin/claude' +``` + +Build + run via `docker build` (or via the substrate-honest `tools/ci/` +TS test wrapper per Rule 0 โ€” `bun tools/ci/docker-nixos-install-sh-test.ts`). + +### Phase 2 โ€” GitHub Actions integration + +Add a workflow step gated to runs that touch `tools/setup/**` or `full-ai-cluster/nixos/modules/common.nix`: + +```yaml +- name: Docker NixOS install.sh test + if: | + contains(github.event.head_commit.modified, 'tools/setup/') || + contains(github.event.head_commit.modified, 'full-ai-cluster/nixos/modules/common.nix') + run: bun tools/ci/docker-nixos-install-sh-test.ts +``` + +### Phase 3 โ€” Compose with B-0831 QEMU cascade + +Docker harness validates `install.sh` on NixOS userspace; QEMU validates the full ISO/boot/install/post-reboot chain. Two complementary surfaces: + +- Docker test catches: linux.sh bugs, mise.sh bugs, .mise.toml version pin issues, install.sh dispatch bugs +- QEMU test catches: ISO build bugs, disko config issues, nixos-install flake-eval bugs, post-reboot bootloader issues, iter-5.4 self-registration end-to-end + +Both run on CI; Docker fires more often (faster), QEMU runs on every PR that touches the install substrate. + +## Acceptance + +### Phase 1 (Docker harness exists) + +- [ ] `tools/ci/dockerfiles/nixos-install-sh-test/Dockerfile` exists +- [ ] `tools/ci/docker-nixos-install-sh-test.ts` TS wrapper (per Rule 0) +- [ ] Local invocation succeeds: `bun tools/ci/docker-nixos-install-sh-test.ts` exits 0 in <60 sec +- [ ] Validates `install.sh` runs cleanly + mise installs bun = 1.3 + claude-code installable + +### Phase 2 (CI integration) + +- [ ] GitHub Actions workflow runs the Docker test on PRs that touch install substrate +- [ ] Path-filter scoped to `tools/setup/**` + `full-ai-cluster/nixos/modules/common.nix` +- [ ] Workflow uploads logs as artifact for diagnostic +- [ ] Pre-existing PR (#5389 alignment fix-fwd) re-evaluable in this harness + +### Phase 3 (compose with B-0831) + +- [ ] Both Docker (Phase 1+2 above) AND QEMU (B-0831 cascade #6) run on PRs that touch install +- [ ] Phase tier-table documents which catches which bug class +- [ ] Tick shard documents an empirical-iteration session using Docker harness for fast loop + +## Composes with + +- B-0831 cascade #6 (QEMU full-install CI test) โ€” complementary; Docker = fast-iteration; QEMU = end-to-end +- B-0835 (install bug cluster โ€” Bug 4+5+6+7+8 + future Bug N+) โ€” Docker harness catches Bug-N at write-time vs reboot-time +- B-0848 (node-local Claude agent) โ€” Phase 1 of B-0848 includes claude-code install validation; Docker harness validates the install-side of that work +- B-0824 (Ace package-manager-of-package-managers) โ€” Docker NixOS install.sh test is one slice of three-way-parity validation; Ace's substrate inherits these test patterns +- PR #5389 (iter-5.5.1 alignment fix-fwd) โ€” the FIRST install.sh change that needs Docker-fast iteration to validate without full ISO cycle +- `.claude/rules/rule-0-no-sh-files.md` โ€” TS wrapper for the Docker invocation (NOT a `.sh` script) +- `.mise.toml` (canonical pinned runtimes) โ€” Docker test verifies mise reads them correctly on NixOS +- GOVERNANCE ยง24 (three-way parity โ€” dev/CI/devcontainer; this row extends to NixOS-via-Docker validation surface) + +## Why P2 + +- Aaron's "easy dockerfile" framing + immediate operator-value (iter-5.x install bugs caught in seconds vs minutes) +- BUT: needs Docker available in CI runners + Docker daemon socket access pattern (operational concerns) +- BUT: needs the substrate work (Phase 1 dockerfile + TS wrapper) before CI integration +- P2 reflects "operator-named, bounded, immediate-iteration-value" โ€” not gating any current shipped work but high-leverage for future install-substrate iteration + +## Sub-rows likely needed + +- B-0849.1: Phase 1 Dockerfile + TS wrapper +- B-0849.2: Phase 2 GitHub Actions integration +- B-0849.3: Phase 3 documentation of Docker-vs-QEMU coverage matrix + +## Full reasoning + +Aaron 2026-05-27 immediately after seeing PR #5389 ship the linux.sh NixOS-detection branch + zeta-install.sh delegation to tools/setup/install.sh. The substrate-engineering target: "we should add docker based nixos install.sh testing so we can iterate quick that's an easy dockerfile" + "nixos*" (correcting typo from "mixos"). + +The empirical case for fast-iteration is strong: + +- iter-5.4 cascade has produced 8 distinct empirical bugs (Bug 1 through Bug 8) across multiple PRs +- Each was caught only after the operator did a USB flash + physical install +- Cycle time ~30+ min per iteration; Aaron has paid this many times today +- Docker harness would have caught Bug 5 (gh not in systemPackages), Bug 7 (NetBIOS conflict with smbd), Bug 8 (credential persistence gap) AT WRITE TIME + +This row is the substrate-engineering investment to convert that 30-min cycle into a 30-sec cycle for the install.sh / linux.sh / mise.sh substrate scope. Composes with B-0831 cascade #6 (which handles the full-VM scope at minutes-per-cycle). + +Phasing is small (Phase 1 dockerfile is literally ~10 lines + a ~50-line TS wrapper); Aaron's "easy dockerfile" was substrate-honest about the scope.