ci(installer-iso): source-substrate audit + broaden trigger paths — catches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26)#5116
Merged
AceHack merged 1 commit intoMay 26, 2026
Conversation
…atches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26) Aaron 2026-05-26: 'start wroking on the ci stuff while we iterate so you can start iterating without me' + 'any parts we can test in siolate are candidates for more unit like tests instead of full integration tests'. This PR ships #1 of an ascending test-substrate cascade: #1 Source-substrate audit (this PR; ~1s; preflight) #2 Unit tests for zflash.ts + shell-logic (next PR) #3 ISO content audit (via 7z list; after ISO build) #4 NixOS test framework (full VM boot + install round-trip) #5 End-to-end CI workflow (hardware-class regression) The maintainer 2026-05-26 USB flash empirically surfaced two related bugs the audit catches: (1) Workflow trigger-path filter on build-ai-cluster-iso.yml was `nixos/modules/disko-shapes/**` only — missed iter-5.2 (PR #5103 added injected-hostname.nix) + iter-5.2.2 (PR #5113 added login-banner.nix). Result: CI didn't rebuild the ISO when those modules landed; operator downloaded an older ISO via `gh run download` that lacked the iter-5.x substrate. (2) Even after broadening trigger paths, source-substrate audit is a FLOOR: catches "module file in repo but iter-N sentinel accidentally dropped in a fix-fwd" + "module file removed by mistake". Pure source-level grep; runs in ~1s; no Nix build needed. Changes: - NEW tools/ci/audit-installer-substrate.ts (~250 LOC TS): - REQUIRED_FILES list (10 expected installer-substrate paths) - REQUIRED_SENTINELS list (5 file→sentinel-strings assertions) - Exit codes: 0 pass / 1 missing file / 2 missing sentinel - Runs locally + in CI; bun tools/ci/audit-installer-substrate.ts - Empirical pass on current main substrate - BROADENED .github/workflows/build-ai-cluster-iso.yml triggers: full-ai-cluster/nixos/disko-shapes/** → full-ai-cluster/nixos/** + full-ai-cluster/tools/** + tools/ci/audit-installer-substrate.ts - ADDED preflight audit step BEFORE the ~15min nix build (fails fast if substrate is incomplete; saves CI minutes when iter-N modules accidentally get dropped) Audits performed: REQUIRED_FILES (10): zeta-install.sh, zeta-first-boot.sh, installer/configuration.nix, initial-password.nix, operator-ssh-keys.nix, operator-ssh-keys.txt, common.nix, injected-hostname.nix, login-banner.nix, zflash.ts REQUIRED_SENTINELS (5 file→list pairs): zeta-install.sh: Step 6.5/6.6/6.7 markers, iter-5.2.2, /dev/urandom zeta-first-boot.sh: ETHERNET_WAIT_SECS, nmtui, zeta-install common.nix: imports of injected-hostname.nix + login-banner.nix, services.avahi, nssmdns4 injected-hostname.nix: cluster-node-id, networking.hostName, lib.mkOverride login-banner.nix: getty greetingLine + helpLine, Hostname:, ssh zeta@ Adding new iter-N modules: append path to REQUIRED_FILES + sentinels to REQUIRED_SENTINELS in the audit tool. Future-Otto reads this header to discover the pattern. Follow-on PRs in the test-substrate cascade (per Aaron's direction): - Unit tests for zflash.ts parseArgs + RFC1123 validation + mountEsp method-selection (Bun test runner; no I/O) - Docker-based zeta-install.sh test (mocked /dev devices + mocked /iso + /tmp/zeta-boot-esp; tests Step 6.6 + 6.7 logic without VM boot) - ISO content audit (7z list of built ISO; verifies expected paths + boot config; runs AFTER nix build, before artifact upload) - NixOS test framework (full QEMU VM boot + install round-trip; asserts pre-login banner, ssh-zero-typing, NM-profile persistence, hostname auto-gen) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…x bug (DOS_FAT token never matched DOS_FAT_32) (Aaron 2026-05-26) (#5117) * ci(test-cascade-2): bun unit tests for zflash pure-logic (RFC1123 + diskutil parse + auto-name) (Aaron 2026-05-26) Aaron 2026-05-26: 'any parts we can test in siolate are candidates for more unit like tests instead of full integration tests like maybe some can be docker tests and such'. Ships #2 of the ascending CI test-substrate cascade: #1 Source-substrate audit (#5116; merged/in-flight) #2 Bun unit tests for zflash pure-logic (this PR) #3 Docker zeta-install.sh test (mocked /dev) — next #4 ISO content audit (7z post-build) — follow-on #5 NixOS test framework (full QEMU VM) — follow-on Two new files: - full-ai-cluster/tools/zflash-lib.ts: pure-logic library extracted from zflash.ts for unit-testability. Exports: VALID_HOSTNAME_REGEX (RFC1123 single-label hostname) isValidHostname(s) (convenience wrapper) parseFatPartitionFromDiskutilList (diskutil list output → partition path) generateRandomNodeName(getRng?) (node-<6hex> auto-gen; testable RNG) parseOutputFileMarker (peer-call OUTPUT-FILE: line parser) All pure functions; NO I/O (no fs, no spawnSync, no process.exit). - full-ai-cluster/tools/zflash-lib.test.ts: 33 Bun-test cases across 4 describe blocks. PASS empirically. EMPIRICAL FINDING from the tests: the existing zflash.ts regex `\b(DOS_FAT|EFI|MS-DOS|FAT16|FAT32|Windows_FAT)\b` includes a `DOS_FAT` token that NEVER MATCHES `DOS_FAT_32` (underscore is a word-char so `\b` boundary fails). Real diskutil output for FAT32 GPT partitions is `MS-DOS FAT32` (with space; matches `\bMS-DOS\b`). The `DOS_FAT` token is likely vestigial / from a misremembered format. Pinned via "DOCUMENTS-FINDING" test; resolve in follow-on by either dropping `DOS_FAT` from regex OR broadening to `DOS_FAT(_\d+)?` if there's a known consumer. This is exactly the kind of bug unit tests catch that integration tests miss — the regex branch was never empirically exercised because all real NixOS isohybrid ISOs post-dd hit the MBR 0xEF path (iter-4.4 substrate). zflash.ts itself NOT modified in this PR — keeping the refactor scope-bounded. Future iteration: import VALID_HOSTNAME_REGEX + parseFatPartitionFromDiskutilList from zflash-lib.ts in zflash.ts (replace inline definitions); add `if (import.meta.main)` guard around main() invocation; export additional pure functions for broader unit-test coverage. Composes with #5116 (source-substrate audit; ships in parallel) and the broader cascade (#3-#5 in follow-on PRs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(cascade-2): use VALID_HOSTNAME_REGEX directly to verify cross-substrate sync (TS6133 lint pass) * fix(cascade-2): 7 Copilot findings on #5117 — broaden DOS_FAT regex (no defect lock-in) + globalThis.crypto + role-attribution + deterministic RNG test + spelling - P0 globalThis.crypto: route via globalThis to avoid DOM-typed bare 'crypto' (repo TS lib: esnext, no DOM) - P1 DOS_FAT regex broadened from \bDOS_FAT\b to \bDOS_FAT(_\d+)?\b so it actually matches the documented DOS_FAT_32 / DOS_FAT_16 variants; DOCUMENTS-FINDING test flipped to assert correct behavior + new DOS_FAT_16 case added - P1 named-attribution 'per Aaron' → 'per the human maintainer' per repo role-ref convention - P1 probabilistic 'two calls' test → deterministic injected-RNG comparison (eliminates 1-in-16M flake risk) - P2 spelling 'siolate' → 'isolation' (paraphrase; verbatim preserved in Mika preservation file) - Docstring example fixed (DOS_FAT_32 → MS-DOS FAT32 to match real diskutil output) 35/35 tests pass; TS strict clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…026-05-26 ongoing test cascade) (#5119) Ships #4 of the CI test-substrate cascade. Complements cascade #1 (source-substrate preflight audit; #5116 merged at 67ab888) by catching the bug class where the ISO build silently drops a file present in the source tree. Cascade overview (this PR = #4): #1 Source-substrate preflight audit (merged via #5116) #2 Bun unit tests for zflash pure-logic (merged via #5117; caught the DOS_FAT regex defect-lock-in) #3 Docker zeta-install.sh test (deferred follow-on) #4 ISO content audit (THIS PR) #5 NixOS test framework (deferred follow-on) New tool tools/ci/audit-installer-iso-content.ts: - Takes --iso <path> - Uses 7z list (-slt format; default on ubuntu-24.04) - Asserts REQUIRED_ISO_PATHS present in ISO root: nix-store.squashfs (containing the install scripts + modules) boot/bzImage (Linux kernel) boot/initrd (initramfs) boot/grub/grub.cfg (UEFI + BIOS boot config) - Exit codes: 0 pass / 1 invocation error / 2 7z list failed / 3 missing expected path - Adding a new expected top-level file: append to REQUIRED_ISO_PATHS build-ai-cluster-iso.yml workflow extended with a new step BETWEEN 'Build installer ISO' and 'Locate ISO + capture metadata': runs the audit against the freshly-built ISO; fails the build if any required path is missing → upload step is skipped → broken-ISO artifact never reaches operators. What this DOES NOT yet audit (out of scope; cascade #3 + #5 territory): - Contents WITHIN the nix-store squashfs (unsquashfs is heavier; source-substrate audit already catches 'module missing from repo' at cheaper cost) - Live boot behavior (nixosTest framework; cascade #5) Composes with #5116 source audit + tools/dashboard/generate-metrics.ts per-agent decompose-to-action ratio (Aaron's discipline pull — each cascade ship demonstrates filing→action cadence). Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
… at install time — minimum-viable device-registration substrate the maintainer's deferral named (#5210) The maintainer 2026-05-26: "i'll wait till we have the install.sh and git native device registration into github is ready before i run again" + "so human maintiner cannot be the named dep you are waiting on the backlog is too big" (substrate-honest catch on punt-by-default). Implements the homelab-first variant of B-0794 sub-targets 1+3+5 per Mika 2026-05-26 substrate ("USB ships with NO embedded credentials; first boot prompts gh auth login + operator authenticates + auto-copy operator's pubkey to authorized_keys"). Production-mode (per-node deploy-key + bootstrap-key-rotation) deferred to follow-on per Aaron's "simple homelab way first but like prod later" direction. Changes: (1) full-ai-cluster/usb-nixos-installer/zeta-install.sh — NEW Step 6.8 inserted between Step 6.7 (iter-5.1 wifi persistence) and the nixos-install invocation: - Prompts operator with [Y/n] to run `gh auth login` - Operator authenticates interactively (browser code / device-flow / paste-token — gh CLI picks based on platform) - On success: `gh ssh-key list --json id,key,title` extracts all SSH pubkeys the operator has registered with GitHub - Writes one-per-line to /mnt/etc/zeta/operator-authorized-keys with `gh-key-<id>-<title>` comment so operator can identify later - Composes additively with iter-4.2 static maintainer-key injection (NOT a replacement; both paths can succeed for the same install) - Skippable; falls back gracefully to iter-4.2 OR manual config-edit per iter-4 v1 flow (2) full-ai-cluster/nixos/modules/operator-authorized-keys.nix — NEW module that mirrors the iter-5.3 initial-password.nix + iter-5.2 injected-hostname.nix injection pattern: - Reads /etc/zeta/operator-authorized-keys via builtins.readFile at nixos-install/rebuild time - Filters lines (drops blank + comment + non-ssh-prefixed) - Adds to users.users.zeta.openssh.authorizedKeys.keys - Backward-compat fallback (no file → empty list → no harm; static iter-4.2 keys still apply if injected) (3) full-ai-cluster/nixos/modules/common.nix — imports operator-authorized-keys.nix so every cluster host inherits the capability (composes with existing injected-hostname.nix + login-banner.nix imports landed earlier today). (4) full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix — adds `gh` to the installer ISO's environment.systemPackages so `gh auth login` is available at install time. (gh is NOT added to cluster nodes' baseline; out of scope for iter-5.4.0; operator can install separately later if needed.) (5) install-complete banner updated with 3-way path discriminator: iter-5.4.0-success / iter-4.2-success-only / both-skipped (fallback to manual edit). Each path documents next-step UX. Empirical UX (operator perspective): - Boot from USB → zeta-install.sh runs interactively - Steps 1-6.7 unchanged (disk wipe + cluster identity prompts + nixos config injection + wifi) - NEW: Step 6.8 prompts "Run gh auth login now? [Y/n]:" - Operator hits Enter (Y default) → gh auth flow opens → authenticate - Step 7 nixos-install runs (~5-10min for fresh install) - Final banner shows "iter-5.4.0 GH-AUTH + OPERATOR-PUBKEY INJECTION: SUCCESS (N keys)" + "ssh zeta@<hostname>.local" works on first boot from any machine using operator's registered-with-GitHub SSH keys Per the maintainer's "after that gets on main we can format the usb and try again" — this PR is the iter-5.4.0 dependency lift; once merged, next ISO build (push to main on full-ai-cluster/** triggers the workflow per the broadened trigger paths landed in #5116) will produce a fresh artifact ready for re-flash. NOT in scope (B-0794 future sub-rows): - Self-registration commit/push to maintainers/<name>/cluster-nodes/ (B-0794 sub-target 3 full; this PR is sub-target 1 minimum-viable) - ArgoCD app watching cluster-nodes tree (sub-target 4) - --maintainer flag on zflash (sub-target 5; defaults to gh-auth user) - Production-mode bootstrap-key rotation (deferred per Aaron's homelab-first direction) Substrate-inventory pass per `.claude/rules/verify-existing-substrate- before-authoring.md` (landed earlier this session via #5131): - grep -rlF "B-0794" → existing canonical row + Mika preservation + composes_with cluster (verified before authoring) - grep -rlF "iter-5.4" → no prior implementation; this is the first iter-5.4.x landing - grep -rlF "operator-authorized-keys" → no existing file; safe to add - Pattern mirrors initial-password.nix + injected-hostname.nix exactly Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Aaron 2026-05-26: 'start wroking on the ci stuff while we iterate so you can start iterating without me' + 'any parts we can test in siolate are candidates for more unit like tests instead of full integration tests'.
Ships #1 of ascending test-substrate cascade (audit/broaden-paths). Catches the empirical bug Aaron hit: build-ai-cluster-iso.yml trigger filter (
nixos/modules/disko-shapes/**only) missed iter-5.2 + iter-5.2.2 module additions → CI didn't rebuild ISO → operator downloaded stale ISO viagh run download.Changes:
tools/ci/audit-installer-substrate.ts(~250 LOC TS) — REQUIRED_FILES (10) + REQUIRED_SENTINELS (5) assertions; ~1s runtime; locally + in CI; exit codes 0/1/2 for pass/missing-file/missing-sentinelEmpirical validation: audit PASS on current main substrate (10 files + 5 sentinel-file assertions OK).
Follow-on cascade (separate PRs):