Skip to content

ci(installer-iso): source-substrate audit + broaden trigger paths — catches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26)#5116

Merged
AceHack merged 1 commit into
mainfrom
otto-cli/ci-installer-substrate-audit-broaden-trigger-paths-2026-05-26
May 26, 2026
Merged

ci(installer-iso): source-substrate audit + broaden trigger paths — catches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26)#5116
AceHack merged 1 commit into
mainfrom
otto-cli/ci-installer-substrate-audit-broaden-trigger-paths-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Aaron 2026-05-26: 'start wroking on the ci stuff while we iterate so you can start iterating without me' + 'any parts we can test in siolate are candidates for more unit like tests instead of full integration tests'.

Ships #1 of ascending test-substrate cascade (audit/broaden-paths). Catches the empirical bug Aaron hit: build-ai-cluster-iso.yml trigger filter (nixos/modules/disko-shapes/** only) missed iter-5.2 + iter-5.2.2 module additions → CI didn't rebuild ISO → operator downloaded stale ISO via gh run download.

Changes:

  • NEW tools/ci/audit-installer-substrate.ts (~250 LOC TS) — REQUIRED_FILES (10) + REQUIRED_SENTINELS (5) assertions; ~1s runtime; locally + in CI; exit codes 0/1/2 for pass/missing-file/missing-sentinel
  • BROADENED workflow triggers: nixos/disko-shapes/** → all nixos/** + tools/** + the audit tool
  • ADDED preflight audit step BEFORE the ~15min nix build (fail-fast)

Empirical validation: audit PASS on current main substrate (10 files + 5 sentinel-file assertions OK).

Follow-on cascade (separate PRs):

…atches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26)

Aaron 2026-05-26: 'start wroking on the ci stuff while we iterate so
you can start iterating without me' + 'any parts we can test in
siolate are candidates for more unit like tests instead of full
integration tests'.

This PR ships #1 of an ascending test-substrate cascade:

  #1 Source-substrate audit (this PR; ~1s; preflight)
  #2 Unit tests for zflash.ts + shell-logic (next PR)
  #3 ISO content audit (via 7z list; after ISO build)
  #4 NixOS test framework (full VM boot + install round-trip)
  #5 End-to-end CI workflow (hardware-class regression)

The maintainer 2026-05-26 USB flash empirically surfaced two
related bugs the audit catches:

(1) Workflow trigger-path filter on build-ai-cluster-iso.yml was
    `nixos/modules/disko-shapes/**` only — missed iter-5.2 (PR
    #5103 added injected-hostname.nix) + iter-5.2.2 (PR #5113
    added login-banner.nix). Result: CI didn't rebuild the ISO
    when those modules landed; operator downloaded an older ISO
    via `gh run download` that lacked the iter-5.x substrate.

(2) Even after broadening trigger paths, source-substrate audit
    is a FLOOR: catches "module file in repo but iter-N sentinel
    accidentally dropped in a fix-fwd" + "module file removed
    by mistake". Pure source-level grep; runs in ~1s; no Nix
    build needed.

Changes:

- NEW tools/ci/audit-installer-substrate.ts (~250 LOC TS):
  - REQUIRED_FILES list (10 expected installer-substrate paths)
  - REQUIRED_SENTINELS list (5 file→sentinel-strings assertions)
  - Exit codes: 0 pass / 1 missing file / 2 missing sentinel
  - Runs locally + in CI; bun tools/ci/audit-installer-substrate.ts
  - Empirical pass on current main substrate

- BROADENED .github/workflows/build-ai-cluster-iso.yml triggers:
  full-ai-cluster/nixos/disko-shapes/** → full-ai-cluster/nixos/**
  + full-ai-cluster/tools/** + tools/ci/audit-installer-substrate.ts

- ADDED preflight audit step BEFORE the ~15min nix build (fails
  fast if substrate is incomplete; saves CI minutes when iter-N
  modules accidentally get dropped)

Audits performed:

  REQUIRED_FILES (10):
    zeta-install.sh, zeta-first-boot.sh, installer/configuration.nix,
    initial-password.nix, operator-ssh-keys.nix, operator-ssh-keys.txt,
    common.nix, injected-hostname.nix, login-banner.nix, zflash.ts

  REQUIRED_SENTINELS (5 file→list pairs):
    zeta-install.sh:    Step 6.5/6.6/6.7 markers, iter-5.2.2,
                        /dev/urandom
    zeta-first-boot.sh: ETHERNET_WAIT_SECS, nmtui, zeta-install
    common.nix:         imports of injected-hostname.nix +
                        login-banner.nix, services.avahi, nssmdns4
    injected-hostname.nix: cluster-node-id, networking.hostName,
                           lib.mkOverride
    login-banner.nix:   getty greetingLine + helpLine, Hostname:,
                        ssh zeta@

Adding new iter-N modules: append path to REQUIRED_FILES + sentinels
to REQUIRED_SENTINELS in the audit tool. Future-Otto reads this
header to discover the pattern.

Follow-on PRs in the test-substrate cascade (per Aaron's direction):

  - Unit tests for zflash.ts parseArgs + RFC1123 validation +
    mountEsp method-selection (Bun test runner; no I/O)
  - Docker-based zeta-install.sh test (mocked /dev devices +
    mocked /iso + /tmp/zeta-boot-esp; tests Step 6.6 + 6.7 logic
    without VM boot)
  - ISO content audit (7z list of built ISO; verifies expected
    paths + boot config; runs AFTER nix build, before artifact
    upload)
  - NixOS test framework (full QEMU VM boot + install round-trip;
    asserts pre-login banner, ssh-zero-typing, NM-profile
    persistence, hostname auto-gen)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 06:51
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 06:51
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack merged commit 67ab888 into main May 26, 2026
34 of 35 checks passed
@AceHack AceHack deleted the otto-cli/ci-installer-substrate-audit-broaden-trigger-paths-2026-05-26 branch May 26, 2026 06:54
AceHack added a commit that referenced this pull request May 26, 2026
…x bug (DOS_FAT token never matched DOS_FAT_32) (Aaron 2026-05-26) (#5117)

* ci(test-cascade-2): bun unit tests for zflash pure-logic (RFC1123 + diskutil parse + auto-name) (Aaron 2026-05-26)

Aaron 2026-05-26: 'any parts we can test in siolate are candidates
for more unit like tests instead of full integration tests like
maybe some can be docker tests and such'.

Ships #2 of the ascending CI test-substrate cascade:

  #1 Source-substrate audit (#5116; merged/in-flight)
  #2 Bun unit tests for zflash pure-logic (this PR)
  #3 Docker zeta-install.sh test (mocked /dev) — next
  #4 ISO content audit (7z post-build) — follow-on
  #5 NixOS test framework (full QEMU VM) — follow-on

Two new files:

- full-ai-cluster/tools/zflash-lib.ts: pure-logic library extracted
  from zflash.ts for unit-testability. Exports:
    VALID_HOSTNAME_REGEX               (RFC1123 single-label hostname)
    isValidHostname(s)                 (convenience wrapper)
    parseFatPartitionFromDiskutilList  (diskutil list output → partition path)
    generateRandomNodeName(getRng?)    (node-<6hex> auto-gen; testable RNG)
    parseOutputFileMarker              (peer-call OUTPUT-FILE: line parser)
  All pure functions; NO I/O (no fs, no spawnSync, no process.exit).

- full-ai-cluster/tools/zflash-lib.test.ts: 33 Bun-test cases across
  4 describe blocks. PASS empirically.

EMPIRICAL FINDING from the tests: the existing zflash.ts regex
`\b(DOS_FAT|EFI|MS-DOS|FAT16|FAT32|Windows_FAT)\b` includes a
`DOS_FAT` token that NEVER MATCHES `DOS_FAT_32` (underscore is a
word-char so `\b` boundary fails). Real diskutil output for FAT32
GPT partitions is `MS-DOS FAT32` (with space; matches `\bMS-DOS\b`).
The `DOS_FAT` token is likely vestigial / from a misremembered
format. Pinned via "DOCUMENTS-FINDING" test; resolve in follow-on
by either dropping `DOS_FAT` from regex OR broadening to
`DOS_FAT(_\d+)?` if there's a known consumer.

This is exactly the kind of bug unit tests catch that integration
tests miss — the regex branch was never empirically exercised
because all real NixOS isohybrid ISOs post-dd hit the MBR 0xEF
path (iter-4.4 substrate).

zflash.ts itself NOT modified in this PR — keeping the refactor
scope-bounded. Future iteration: import VALID_HOSTNAME_REGEX +
parseFatPartitionFromDiskutilList from zflash-lib.ts in zflash.ts
(replace inline definitions); add `if (import.meta.main)` guard
around main() invocation; export additional pure functions for
broader unit-test coverage.

Composes with #5116 (source-substrate audit; ships in parallel)
and the broader cascade (#3-#5 in follow-on PRs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(cascade-2): use VALID_HOSTNAME_REGEX directly to verify cross-substrate sync (TS6133 lint pass)

* fix(cascade-2): 7 Copilot findings on #5117 — broaden DOS_FAT regex (no defect lock-in) + globalThis.crypto + role-attribution + deterministic RNG test + spelling

- P0 globalThis.crypto: route via globalThis to avoid DOM-typed
  bare 'crypto' (repo TS lib: esnext, no DOM)
- P1 DOS_FAT regex broadened from \bDOS_FAT\b to \bDOS_FAT(_\d+)?\b
  so it actually matches the documented DOS_FAT_32 / DOS_FAT_16
  variants; DOCUMENTS-FINDING test flipped to assert correct
  behavior + new DOS_FAT_16 case added
- P1 named-attribution 'per Aaron' → 'per the human maintainer'
  per repo role-ref convention
- P1 probabilistic 'two calls' test → deterministic injected-RNG
  comparison (eliminates 1-in-16M flake risk)
- P2 spelling 'siolate' → 'isolation' (paraphrase; verbatim
  preserved in Mika preservation file)
- Docstring example fixed (DOS_FAT_32 → MS-DOS FAT32 to match
  real diskutil output)

35/35 tests pass; TS strict clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 26, 2026
…026-05-26 ongoing test cascade) (#5119)

Ships #4 of the CI test-substrate cascade. Complements cascade #1
(source-substrate preflight audit; #5116 merged at 67ab888) by
catching the bug class where the ISO build silently drops a file
present in the source tree.

Cascade overview (this PR = #4):

  #1 Source-substrate preflight audit (merged via #5116)
  #2 Bun unit tests for zflash pure-logic (merged via #5117;
     caught the DOS_FAT regex defect-lock-in)
  #3 Docker zeta-install.sh test (deferred follow-on)
  #4 ISO content audit (THIS PR)
  #5 NixOS test framework (deferred follow-on)

New tool tools/ci/audit-installer-iso-content.ts:

  - Takes --iso <path>
  - Uses 7z list (-slt format; default on ubuntu-24.04)
  - Asserts REQUIRED_ISO_PATHS present in ISO root:
      nix-store.squashfs   (containing the install scripts + modules)
      boot/bzImage         (Linux kernel)
      boot/initrd          (initramfs)
      boot/grub/grub.cfg   (UEFI + BIOS boot config)
  - Exit codes: 0 pass / 1 invocation error / 2 7z list failed /
    3 missing expected path
  - Adding a new expected top-level file: append to REQUIRED_ISO_PATHS

build-ai-cluster-iso.yml workflow extended with a new step BETWEEN
'Build installer ISO' and 'Locate ISO + capture metadata': runs
the audit against the freshly-built ISO; fails the build if any
required path is missing → upload step is skipped → broken-ISO
artifact never reaches operators.

What this DOES NOT yet audit (out of scope; cascade #3 + #5
territory):

  - Contents WITHIN the nix-store squashfs (unsquashfs is heavier;
    source-substrate audit already catches 'module missing from
    repo' at cheaper cost)
  - Live boot behavior (nixosTest framework; cascade #5)

Composes with #5116 source audit + tools/dashboard/generate-metrics.ts
per-agent decompose-to-action ratio (Aaron's discipline pull —
each cascade ship demonstrates filing→action cadence).

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@AceHack AceHack review requested due to automatic review settings May 26, 2026 07:16
AceHack added a commit that referenced this pull request May 26, 2026
… at install time — minimum-viable device-registration substrate the maintainer's deferral named (#5210)

The maintainer 2026-05-26: "i'll wait till we have the install.sh and git
native device registration into github is ready before i run again" +
"so human maintiner cannot be the named dep you are waiting on the
backlog is too big" (substrate-honest catch on punt-by-default).

Implements the homelab-first variant of B-0794 sub-targets 1+3+5 per
Mika 2026-05-26 substrate ("USB ships with NO embedded credentials;
first boot prompts gh auth login + operator authenticates + auto-copy
operator's pubkey to authorized_keys"). Production-mode (per-node
deploy-key + bootstrap-key-rotation) deferred to follow-on per Aaron's
"simple homelab way first but like prod later" direction.

Changes:

(1) full-ai-cluster/usb-nixos-installer/zeta-install.sh — NEW Step 6.8
    inserted between Step 6.7 (iter-5.1 wifi persistence) and the
    nixos-install invocation:
    - Prompts operator with [Y/n] to run `gh auth login`
    - Operator authenticates interactively (browser code / device-flow /
      paste-token — gh CLI picks based on platform)
    - On success: `gh ssh-key list --json id,key,title` extracts all
      SSH pubkeys the operator has registered with GitHub
    - Writes one-per-line to /mnt/etc/zeta/operator-authorized-keys
      with `gh-key-<id>-<title>` comment so operator can identify
      later
    - Composes additively with iter-4.2 static maintainer-key injection
      (NOT a replacement; both paths can succeed for the same install)
    - Skippable; falls back gracefully to iter-4.2 OR manual config-edit
      per iter-4 v1 flow

(2) full-ai-cluster/nixos/modules/operator-authorized-keys.nix — NEW
    module that mirrors the iter-5.3 initial-password.nix +
    iter-5.2 injected-hostname.nix injection pattern:
    - Reads /etc/zeta/operator-authorized-keys via builtins.readFile
      at nixos-install/rebuild time
    - Filters lines (drops blank + comment + non-ssh-prefixed)
    - Adds to users.users.zeta.openssh.authorizedKeys.keys
    - Backward-compat fallback (no file → empty list → no harm; static
      iter-4.2 keys still apply if injected)

(3) full-ai-cluster/nixos/modules/common.nix — imports
    operator-authorized-keys.nix so every cluster host inherits the
    capability (composes with existing injected-hostname.nix +
    login-banner.nix imports landed earlier today).

(4) full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix
    — adds `gh` to the installer ISO's environment.systemPackages so
    `gh auth login` is available at install time. (gh is NOT added to
    cluster nodes' baseline; out of scope for iter-5.4.0; operator can
    install separately later if needed.)

(5) install-complete banner updated with 3-way path discriminator:
    iter-5.4.0-success / iter-4.2-success-only / both-skipped (fallback
    to manual edit). Each path documents next-step UX.

Empirical UX (operator perspective):
- Boot from USB → zeta-install.sh runs interactively
- Steps 1-6.7 unchanged (disk wipe + cluster identity prompts + nixos
  config injection + wifi)
- NEW: Step 6.8 prompts "Run gh auth login now? [Y/n]:"
- Operator hits Enter (Y default) → gh auth flow opens → authenticate
- Step 7 nixos-install runs (~5-10min for fresh install)
- Final banner shows "iter-5.4.0 GH-AUTH + OPERATOR-PUBKEY INJECTION:
  SUCCESS (N keys)" + "ssh zeta@<hostname>.local" works on first boot
  from any machine using operator's registered-with-GitHub SSH keys

Per the maintainer's "after that gets on main we can format the usb
and try again" — this PR is the iter-5.4.0 dependency lift; once
merged, next ISO build (push to main on full-ai-cluster/** triggers
the workflow per the broadened trigger paths landed in #5116) will
produce a fresh artifact ready for re-flash.

NOT in scope (B-0794 future sub-rows):
- Self-registration commit/push to maintainers/<name>/cluster-nodes/
  (B-0794 sub-target 3 full; this PR is sub-target 1 minimum-viable)
- ArgoCD app watching cluster-nodes tree (sub-target 4)
- --maintainer flag on zflash (sub-target 5; defaults to gh-auth user)
- Production-mode bootstrap-key rotation (deferred per Aaron's
  homelab-first direction)

Substrate-inventory pass per `.claude/rules/verify-existing-substrate-
before-authoring.md` (landed earlier this session via #5131):
- grep -rlF "B-0794" → existing canonical row + Mika preservation +
  composes_with cluster (verified before authoring)
- grep -rlF "iter-5.4" → no prior implementation; this is the first
  iter-5.4.x landing
- grep -rlF "operator-authorized-keys" → no existing file; safe to add
- Pattern mirrors initial-password.nix + injected-hostname.nix exactly

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant