feat(B-0789 iter-4.3): zflash stale-checkout detection + auto-download fresh ISO from CI#5091
Merged
AceHack merged 1 commit intoMay 26, 2026
Conversation
…d fresh ISO from CI — closes 2 gaps surfaced by 2026-05-26 empirical iter-4.2 test Per maintainer 2026-05-26 *"any fixes lets make sure they make it in main"* + *"does the script not auto download the latest?"* + *"we want to run what a contributor will run"* + *"no rush we can wait on main we are going for right not fast"*. Two gaps surfaced when I ran zflash from the operator's local checkout (HEAD 89a39ea = pre-iter-4.2) with an iter-4.2 ISO: 1. Flash ran the OLD zflash (no --no-eject + no inject step). USB came out bootable but silently WITHOUT operator-ssh-keys.txt populated. The iter-4.2 zero-typing target failed silently because the local zflash code itself was stale, not because the iter-4.2 substrate on main was wrong. 2. The May 25 ISO in ~/Downloads was iter-3-era. I had to manually `gh run download` the fresh CI artifact (run 26432433541) before zflash had anything usable to flash. Contributor flow today requires the operator to remember to pull fresh ISOs. iter-4.3 closes both gaps in full-ai-cluster/tools/zflash.ts: checkLocalCheckoutFreshness(): - findRepoRoot() walks up from zflash.ts location to find .git - Best-effort `git fetch origin main --quiet` (offline = warn + skip) - For each INSTALL_SUBSTRATE_FILES entry (zflash.ts, flash-usb.ts, zeta-install.sh, flake.nix, the 3 nix modules + .txt), checks `git diff --quiet HEAD origin/main -- <file>` → status 1 = stale - If ANY file is stale → bail loud with specific remediation: "git -C <root> pull --rebase origin main" or "zflash --skip-freshness-check" escape hatch - Eliminates the silent-stale-code class autoDownloadFreshIsoIfNeeded(): - Queries `gh api .../actions/workflows/build-ai-cluster-iso.yml/ runs?branch=main&status=success&per_page=1` - If latest run's updated_at > local newest ISO's mtime, pulls via `gh run download <run-id> --dir /tmp/zflash-ci-iso-<run-id>` - Walks the dl dir to find the .iso file (artifact dir nesting) - Copies to ~/Downloads/zeta-installer-24.11-ci<run-id>-<date>.iso so future autoDiscoverIso() picks it as newest - Skipped when explicit ISO path passed OR --skip-iso-pull set - Offline / gh failure → falls back to local newest with warning Two new opt-out flags + allowlist updated: - --skip-freshness-check: bypass stale-check (not recommended) - --skip-iso-pull: bypass CI ISO auto-pull (use local newest) Help text updated to surface the new flags + the auto-pull behavior. Composes with B-0759 (first-time-CLI-user UX), B-0780 (tier-2 dev experience — Max + Addison's onboarding will benefit), iter-4.2 substrate (#5083 / #5086 / #5088). Substrate-honest follow-on: captures the lessons-learned from the maintainer's first actual test of the iter-4.2 flow. Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Adds iter-4.3 safeguards to the zflash macOS USB flasher to prevent two common “stale inputs” failure modes: running an outdated checkout (missing installer-substrate behavior) and flashing an outdated local ISO when CI has produced a newer one.
Changes:
- Add a local-checkout freshness gate that fetches
origin/mainand refuses to proceed if install-substrate files differ (with an escape-hatch flag). - Add CI-backed ISO freshness logic that queries the latest successful workflow run and auto-downloads a newer ISO artifact when needed (with an opt-out flag).
- Extend CLI flag allowlist +
--helpoutput to include--skip-freshness-checkand--skip-iso-pull.
4 tasks
AceHack
pushed a commit
that referenced
this pull request
May 26, 2026
… TS strict + 1 P0 auth-misclassification + 1 P1 ENOENT diagnostic) Three findings from Copilot on the iter-4.3 fixfwd PR #5093 that landed post-merge of #5091 but pre-merge of #5093 itself. P0 (PRRT_kwDOSF9kNM6Er1uQ) — TypeScript strict mode: `let dlDest: string | null = null` then `return dlDest` violates the function's `string` return type. CI's tsc check would have failed merge. Fix: initialize `dlDest: string = localIso` so it's always string; overwrite on copy success. P0 (PRRT_kwDOSF9kNM6Er1un) — auth-as-network misclassification: the freshness-fetch failure classifier matched "could not read from remote repository" as a network signal, but that string also appears on auth / permission failures (private repo, expired credential, etc.). Auth failures would have been silently downgraded to "offline" warn, defeating the freshness guard. Fix: (a) auth signals matched FIRST with explicit auth-pattern list ("permission denied", "authentication failed", "fatal: authentication", "invalid credentials", "bad credentials", "could not read username", "support for password authentication was removed", "403 forbidden", "401 unauthorized"); (b) network signals require actual host/connectivity word (removed "could not read from remote repository" from network list — needs a separate host/connectivity signal to qualify as network); (c) auth- flavored bail surfaces "(AUTH-flavored)" tag + remediation suggests re-auth via gh auth login. P1 (PRRT_kwDOSF9kNM6Er1uw) — ENOENT diagnostic: when git binary is missing, the exec throws without stderr. Prior diagnostic said "(no stderr captured)" — hides the actual cause (ENOENT). Fix: capture `e.message` + `e.code` (when present) + surface in the bail diagnostic. Common-causes list also adds "git not installed / not on PATH (e.code === 'ENOENT')" as the first entry. All 3 fixes layered onto the existing iter-4.3 freshness-check function. The auth-misclassification is the highest-impact (security- meaningful) because it would have made the freshness guard fail-open for auth-failure scenarios. Co-Authored-By: Claude <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…-guard-defeat + 2 P1 + 1 P2) (#5093) * fix(B-0789 iter-4.3 fixfwd): 4 Copilot findings on #5091 (1 P0 silent-guard-defeat + 2 P1 + 1 P2) PR #5091 (iter-4.3 stale-checkout + auto-ISO-download) auto-merged with required checks green; 4 substantive Copilot findings landed post-merge. The P0 silently defeats the safety guard the whole PR exists to provide — must fix-forward before maintainer's next iter-4.2 test. P0 (PRRT_kwDOSF9kNM6Eryew): Per-file `git diff` loop silently skipped non-0/1 exit codes (git error = file missing, repo bad state, origin/ main ref missing). Result: false "up-to-date" verdict when the safety guard should have surfaced a real problem. Now treats non-0/1 status as HARD FAIL: collects (file, status, stderr) tuples; if any errored, bails with the errored files + common-cause diagnostic + escape hatch. P1 (PRRT_kwDOSF9kNM6Eryei): `git fetch` failure was masked as "offline"; real errors (git missing, no origin, auth failure, rate-limit) got swallowed. Now captures stderr; discriminates known network signals ("could not resolve host", "connection refused", "connection timed out", "network is unreachable", "no route to host", "operation timed out", "could not read from remote repository", "ssh: connect to host") from other failures; only the network class degrades to "offline" warn. Non-network failures bail with stderr + common causes + escape hatch suggestion (--skip-freshness-check). P1 (PRRT_kwDOSF9kNM6Erye3): Error text said "behind origin/main" but content-diff fires on behind OR ahead OR diverged. Reworded to "differs from origin/main" + remediation now lists all three cases: behind → git pull --rebase; ahead/diverged → log HEAD..origin/main + log origin/main..HEAD + push as PR OR rebase / merge as appropriate; escape hatch unchanged. P2 (PRRT_kwDOSF9kNM6Erye8): `/tmp/zflash-ci-iso-<runId>` was a stable path; never cleaned up; could re-use partial download from interrupted run. Now uses mkdtempSync(`zflash-ci-iso-<runId>-`) + try/finally with rmSync to guarantee cleanup whether copy succeeded or threw. Imports added: mkdtempSync, rmSync, tmpdir. Composes with iter-4.3 substrate (#5091); this fix-forward makes the freshness guard actually trustworthy + makes the auto-download path properly hygienic. Co-Authored-By: Claude <noreply@anthropic.com> * fix(B-0789 iter-4.3 fixfwd-2): 3 more Copilot findings on #5093 (1 P0 TS strict + 1 P0 auth-misclassification + 1 P1 ENOENT diagnostic) Three findings from Copilot on the iter-4.3 fixfwd PR #5093 that landed post-merge of #5091 but pre-merge of #5093 itself. P0 (PRRT_kwDOSF9kNM6Er1uQ) — TypeScript strict mode: `let dlDest: string | null = null` then `return dlDest` violates the function's `string` return type. CI's tsc check would have failed merge. Fix: initialize `dlDest: string = localIso` so it's always string; overwrite on copy success. P0 (PRRT_kwDOSF9kNM6Er1un) — auth-as-network misclassification: the freshness-fetch failure classifier matched "could not read from remote repository" as a network signal, but that string also appears on auth / permission failures (private repo, expired credential, etc.). Auth failures would have been silently downgraded to "offline" warn, defeating the freshness guard. Fix: (a) auth signals matched FIRST with explicit auth-pattern list ("permission denied", "authentication failed", "fatal: authentication", "invalid credentials", "bad credentials", "could not read username", "support for password authentication was removed", "403 forbidden", "401 unauthorized"); (b) network signals require actual host/connectivity word (removed "could not read from remote repository" from network list — needs a separate host/connectivity signal to qualify as network); (c) auth- flavored bail surfaces "(AUTH-flavored)" tag + remediation suggests re-auth via gh auth login. P1 (PRRT_kwDOSF9kNM6Er1uw) — ENOENT diagnostic: when git binary is missing, the exec throws without stderr. Prior diagnostic said "(no stderr captured)" — hides the actual cause (ENOENT). Fix: capture `e.message` + `e.code` (when present) + surface in the bail diagnostic. Common-causes list also adds "git not installed / not on PATH (e.code === 'ENOENT')" as the first entry. All 3 fixes layered onto the existing iter-4.3 freshness-check function. The auth-misclassification is the highest-impact (security- meaningful) because it would have made the freshness guard fail-open for auth-failure scenarios. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
4 tasks
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…t_msdos fallback for NixOS isohybrid post-dd ESP (#5099) 2026-05-26 empirical test of #5093 (iter-4.3 fix-forward) surfaced TWO bugs in the iter-4.2 inject step that the auto-diagnostic substrate caught loud + photo-friendly: (1) ESP detection regex at line 481 missed MBR 0xEF partition type. NixOS install ISOs use isohybrid layout; after dd to USB, macOS's diskutil reads the FDisk MBR partition table and reports the FAT ESP with TYPE='0xEF' (the MBR numeric type code for "EFI System Partition") rather than diskutil's GPT-style 'EFI EFI' or 'DOS_FAT_32' labels. The existing regex \b(DOS_FAT|EFI|MS-DOS|FAT16|FAT32|Windows_FAT)\b matched the GPT cases but not the MBR cases. (2) Even with detection fixed, 'diskutil mount /dev/disk6s2' fails on MBR 0xEF partitions because macOS's diskutil auto-probe doesn't recognize the FAT12 filesystem-in-0xEF combination as mountable. diskutil info reports 'File System: None' even though 'sudo file -s /dev/disk6s2' clearly shows the FAT12 EFIBOOT volume is there + writable. The fix: explicit mount_msdos against a mkdtemp mount point as fallback. Changes: - findFatPartition() regex extended to ALSO match MBR partition type codes 0x(EF|0C|0E|06|0B|0F) covering EFI System Partition + FAT16 + FAT32 + FAT32-LBA + FAT16-LBA + Extended-LBA. \b on both sides prevents accidental match inside longer hex strings. - New mountEsp(espPart) helper: tries 'diskutil mount' first (GPT case), falls back to 'sudo mount_msdos -o nodev,nosuid <part> <mkdtemp>' for MBR 0xEF case. Returns method tag so unmount matches. - New unmountEsp(espPart, result) helper: dispatches to 'diskutil unmount' or 'sudo umount + rmSync(tmpdir)' based on method. - injectPubkeyToUsb() simplified: single mountEsp call replaces the diskutil-mount + getMountPoint + error-path-unmount duplication; single unmountEsp call at success + error paths. - Diagnostic message updated: 'mountEsp ${espPart} failed (both diskutil + mount_msdos paths)' tells the operator BOTH paths were exhausted before bail. Empirical validation 2026-05-26: manually ran 'sudo mount_msdos -o nodev,nosuid /dev/disk6s2 /tmp/zeta-esp-mount' on the post-dd USB (124GB PNY, 3.1MB 0xEF ESP labeled EFIBOOT) — mount succeeded; wrote zeta-authorized-keys.pub via 'sudo cp ~/.ssh/ id_ed25519.pub /tmp/zeta-esp-mount/...'; unmounted + ejected cleanly. The substrate the script now ships matches what worked empirically. Composes with iter-4.3 (#5091) + iter-4.3 fixfwd (#5093) + B-0789 parent + B-0790 end-state (zero-typing SSH after first cluster boot is one of the load-bearing steps toward zero-dev-machine homelab persona). Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26) (#5103) * feat(B-0792 iter-5.1): persist NetworkManager profiles from live installer to installed system + enable Avahi mDNS publishing Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test: > "we won't have ethernet for most machines it needs to > remember the wifi on setup" > "completely self contained usb we already try eth for 30 > seconds and then ask for wifi we just need to remember it > afterwards" REVISED iter-5.1 design (NO operator-side credential pipeline; no keychain extract; no JSON file; no CLI flags): completely self-contained on USB. Existing flow already works through nmtui: 1. zeta-first-boot.sh waits 30s for ethernet DHCP 2. If absent, launches nmtui (single TUI form on cluster console) 3. Operator picks wifi SSID + enters password ONCE 4. NetworkManager writes profile to /etc/NetworkManager/system-connections/<ssid>.nmconnection 5. Installer connects + runs zeta-install.sh 6. **BUG (today)**: nixos-install installs fresh system that inherits NetworkManager service but NOT the operator's connection profile 7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection files from live installer to /mnt before nixos-install runs 8. Reboot → installed NixOS NetworkManager loads the profile → wifi reconnects automatically Two changes: 1. full-ai-cluster/usb-nixos-installer/zeta-install.sh: New Step 6.5 (before nixos-install): detect + copy /etc/NetworkManager/system-connections/*.nmconnection from live installer to /mnt/etc/NetworkManager/system-connections/. chmod 0600 + chown root:root (NM requires; else profiles silently ignored at boot with "permissions not strict enough" warning in journalctl). Photo-friendly disclosure per profile: "[iter-5.1] persisted: <name>.nmconnection (ssid=<ssid>)". Never prints psk. Skips cleanly if /etc/NetworkManager/ system-connections doesn't exist OR has no .nmconnection files (ethernet-DHCP path; no profile to copy). 2. full-ai-cluster/nixos/modules/common.nix: Enable services.avahi for mDNS publishing so `ssh zeta@control-plane.local` resolves from operator Mac (Bonjour) and Linux peers (nss-mdns) on the LAN without IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2 test surfaced the gap when SSH-by-hostname.local failed even though node was up. services.avahi = { enable = true; nssmdns4 = true; openFirewall = true; # 5353/udp publish = { enable = true; addresses = true; workstation = true; domain = true; }; }; Composes with iter-4.x (#5080→#5083→#5086→#5088→#5091→#5093→ #5099) substrate. Acceptance: empirical wifi-only mini-PC bring- up → nmtui-once at install → reboot → ssh zeta@<hostname>.local zero-typing zero-console from operator Mac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(B-0792 iter-5.2): per-node hostname injection — decouple hostname from role-stack (Aaron 2026-05-26) Aaron 2026-05-26 architectural framing: > "make any multi node changes we need to like think though > mdns names when we have two control planes" > "since our different roles are multi install you can be > control plane AND gpu node AND cpu node these distinctions > are not very elegant and host names tied to them are not > great either" Bug: every node installed from --flake .#control-plane gets hostname "control-plane" (baked in flake config); two such nodes collide on mDNS (Avahi auto-renames second to "control-plane-2.local" but underlying NixOS hostname stays "control-plane" — confusing in logs / journalctl / kubectl / node-labels). And role-tied hostname pattern is architecturally broken — a single node can be control-plane AND gpu-worker AND storage simultaneously. iter-5.2 fix: SEPARATE hostname identity from role-stack selection. Three changes: 1. nixos/modules/injected-hostname.nix (NEW): NixOS module that reads /etc/zeta/cluster-node-id at evaluation time + overrides networking.hostName via lib.mkOverride 50. If file doesn't exist OR is empty OR invalid, the per-host flake config default stays in effect — backward-compatible with single-node zero-typing path. 2. nixos/modules/common.nix: import injected-hostname.nix so EVERY host (control-plane, worker-gpu, worker-template, future configs) gets the override capability transitively via common.nix's existing import-chain. 3. tools/zflash.ts: add --host <name> flag with RFC1123 validation at flag-parse time (alphanumeric + hyphens, 1-63 chars, no leading/trailing hyphen). When passed, write zeta-hostname.txt to USB ESP in the same mount session as zeta-authorized-keys.pub (covered by same sudo timestamp window; no additional Touch ID). 4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before nixos-install) — probe USB for zeta-hostname.txt; if present + valid RFC1123, write to /mnt/etc/zeta/ cluster-node-id (mode 0644). injected-hostname.nix module picks it up at NixOS evaluation time. Backward-compatible: if no zeta-hostname.txt, flake default stays. Empirical UX: # Single-node, zero-typing (today's path; unchanged): zflash # → hostname stays 'control-plane'; ssh zeta@control-plane.local # Multi-node, one short flag per USB: zflash --host pikachu # → ssh zeta@pikachu.local zflash --host charizard # → ssh zeta@charizard.local zflash --host bulbasaur # → ssh zeta@bulbasaur.local # No flake explosion; all three install from .#control-plane # role-stack but each gets unique hostname + mDNS announcement. The architectural concern Aaron raised (role-as-capability composition; one node = control-plane AND gpu-worker AND storage) is BEYOND iter-5.2 scope — refactor of nixos/hosts/<role>/configuration.nix → composable nixos/modules/role-*.nix capability modules — filed separately as B-0793 follow-up. Composes with iter-5.1 in same PR; together they ship "completely self-contained USB" per Aaron's discipline: nmtui-once-at-install for wifi, --host <name>-at-zflash for multi-node identity, NetworkManager profile persistence + hostname injection at install time, mDNS publishing for zero-IP-discovery SSH-by-hostname.local from operator Mac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(B-0792 iter-5.1+5.2): 4 Copilot findings — Step 6.5 dup, personal-name attribution, nullglob comment, SSID truncation on '=' - P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for iter-4.2 pubkey probe at line 229; renumber my additions to avoid ambiguous labels in install logs) - P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26" in 2 comment blocks (repo convention: role-based attribution in non-history surfaces) - P2: update nullglob comment — code uses find not glob; describe that find + filter handles empty-dir naturally without nullglob shell option - P2: SSID extraction from .nmconnection — replace `awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with `sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per 802.11 spec). Log accuracy fix for all SSIDs. shellcheck clean post-fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes 2 gaps surfaced by the 2026-05-26 empirical iter-4.2 test run. Per maintainer signals:
Two gaps closed
Gap 1 — stale-checkout silent failure
When zflash runs from a checkout that's behind
origin/mainon install-substrate files, it executes the OLD code (which doesn't have the iter-4.2 inject step). USB comes out bootable but silently WITHOUToperator-ssh-keys.txtpopulated. Zero-typing target fails for non-obvious reason.Fix:
checkLocalCheckoutFreshness()— at zflash start:git fetch origin main --quiet(offline → warn + skip)INSTALL_SUBSTRATE_FILESentry (zflash.ts, flash-usb.ts, zeta-install.sh, flake.nix, 3 nix modules + .txt), rungit diff --quiet HEAD origin/main -- <file>git pull --rebase origin mainor--skip-freshness-checkescape hatch)Gap 2 — stale local ISO
Contributor has to manually
gh run downloadfresh CI ISO when the workflow regenerates. Today nobody does this until something breaks.Fix:
autoDownloadFreshIsoIfNeeded()— after ISO discovery, before flash:gh api .../workflows/build-ai-cluster-iso.yml/runs?branch=main&status=success&per_page=1updated_at> local newest ISO's mtime, pulls viagh run download→ walks artifact dir → copies to~/Downloads/zeta-installer-24.11-ci<run-id>-<date>.iso--skip-iso-pullsetNew flags
--skip-freshness-check: bypass stale-checkout detection (escape hatch; warn)--skip-iso-pull: bypass CI-ISO auto-download (use local newest)Composes with
Test plan
--bogus→ "unknown flag(s): --bogus"])After this lands, the iter-4.2 maintainer test gets a second attempt with auto-protections in place.
🤖 Generated with Claude Code