feat(B-0789 iter-4.2): zflash auto-inject SSH pubkey to boot USB ESP + zeta-install.sh probe — zero-typing SSH on first boot#5083
Merged
Conversation
…SP + zeta-install.sh USB probe → zero-typing SSH access on first cluster boot
The maintainer's actually-usable iter-4 path (v1 was scaffolding-only;
this PR is the workflow Aaron will test against). Per Aaron's 2026-05-26
discipline signals:
1. "we can do what's going to make cluster setup eaiser for me and not
users if that's ssh lets do that first cause we want to get ai
running the cluster asap" — iter-4 authorized
2. "i can wait for 4.2 or whatever version before we try again" —
downgraded v1 to scaffolding; this PR is what Aaron flashes
3. "--no-creds is basically useless right?" — opt-out removed from
recommended path (kept as --no-inject escape hatch only)
4. "whenever i have to ferry commands by reading and typing i'm going
to avoid it like the plague and try to get like pictures and auto
run and short commands pre built in" — design discipline: ALL
diagnostics auto-fire in-place + are photo-friendly; zero operator-
typed commands beyond `zflash`
Files:
- full-ai-cluster/tools/flash-usb.ts: added `--no-eject` flag (4 lines)
so zflash can do post-flash ESP-mount-and-write before the USB ejects.
Allowlist updated per the Copilot P0 catch about destructive-tool
flag validation. Help text mentions iter-4.2 use case
- full-ai-cluster/tools/zflash.ts: extended with post-flash macOS-side
ESP-mount-and-write step:
* Default reads ~/.ssh/id_ed25519.pub
* --ssh-key <path> overrides
* --no-inject opt-out (escape hatch only)
* Re-scans external disks post-flash (flash-usb's single-USB-only
requirement guarantees exactly one external disk)
* Identifies FAT/EFI partition via `diskutil list` regex match
(DOS_FAT / EFI / MS-DOS / FAT16 / FAT32 / Windows_FAT)
* Mounts via `diskutil mount`; gets mount point from `diskutil info`
* Writes <mount>/zeta-authorized-keys.pub via `sudo tee` (stdin
avoids shell-quoting hazards)
* Unmounts + ejects when done
* dumpDiagnostics() helper auto-runs on any failure path:
diskutil list external + mounted /Volumes/* + "what to do next"
suggestions. Compact + photo-friendly per the design discipline
- full-ai-cluster/usb-nixos-installer/zeta-install.sh: added step 6.5
pre-install pubkey probe + injection:
* Try 1: scan /iso /run /mnt /boot for zeta-authorized-keys.pub
via `sudo find -maxdepth 5`
* Try 2: probe USB partitions (/dev/sd? /dev/nvme?n? /dev/vd?
/dev/mmcblk?, minus install targets) via vfat-readonly mount +
file existence check. Partition suffix handling: 1/2 on sd/vd;
p1/p2 on nvme/mmcblk
* If found: writes operator-ssh-keys.nix with valid ssh-* lines
from the file BEFORE nixos-install
* If not found: diagnostics auto-fire (external block devices,
install targets, full lsblk, "what to do next") + falls back to
v1 stub
* Post-install credentials echo branches on INJECT_OK: success
path says "SSH works immediately"; fallback keeps v1 manual-
edit + nixos-rebuild instructions
* shellcheck clean (fixed SC2261 redundant stderr redirect)
- docs/backlog/P1/B-0789-*.md: updated iter-4.2 acceptance to reflect
what shipped: [x] flash-usb.ts --no-eject; [x] zflash.ts ESP inject;
[x] zeta-install.sh probe + inject + branched credentials echo;
[ ] maintainer flashes + tests on PC; [ ] if failure: photo-driven
fix-forward workflow per the maintainer's explicit design choice
Composes with PR #5080 (iter-4 v1 scaffolding: initial-password.nix +
operator-ssh-keys.nix stub + per-host imports) which this PR builds on.
The zero-typing target: `zflash` → boot USB on PC → install → SSH-able
as zeta@<hostname> from the maintainer's Mac using the existing
~/.ssh/id_ed25519 key. Failure path: photo of auto-diagnostics →
AI fixes-forward.
Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
This PR extends the AI-cluster USB flashing + installer workflow to enable “zero-typing” SSH access on first boot by automatically copying the operator’s SSH public key onto the flashed USB (macOS-side) and injecting it into the installed NixOS config during zeta-install.sh.
Changes:
- Add
--no-ejecttoflash-usb.tsso downstream tooling can mount/write the USB ESP before ejection. - Extend
zflash.tsto (optionally) mount the flashed USB’s FAT/EFI partition and writezeta-authorized-keys.pub, with photo-friendly diagnostics on failures. - Extend
zeta-install.shto probe forzeta-authorized-keys.puband generateoperator-ssh-keys.nixprior tonixos-install, with branched post-install messaging.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| full-ai-cluster/usb-nixos-installer/zeta-install.sh | Adds step 6.5 USB pubkey probe + injection into operator-ssh-keys.nix, plus updated post-install messaging. |
| full-ai-cluster/tools/zflash.ts | Adds iter-4.2 post-flash ESP mount/write of zeta-authorized-keys.pub and diagnostics; adds --ssh-key / --no-inject. |
| full-ai-cluster/tools/flash-usb.ts | Adds --no-eject flag and skips eject when requested to support post-flash ESP writes. |
| docs/backlog/P1/B-0789-iter4-ssh-key-and-hashedpassword-substrate-for-cluster-bringup-2026-05-26.md | Updates iter-4.2 acceptance checklist to reflect the shipped auto-inject/probe behavior. |
5 tasks
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…5083 (3 P0 + 2 P1) before maintainer test (#5086) PR #5083 auto-merged with required checks green; 5 substantive Copilot findings landed post-merge. All real; the Nix-injection P0 is security- relevant; install-script P0s would abort the install on real hardware. Fix-forward before the maintainer tests iter-4.2 end-to-end. P0 fixes in zeta-install.sh: 1. PRRT_kwDOSF9kNM6Erhtf — `find /iso /run /mnt /boot` under `set -euo pipefail` aborts the install if any start-path doesn't exist (e.g., /iso on some installer ISOs). Fix: filter to existing dirs first via SEARCH_DIRS array; only invoke find if non-empty; append `|| true` to swallow find's own exit code defensively 2. PRRT_kwDOSF9kNM6Erhto — `while read line < $PUBKEY_FILE` reads without sudo; fails on root-owned mounts (/mnt/* or /tmp/zeta-boot- esp from the readonly vfat probe) and aborts the install under `set -e`. Fix: read via `sudo cat` process-substitution 3. PRRT_kwDOSF9kNM6Erhty — NIX CODE INJECTION HAZARD. Pubkey lines interpolated into `"..."` Nix strings without escaping. SSH key comment containing `"` or `\` produces invalid Nix; a maliciously- crafted line on the USB could inject Nix code at install time (operator-ssh-keys.nix is imported by configuration.nix). Fix: sed escape `\\` → `\\\\` then `"` → `\"` (Nix double-quoted-string escape rules; ordering matters — backslash first) P1 fixes: 4. PRRT_kwDOSF9kNM6ErhuB (zflash.ts) — `resolve(next)` in Node doesn't expand `~/`. `--ssh-key ~/.ssh/id_ed25519.pub` would resolve to a literal `~/.ssh/...` path under cwd and fail existence checks. Fix: expand leading `~/` (and bare `~`) to homedir() before resolve 5. PRRT_kwDOSF9kNM6ErhuK (zflash.ts + zeta-install.sh) — pubkey type regex/glob only matched `ssh-(ed25519|rsa|ecdsa|dss)` — missed `ecdsa-sha2-nistp{256,384,521}` (no ssh- prefix; what ssh-keygen defaults to for ECDSA) and FIDO/security keys `sk-ssh-ed25519@ openssh.com` / `sk-ecdsa-sha2-*`. Fix: broaden to OpenSSH-spec prefixes per sshd(8) AuthorizedKeysFile Also resolves the 5 review threads (handled separately via resolveReviewThread mutation). Tests: - shellcheck clean on zeta-install.sh - zflash.ts --help parses cleanly post-fix Composes with #5083 (iter-4.2 substrate that this PR fix-forwards). Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…d fresh ISO from CI — closes 2 gaps surfaced by 2026-05-26 empirical iter-4.2 test (#5091) Per maintainer 2026-05-26 *"any fixes lets make sure they make it in main"* + *"does the script not auto download the latest?"* + *"we want to run what a contributor will run"* + *"no rush we can wait on main we are going for right not fast"*. Two gaps surfaced when I ran zflash from the operator's local checkout (HEAD 89a39ea = pre-iter-4.2) with an iter-4.2 ISO: 1. Flash ran the OLD zflash (no --no-eject + no inject step). USB came out bootable but silently WITHOUT operator-ssh-keys.txt populated. The iter-4.2 zero-typing target failed silently because the local zflash code itself was stale, not because the iter-4.2 substrate on main was wrong. 2. The May 25 ISO in ~/Downloads was iter-3-era. I had to manually `gh run download` the fresh CI artifact (run 26432433541) before zflash had anything usable to flash. Contributor flow today requires the operator to remember to pull fresh ISOs. iter-4.3 closes both gaps in full-ai-cluster/tools/zflash.ts: checkLocalCheckoutFreshness(): - findRepoRoot() walks up from zflash.ts location to find .git - Best-effort `git fetch origin main --quiet` (offline = warn + skip) - For each INSTALL_SUBSTRATE_FILES entry (zflash.ts, flash-usb.ts, zeta-install.sh, flake.nix, the 3 nix modules + .txt), checks `git diff --quiet HEAD origin/main -- <file>` → status 1 = stale - If ANY file is stale → bail loud with specific remediation: "git -C <root> pull --rebase origin main" or "zflash --skip-freshness-check" escape hatch - Eliminates the silent-stale-code class autoDownloadFreshIsoIfNeeded(): - Queries `gh api .../actions/workflows/build-ai-cluster-iso.yml/ runs?branch=main&status=success&per_page=1` - If latest run's updated_at > local newest ISO's mtime, pulls via `gh run download <run-id> --dir /tmp/zflash-ci-iso-<run-id>` - Walks the dl dir to find the .iso file (artifact dir nesting) - Copies to ~/Downloads/zeta-installer-24.11-ci<run-id>-<date>.iso so future autoDiscoverIso() picks it as newest - Skipped when explicit ISO path passed OR --skip-iso-pull set - Offline / gh failure → falls back to local newest with warning Two new opt-out flags + allowlist updated: - --skip-freshness-check: bypass stale-check (not recommended) - --skip-iso-pull: bypass CI ISO auto-pull (use local newest) Help text updated to surface the new flags + the auto-pull behavior. Composes with B-0759 (first-time-CLI-user UX), B-0780 (tier-2 dev experience — Max + Addison's onboarding will benefit), iter-4.2 substrate (#5083 / #5086 / #5088). Substrate-honest follow-on: captures the lessons-learned from the maintainer's first actual test of the iter-4.2 flow. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
5 tasks
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26) (#5103) * feat(B-0792 iter-5.1): persist NetworkManager profiles from live installer to installed system + enable Avahi mDNS publishing Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test: > "we won't have ethernet for most machines it needs to > remember the wifi on setup" > "completely self contained usb we already try eth for 30 > seconds and then ask for wifi we just need to remember it > afterwards" REVISED iter-5.1 design (NO operator-side credential pipeline; no keychain extract; no JSON file; no CLI flags): completely self-contained on USB. Existing flow already works through nmtui: 1. zeta-first-boot.sh waits 30s for ethernet DHCP 2. If absent, launches nmtui (single TUI form on cluster console) 3. Operator picks wifi SSID + enters password ONCE 4. NetworkManager writes profile to /etc/NetworkManager/system-connections/<ssid>.nmconnection 5. Installer connects + runs zeta-install.sh 6. **BUG (today)**: nixos-install installs fresh system that inherits NetworkManager service but NOT the operator's connection profile 7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection files from live installer to /mnt before nixos-install runs 8. Reboot → installed NixOS NetworkManager loads the profile → wifi reconnects automatically Two changes: 1. full-ai-cluster/usb-nixos-installer/zeta-install.sh: New Step 6.5 (before nixos-install): detect + copy /etc/NetworkManager/system-connections/*.nmconnection from live installer to /mnt/etc/NetworkManager/system-connections/. chmod 0600 + chown root:root (NM requires; else profiles silently ignored at boot with "permissions not strict enough" warning in journalctl). Photo-friendly disclosure per profile: "[iter-5.1] persisted: <name>.nmconnection (ssid=<ssid>)". Never prints psk. Skips cleanly if /etc/NetworkManager/ system-connections doesn't exist OR has no .nmconnection files (ethernet-DHCP path; no profile to copy). 2. full-ai-cluster/nixos/modules/common.nix: Enable services.avahi for mDNS publishing so `ssh zeta@control-plane.local` resolves from operator Mac (Bonjour) and Linux peers (nss-mdns) on the LAN without IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2 test surfaced the gap when SSH-by-hostname.local failed even though node was up. services.avahi = { enable = true; nssmdns4 = true; openFirewall = true; # 5353/udp publish = { enable = true; addresses = true; workstation = true; domain = true; }; }; Composes with iter-4.x (#5080→#5083→#5086→#5088→#5091→#5093→ #5099) substrate. Acceptance: empirical wifi-only mini-PC bring- up → nmtui-once at install → reboot → ssh zeta@<hostname>.local zero-typing zero-console from operator Mac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(B-0792 iter-5.2): per-node hostname injection — decouple hostname from role-stack (Aaron 2026-05-26) Aaron 2026-05-26 architectural framing: > "make any multi node changes we need to like think though > mdns names when we have two control planes" > "since our different roles are multi install you can be > control plane AND gpu node AND cpu node these distinctions > are not very elegant and host names tied to them are not > great either" Bug: every node installed from --flake .#control-plane gets hostname "control-plane" (baked in flake config); two such nodes collide on mDNS (Avahi auto-renames second to "control-plane-2.local" but underlying NixOS hostname stays "control-plane" — confusing in logs / journalctl / kubectl / node-labels). And role-tied hostname pattern is architecturally broken — a single node can be control-plane AND gpu-worker AND storage simultaneously. iter-5.2 fix: SEPARATE hostname identity from role-stack selection. Three changes: 1. nixos/modules/injected-hostname.nix (NEW): NixOS module that reads /etc/zeta/cluster-node-id at evaluation time + overrides networking.hostName via lib.mkOverride 50. If file doesn't exist OR is empty OR invalid, the per-host flake config default stays in effect — backward-compatible with single-node zero-typing path. 2. nixos/modules/common.nix: import injected-hostname.nix so EVERY host (control-plane, worker-gpu, worker-template, future configs) gets the override capability transitively via common.nix's existing import-chain. 3. tools/zflash.ts: add --host <name> flag with RFC1123 validation at flag-parse time (alphanumeric + hyphens, 1-63 chars, no leading/trailing hyphen). When passed, write zeta-hostname.txt to USB ESP in the same mount session as zeta-authorized-keys.pub (covered by same sudo timestamp window; no additional Touch ID). 4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before nixos-install) — probe USB for zeta-hostname.txt; if present + valid RFC1123, write to /mnt/etc/zeta/ cluster-node-id (mode 0644). injected-hostname.nix module picks it up at NixOS evaluation time. Backward-compatible: if no zeta-hostname.txt, flake default stays. Empirical UX: # Single-node, zero-typing (today's path; unchanged): zflash # → hostname stays 'control-plane'; ssh zeta@control-plane.local # Multi-node, one short flag per USB: zflash --host pikachu # → ssh zeta@pikachu.local zflash --host charizard # → ssh zeta@charizard.local zflash --host bulbasaur # → ssh zeta@bulbasaur.local # No flake explosion; all three install from .#control-plane # role-stack but each gets unique hostname + mDNS announcement. The architectural concern Aaron raised (role-as-capability composition; one node = control-plane AND gpu-worker AND storage) is BEYOND iter-5.2 scope — refactor of nixos/hosts/<role>/configuration.nix → composable nixos/modules/role-*.nix capability modules — filed separately as B-0793 follow-up. Composes with iter-5.1 in same PR; together they ship "completely self-contained USB" per Aaron's discipline: nmtui-once-at-install for wifi, --host <name>-at-zflash for multi-node identity, NetworkManager profile persistence + hostname injection at install time, mDNS publishing for zero-IP-discovery SSH-by-hostname.local from operator Mac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(B-0792 iter-5.1+5.2): 4 Copilot findings — Step 6.5 dup, personal-name attribution, nullglob comment, SSID truncation on '=' - P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for iter-4.2 pubkey probe at line 229; renumber my additions to avoid ambiguous labels in install logs) - P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26" in 2 comment blocks (repo convention: role-based attribution in non-history surfaces) - P2: update nullglob comment — code uses find not glob; describe that find + filter handles empty-dir naturally without nullglob shell option - P2: SSID extraction from .nmconnection — replace `awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with `sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per 802.11 spec). Log accuracy fix for all SSIDs. shellcheck clean post-fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The maintainer's actually-usable iter-4 path. Builds on PR #5080 (v1 scaffolding: initial-password.nix + operator-ssh-keys.nix stub + per-host imports). Result:
zflashon macOS → boot USB on PC → install → SSH-able aszeta@<hostname>from the maintainer's Mac using the existing~/.ssh/id_ed25519key. Zero operator-typed commands beyondzflash.Design discipline (per Aaron 2026-05-26 four signals)
--no-injectkept as escape hatch only)Files
full-ai-cluster/tools/flash-usb.ts: added--no-ejectflag (allowlist + skip-eject branch) so zflash can do post-flash ESP-mount-and-write before the USB ejectsfull-ai-cluster/tools/zflash.ts: post-flash macOS-side ESP-mount-and-write:~/.ssh/id_ed25519.pub;--ssh-key <path>override;--no-injectescape hatchdiskutil listregex; mounts; gets mount point viadiskutil info; writes viasudo tee; unmounts + ejectsdumpDiagnostics()auto-fires on any failure:diskutil list external+ mounted/Volumes/*+ "what to do next" suggestions. Photo-friendly compact blockfull-ai-cluster/usb-nixos-installer/zeta-install.sh: step 6.5 pre-install probe:/iso /run /mnt /bootforzeta-authorized-keys.puboperator-ssh-keys.nixwith validssh-*lines beforenixos-installINJECT_OK: success says "SSH works immediately"; fallback keeps v1 manual instructionsdocs/backlog/P1/B-0789-*.md: updated iter-4.2 acceptance to mark what shipped + the maintainer-test-pending checkpointEnd-to-end zero-typing flow
Failure-path workflow (per Aaron's photo-driven design)
If anything in zflash's ESP-mount or zeta-install.sh's probe fails, photo-friendly diagnostics auto-fire in-place. Aaron photographs + sends → AI fixes-forward against the actual substrate the photo reveals. No "now run this command to debug" — the diagnostic IS the in-place output.
Test plan
ssh zeta@<hostname>works immediately ← end-to-end success criterion🤖 Generated with Claude Code