Skip to content

fix(B-0789 iter-4.2 fixfwd): 5 Copilot findings on #5083 (3 P0 incl Nix-injection + 2 P1) before maintainer test#5086

Merged
AceHack merged 1 commit into
mainfrom
otto-cli/iter42-fixfwd-5-copilot-findings-2026-05-26
May 26, 2026
Merged

fix(B-0789 iter-4.2 fixfwd): 5 Copilot findings on #5083 (3 P0 incl Nix-injection + 2 P1) before maintainer test#5086
AceHack merged 1 commit into
mainfrom
otto-cli/iter42-fixfwd-5-copilot-findings-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

PR #5083 (iter-4.2 substrate) auto-merged with required checks green; 5 substantive Copilot findings landed post-merge. All real; the Nix-injection P0 is security-relevant; install-script P0s would abort the install on real hardware under set -euo pipefail. Fix-forward before maintainer tests iter-4.2 end-to-end on PC 1.

P0 fixes (would actually break install or open security hole)

Thread Fix
PRRT_kwDOSF9kNM6Erhtf find /iso /run /mnt /boot aborts install when start-path missing → filter to existing dirs only via SEARCH_DIRS array + || true defense
PRRT_kwDOSF9kNM6Erhto while read < $PUBKEY_FILE fails on root-owned mounts → read via sudo cat process substitution
PRRT_kwDOSF9kNM6Erhty NIX CODE INJECTION in operator-ssh-keys.nix if pubkey comment has " or \ → sed-escape \\\\\\ then "\" (Nix double-quoted string rules; backslash first)

P1 fixes

Thread Fix
PRRT_kwDOSF9kNM6ErhuB resolve('~/path') doesn't expand ~/ in Node → expand leading ~/ (and bare ~) to homedir() before resolve()
PRRT_kwDOSF9kNM6ErhuK Pubkey regex/glob missed ecdsa-sha2-nistp{256,384,521} + FIDO sk-ssh-ed25519@* / sk-ecdsa-sha2-* → broaden to OpenSSH-spec prefixes per sshd(8) AuthorizedKeysFile

Files

  • full-ai-cluster/tools/zflash.ts: ~/ expansion + broader pubkey regex
  • full-ai-cluster/usb-nixos-installer/zeta-install.sh: filtered find paths, sudo cat read, Nix string escape, broader pubkey glob

Test plan

🤖 Generated with Claude Code

…5083 (3 P0 + 2 P1) before maintainer test

PR #5083 auto-merged with required checks green; 5 substantive Copilot
findings landed post-merge. All real; the Nix-injection P0 is security-
relevant; install-script P0s would abort the install on real hardware.
Fix-forward before the maintainer tests iter-4.2 end-to-end.

P0 fixes in zeta-install.sh:

1. PRRT_kwDOSF9kNM6Erhtf — `find /iso /run /mnt /boot` under `set -euo
   pipefail` aborts the install if any start-path doesn't exist (e.g.,
   /iso on some installer ISOs). Fix: filter to existing dirs first via
   SEARCH_DIRS array; only invoke find if non-empty; append `|| true`
   to swallow find's own exit code defensively

2. PRRT_kwDOSF9kNM6Erhto — `while read line < $PUBKEY_FILE` reads
   without sudo; fails on root-owned mounts (/mnt/* or /tmp/zeta-boot-
   esp from the readonly vfat probe) and aborts the install under
   `set -e`. Fix: read via `sudo cat` process-substitution

3. PRRT_kwDOSF9kNM6Erhty — NIX CODE INJECTION HAZARD. Pubkey lines
   interpolated into `"..."` Nix strings without escaping. SSH key
   comment containing `"` or `\` produces invalid Nix; a maliciously-
   crafted line on the USB could inject Nix code at install time
   (operator-ssh-keys.nix is imported by configuration.nix). Fix: sed
   escape `\\` → `\\\\` then `"` → `\"` (Nix double-quoted-string
   escape rules; ordering matters — backslash first)

P1 fixes:

4. PRRT_kwDOSF9kNM6ErhuB (zflash.ts) — `resolve(next)` in Node doesn't
   expand `~/`. `--ssh-key ~/.ssh/id_ed25519.pub` would resolve to a
   literal `~/.ssh/...` path under cwd and fail existence checks.
   Fix: expand leading `~/` (and bare `~`) to homedir() before resolve

5. PRRT_kwDOSF9kNM6ErhuK (zflash.ts + zeta-install.sh) — pubkey type
   regex/glob only matched `ssh-(ed25519|rsa|ecdsa|dss)` — missed
   `ecdsa-sha2-nistp{256,384,521}` (no ssh- prefix; what ssh-keygen
   defaults to for ECDSA) and FIDO/security keys `sk-ssh-ed25519@
   openssh.com` / `sk-ecdsa-sha2-*`. Fix: broaden to OpenSSH-spec
   prefixes per sshd(8) AuthorizedKeysFile

Also resolves the 5 review threads (handled separately via
resolveReviewThread mutation).

Tests:
- shellcheck clean on zeta-install.sh
- zflash.ts --help parses cleanly post-fix

Composes with #5083 (iter-4.2 substrate that this PR fix-forwards).

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 04:26
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 04:26
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fix-forward PR addressing five Copilot review findings from #5083 (iter-4.2 SSH pubkey injection for the cluster install USB). Three P0s would break the install under set -euo pipefail or open a Nix code-injection hole; two P1s improve path/key handling on the macOS side.

Changes:

  • zeta-install.sh: filter find start-paths to existing dirs (with || true), read pubkey via sudo cat process substitution, sed-escape \ then " before interpolating into the generated Nix file, and broaden the case-glob to match ecdsa-sha2-* and FIDO sk-*@* key prefixes.
  • zflash.ts: expand leading ~ / ~/ to homedir() before resolve() for --ssh-key, and broaden the OpenSSH key-type regex to match ecdsa-sha2-*, sk-ssh-ed25519@*, sk-ecdsa-sha2-* (dropping the bogus ssh-ecdsa token).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
full-ai-cluster/usb-nixos-installer/zeta-install.sh Hardens probe step against missing dirs, root-owned mounts, and Nix-string injection; broadens accepted key types.
full-ai-cluster/tools/zflash.ts Expands ~/~/ in --ssh-key argument and broadens pubkey-type validation regex.

Verification spot-checks:

  • Nix escape order is correct: s/\\/\\\\/g then s/"/\\"/g — second pass does not re-double the backslashes added in the first (sequential -e apply once each), so "\" and \\\ in the emitted Nix double-quoted string.
  • ~ expansion: next === "~" slices 1 (empty tail) → homedir(); ~/foo slices 2 → join(homedir(), "foo"). Both correct.
  • find … || true plus prior [ -d "$d" ] guard removes the set -e abort path.
  • done < <(sudo cat "$PUBKEY_FILE") works even when the file is on a root-owned mount.
  • Case glob ecdsa-sha2-*\ * and sk-{ssh-ed25519,ecdsa-sha2-*}@*\ * match the OpenSSH token forms; TS regex mirrors these structurally.

No new issues found in the diff.

@AceHack AceHack merged commit 0b9f5ea into main May 26, 2026
34 checks passed
@AceHack AceHack deleted the otto-cli/iter42-fixfwd-5-copilot-findings-2026-05-26 branch May 26, 2026 04:29
AceHack added a commit that referenced this pull request May 26, 2026
…d fresh ISO from CI — closes 2 gaps surfaced by 2026-05-26 empirical iter-4.2 test (#5091)

Per maintainer 2026-05-26 *"any fixes lets make sure they make it in
main"* + *"does the script not auto download the latest?"* + *"we want
to run what a contributor will run"* + *"no rush we can wait on main
we are going for right not fast"*.

Two gaps surfaced when I ran zflash from the operator's local checkout
(HEAD 89a39ea = pre-iter-4.2) with an iter-4.2 ISO:

1. Flash ran the OLD zflash (no --no-eject + no inject step). USB came
   out bootable but silently WITHOUT operator-ssh-keys.txt populated.
   The iter-4.2 zero-typing target failed silently because the local
   zflash code itself was stale, not because the iter-4.2 substrate
   on main was wrong.

2. The May 25 ISO in ~/Downloads was iter-3-era. I had to manually
   `gh run download` the fresh CI artifact (run 26432433541) before
   zflash had anything usable to flash. Contributor flow today
   requires the operator to remember to pull fresh ISOs.

iter-4.3 closes both gaps in full-ai-cluster/tools/zflash.ts:

checkLocalCheckoutFreshness():
- findRepoRoot() walks up from zflash.ts location to find .git
- Best-effort `git fetch origin main --quiet` (offline = warn + skip)
- For each INSTALL_SUBSTRATE_FILES entry (zflash.ts, flash-usb.ts,
  zeta-install.sh, flake.nix, the 3 nix modules + .txt), checks
  `git diff --quiet HEAD origin/main -- <file>` → status 1 = stale
- If ANY file is stale → bail loud with specific remediation:
  "git -C <root> pull --rebase origin main" or
  "zflash --skip-freshness-check" escape hatch
- Eliminates the silent-stale-code class

autoDownloadFreshIsoIfNeeded():
- Queries `gh api .../actions/workflows/build-ai-cluster-iso.yml/
  runs?branch=main&status=success&per_page=1`
- If latest run's updated_at > local newest ISO's mtime, pulls via
  `gh run download <run-id> --dir /tmp/zflash-ci-iso-<run-id>`
- Walks the dl dir to find the .iso file (artifact dir nesting)
- Copies to ~/Downloads/zeta-installer-24.11-ci<run-id>-<date>.iso
  so future autoDiscoverIso() picks it as newest
- Skipped when explicit ISO path passed OR --skip-iso-pull set
- Offline / gh failure → falls back to local newest with warning

Two new opt-out flags + allowlist updated:
- --skip-freshness-check: bypass stale-check (not recommended)
- --skip-iso-pull: bypass CI ISO auto-pull (use local newest)

Help text updated to surface the new flags + the auto-pull behavior.

Composes with B-0759 (first-time-CLI-user UX), B-0780 (tier-2 dev
experience — Max + Addison's onboarding will benefit), iter-4.2
substrate (#5083 / #5086 / #5088). Substrate-honest follow-on:
captures the lessons-learned from the maintainer's first actual
test of the iter-4.2 flow.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 26, 2026
…Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26) (#5103)

* feat(B-0792 iter-5.1): persist NetworkManager profiles from live installer to installed system + enable Avahi mDNS publishing

Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test:

> "we won't have ethernet for most machines it needs to
> remember the wifi on setup"

> "completely self contained usb we already try eth for 30
> seconds and then ask for wifi we just need to remember it
> afterwards"

REVISED iter-5.1 design (NO operator-side credential pipeline;
no keychain extract; no JSON file; no CLI flags): completely
self-contained on USB. Existing flow already works through nmtui:

1. zeta-first-boot.sh waits 30s for ethernet DHCP
2. If absent, launches nmtui (single TUI form on cluster console)
3. Operator picks wifi SSID + enters password ONCE
4. NetworkManager writes profile to
   /etc/NetworkManager/system-connections/<ssid>.nmconnection
5. Installer connects + runs zeta-install.sh
6. **BUG (today)**: nixos-install installs fresh system that
   inherits NetworkManager service but NOT the operator's
   connection profile
7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection
   files from live installer to /mnt before nixos-install runs
8. Reboot → installed NixOS NetworkManager loads the profile →
   wifi reconnects automatically

Two changes:

1. full-ai-cluster/usb-nixos-installer/zeta-install.sh:
   New Step 6.5 (before nixos-install): detect + copy
   /etc/NetworkManager/system-connections/*.nmconnection from
   live installer to /mnt/etc/NetworkManager/system-connections/.
   chmod 0600 + chown root:root (NM requires; else profiles
   silently ignored at boot with "permissions not strict enough"
   warning in journalctl). Photo-friendly disclosure per profile:
   "[iter-5.1]   persisted: <name>.nmconnection (ssid=<ssid>)".
   Never prints psk. Skips cleanly if /etc/NetworkManager/
   system-connections doesn't exist OR has no .nmconnection
   files (ethernet-DHCP path; no profile to copy).

2. full-ai-cluster/nixos/modules/common.nix:
   Enable services.avahi for mDNS publishing so
   `ssh zeta@control-plane.local` resolves from operator Mac
   (Bonjour) and Linux peers (nss-mdns) on the LAN without
   IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2
   test surfaced the gap when SSH-by-hostname.local failed
   even though node was up.

   services.avahi = {
     enable = true;
     nssmdns4 = true;
     openFirewall = true;  # 5353/udp
     publish = {
       enable = true;
       addresses = true;
       workstation = true;
       domain = true;
     };
   };

Composes with iter-4.x (#5080#5083#5086#5088#5091#5093#5099) substrate. Acceptance: empirical wifi-only mini-PC bring-
up → nmtui-once at install → reboot → ssh zeta@<hostname>.local
zero-typing zero-console from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(B-0792 iter-5.2): per-node hostname injection — decouple hostname from role-stack (Aaron 2026-05-26)

Aaron 2026-05-26 architectural framing:

> "make any multi node changes we need to like think though
> mdns names when we have two control planes"

> "since our different roles are multi install you can be
> control plane AND gpu node AND cpu node these distinctions
> are not very elegant and host names tied to them are not
> great either"

Bug: every node installed from --flake .#control-plane gets
hostname "control-plane" (baked in flake config); two such
nodes collide on mDNS (Avahi auto-renames second to
"control-plane-2.local" but underlying NixOS hostname stays
"control-plane" — confusing in logs / journalctl / kubectl /
node-labels). And role-tied hostname pattern is architecturally
broken — a single node can be control-plane AND gpu-worker
AND storage simultaneously.

iter-5.2 fix: SEPARATE hostname identity from role-stack
selection. Three changes:

1. nixos/modules/injected-hostname.nix (NEW): NixOS module that
   reads /etc/zeta/cluster-node-id at evaluation time + overrides
   networking.hostName via lib.mkOverride 50. If file doesn't
   exist OR is empty OR invalid, the per-host flake config
   default stays in effect — backward-compatible with
   single-node zero-typing path.

2. nixos/modules/common.nix: import injected-hostname.nix so
   EVERY host (control-plane, worker-gpu, worker-template,
   future configs) gets the override capability transitively
   via common.nix's existing import-chain.

3. tools/zflash.ts: add --host <name> flag with RFC1123
   validation at flag-parse time (alphanumeric + hyphens,
   1-63 chars, no leading/trailing hyphen). When passed,
   write zeta-hostname.txt to USB ESP in the same mount
   session as zeta-authorized-keys.pub (covered by same sudo
   timestamp window; no additional Touch ID).

4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before
   nixos-install) — probe USB for zeta-hostname.txt; if
   present + valid RFC1123, write to /mnt/etc/zeta/
   cluster-node-id (mode 0644). injected-hostname.nix module
   picks it up at NixOS evaluation time. Backward-compatible:
   if no zeta-hostname.txt, flake default stays.

Empirical UX:

  # Single-node, zero-typing (today's path; unchanged):
  zflash
  # → hostname stays 'control-plane'; ssh zeta@control-plane.local

  # Multi-node, one short flag per USB:
  zflash --host pikachu      # → ssh zeta@pikachu.local
  zflash --host charizard    # → ssh zeta@charizard.local
  zflash --host bulbasaur    # → ssh zeta@bulbasaur.local
  # No flake explosion; all three install from .#control-plane
  # role-stack but each gets unique hostname + mDNS announcement.

The architectural concern Aaron raised (role-as-capability
composition; one node = control-plane AND gpu-worker AND
storage) is BEYOND iter-5.2 scope — refactor of
nixos/hosts/<role>/configuration.nix → composable
nixos/modules/role-*.nix capability modules — filed separately
as B-0793 follow-up.

Composes with iter-5.1 in same PR; together they ship
"completely self-contained USB" per Aaron's discipline:
nmtui-once-at-install for wifi, --host <name>-at-zflash for
multi-node identity, NetworkManager profile persistence +
hostname injection at install time, mDNS publishing for
zero-IP-discovery SSH-by-hostname.local from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0792 iter-5.1+5.2): 4 Copilot findings — Step 6.5 dup, personal-name attribution, nullglob comment, SSID truncation on '='

- P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for
  iter-4.2 pubkey probe at line 229; renumber my additions to
  avoid ambiguous labels in install logs)
- P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26"
  in 2 comment blocks (repo convention: role-based attribution
  in non-history surfaces)
- P2: update nullglob comment — code uses find not glob; describe
  that find + filter handles empty-dir naturally without
  nullglob shell option
- P2: SSID extraction from .nmconnection — replace
  `awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with
  `sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per
  802.11 spec). Log accuracy fix for all SSIDs.

shellcheck clean post-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants