Skip to content

fix(B-0789 iter-4.3 fixfwd): 4 Copilot findings on #5091 (1 P0 silent-guard-defeat + 2 P1 + 1 P2)#5093

Merged
AceHack merged 2 commits into
mainfrom
otto-cli/iter43-fixfwd-4-copilot-findings-2026-05-26
May 26, 2026
Merged

fix(B-0789 iter-4.3 fixfwd): 4 Copilot findings on #5091 (1 P0 silent-guard-defeat + 2 P1 + 1 P2)#5093
AceHack merged 2 commits into
mainfrom
otto-cli/iter43-fixfwd-4-copilot-findings-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

PR #5091 auto-merged with required checks green; 4 substantive Copilot findings landed post-merge. The P0 silently defeats the safety guard the whole PR exists to provide — fix-forward before the maintainer's next iter-4.2 test.

Fixes

Thread Severity Fix
PRRT_kwDOSF9kNM6Eryew P0 Per-file git diff loop silently skipped non-0/1 exit codes → false "up-to-date" verdict. Now collects (file, status, stderr) tuples + bails HARD with diagnostic if any errored
PRRT_kwDOSF9kNM6Eryei P1 git fetch failure masked as "offline"; auth / missing-git / no-origin all silently swallowed. Now captures stderr + discriminates 8 network-signal strings; non-network = bail with cause + escape hatch suggestion
PRRT_kwDOSF9kNM6Erye3 P1 Error text said "behind origin/main" but content-diff fires on behind OR ahead OR diverged. Reworded to "differs from origin/main" + remediation covers all three cases
PRRT_kwDOSF9kNM6Erye8 P2 /tmp/zflash-ci-iso-<runId> stable path never cleaned up; could re-use partial download. Now uses mkdtempSync + try/finally with rmSync

Test plan

  • zflash --help still parses (no flag changes)
  • All 4 fixes in single file (zflash.ts; 163 insertions / 58 deletions)
  • CI passes (gate workflow + CodeQL)
  • Maintainer's next iter-4.2 test uses the now-trustworthy freshness guard

🤖 Generated with Claude Code

…-guard-defeat + 2 P1 + 1 P2)

PR #5091 (iter-4.3 stale-checkout + auto-ISO-download) auto-merged
with required checks green; 4 substantive Copilot findings landed
post-merge. The P0 silently defeats the safety guard the whole PR
exists to provide — must fix-forward before maintainer's next iter-4.2
test.

P0 (PRRT_kwDOSF9kNM6Eryew): Per-file `git diff` loop silently skipped
non-0/1 exit codes (git error = file missing, repo bad state, origin/
main ref missing). Result: false "up-to-date" verdict when the safety
guard should have surfaced a real problem. Now treats non-0/1 status
as HARD FAIL: collects (file, status, stderr) tuples; if any errored,
bails with the errored files + common-cause diagnostic + escape hatch.

P1 (PRRT_kwDOSF9kNM6Eryei): `git fetch` failure was masked as "offline";
real errors (git missing, no origin, auth failure, rate-limit) got
swallowed. Now captures stderr; discriminates known network signals
("could not resolve host", "connection refused", "connection timed
out", "network is unreachable", "no route to host", "operation timed
out", "could not read from remote repository", "ssh: connect to host")
from other failures; only the network class degrades to "offline" warn.
Non-network failures bail with stderr + common causes + escape hatch
suggestion (--skip-freshness-check).

P1 (PRRT_kwDOSF9kNM6Erye3): Error text said "behind origin/main" but
content-diff fires on behind OR ahead OR diverged. Reworded to
"differs from origin/main" + remediation now lists all three cases:
behind → git pull --rebase; ahead/diverged → log HEAD..origin/main +
log origin/main..HEAD + push as PR OR rebase / merge as appropriate;
escape hatch unchanged.

P2 (PRRT_kwDOSF9kNM6Erye8): `/tmp/zflash-ci-iso-<runId>` was a stable
path; never cleaned up; could re-use partial download from interrupted
run. Now uses mkdtempSync(`zflash-ci-iso-<runId>-`) + try/finally with
rmSync to guarantee cleanup whether copy succeeded or threw. Imports
added: mkdtempSync, rmSync, tmpdir.

Composes with iter-4.3 substrate (#5091); this fix-forward makes the
freshness guard actually trustworthy + makes the auto-download path
properly hygienic.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 04:57
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 04:57
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fix-forward for post-merge findings from #5091 to restore the intended safety guarantees in zflash’s iter-4.3 “freshness guard” and CI-ISO auto-download flow.

Changes:

  • Makes the stale-checkout guard fail hard on unexpected git diff exit codes (instead of silently skipping errors).
  • Refines git fetch failure handling to distinguish “offline/network-ish” errors from other failures, and updates user-facing remediation text to match “differs from origin/main” semantics.
  • Replaces a stable /tmp download directory with a mkdtempSync temp dir plus try/finally cleanup to avoid reusing partial downloads and leaving clutter behind.

Comment thread full-ai-cluster/tools/zflash.ts
Comment thread full-ai-cluster/tools/zflash.ts Outdated
Comment thread full-ai-cluster/tools/zflash.ts
… TS strict + 1 P0 auth-misclassification + 1 P1 ENOENT diagnostic)

Three findings from Copilot on the iter-4.3 fixfwd PR #5093 that
landed post-merge of #5091 but pre-merge of #5093 itself.

P0 (PRRT_kwDOSF9kNM6Er1uQ) — TypeScript strict mode: `let dlDest:
string | null = null` then `return dlDest` violates the function's
`string` return type. CI's tsc check would have failed merge.
Fix: initialize `dlDest: string = localIso` so it's always string;
overwrite on copy success.

P0 (PRRT_kwDOSF9kNM6Er1un) — auth-as-network misclassification: the
freshness-fetch failure classifier matched "could not read from remote
repository" as a network signal, but that string also appears on auth
/ permission failures (private repo, expired credential, etc.).
Auth failures would have been silently downgraded to "offline" warn,
defeating the freshness guard. Fix: (a) auth signals matched FIRST
with explicit auth-pattern list ("permission denied", "authentication
failed", "fatal: authentication", "invalid credentials", "bad
credentials", "could not read username", "support for password
authentication was removed", "403 forbidden", "401 unauthorized");
(b) network signals require actual host/connectivity word (removed
"could not read from remote repository" from network list — needs a
separate host/connectivity signal to qualify as network); (c) auth-
flavored bail surfaces "(AUTH-flavored)" tag + remediation suggests
re-auth via gh auth login.

P1 (PRRT_kwDOSF9kNM6Er1uw) — ENOENT diagnostic: when git binary is
missing, the exec throws without stderr. Prior diagnostic said "(no
stderr captured)" — hides the actual cause (ENOENT). Fix: capture
`e.message` + `e.code` (when present) + surface in the bail
diagnostic. Common-causes list also adds "git not installed / not on
PATH (e.code === 'ENOENT')" as the first entry.

All 3 fixes layered onto the existing iter-4.3 freshness-check
function. The auth-misclassification is the highest-impact (security-
meaningful) because it would have made the freshness guard fail-open
for auth-failure scenarios.

Co-Authored-By: Claude <noreply@anthropic.com>
@AceHack AceHack merged commit 3fae46d into main May 26, 2026
32 checks passed
@AceHack AceHack deleted the otto-cli/iter43-fixfwd-4-copilot-findings-2026-05-26 branch May 26, 2026 05:06
AceHack added a commit that referenced this pull request May 26, 2026
…t_msdos fallback for NixOS isohybrid post-dd ESP (#5099)

2026-05-26 empirical test of #5093 (iter-4.3 fix-forward) surfaced
TWO bugs in the iter-4.2 inject step that the auto-diagnostic
substrate caught loud + photo-friendly:

(1) ESP detection regex at line 481 missed MBR 0xEF partition type.
    NixOS install ISOs use isohybrid layout; after dd to USB, macOS's
    diskutil reads the FDisk MBR partition table and reports the FAT
    ESP with TYPE='0xEF' (the MBR numeric type code for "EFI System
    Partition") rather than diskutil's GPT-style 'EFI EFI' or
    'DOS_FAT_32' labels. The existing regex
    \b(DOS_FAT|EFI|MS-DOS|FAT16|FAT32|Windows_FAT)\b matched the
    GPT cases but not the MBR cases.

(2) Even with detection fixed, 'diskutil mount /dev/disk6s2' fails
    on MBR 0xEF partitions because macOS's diskutil auto-probe
    doesn't recognize the FAT12 filesystem-in-0xEF combination as
    mountable. diskutil info reports 'File System: None' even
    though 'sudo file -s /dev/disk6s2' clearly shows the FAT12
    EFIBOOT volume is there + writable. The fix: explicit
    mount_msdos against a mkdtemp mount point as fallback.

Changes:

- findFatPartition() regex extended to ALSO match MBR partition
  type codes 0x(EF|0C|0E|06|0B|0F) covering EFI System Partition +
  FAT16 + FAT32 + FAT32-LBA + FAT16-LBA + Extended-LBA. \b on both
  sides prevents accidental match inside longer hex strings.
- New mountEsp(espPart) helper: tries 'diskutil mount' first (GPT
  case), falls back to 'sudo mount_msdos -o nodev,nosuid <part>
  <mkdtemp>' for MBR 0xEF case. Returns method tag so unmount
  matches.
- New unmountEsp(espPart, result) helper: dispatches to 'diskutil
  unmount' or 'sudo umount + rmSync(tmpdir)' based on method.
- injectPubkeyToUsb() simplified: single mountEsp call replaces
  the diskutil-mount + getMountPoint + error-path-unmount duplication;
  single unmountEsp call at success + error paths.
- Diagnostic message updated: 'mountEsp ${espPart} failed (both
  diskutil + mount_msdos paths)' tells the operator BOTH paths
  were exhausted before bail.

Empirical validation 2026-05-26: manually ran 'sudo mount_msdos
-o nodev,nosuid /dev/disk6s2 /tmp/zeta-esp-mount' on the
post-dd USB (124GB PNY, 3.1MB 0xEF ESP labeled EFIBOOT) — mount
succeeded; wrote zeta-authorized-keys.pub via 'sudo cp ~/.ssh/
id_ed25519.pub /tmp/zeta-esp-mount/...'; unmounted + ejected
cleanly. The substrate the script now ships matches what worked
empirically.

Composes with iter-4.3 (#5091) + iter-4.3 fixfwd (#5093) + B-0789
parent + B-0790 end-state (zero-typing SSH after first cluster
boot is one of the load-bearing steps toward zero-dev-machine
homelab persona).

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 26, 2026
…Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26) (#5103)

* feat(B-0792 iter-5.1): persist NetworkManager profiles from live installer to installed system + enable Avahi mDNS publishing

Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test:

> "we won't have ethernet for most machines it needs to
> remember the wifi on setup"

> "completely self contained usb we already try eth for 30
> seconds and then ask for wifi we just need to remember it
> afterwards"

REVISED iter-5.1 design (NO operator-side credential pipeline;
no keychain extract; no JSON file; no CLI flags): completely
self-contained on USB. Existing flow already works through nmtui:

1. zeta-first-boot.sh waits 30s for ethernet DHCP
2. If absent, launches nmtui (single TUI form on cluster console)
3. Operator picks wifi SSID + enters password ONCE
4. NetworkManager writes profile to
   /etc/NetworkManager/system-connections/<ssid>.nmconnection
5. Installer connects + runs zeta-install.sh
6. **BUG (today)**: nixos-install installs fresh system that
   inherits NetworkManager service but NOT the operator's
   connection profile
7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection
   files from live installer to /mnt before nixos-install runs
8. Reboot → installed NixOS NetworkManager loads the profile →
   wifi reconnects automatically

Two changes:

1. full-ai-cluster/usb-nixos-installer/zeta-install.sh:
   New Step 6.5 (before nixos-install): detect + copy
   /etc/NetworkManager/system-connections/*.nmconnection from
   live installer to /mnt/etc/NetworkManager/system-connections/.
   chmod 0600 + chown root:root (NM requires; else profiles
   silently ignored at boot with "permissions not strict enough"
   warning in journalctl). Photo-friendly disclosure per profile:
   "[iter-5.1]   persisted: <name>.nmconnection (ssid=<ssid>)".
   Never prints psk. Skips cleanly if /etc/NetworkManager/
   system-connections doesn't exist OR has no .nmconnection
   files (ethernet-DHCP path; no profile to copy).

2. full-ai-cluster/nixos/modules/common.nix:
   Enable services.avahi for mDNS publishing so
   `ssh zeta@control-plane.local` resolves from operator Mac
   (Bonjour) and Linux peers (nss-mdns) on the LAN without
   IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2
   test surfaced the gap when SSH-by-hostname.local failed
   even though node was up.

   services.avahi = {
     enable = true;
     nssmdns4 = true;
     openFirewall = true;  # 5353/udp
     publish = {
       enable = true;
       addresses = true;
       workstation = true;
       domain = true;
     };
   };

Composes with iter-4.x (#5080#5083#5086#5088#5091#5093#5099) substrate. Acceptance: empirical wifi-only mini-PC bring-
up → nmtui-once at install → reboot → ssh zeta@<hostname>.local
zero-typing zero-console from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(B-0792 iter-5.2): per-node hostname injection — decouple hostname from role-stack (Aaron 2026-05-26)

Aaron 2026-05-26 architectural framing:

> "make any multi node changes we need to like think though
> mdns names when we have two control planes"

> "since our different roles are multi install you can be
> control plane AND gpu node AND cpu node these distinctions
> are not very elegant and host names tied to them are not
> great either"

Bug: every node installed from --flake .#control-plane gets
hostname "control-plane" (baked in flake config); two such
nodes collide on mDNS (Avahi auto-renames second to
"control-plane-2.local" but underlying NixOS hostname stays
"control-plane" — confusing in logs / journalctl / kubectl /
node-labels). And role-tied hostname pattern is architecturally
broken — a single node can be control-plane AND gpu-worker
AND storage simultaneously.

iter-5.2 fix: SEPARATE hostname identity from role-stack
selection. Three changes:

1. nixos/modules/injected-hostname.nix (NEW): NixOS module that
   reads /etc/zeta/cluster-node-id at evaluation time + overrides
   networking.hostName via lib.mkOverride 50. If file doesn't
   exist OR is empty OR invalid, the per-host flake config
   default stays in effect — backward-compatible with
   single-node zero-typing path.

2. nixos/modules/common.nix: import injected-hostname.nix so
   EVERY host (control-plane, worker-gpu, worker-template,
   future configs) gets the override capability transitively
   via common.nix's existing import-chain.

3. tools/zflash.ts: add --host <name> flag with RFC1123
   validation at flag-parse time (alphanumeric + hyphens,
   1-63 chars, no leading/trailing hyphen). When passed,
   write zeta-hostname.txt to USB ESP in the same mount
   session as zeta-authorized-keys.pub (covered by same sudo
   timestamp window; no additional Touch ID).

4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before
   nixos-install) — probe USB for zeta-hostname.txt; if
   present + valid RFC1123, write to /mnt/etc/zeta/
   cluster-node-id (mode 0644). injected-hostname.nix module
   picks it up at NixOS evaluation time. Backward-compatible:
   if no zeta-hostname.txt, flake default stays.

Empirical UX:

  # Single-node, zero-typing (today's path; unchanged):
  zflash
  # → hostname stays 'control-plane'; ssh zeta@control-plane.local

  # Multi-node, one short flag per USB:
  zflash --host pikachu      # → ssh zeta@pikachu.local
  zflash --host charizard    # → ssh zeta@charizard.local
  zflash --host bulbasaur    # → ssh zeta@bulbasaur.local
  # No flake explosion; all three install from .#control-plane
  # role-stack but each gets unique hostname + mDNS announcement.

The architectural concern Aaron raised (role-as-capability
composition; one node = control-plane AND gpu-worker AND
storage) is BEYOND iter-5.2 scope — refactor of
nixos/hosts/<role>/configuration.nix → composable
nixos/modules/role-*.nix capability modules — filed separately
as B-0793 follow-up.

Composes with iter-5.1 in same PR; together they ship
"completely self-contained USB" per Aaron's discipline:
nmtui-once-at-install for wifi, --host <name>-at-zflash for
multi-node identity, NetworkManager profile persistence +
hostname injection at install time, mDNS publishing for
zero-IP-discovery SSH-by-hostname.local from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0792 iter-5.1+5.2): 4 Copilot findings — Step 6.5 dup, personal-name attribution, nullglob comment, SSID truncation on '='

- P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for
  iter-4.2 pubkey probe at line 229; renumber my additions to
  avoid ambiguous labels in install logs)
- P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26"
  in 2 comment blocks (repo convention: role-based attribution
  in non-history surfaces)
- P2: update nullglob comment — code uses find not glob; describe
  that find + filter handles empty-dir naturally without
  nullglob shell option
- P2: SSID extraction from .nmconnection — replace
  `awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with
  `sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per
  802.11 spec). Log accuracy fix for all SSIDs.

shellcheck clean post-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants