Skip to content

backlog(B-0789): iter-4 v1 cluster credential substrate — hashedPassword + operator-ssh-keys scaffold (iter-4.2 ships zero-typing auto-inject)#5080

Merged
AceHack merged 1 commit into
mainfrom
otto-cli/iter4-ssh-password-substrate-b0789-2026-05-26
May 26, 2026
Merged

backlog(B-0789): iter-4 v1 cluster credential substrate — hashedPassword + operator-ssh-keys scaffold (iter-4.2 ships zero-typing auto-inject)#5080
AceHack merged 1 commit into
mainfrom
otto-cli/iter4-ssh-password-substrate-b0789-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

The maintainer 2026-05-26 two-step authorization:

  1. "we can do what's going to make cluster setup eaiser for me and not users if that's ssh lets do that first cause we want to get ai running the cluster asap" — authorized iter-4 SSH+password work
  2. "i can wait for 4.2 or whatever version before we try again" — downgraded v1 from "test via re-flash" to "substrate scaffolding for iter-4.2 to build on"

iter-4 v1 ships the Nix-module + per-host-import scaffolding so iter-4.2 (zflash auto-inject + zeta-install.sh USB probe — the maintainer's actually-usable test target) is a tightly-scoped tooling PR rather than a substrate-shape PR.

Files

  • full-ai-cluster/nixos/modules/initial-password.nix (new): sha512crypt hash for zeta-change-me; operator rotates on first tty1 login via passwd zeta. Simplest-first; promote to yescrypt / agenix / sops-nix when repo goes public OR multi-operator isolation becomes load-bearing
  • full-ai-cluster/nixos/modules/operator-ssh-keys.nix (new): empty stub with edit-and-rebuild workflow documented in the comment header. iter-4.2 OVERWRITES this file at install time from the boot USB
  • full-ai-cluster/nixos/hosts/control-plane/configuration.nix: imports the two new modules; removes the prior inline empty authorizedKeys.keys declaration
  • full-ai-cluster/usb-nixos-installer/zeta-install.sh: prints initial credentials + post-install workflow block before exit (same echo block works for both v1 manual-edit fallback path and iter-4.2 zero-typing path)
  • docs/backlog/P1/B-0789-*.md (new): captures iter-4 v1 acceptance (scaffolding-only) + iter-4.2 acceptance (zflash auto-inject) + iter-4.3 multi-key extension + iter-5 per-node deploy-key + iter-5+ secret-management substrate promotion paths
  • docs/BACKLOG.md: regenerated via BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts

Initial password is zeta-change-me

Rotate immediately on first tty1 login via passwd zeta. Hash format: sha512crypt ($6$...). Generated via openssl passwd -6 'zeta-change-me'. NixOS reads via users.users.zeta.hashedPassword.

Composes with

  • B-0754 (iter-3 zero-typing USB install — iter-4 is the credential-substrate follow-on)
  • B-0759 (first-time-CLI-user persona)
  • B-0770 (Comet Pro IP-KVM — local-console-with-password becomes load-bearing for the IP-KVM substrate)
  • B-0776 / B-0786 (simplest-first discipline)
  • B-0780 (Local Loop tier-3 substrate needs reachable clusters)
  • B-0778 (commodity hardware reference)
  • .claude/rules/human-audit-and-legal-risk-acceptance-pattern-in-settings.md Shape A (hashedPassword-in-per-host-module)

Out of scope (deferred to iter-4.2+)

  • zflash auto-inject of SSH key to boot USB
  • zeta-install.sh USB probe + injection into operator-ssh-keys.nix
  • Multi-key per-context support (iter-4.3)
  • Per-node SSH keypair + GitHub deploy-key registration (iter-5)
  • agenix / sops-nix secret-management substrate (iter-5+)
  • Worker-template + worker-gpu module imports (v1.1 within this row)

Test plan

  • markdownlint clean
  • No tooling change (Nix module structure only); zeta-install.sh print block is the only operator-facing behavioral change
  • BACKLOG.md regenerated to pick up B-0789
  • CI passes (gate workflow + CodeQL)

The maintainer will NOT re-flash for v1 (per "i can wait for 4.2"); v1 is substrate-engineering housekeeping for the iter-4.2 PR to land cleanly on.

🤖 Generated with Claude Code

…ord + operator-ssh-keys scaffold (iter-4.2 ships the zero-typing auto-inject)

The maintainer 2026-05-26 surfaced two adjacent signals across recent
ticks:

1. "we can do what's going to make cluster setup eaiser for me and not
   users if that's ssh lets do that first cause we want to get ai
   running the cluster asap" — authorized iter-4 SSH+password work
2. "i can wait for 4.2 or whatever version before we try again" —
   downgraded v1 from "test via re-flash" to "substrate scaffolding
   for 4.2 to build on"

iter-4 v1 ships the Nix-module + per-host-import scaffolding so
iter-4.2 (zflash auto-inject + zeta-install.sh USB probe) is a
tightly-scoped tooling PR rather than a substrate-shape PR.

Files:

- full-ai-cluster/nixos/modules/initial-password.nix (new): sha512crypt
  hash for "zeta-change-me" via `openssl passwd -6`; operator rotates
  on first tty1 login via `passwd zeta`. Simplest-first per the
  maintainer-Mika 2026-05-25 feedback memory; promote to yescrypt /
  agenix / sops-nix when repo goes public OR multi-operator isolation
  becomes load-bearing

- full-ai-cluster/nixos/modules/operator-ssh-keys.nix (new): empty
  stub with edit-and-rebuild workflow documented in the module's
  comment header. iter-4.2 OVERWRITES this file at install time from
  the boot USB; v1 ships the stub so per-host imports compose cleanly
  with the iter-4.2 tooling change without re-architecting

- full-ai-cluster/nixos/hosts/control-plane/configuration.nix:
  imports the two new modules; removes the prior inline empty
  authorizedKeys.keys declaration (now handled by the imported module)

- full-ai-cluster/usb-nixos-installer/zeta-install.sh: prints
  initial credentials + post-install operator workflow block before
  exit. Text covers both the v1 manual-edit fallback path and the
  iter-4.2 zero-typing path so the same echo block works for either
  iteration

- docs/backlog/P1/B-0789-*.md (new): captures iter-4 v1 acceptance
  (scaffolding-only) + iter-4.2 acceptance (zflash auto-inject; the
  maintainer's actually-usable target) + iter-4.3 multi-key extension
  + iter-5 per-node deploy-key + iter-5+ secret-management substrate
  promotion paths

Composes with B-0754 (iter-3 zero-typing USB install — iter-4 is the
credential-substrate follow-on); B-0759 (first-time-CLI-user persona);
B-0770 (Comet Pro IP-KVM — local-console-with-password becomes load-
bearing for the IP-KVM substrate); B-0776 (simplest-first plugin
sequence); B-0780 (Local Loop tier-3 substrate needs reachable
clusters); B-0786 (simplest-first discipline applied at every choice);
B-0778 (commodity hardware reference); .claude/rules/human-audit-and-
legal-risk-acceptance-pattern-in-settings.md Shape A
(hashedPassword-in-per-host-module).

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 04:02
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 04:02
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack merged commit ddcc988 into main May 26, 2026
30 of 31 checks passed
@AceHack AceHack deleted the otto-cli/iter4-ssh-password-substrate-b0789-2026-05-26 branch May 26, 2026 04:05
AceHack added a commit that referenced this pull request May 26, 2026
…SP + zeta-install.sh USB probe → zero-typing SSH access on first cluster boot (#5083)

The maintainer's actually-usable iter-4 path (v1 was scaffolding-only;
this PR is the workflow Aaron will test against). Per Aaron's 2026-05-26
discipline signals:

1. "we can do what's going to make cluster setup eaiser for me and not
   users if that's ssh lets do that first cause we want to get ai
   running the cluster asap" — iter-4 authorized
2. "i can wait for 4.2 or whatever version before we try again" —
   downgraded v1 to scaffolding; this PR is what Aaron flashes
3. "--no-creds is basically useless right?" — opt-out removed from
   recommended path (kept as --no-inject escape hatch only)
4. "whenever i have to ferry commands by reading and typing i'm going
   to avoid it like the plague and try to get like pictures and auto
   run and short commands pre built in" — design discipline: ALL
   diagnostics auto-fire in-place + are photo-friendly; zero operator-
   typed commands beyond `zflash`

Files:

- full-ai-cluster/tools/flash-usb.ts: added `--no-eject` flag (4 lines)
  so zflash can do post-flash ESP-mount-and-write before the USB ejects.
  Allowlist updated per the Copilot P0 catch about destructive-tool
  flag validation. Help text mentions iter-4.2 use case

- full-ai-cluster/tools/zflash.ts: extended with post-flash macOS-side
  ESP-mount-and-write step:
    * Default reads ~/.ssh/id_ed25519.pub
    * --ssh-key <path> overrides
    * --no-inject opt-out (escape hatch only)
    * Re-scans external disks post-flash (flash-usb's single-USB-only
      requirement guarantees exactly one external disk)
    * Identifies FAT/EFI partition via `diskutil list` regex match
      (DOS_FAT / EFI / MS-DOS / FAT16 / FAT32 / Windows_FAT)
    * Mounts via `diskutil mount`; gets mount point from `diskutil info`
    * Writes <mount>/zeta-authorized-keys.pub via `sudo tee` (stdin
      avoids shell-quoting hazards)
    * Unmounts + ejects when done
    * dumpDiagnostics() helper auto-runs on any failure path:
      diskutil list external + mounted /Volumes/* + "what to do next"
      suggestions. Compact + photo-friendly per the design discipline

- full-ai-cluster/usb-nixos-installer/zeta-install.sh: added step 6.5
  pre-install pubkey probe + injection:
    * Try 1: scan /iso /run /mnt /boot for zeta-authorized-keys.pub
      via `sudo find -maxdepth 5`
    * Try 2: probe USB partitions (/dev/sd? /dev/nvme?n? /dev/vd?
      /dev/mmcblk?, minus install targets) via vfat-readonly mount +
      file existence check. Partition suffix handling: 1/2 on sd/vd;
      p1/p2 on nvme/mmcblk
    * If found: writes operator-ssh-keys.nix with valid ssh-* lines
      from the file BEFORE nixos-install
    * If not found: diagnostics auto-fire (external block devices,
      install targets, full lsblk, "what to do next") + falls back to
      v1 stub
    * Post-install credentials echo branches on INJECT_OK: success
      path says "SSH works immediately"; fallback keeps v1 manual-
      edit + nixos-rebuild instructions
    * shellcheck clean (fixed SC2261 redundant stderr redirect)

- docs/backlog/P1/B-0789-*.md: updated iter-4.2 acceptance to reflect
  what shipped: [x] flash-usb.ts --no-eject; [x] zflash.ts ESP inject;
  [x] zeta-install.sh probe + inject + branched credentials echo;
  [ ] maintainer flashes + tests on PC; [ ] if failure: photo-driven
  fix-forward workflow per the maintainer's explicit design choice

Composes with PR #5080 (iter-4 v1 scaffolding: initial-password.nix +
operator-ssh-keys.nix stub + per-host imports) which this PR builds on.
The zero-typing target: `zflash` → boot USB on PC → install → SSH-able
as zeta@<hostname> from the maintainer's Mac using the existing
~/.ssh/id_ed25519 key. Failure path: photo of auto-diagnostics →
AI fixes-forward.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
@AceHack AceHack review requested due to automatic review settings May 26, 2026 04:24
AceHack added a commit that referenced this pull request May 26, 2026
…Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26) (#5103)

* feat(B-0792 iter-5.1): persist NetworkManager profiles from live installer to installed system + enable Avahi mDNS publishing

Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test:

> "we won't have ethernet for most machines it needs to
> remember the wifi on setup"

> "completely self contained usb we already try eth for 30
> seconds and then ask for wifi we just need to remember it
> afterwards"

REVISED iter-5.1 design (NO operator-side credential pipeline;
no keychain extract; no JSON file; no CLI flags): completely
self-contained on USB. Existing flow already works through nmtui:

1. zeta-first-boot.sh waits 30s for ethernet DHCP
2. If absent, launches nmtui (single TUI form on cluster console)
3. Operator picks wifi SSID + enters password ONCE
4. NetworkManager writes profile to
   /etc/NetworkManager/system-connections/<ssid>.nmconnection
5. Installer connects + runs zeta-install.sh
6. **BUG (today)**: nixos-install installs fresh system that
   inherits NetworkManager service but NOT the operator's
   connection profile
7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection
   files from live installer to /mnt before nixos-install runs
8. Reboot → installed NixOS NetworkManager loads the profile →
   wifi reconnects automatically

Two changes:

1. full-ai-cluster/usb-nixos-installer/zeta-install.sh:
   New Step 6.5 (before nixos-install): detect + copy
   /etc/NetworkManager/system-connections/*.nmconnection from
   live installer to /mnt/etc/NetworkManager/system-connections/.
   chmod 0600 + chown root:root (NM requires; else profiles
   silently ignored at boot with "permissions not strict enough"
   warning in journalctl). Photo-friendly disclosure per profile:
   "[iter-5.1]   persisted: <name>.nmconnection (ssid=<ssid>)".
   Never prints psk. Skips cleanly if /etc/NetworkManager/
   system-connections doesn't exist OR has no .nmconnection
   files (ethernet-DHCP path; no profile to copy).

2. full-ai-cluster/nixos/modules/common.nix:
   Enable services.avahi for mDNS publishing so
   `ssh zeta@control-plane.local` resolves from operator Mac
   (Bonjour) and Linux peers (nss-mdns) on the LAN without
   IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2
   test surfaced the gap when SSH-by-hostname.local failed
   even though node was up.

   services.avahi = {
     enable = true;
     nssmdns4 = true;
     openFirewall = true;  # 5353/udp
     publish = {
       enable = true;
       addresses = true;
       workstation = true;
       domain = true;
     };
   };

Composes with iter-4.x (#5080#5083#5086#5088#5091#5093#5099) substrate. Acceptance: empirical wifi-only mini-PC bring-
up → nmtui-once at install → reboot → ssh zeta@<hostname>.local
zero-typing zero-console from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(B-0792 iter-5.2): per-node hostname injection — decouple hostname from role-stack (Aaron 2026-05-26)

Aaron 2026-05-26 architectural framing:

> "make any multi node changes we need to like think though
> mdns names when we have two control planes"

> "since our different roles are multi install you can be
> control plane AND gpu node AND cpu node these distinctions
> are not very elegant and host names tied to them are not
> great either"

Bug: every node installed from --flake .#control-plane gets
hostname "control-plane" (baked in flake config); two such
nodes collide on mDNS (Avahi auto-renames second to
"control-plane-2.local" but underlying NixOS hostname stays
"control-plane" — confusing in logs / journalctl / kubectl /
node-labels). And role-tied hostname pattern is architecturally
broken — a single node can be control-plane AND gpu-worker
AND storage simultaneously.

iter-5.2 fix: SEPARATE hostname identity from role-stack
selection. Three changes:

1. nixos/modules/injected-hostname.nix (NEW): NixOS module that
   reads /etc/zeta/cluster-node-id at evaluation time + overrides
   networking.hostName via lib.mkOverride 50. If file doesn't
   exist OR is empty OR invalid, the per-host flake config
   default stays in effect — backward-compatible with
   single-node zero-typing path.

2. nixos/modules/common.nix: import injected-hostname.nix so
   EVERY host (control-plane, worker-gpu, worker-template,
   future configs) gets the override capability transitively
   via common.nix's existing import-chain.

3. tools/zflash.ts: add --host <name> flag with RFC1123
   validation at flag-parse time (alphanumeric + hyphens,
   1-63 chars, no leading/trailing hyphen). When passed,
   write zeta-hostname.txt to USB ESP in the same mount
   session as zeta-authorized-keys.pub (covered by same sudo
   timestamp window; no additional Touch ID).

4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before
   nixos-install) — probe USB for zeta-hostname.txt; if
   present + valid RFC1123, write to /mnt/etc/zeta/
   cluster-node-id (mode 0644). injected-hostname.nix module
   picks it up at NixOS evaluation time. Backward-compatible:
   if no zeta-hostname.txt, flake default stays.

Empirical UX:

  # Single-node, zero-typing (today's path; unchanged):
  zflash
  # → hostname stays 'control-plane'; ssh zeta@control-plane.local

  # Multi-node, one short flag per USB:
  zflash --host pikachu      # → ssh zeta@pikachu.local
  zflash --host charizard    # → ssh zeta@charizard.local
  zflash --host bulbasaur    # → ssh zeta@bulbasaur.local
  # No flake explosion; all three install from .#control-plane
  # role-stack but each gets unique hostname + mDNS announcement.

The architectural concern Aaron raised (role-as-capability
composition; one node = control-plane AND gpu-worker AND
storage) is BEYOND iter-5.2 scope — refactor of
nixos/hosts/<role>/configuration.nix → composable
nixos/modules/role-*.nix capability modules — filed separately
as B-0793 follow-up.

Composes with iter-5.1 in same PR; together they ship
"completely self-contained USB" per Aaron's discipline:
nmtui-once-at-install for wifi, --host <name>-at-zflash for
multi-node identity, NetworkManager profile persistence +
hostname injection at install time, mDNS publishing for
zero-IP-discovery SSH-by-hostname.local from operator Mac.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0792 iter-5.1+5.2): 4 Copilot findings — Step 6.5 dup, personal-name attribution, nullglob comment, SSID truncation on '='

- P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for
  iter-4.2 pubkey probe at line 229; renumber my additions to
  avoid ambiguous labels in install logs)
- P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26"
  in 2 comment blocks (repo convention: role-based attribution
  in non-history surfaces)
- P2: update nullglob comment — code uses find not glob; describe
  that find + filter handles empty-dir naturally without
  nullglob shell option
- P2: SSID extraction from .nmconnection — replace
  `awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with
  `sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per
  802.11 spec). Log accuracy fix for all SSIDs.

shellcheck clean post-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant