feat(B-0792 iter-5.1+5.2): self-contained USB — NM-profile persist + Avahi mDNS + per-node hostname injection (decouple from role-stack) (Aaron 2026-05-26)#5103
Merged
AceHack merged 3 commits intoMay 26, 2026
Conversation
…aller to installed system + enable Avahi mDNS publishing
Aaron 2026-05-26 surfaced after iter-4.2 PC1 empirical test:
> "we won't have ethernet for most machines it needs to
> remember the wifi on setup"
> "completely self contained usb we already try eth for 30
> seconds and then ask for wifi we just need to remember it
> afterwards"
REVISED iter-5.1 design (NO operator-side credential pipeline;
no keychain extract; no JSON file; no CLI flags): completely
self-contained on USB. Existing flow already works through nmtui:
1. zeta-first-boot.sh waits 30s for ethernet DHCP
2. If absent, launches nmtui (single TUI form on cluster console)
3. Operator picks wifi SSID + enters password ONCE
4. NetworkManager writes profile to
/etc/NetworkManager/system-connections/<ssid>.nmconnection
5. Installer connects + runs zeta-install.sh
6. **BUG (today)**: nixos-install installs fresh system that
inherits NetworkManager service but NOT the operator's
connection profile
7. **FIX (iter-5.1)**: zeta-install.sh copies *.nmconnection
files from live installer to /mnt before nixos-install runs
8. Reboot → installed NixOS NetworkManager loads the profile →
wifi reconnects automatically
Two changes:
1. full-ai-cluster/usb-nixos-installer/zeta-install.sh:
New Step 6.5 (before nixos-install): detect + copy
/etc/NetworkManager/system-connections/*.nmconnection from
live installer to /mnt/etc/NetworkManager/system-connections/.
chmod 0600 + chown root:root (NM requires; else profiles
silently ignored at boot with "permissions not strict enough"
warning in journalctl). Photo-friendly disclosure per profile:
"[iter-5.1] persisted: <name>.nmconnection (ssid=<ssid>)".
Never prints psk. Skips cleanly if /etc/NetworkManager/
system-connections doesn't exist OR has no .nmconnection
files (ethernet-DHCP path; no profile to copy).
2. full-ai-cluster/nixos/modules/common.nix:
Enable services.avahi for mDNS publishing so
`ssh zeta@control-plane.local` resolves from operator Mac
(Bonjour) and Linux peers (nss-mdns) on the LAN without
IP-discovery step. Empirical anchor: 2026-05-26 iter-4.2
test surfaced the gap when SSH-by-hostname.local failed
even though node was up.
services.avahi = {
enable = true;
nssmdns4 = true;
openFirewall = true; # 5353/udp
publish = {
enable = true;
addresses = true;
workstation = true;
domain = true;
};
};
Composes with iter-4.x (#5080→#5083→#5086→#5088→#5091→#5093→
#5099) substrate. Acceptance: empirical wifi-only mini-PC bring-
up → nmtui-once at install → reboot → ssh zeta@<hostname>.local
zero-typing zero-console from operator Mac.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…e from role-stack (Aaron 2026-05-26) Aaron 2026-05-26 architectural framing: > "make any multi node changes we need to like think though > mdns names when we have two control planes" > "since our different roles are multi install you can be > control plane AND gpu node AND cpu node these distinctions > are not very elegant and host names tied to them are not > great either" Bug: every node installed from --flake .#control-plane gets hostname "control-plane" (baked in flake config); two such nodes collide on mDNS (Avahi auto-renames second to "control-plane-2.local" but underlying NixOS hostname stays "control-plane" — confusing in logs / journalctl / kubectl / node-labels). And role-tied hostname pattern is architecturally broken — a single node can be control-plane AND gpu-worker AND storage simultaneously. iter-5.2 fix: SEPARATE hostname identity from role-stack selection. Three changes: 1. nixos/modules/injected-hostname.nix (NEW): NixOS module that reads /etc/zeta/cluster-node-id at evaluation time + overrides networking.hostName via lib.mkOverride 50. If file doesn't exist OR is empty OR invalid, the per-host flake config default stays in effect — backward-compatible with single-node zero-typing path. 2. nixos/modules/common.nix: import injected-hostname.nix so EVERY host (control-plane, worker-gpu, worker-template, future configs) gets the override capability transitively via common.nix's existing import-chain. 3. tools/zflash.ts: add --host <name> flag with RFC1123 validation at flag-parse time (alphanumeric + hyphens, 1-63 chars, no leading/trailing hyphen). When passed, write zeta-hostname.txt to USB ESP in the same mount session as zeta-authorized-keys.pub (covered by same sudo timestamp window; no additional Touch ID). 4. usb-nixos-installer/zeta-install.sh: new Step 6.4 (before nixos-install) — probe USB for zeta-hostname.txt; if present + valid RFC1123, write to /mnt/etc/zeta/ cluster-node-id (mode 0644). injected-hostname.nix module picks it up at NixOS evaluation time. Backward-compatible: if no zeta-hostname.txt, flake default stays. Empirical UX: # Single-node, zero-typing (today's path; unchanged): zflash # → hostname stays 'control-plane'; ssh zeta@control-plane.local # Multi-node, one short flag per USB: zflash --host pikachu # → ssh zeta@pikachu.local zflash --host charizard # → ssh zeta@charizard.local zflash --host bulbasaur # → ssh zeta@bulbasaur.local # No flake explosion; all three install from .#control-plane # role-stack but each gets unique hostname + mDNS announcement. The architectural concern Aaron raised (role-as-capability composition; one node = control-plane AND gpu-worker AND storage) is BEYOND iter-5.2 scope — refactor of nixos/hosts/<role>/configuration.nix → composable nixos/modules/role-*.nix capability modules — filed separately as B-0793 follow-up. Composes with iter-5.1 in same PR; together they ship "completely self-contained USB" per Aaron's discipline: nmtui-once-at-install for wifi, --host <name>-at-zflash for multi-node identity, NetworkManager profile persistence + hostname injection at install time, mDNS publishing for zero-IP-discovery SSH-by-hostname.local from operator Mac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves first-boot operability for wifi-only cluster installs by persisting NetworkManager connection profiles from the live installer into the installed system, and by enabling Avahi mDNS publishing so hosts are reachable via <hostname>.local without manual IP discovery.
Changes:
- Copy
*.nmconnectionprofiles from the live ISO (/etc/NetworkManager/system-connections/) into the target system (/mnt/etc/NetworkManager/system-connections/) beforenixos-install. - Enable
services.avahiwith firewall opening and publishing settings in the shared NixOScommon.nixmodule.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| full-ai-cluster/usb-nixos-installer/zeta-install.sh | Adds a pre-install step to persist NetworkManager wifi profiles into /mnt so the installed system can reconnect automatically. |
| full-ai-cluster/nixos/modules/common.nix | Enables Avahi mDNS publishing to support ssh zeta@<hostname>.local hostname-based access on the LAN. |
…-name attribution, nullglob comment, SSID truncation on '='
- P1: rename Step 6.4/6.5 to Step 6.6/6.7 (existing Step 6.5 for
iter-4.2 pubkey probe at line 229; renumber my additions to
avoid ambiguous labels in install logs)
- P1: replace "Aaron 2026-05-26" with "the maintainer 2026-05-26"
in 2 comment blocks (repo convention: role-based attribution
in non-history surfaces)
- P2: update nullglob comment — code uses find not glob; describe
that find + filter handles empty-dir naturally without
nullglob shell option
- P2: SSID extraction from .nmconnection — replace
`awk -F= '/^ssid=/{print $2}'` (truncates at first '=') with
`sed -n 's/^ssid=//p'` (preserves SSIDs containing '=' per
802.11 spec). Log accuracy fix for all SSIDs.
shellcheck clean post-fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
… operator can rename later via digital-twin (Aaron 2026-05-26) (#5107) * feat(B-0792 iter-5.2.1): auto-generate node-<6hex> hostname when --host not specified — operator can rename later via digital-twin substrate (Aaron 2026-05-26) The maintainer 2026-05-26 architectural framing: > "can we have it auto generate the host name we can change > later via digital twin after it self registers" Composes iter-5.2 (--host injection mechanism) with B-0794 (node self-registration / digital-twin substrate). Zero-typing default for operators who don't want to think about names at flash time; rename later via editing the digital-twin node-config YAML once self-registration substrate (B-0794) ships. Implementation: when operator runs `zflash` without --host flag, generate `node-<6hex>` from 3 random bytes (Web Crypto getRandomValues). 24-bit entropy = ~16M possible names → negligible collision risk for any homelab cluster size + mDNS uniqueness preserved per-node. Operator-named hostnames (via --host) take priority; auto-gen only fires when --host omitted AND --no-inject NOT set (no ESP write to carry the name anyway in --no-inject path). Generated name logged CLEARLY pre-flash so operator knows what to ssh to post-install: iter-5.2.1: --host not specified; auto-generated hostname: node-a3f9c2 (rename later via digital-twin substrate per B-0794) cluster will be reachable as: ssh zeta@node-a3f9c2.local Composes with iter-5.1+5.2 (#5103 merged at 6ee3a29) + B-0794 self-registration target. Future iter-5.4+ can extend the auto-gen with memorable-name dictionaries (pokemon, gemstones, fruits, etc.) once the framework is empirically validated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(iter-5.2.1): gate auto-hostname on willInject + drop concrete path to unimplemented B-0794 substrate (Copilot P0 + P1 on #5107) - P0 race: auto-gen was running BEFORE pubkey-existence check that sets willInject=false. If pubkey missing, willInject becomes false but auto-gen already happened + already printed "ssh zeta@node-XXXXXX.local" — misleading the operator since the hostname never gets written to the USB ESP. Move auto-gen AFTER willInject finalized + gate on willInject (was: gate on !noInject which doesn't account for missing-pubkey path). - P1 misleading path: comment referenced `maintainers/<name>/cluster-nodes/<node>/` which doesn't exist yet (B-0794 substrate not implemented; current maintainers/aaron/ only has legal-entities/). Reword to point at B-0794 backlog row instead of a concrete-but-fictional path. Same reword in the printed operator-facing line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…evice-registration substrate; production bootstrap-key-rotation deferred (informs B-0794 iter-5.4) (#5108) * preserve(mika): Aaron + Mika 2026-05-26 homelab-first gh-auth-login device-registration; production-mode bootstrap-key-rotation deferred Verbatim preservation of Aaron + Mika voice-mode conversation during iter-5 session. Mika is external Grok-native AI participant per .claude/rules/agent-roster-reference-card.md; co-originator of substantive substrate-engineering input. Architectural lock-in (Aaron 2026-05-26 final decision): - HOMELAB MODE FIRST: USB ships with NO embedded credentials; first boot prompts `gh auth login` interactively; operator's GitHub credentials register the machine + clone + set up cluster under operator's account; auto-copies operator's pubkey to authorized_keys. Zero shipped secrets. - PRODUCTION MODE LATER: ship USB with narrow restricted "bootstrap key" / "registration key" with register-only scope; immediately rotates to per-node identity after first registration succeeds. Two modes use DIFFERENT USBs (different flakes). Aaron: "different USBs for different audiences. But home lab is what I'm going for first, not production." Aaron standing direction for next iteration: "we should do it like this for gh and device registration the simple homelab way first but like prod later" THIS conversation directly informs B-0794 iter-5.4 implementation. Composes with PR #5103 (iter-5.1+5.2 substrate) + PR #5107 (iter-5.2.1 auto-hostname) + B-0792/B-0793/B-0794 backlog rows landed today. Per .claude/rules/substrate-or-it-didnt-happen.md verbatim preservation discipline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(memory): regen MEMORY.md to include Mika 2026-05-26 homelab-gh-auth-login preservation file * fix(mika preservation): add YAML frontmatter per memory format standard + reconcile 'MERGED' wording with rows' actual status: open state (Copilot P1 ×2 on #5108) --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
… + rows_filed_24h (Aaron 2026-05-26 — "per agent so we can see helath like per trajectory") (#5115) Aaron 2026-05-26 substrate-engineering concern: > 'we need to make sure that decopose is happening an on going > backlog log or else infinate backlog is just infnate debt' > 'the decompose to action is what i want background to show > with stats over time on the github page we have for plant > metrics that and also prs, i want that per agent so we can > see helath like per trajectory' Extends tools/dashboard/generate-metrics.ts to surface per-agent PR-shipping rate + decompose-to-action ratio in demo/metrics.json (consumed by the Zeta Factory Dashboard at lucent-financial-group.github.io/Zeta/demo/index.html). Three new per-agent fields: prs_merged_24h — PRs this agent merged in 24h window rows_filed_24h — PRs whose title matches `backlog(B-NNNN` (row-filing-only PRs, NOT action-on-rows) decompose_to_action_ratio — (prs_merged - rows_filed) / max(rows_filed, 1) → impl-PRs per row-filing-PR → >=1 = strong action-on-rows discipline → <1 = filing rows faster than shipping them = debt-accumulation signal Attribution via branch-prefix lookup (BRANCH_PREFIX_TO_AGENT) per .claude/rules/agent-roster-reference-card.md lane discipline: otto-cli/ + otto-desktop/ + otto-vscode/ + otto/ → Otto; alexa-kiro/ + alexa/ → Alexa; riven-cursor/ + riven/ → Riven; vera-codex/ + vera/ → Vera; lior-antigravity/ + lior-gemini/ + lior/ → Lior. PRs from non-prefixed branches attribute to 'Unknown' bucket (operator-auditable as missing-attribution surface). EMPIRICAL validation 2026-05-26 (live run): Otto: 57 PRs / 30 row-filing → ratio = 0.9 (nearly 1:1; debt signal!) Lior: 6 PRs / 0 row-filing → ratio = 6 (all action) Others: 0/0/0 (quiet 24h window) Otto ratio 0.9 EMPIRICALLY VALIDATES Aaron's concern — this session filed 6 substantive rows (B-0791..B-0794, B-0796, B-0797) + shipped 4 implementation PRs (#5103 iter-5.1+5.2, #5107 iter-5.2.1, #5113 iter-5.2.2, #5110 draft) — ratio < 1. The metric now exposes the pattern continuously. Dashboard HTML render of these new fields is follow-on substrate (small UI work). The data layer is the load-bearing first step; operator + Mika can read demo/metrics.json directly until UI lands. Substrate-honest note: the dashboard generation itself happens on the autonomous-loop cron tick (per B-0414); per-agent stats will update on every tick going forward. Time-series tracking (today's metric vs 7d-ago, 30d-ago) is separate substrate (would need to preserve historical metrics.json snapshots; deferred to follow-on iteration). Composes with .claude/rules/agent-roster-reference-card.md (branch-prefix attribution), .claude/rules/holding-without-named- dependency-is-standing-by-failure.md (decompose-to-action discipline), B-0797 (autonomous-loop sometimes-task; same substrate-engineering direction). Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…atches dropped iter-N modules before ~15min Nix build (Aaron 2026-05-26) (#5116) Aaron 2026-05-26: 'start wroking on the ci stuff while we iterate so you can start iterating without me' + 'any parts we can test in siolate are candidates for more unit like tests instead of full integration tests'. This PR ships #1 of an ascending test-substrate cascade: #1 Source-substrate audit (this PR; ~1s; preflight) #2 Unit tests for zflash.ts + shell-logic (next PR) #3 ISO content audit (via 7z list; after ISO build) #4 NixOS test framework (full VM boot + install round-trip) #5 End-to-end CI workflow (hardware-class regression) The maintainer 2026-05-26 USB flash empirically surfaced two related bugs the audit catches: (1) Workflow trigger-path filter on build-ai-cluster-iso.yml was `nixos/modules/disko-shapes/**` only — missed iter-5.2 (PR #5103 added injected-hostname.nix) + iter-5.2.2 (PR #5113 added login-banner.nix). Result: CI didn't rebuild the ISO when those modules landed; operator downloaded an older ISO via `gh run download` that lacked the iter-5.x substrate. (2) Even after broadening trigger paths, source-substrate audit is a FLOOR: catches "module file in repo but iter-N sentinel accidentally dropped in a fix-fwd" + "module file removed by mistake". Pure source-level grep; runs in ~1s; no Nix build needed. Changes: - NEW tools/ci/audit-installer-substrate.ts (~250 LOC TS): - REQUIRED_FILES list (10 expected installer-substrate paths) - REQUIRED_SENTINELS list (5 file→sentinel-strings assertions) - Exit codes: 0 pass / 1 missing file / 2 missing sentinel - Runs locally + in CI; bun tools/ci/audit-installer-substrate.ts - Empirical pass on current main substrate - BROADENED .github/workflows/build-ai-cluster-iso.yml triggers: full-ai-cluster/nixos/disko-shapes/** → full-ai-cluster/nixos/** + full-ai-cluster/tools/** + tools/ci/audit-installer-substrate.ts - ADDED preflight audit step BEFORE the ~15min nix build (fails fast if substrate is incomplete; saves CI minutes when iter-N modules accidentally get dropped) Audits performed: REQUIRED_FILES (10): zeta-install.sh, zeta-first-boot.sh, installer/configuration.nix, initial-password.nix, operator-ssh-keys.nix, operator-ssh-keys.txt, common.nix, injected-hostname.nix, login-banner.nix, zflash.ts REQUIRED_SENTINELS (5 file→list pairs): zeta-install.sh: Step 6.5/6.6/6.7 markers, iter-5.2.2, /dev/urandom zeta-first-boot.sh: ETHERNET_WAIT_SECS, nmtui, zeta-install common.nix: imports of injected-hostname.nix + login-banner.nix, services.avahi, nssmdns4 injected-hostname.nix: cluster-node-id, networking.hostName, lib.mkOverride login-banner.nix: getty greetingLine + helpLine, Hostname:, ssh zeta@ Adding new iter-N modules: append path to REQUIRED_FILES + sentinels to REQUIRED_SENTINELS in the audit tool. Future-Otto reads this header to discover the pattern. Follow-on PRs in the test-substrate cascade (per Aaron's direction): - Unit tests for zflash.ts parseArgs + RFC1123 validation + mountEsp method-selection (Bun test runner; no I/O) - Docker-based zeta-install.sh test (mocked /dev devices + mocked /iso + /tmp/zeta-boot-esp; tests Step 6.6 + 6.7 logic without VM boot) - ISO content audit (7z list of built ISO; verifies expected paths + boot config; runs AFTER nix build, before artifact upload) - NixOS test framework (full QEMU VM boot + install round-trip; asserts pre-login banner, ssh-zero-typing, NM-profile persistence, hostname auto-gen) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Aaron 2026-05-26 architectural framings during iter-4.2 empirical test:
Ships iter-5.1 (wifi persistence + mDNS publishing) + iter-5.2 (per-node hostname decoupling) in one self-contained-USB substrate update.
iter-5.1 — NM-profile persistence + Avahi mDNS
`zeta-install.sh` copies `/etc/NetworkManager/system-connections/*.nmconnection` from live installer to `/mnt` before `nixos-install` runs → wifi credentials persist across reboot. Existing flow (eth-30s → nmtui-once → connect → install) unchanged; iter-5.1 just makes the installed system inherit the connection. `common.nix` enables Avahi mDNS publishing so `ssh zeta@.local` resolves from operator Mac (Bonjour) + Linux peers (nss-mdns) without IP-discovery step.
iter-5.2 — per-node hostname injection (decoupled from role-stack)
Three changes:
Bug fixed: today every `--flake .#control-plane` node gets hostname "control-plane"; multi-node collision; mDNS auto-renames second to "control-plane-2.local" but underlying NixOS hostname stays "control-plane" (confusing in logs/journalctl/kubectl/labels).
Empirical UX:
```
Single-node, zero-typing (today's path; UNCHANGED):
zflash
→ hostname stays 'control-plane'; ssh zeta@control-plane.local
Multi-node, one short flag per USB:
zflash --host pikachu # → ssh zeta@pikachu.local
zflash --host charizard # → ssh zeta@charizard.local
zflash --host bulbasaur # → ssh zeta@bulbasaur.local
All three install from .#control-plane role-stack;
each gets unique hostname + mDNS announcement; zero flake explosion
```
Out of scope (filed separately as B-0793)
The deeper architectural concern Aaron raised — "role-as-capability composition; one node = control-plane AND gpu-worker AND storage simultaneously" — requires refactoring `nixos/hosts//configuration.nix` → composable `nixos/modules/role-*.nix` capability modules. Filed as B-0793 follow-on; substantial refactor; landing as separate iteration.
Composes with
Test plan
🤖 Generated with Claude Code