fix(B-0835 Bug 1): hostname injection — symlink + --impure so flake eval reads cluster-node-id (same bug class as Bug 3b)#5354
Merged
AceHack merged 2 commits intoMay 26, 2026
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Fixes B-0835 Bug 1 where the live-ISO nixos-install --flake ... evaluation couldn’t see the install-target’s generated /mnt/etc/zeta/cluster-node-id, causing the system to fall back to the flake default networking.hostName (e.g., control-plane) instead of the per-node node-<6hex>.
Changes:
- Pre-stages a live-ISO
/etc/zeta/cluster-node-idsymlink pointing at/mnt/etc/zeta/cluster-node-idbefore runningnixos-install. - Adds
--impuretonixos-installso Nix evaluation can read the absolute/etc/zeta/cluster-node-idpath viabuiltins.pathExists/builtins.readFile. - Removes the created symlink after
nixos-installcompletes.
…zeta + --impure so flake eval reads cluster-node-id ROOT CAUSE: same bug class as Bug 3b (password). injected-hostname.nix reads /etc/zeta/cluster-node-id via builtins.pathExists + builtins.readFile at NixOS evaluation time (flake build-time). During nixos-install from live ISO: - zeta-install.sh Step 6.6 writes /mnt/etc/zeta/cluster-node-id ✓ - Flake eval reads /etc/zeta/cluster-node-id (LIVE ISO context; absent) - Module falls through to flake's hardcoded networking.hostName - Operator gets flake-default hostname (e.g., "control-plane") instead of unique node-<6hex> that iter-5.2.2 generated FIX (different from Bug 3b's activation-script approach because hostname CANNOT cleanly change at activation — many services bake hostname at build time): 1. Symlink /mnt/etc/zeta/cluster-node-id → /etc/zeta/cluster-node-id BEFORE nixos-install runs. Makes the file visible at the path injected-hostname.nix expects during flake eval phase. 2. Add --impure flag to nixos-install so flake pure-mode allows builtins.pathExists + builtins.readFile on the non-store path. 3. Cleanup the symlink AFTER nixos-install (no dangling reference if /mnt is unmounted before reboot). Subsequent rebuilds on the installed system work without the symlink because /etc/zeta/cluster-node-id IS on the installed root filesystem (written by the install). Empirical anchor: operator 2026-05-26 physical hardware-support test showed "control-plane login:" instead of unique node-<6hex>. Safety: - Only impure read is operator-chosen hostname (not a secret) - Other modules (initial-password.nix per Bug 3b fix) use activation-scripts so they don't need --impure - Symlink-then-cleanup is idempotent + reversible Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d files (cluster-node-id + operator-authorized-keys) + log/comment accuracy 4 legitimate Copilot findings on prior #5354 commit, all real: 1. **Trap-based cleanup**: prior cleanup only fired on success path. If nixos-install fails OR Ctrl-C, /etc/zeta/cluster-node-id symlink would persist + dangle when /mnt is unmounted. FIX: trap EXIT handler runs cleanup on ALL exit paths (success/failure/signal). Defense-in-depth via explicit cleanup at end too. 2. **Misleading log message**: prior "symlinking $X → /etc/zeta/..." printed even when no symlink was actually created. FIX: move log inside the maybe_symlink helper so it only prints on actual creation. 3. **Comment vs code mismatch**: prior comment said "Symlinking /mnt/etc/zeta → /etc/zeta" (directory-level) but code only handled the single cluster-node-id file. FIX: rewrote comment to match per-file approach + named all affected modules. 4. **Safety note wrong about operator-authorized-keys.nix**: prior note claimed "initial-password.nix doesn't use builtins.readFile" but didn't acknowledge that operator-authorized-keys.nix DOES use builtins.readFile on /etc/zeta/operator-authorized-keys. With --impure now active, that module ALSO needs the symlink-or-it- silently-loses-iter-5.4.0-pubkeys. FIX: extended the symlink approach to operator-authorized-keys too + updated safety note to correctly distinguish all 3 modules: - injected-hostname.nix → symlinked (Bug 1 fix) - operator-authorized-keys.nix → symlinked (sibling-bug-class) - initial-password.nix → activation-script (Bug 3b fix) Helper function maybe_symlink() centralizes the "create only if file exists AND destination doesn't" logic; trap handler removes only files this script created. Composes with PR #5351 (Bug 3b activation-script) — together the two PRs fix all 3 instances of the build-time-eval-vs-install-time-write path-mismatch bug class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8357657 to
044c34c
Compare
AceHack
added a commit
that referenced
this pull request
May 27, 2026
…ntinels (#5365) * ci(B-0831 layer-1): extend audit-installer-substrate with iter-5.4 sentinels (gh auth setup-git, ssh-key stderr-capture, self-reg flow, ClusterNode YAML schema, MAC parsing) Layer 1 of a 4-layer CI testing approach for the iter-5.4 substrate (B-0812 self-registration + B-0813 cluster reconciliation + B-0835 bug fixes Bug 2a + 2b on #5364): Layer 1 (THIS PR) — source-level sentinel audit (cheap; catches regression) Layer 2 (next PR) — behavioral test with mock gh shim on PATH Layer 3 (B-0833 Approach A) — mock GH device-code endpoint Layer 4 (B-0831 cascade #6) — QEMU full-install + cluster auto-join This layer extends the existing REQUIRED_SENTINELS for full-ai-cluster/usb-nixos-installer/zeta-install.sh with 14 new substrings, organized into 3 groups: (a) iter-5.4 flow anchors (5 sentinels): - "Step 6.8: iter-5.4.0 homelab gh-auth + operator pubkey copy" - "Step 6.9: iter-5.4.1 self-registration commit+push" - "gh auth login" - "gh ssh-key list" - "gh repo clone Lucent-Financial-Group/Zeta" (b) Bug 2a + 2b fix-regression catches (3 sentinels): - "gh auth setup-git" — Bug 2a fix; presence catches removal - "SSH_KEY_ERR_FILE" — Bug 2b fix; presence catches stderr-capture removal - "admin:public_key" — Bug 2b fix; presence catches scope-recovery message removal (c) ClusterNode YAML schema sentinels (5 sentinels — catches the Copilot findings on #5352 where spec.role was scalar, spec.maintainer was at wrong path, spec.storage was a sibling instead of under hardware block): - "apiVersion: zeta.lucent-financial-group.com/v1" - "kind: ClusterNode" - " roles:" — spec.roles is ARRAY per B-0813 - " registration:" — spec.registration block per B-0813 - " hardware:" — spec.hardware block per B-0813 (d) Hardware-probe sentinels (catches MAC parsing regression from #5352): - "/proc/cpuinfo" — CPU_MODEL extraction - "link/ether" — MAC parses field after link/ether (not before) (e) Self-reg branch-shape sentinel: - "register-${NODE_HOSTNAME}-" — iter-5.4.1 branch name pattern Composes with: - PR #5364 (Bug 2a + 2b fixes that this audit will catch if regressed) - PR #5352 (iter-5.4.1 Copilot findings that this audit will catch) - PR #5354 (Bug 1 hostname symlink fix — already covered by existing sentinels) - B-0831 (cascade #6 full-install QEMU test; this is layer 1 of that work) - B-0833 (interactive-login vs baked-in-keys tension; layer 3 of the cascade) Why source-level + cheap-first: - Workflow build-ai-cluster-iso.yml runs `bun tools/ci/audit-installer-substrate.ts` on every PR touching the installer surface - Source-level catches substrate-regression at PR-author-time (seconds) - vs Layer 4 QEMU full-install (~minutes; expensive; flaky) - Layer 1 is the inner loop; Layers 2-4 are the outer loops Per `.claude/rules/verify-existing-substrate-before-authoring.md`: substrate-inventory pass found `tools/ci/audit-installer-substrate.ts` already has the REQUIRED_SENTINELS pattern for iter-4.2 + iter-5.1 + iter-5.2 + iter-5.2.2; this PR extends with iter-5.4 sentinels rather than minting parallel substrate. Verified: `bun tools/ci/audit-installer-substrate.ts` exits 0 ("PASS — 10 required files + 5 sentinel-file assertions OK") with the extended sentinel list against the current installer script at origin/main HEAD (commit 19d9617 from #5364). 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(#5365 Copilot): reflow iter-5.4.1 YAML schema sentinels comment so parenthesis closes on first line Copilot finding on the audit-installer-substrate.ts iter-5.4 sentinel addition: the comment 'iter-5.4.1 YAML schema sentinels (catches the Copilot findings from #5352' opened a parenthesis on line 98 that didn't close until line 100 ('block)'). To a code-reader scanning line 98, the sentence reads as unfinished. Fix: restructure as 'sentinels. Each catches a specific Copilot finding on PR #5352: ...' — no multi-line parenthesis; each schema-correction is a complete clause. --------- Co-authored-by: Lior <lior@zeta.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes B-0835 Bug 1 — login banner showed `control-plane login:` instead of unique `node-<6hex>`. Same bug class as Bug 3b (build-time-eval vs install-time-write path mismatch).
Root cause
`injected-hostname.nix` reads `/etc/zeta/cluster-node-id` via `builtins.pathExists` + `builtins.readFile` at NixOS evaluation time. During `nixos-install` from live ISO:
Fix
Different from Bug 3b's activation-script approach because hostname CANNOT cleanly change at activation (many services bake hostname at build time).
Subsequent rebuilds on installed system work without symlink (file IS on installed root fs after install).
Safety
Test plan
🤖 Generated with Claude Code