Skip to content

fix(B-0835 Bug 1): hostname injection — symlink + --impure so flake eval reads cluster-node-id (same bug class as Bug 3b)#5354

Merged
AceHack merged 2 commits into
mainfrom
otto/b-0835-bug-1-hostname-injected-path-symlink-impure-fix-2026-05-26
May 26, 2026
Merged

fix(B-0835 Bug 1): hostname injection — symlink + --impure so flake eval reads cluster-node-id (same bug class as Bug 3b)#5354
AceHack merged 2 commits into
mainfrom
otto/b-0835-bug-1-hostname-injected-path-symlink-impure-fix-2026-05-26

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Summary

Fixes B-0835 Bug 1 — login banner showed `control-plane login:` instead of unique `node-<6hex>`. Same bug class as Bug 3b (build-time-eval vs install-time-write path mismatch).

Root cause

`injected-hostname.nix` reads `/etc/zeta/cluster-node-id` via `builtins.pathExists` + `builtins.readFile` at NixOS evaluation time. During `nixos-install` from live ISO:

  • `zeta-install.sh` Step 6.6 writes `/mnt/etc/zeta/cluster-node-id` ✓
  • Flake eval reads `/etc/zeta/cluster-node-id` (LIVE ISO context; absent)
  • Module falls through to flake's hardcoded `networking.hostName`
  • Operator gets flake-default hostname (`control-plane`) instead of unique `node-<6hex>`

Fix

Different from Bug 3b's activation-script approach because hostname CANNOT cleanly change at activation (many services bake hostname at build time).

  1. Symlink `/mnt/etc/zeta/cluster-node-id` → `/etc/zeta/cluster-node-id` BEFORE `nixos-install` runs
  2. Add `--impure` flag so flake pure-mode allows `builtins.pathExists` + `builtins.readFile` on the non-store path
  3. Cleanup symlink AFTER `nixos-install` (no dangling reference if /mnt unmounted before reboot)

Subsequent rebuilds on installed system work without symlink (file IS on installed root fs after install).

Safety

  • Only impure read is operator-chosen hostname (not a secret)
  • Other modules (initial-password.nix per Bug 3b fix) use activation-scripts so don't need --impure
  • Symlink-then-cleanup is idempotent + reversible

Test plan

  • Bash syntax OK (`bash -n` passes)
  • Idempotent (only symlinks if /etc/zeta/cluster-node-id doesn't already exist)
  • Reversible (cleanup removes symlink only if we created it)
  • No HARD LIMITS violated (no secrets in symlink target)

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 26, 2026 23:29
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 23:29
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes B-0835 Bug 1 where the live-ISO nixos-install --flake ... evaluation couldn’t see the install-target’s generated /mnt/etc/zeta/cluster-node-id, causing the system to fall back to the flake default networking.hostName (e.g., control-plane) instead of the per-node node-<6hex>.

Changes:

  • Pre-stages a live-ISO /etc/zeta/cluster-node-id symlink pointing at /mnt/etc/zeta/cluster-node-id before running nixos-install.
  • Adds --impure to nixos-install so Nix evaluation can read the absolute /etc/zeta/cluster-node-id path via builtins.pathExists/builtins.readFile.
  • Removes the created symlink after nixos-install completes.

Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Lior and others added 2 commits May 26, 2026 19:33
…zeta + --impure so flake eval reads cluster-node-id

ROOT CAUSE: same bug class as Bug 3b (password). injected-hostname.nix
reads /etc/zeta/cluster-node-id via builtins.pathExists +
builtins.readFile at NixOS evaluation time (flake build-time).

During nixos-install from live ISO:
- zeta-install.sh Step 6.6 writes /mnt/etc/zeta/cluster-node-id ✓
- Flake eval reads /etc/zeta/cluster-node-id (LIVE ISO context; absent)
- Module falls through to flake's hardcoded networking.hostName
- Operator gets flake-default hostname (e.g., "control-plane") instead
  of unique node-<6hex> that iter-5.2.2 generated

FIX (different from Bug 3b's activation-script approach because
hostname CANNOT cleanly change at activation — many services bake
hostname at build time):

1. Symlink /mnt/etc/zeta/cluster-node-id → /etc/zeta/cluster-node-id
   BEFORE nixos-install runs. Makes the file visible at the path
   injected-hostname.nix expects during flake eval phase.
2. Add --impure flag to nixos-install so flake pure-mode allows
   builtins.pathExists + builtins.readFile on the non-store path.
3. Cleanup the symlink AFTER nixos-install (no dangling reference
   if /mnt is unmounted before reboot).

Subsequent rebuilds on the installed system work without the symlink
because /etc/zeta/cluster-node-id IS on the installed root filesystem
(written by the install).

Empirical anchor: operator 2026-05-26 physical hardware-support test
showed "control-plane login:" instead of unique node-<6hex>.

Safety:
- Only impure read is operator-chosen hostname (not a secret)
- Other modules (initial-password.nix per Bug 3b fix) use
  activation-scripts so they don't need --impure
- Symlink-then-cleanup is idempotent + reversible

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d files (cluster-node-id + operator-authorized-keys) + log/comment accuracy

4 legitimate Copilot findings on prior #5354 commit, all real:

1. **Trap-based cleanup**: prior cleanup only fired on success path. If
   nixos-install fails OR Ctrl-C, /etc/zeta/cluster-node-id symlink
   would persist + dangle when /mnt is unmounted. FIX: trap EXIT
   handler runs cleanup on ALL exit paths (success/failure/signal).
   Defense-in-depth via explicit cleanup at end too.

2. **Misleading log message**: prior "symlinking $X → /etc/zeta/..."
   printed even when no symlink was actually created. FIX: move log
   inside the maybe_symlink helper so it only prints on actual creation.

3. **Comment vs code mismatch**: prior comment said "Symlinking
   /mnt/etc/zeta → /etc/zeta" (directory-level) but code only handled
   the single cluster-node-id file. FIX: rewrote comment to match
   per-file approach + named all affected modules.

4. **Safety note wrong about operator-authorized-keys.nix**: prior
   note claimed "initial-password.nix doesn't use builtins.readFile"
   but didn't acknowledge that operator-authorized-keys.nix DOES use
   builtins.readFile on /etc/zeta/operator-authorized-keys. With
   --impure now active, that module ALSO needs the symlink-or-it-
   silently-loses-iter-5.4.0-pubkeys. FIX: extended the symlink
   approach to operator-authorized-keys too + updated safety note
   to correctly distinguish all 3 modules:
   - injected-hostname.nix     → symlinked (Bug 1 fix)
   - operator-authorized-keys.nix → symlinked (sibling-bug-class)
   - initial-password.nix       → activation-script (Bug 3b fix)

Helper function maybe_symlink() centralizes the "create only if file
exists AND destination doesn't" logic; trap handler removes only
files this script created.

Composes with PR #5351 (Bug 3b activation-script) — together the two
PRs fix all 3 instances of the build-time-eval-vs-install-time-write
path-mismatch bug class.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AceHack AceHack force-pushed the otto/b-0835-bug-1-hostname-injected-path-symlink-impure-fix-2026-05-26 branch from 8357657 to 044c34c Compare May 26, 2026 23:34
@AceHack AceHack merged commit 778225c into main May 26, 2026
29 checks passed
@AceHack AceHack deleted the otto/b-0835-bug-1-hostname-injected-path-symlink-impure-fix-2026-05-26 branch May 26, 2026 23:38
AceHack added a commit that referenced this pull request May 27, 2026
…ntinels (#5365)

* ci(B-0831 layer-1): extend audit-installer-substrate with iter-5.4 sentinels (gh auth setup-git, ssh-key stderr-capture, self-reg flow, ClusterNode YAML schema, MAC parsing)

Layer 1 of a 4-layer CI testing approach for the iter-5.4 substrate
(B-0812 self-registration + B-0813 cluster reconciliation + B-0835 bug
fixes Bug 2a + 2b on #5364):

  Layer 1 (THIS PR) — source-level sentinel audit (cheap; catches regression)
  Layer 2 (next PR) — behavioral test with mock gh shim on PATH
  Layer 3 (B-0833 Approach A) — mock GH device-code endpoint
  Layer 4 (B-0831 cascade #6) — QEMU full-install + cluster auto-join

This layer extends the existing REQUIRED_SENTINELS for
full-ai-cluster/usb-nixos-installer/zeta-install.sh with 14 new
substrings, organized into 3 groups:

(a) iter-5.4 flow anchors (5 sentinels):
  - "Step 6.8: iter-5.4.0 homelab gh-auth + operator pubkey copy"
  - "Step 6.9: iter-5.4.1 self-registration commit+push"
  - "gh auth login"
  - "gh ssh-key list"
  - "gh repo clone Lucent-Financial-Group/Zeta"

(b) Bug 2a + 2b fix-regression catches (3 sentinels):
  - "gh auth setup-git"     — Bug 2a fix; presence catches removal
  - "SSH_KEY_ERR_FILE"      — Bug 2b fix; presence catches stderr-capture removal
  - "admin:public_key"      — Bug 2b fix; presence catches scope-recovery message removal

(c) ClusterNode YAML schema sentinels (5 sentinels — catches the Copilot
findings on #5352 where spec.role was scalar, spec.maintainer was at
wrong path, spec.storage was a sibling instead of under hardware block):
  - "apiVersion: zeta.lucent-financial-group.com/v1"
  - "kind: ClusterNode"
  - "  roles:"             — spec.roles is ARRAY per B-0813
  - "  registration:"      — spec.registration block per B-0813
  - "  hardware:"          — spec.hardware block per B-0813

(d) Hardware-probe sentinels (catches MAC parsing regression from #5352):
  - "/proc/cpuinfo"   — CPU_MODEL extraction
  - "link/ether"      — MAC parses field after link/ether (not before)

(e) Self-reg branch-shape sentinel:
  - "register-${NODE_HOSTNAME}-" — iter-5.4.1 branch name pattern

Composes with:
  - PR #5364 (Bug 2a + 2b fixes that this audit will catch if regressed)
  - PR #5352 (iter-5.4.1 Copilot findings that this audit will catch)
  - PR #5354 (Bug 1 hostname symlink fix — already covered by existing sentinels)
  - B-0831 (cascade #6 full-install QEMU test; this is layer 1 of that work)
  - B-0833 (interactive-login vs baked-in-keys tension; layer 3 of the cascade)

Why source-level + cheap-first:
  - Workflow build-ai-cluster-iso.yml runs `bun tools/ci/audit-installer-substrate.ts`
    on every PR touching the installer surface
  - Source-level catches substrate-regression at PR-author-time (seconds)
  - vs Layer 4 QEMU full-install (~minutes; expensive; flaky)
  - Layer 1 is the inner loop; Layers 2-4 are the outer loops

Per `.claude/rules/verify-existing-substrate-before-authoring.md`:
substrate-inventory pass found `tools/ci/audit-installer-substrate.ts`
already has the REQUIRED_SENTINELS pattern for iter-4.2 + iter-5.1 +
iter-5.2 + iter-5.2.2; this PR extends with iter-5.4 sentinels rather
than minting parallel substrate.

Verified: `bun tools/ci/audit-installer-substrate.ts` exits 0
("PASS — 10 required files + 5 sentinel-file assertions OK") with the
extended sentinel list against the current installer script at
origin/main HEAD (commit 19d9617 from #5364).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(#5365 Copilot): reflow iter-5.4.1 YAML schema sentinels comment so parenthesis closes on first line

Copilot finding on the audit-installer-substrate.ts iter-5.4 sentinel
addition: the comment 'iter-5.4.1 YAML schema sentinels (catches the
Copilot findings from #5352' opened a parenthesis on line 98 that
didn't close until line 100 ('block)'). To a code-reader scanning
line 98, the sentence reads as unfinished.

Fix: restructure as 'sentinels. Each catches a specific Copilot
finding on PR #5352: ...' — no multi-line parenthesis; each
schema-correction is a complete clause.

---------

Co-authored-by: Lior <lior@zeta.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants