feat(node-register): node-e5a176 self-registers via iter-5.4.1#5380
feat(node-register): node-e5a176 self-registers via iter-5.4.1#5380AceHack wants to merge 1 commit into
Conversation
Auto-generated by zeta-install.sh Step 6.9 on the node during install. Registers node-e5a176 under maintainers/AceHack/cluster-nodes/. ArgoCD watches maintainers/*/cluster-nodes/** + reconciles per B-0813. flake-host: control-plane flake-commit: dc133b4 registered-at: 2026-05-27T02:06:08Z
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Adds a new ClusterNode custom resource manifest to register node-e5a176 under the maintainers/AceHack/cluster-nodes/ GitOps inventory, so ArgoCD reconciliation (iter-5.4.x flow) can pick it up and bring the node into the cluster.
Changes:
- Introduces
maintainers/AceHack/cluster-nodes/node-e5a176/node.yamldefining theClusterNoderesource (metadata, registration info, roles, and probed hardware summary).
…ick shard via isolated worktree (#5381) Fresh autonomous-loop cold-boot: - `CronList` empty (catch-43 fired) → sentinel `271e3030` re-armed as first action - Root checkout on operator's primary `main` with 30+ untracked peer-WIP (PR-discussion files + decompose-4847-* dirs) - Substrate written via isolated worktree off `origin/main` HEAD `46ac81c4a` per `zeta-expected-branch.md` race-window-caveat + agent-worktree-hygiene "never hold main" floor - Tier: Normal (GraphQL 4791/5000); dotgit recovered (3 stuck procs); peer Otto-CLI active (PR #5380 ~2 min ago) - First shard for 2026-05-27 UTC-day; ~4h gap since 2026-05-26T22:08Z (documented session-exit-non-persistence cadence) Per `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` condition #3 — concrete bounded artifact in own lane. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
|
Structural fix for the Three resolution options for THIS PR's
The thread tracks the structural bug; #5385 closes it. (actor: Otto-CLI via borrowed gh auth per the B-0847 attribution-gap discipline) |
…al anchors): zeta-install.sh storage probe filters 0B devices + common.nix adds gh CLI to systemPackages (#5385) Two empirical anchors from Aaron's iter-5.4 install of `node-e5a176` (PR #5380 self-registered) where install completed but operator hit two distinct gaps on first login: Bug 4 — `/dev/sda 0B` zero-size storage device in node.yaml ================================================================ The storage probe in zeta-install.sh (line 781) emitted EVERY block device from lsblk, including 0-byte placeholder devices (empty SD card readers, empty optical bays, removable-media readers without media). Aaron's Intel Core Ultra 9 185H node has /dev/sda 0B (likely the laptop's empty SD card reader) which got registered as "storage" — Copilot P1 finding on PR #5380. Fix: add `$2 != "0B"` filter to the awk pipeline so zero-size placeholders are excluded from the spec.hardware.storage list. - STORAGE_LINES=$(lsblk -ndo NAME,SIZE,TYPE -e7 2>/dev/null | - awk '$3=="disk"{print "..."}' || echo "") + STORAGE_LINES=$(lsblk -ndo NAME,SIZE,TYPE -e7 2>/dev/null | + awk '$3=="disk" && $2!="0B"{print "..."}' || echo "") This prevents reconcilers reading spec.hardware.storage from treating 0-byte devices as usable storage targets. Bug 5 — gh CLI not in installed system's PATH after reboot ================================================================ Operator framing: "when i log in gh command is not found" The installer ISO had gh in PATH (used by iter-5.4.0 for `gh auth login` during Step 6.8) but common.nix systemPackages did not include gh, so post-reboot the auth tokens stored in ~/.config/gh are useless without the binary. The gap surfaced empirically on Aaron's first login to the freshly-installed node-e5a176. Fix: add `gh` to common.nix environment.systemPackages so the installed system has it for ongoing operator workflows (re-auth, ssh-key sync, future register/deregister-node tooling, kubectl helpers that wrap gh, etc.). Composes with: B-0813 (cluster-node schema), B-0817 (register-node tool), iter-5.4 install cascade. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
…reports K8s cluster status — operator interactive-login pattern; first concrete instance of B-0847 AI-on-cluster substrate (Aaron 2026-05-26) (#5386) Operator framing (verbatim): > "oh shit is that pr fully automatic? can we make an claude agent > get installed and do what you do on there but it's main goal is just > to get it to steward the registerain pr for now and then after it's > checked in report on the status of the k8s cluster, i can > interactive login like gh if that works." Direct response to PR #5380 (Aaron's `node-e5a176` self-registration) being auto-merge-armed but blocked on 1 Copilot thread — Aaron's recognition that the bounded PR-stewardship work Otto-CLI does on his Mac can be done by a node-local Claude agent on the cluster itself. Two-phase scope: - Phase 1: steward the node's own registration PR (poll → diagnose threads → fix Copilot findings → rebase → resolve threads → auto-merge fires) - Phase 2: after registration merged + cluster running, report on K8s cluster status (kubectl get nodes/applications/pods/events; synthesize per-tick health report to operator-visible surface) Auth model mirrors gh: operator-interactive `claude login` via device flow (parallel to iter-5.4.0 `gh auth login`); token stored in ~/.config/claude/; per-AI identity migration composes with B-0847 when that ratifies. Bounded scope explicit: read-only K8s queries + scoped GitHub PR actions on own-registration only; NOT arbitrary cluster mutations (no kubectl apply/delete/drain). Operator stays in loop for irreversible actions per NCI HC-8 + the autonomous-loop discipline this conversation already established. 5-phase landing: - Phase 0 (this row): substrate landing - Phase 1: manual install + operator interactive login + PR-stewardship validation on node-e5a176 - Phase 2: K8s health reporter scope expansion - Phase 3: NixOS module + multi-node composability - Phase 4: per-AI GitHub identity migration (composes B-0847) - Phase 5: cluster-wide coordination (composes B-0796 Twilio sibling) Composes with: B-0847 (per-AI GitHub identity; this row IS first concrete instance) · B-0794 (iter-5.4.0 interactive-login pattern) · B-0795/B-0812/B-0813 (the registration substrate this agent stewards) · B-0796 (Twilio voice-interface sibling at cluster-AI-support scope) · B-0628 (Knights Guild ratification) · B-0751 (per-agent isolated clones) · B-0835 Bug 5 (gh in systemPackages; claude-code is parallel addition). Per the .claude/rules/algo-wink-failure-mode.md + the algo-wink- attribution memory entry: node-local Claude inherits the substrate- honest attribution discipline (token-owner ≠ actor; cross-reference Co-Authored-By trailer). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
… + NetBIOS (nmbd) + DHCP-hostname; reliability for 'i can't ping it by name' (Aaron 2026-05-27) (#5387) * fix(B-0835 Bug 6+7 — Aaron 2026-05-27 name-resolution reliability ask): multi-protocol name resolution — Avahi hardening + NetBIOS via Samba's nmbd + DHCP-hostname registration; belt-and-suspenders for `i can't ping it by name` Operator framing (verbatim): > "my mac is ethernet connected and i connected to the same wifi as > it but i still can't ping could it be something else or can we > make hostname more reliable? maybe a netbios or something? i > like ashai or whatever it is but can we make it reliable? i > think this is looking very good." Aaron empirically observed mDNS unreliable even with operator Mac on both ethernet AND same WiFi as node-e5a176. Diagnostic from Mac: ping by IP works, SSH works, but `dscacheutil -q host -a name node-e5a176.local` empty AND unicast mDNS query to 192.168.4.128:5353 TIMED OUT (not just connection-attempt-noise — actual no-response). Multi-protocol additive approach (preserve operator's preferred Avahi/Bonjour AND add fallback mechanisms with different failure modes): Bug 6 — Avahi hardening ======================== Adds: - nssmdns6 = true (IPv6 nss-mdns; some macOS configs prefer AAAA) - ipv4 + ipv6 explicit (vs defaults that might bind one or other) - reflector = true (forward mDNS across subnets — composes with multi-segment LAN setups) - publish.hinfo + publish.userServices (additional discoverability) Bug 7 — NetBIOS via Samba's nmbd (additive belt-and-suspenders) ================================================================ NetBIOS uses UDP broadcast on port 137 (vs mDNS multicast on 5353) — different failure modes. If network drops IGMP/multicast but allows broadcast, `node-e5a176` resolves via NetBIOS where `node-e5a176.local` fails via mDNS. Operator usage (any LAN host): nmblookup node-e5a176 # Linux/macOS NetBIOS lookup smbutil lookup node-e5a176 # macOS native NetBIOS ping node-e5a176 # if nsswitch has wins (default macOS) Samba is enabled for NetBIOS name-advertisement ONLY (no shares declared = no SMB file-share exposure). The "disable netbios = no" + workgroup ZETA + per-host netbios-name = config.networking.hostName config matches the per-node identity from injected-hostname.nix. DHCP-hostname registration (3rd reliability layer) =================================================== NetworkManager already advertises hostname via DHCP option 12 by default. Many home routers (Asus/Netgear/Eero/etc) register DHCP client hostnames as DNS names like `node-e5a176.lan` — no NixOS config change needed beyond the existing networking.networkmanager. Operator now has 3 ways to find node-e5a176: 1. `node-e5a176.local` (mDNS — preferred, may flake) 2. `node-e5a176` / `nmblookup ...` (NetBIOS — different protocol) 3. `node-e5a176.lan` (or .home) (router DHCP — works for most home routers) Plus the always-reliable: 4. IP address (192.168.4.128 in Aaron's case; via arp -a) Composes with: B-0792 (injected-hostname); iter-5.4.1 self- registration (PR #5380 has the MAC + hostname; operator can correlate); B-0848 (node-local Claude needs reliable name resolution to act on cluster). Diagnostic surface preserved at operator side: ssh in + run `systemctl status avahi-daemon nmbd` + `journalctl -u avahi-daemon -u nmbd --since "1 hour ago"` to see why a specific mechanism failed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix(PR-5387 Copilot 3 findings — P0+P1 security + P2 name-attribution): NetBIOS-only Samba via smbd.enable=false + explicit allowedUDPPorts; replace 'Aaron' with 'operator'/'maintainer' per .github/copilot-instructions.md 3 substantive findings, all real: P0 — services.samba.openFirewall=true contradicted the "name resolution only" claim by opening 139/tcp + 445/tcp (SMB ports). Fix: openFirewall=false + explicit networking.firewall.allowedUDPPorts = [ 137 138 ] (NetBIOS-NS + NetBIOS-DGM only). P1 — comment claimed "disables SMB file-sharing entirely" but the config kept smbd active via `smb ports = "445"`. Fix: actually disable smbd via services.samba.smbd.enable = false; keep services.samba.nmbd.enable = true. Now ONLY nmbd runs — zero SMB attack surface, comment matches reality. P2 — comments contained personal name attribution ("Aaron ...") which violates .github/copilot-instructions.md "No name attribution in code, docs, or skills". Fix: replaced with "operator" / "maintainer" / "control-plane physical-hardware-support test" framings. Verbatim quotes from operator already preserved at the backlog row + PR body (history surfaces); code/module comments use role-refs only. Substrate-honest about the security: PR #5387 as originally pushed WOULD have opened SMB ports on cluster nodes despite the stated goal. Reviewer caught it; the fix actually delivers the "NetBIOS-name-resolution-only" promise. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
|
Closing per operator 2026-05-27: 'we can close the one about device register because we are able to test registering again once claude is on there'. With iter-5.5.0 substrate merged (#5388 + iter-5.5.1 alignment fix-fwd #5389), the next node install (or this node re-running iter-5.4.1 self-registration) will produce a clean registration PR carrying: (a) Bug 4 fix — no /dev/sda 0B entries, (b) Bug 8 fix — gh+claude credentials persisted, (c) full mise-managed runtime substrate. Pre-emptively closing this one rather than merging the pre-fix data state. Composes with B-0848 Phase 1 (node-local Claude install) — once claude is on the node, registration becomes a steward-able PR per the substrate. |
Self-registration PR opened by zeta-install.sh on the node during install. Composes with B-0812 iter-5.4.1 + B-0813 iter-5.4.2 ArgoCD reconciliation. Review + merge to bring the node into the cluster.