Skip to content

feat(node-register): node-e5a176 self-registers via iter-5.4.1#5380

Closed
AceHack wants to merge 1 commit into
mainfrom
register-node-e5a176-20260527T020608Z
Closed

feat(node-register): node-e5a176 self-registers via iter-5.4.1#5380
AceHack wants to merge 1 commit into
mainfrom
register-node-e5a176-20260527T020608Z

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 27, 2026

Self-registration PR opened by zeta-install.sh on the node during install. Composes with B-0812 iter-5.4.1 + B-0813 iter-5.4.2 ArgoCD reconciliation. Review + merge to bring the node into the cluster.

Auto-generated by zeta-install.sh Step 6.9 on the node during install.
Registers node-e5a176 under maintainers/AceHack/cluster-nodes/.
ArgoCD watches maintainers/*/cluster-nodes/** + reconciles per B-0813.

flake-host: control-plane
flake-commit: dc133b4
registered-at: 2026-05-27T02:06:08Z
Copilot AI review requested due to automatic review settings May 27, 2026 02:06
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ClusterNode custom resource manifest to register node-e5a176 under the maintainers/AceHack/cluster-nodes/ GitOps inventory, so ArgoCD reconciliation (iter-5.4.x flow) can pick it up and bring the node into the cluster.

Changes:

  • Introduces maintainers/AceHack/cluster-nodes/node-e5a176/node.yaml defining the ClusterNode resource (metadata, registration info, roles, and probed hardware summary).

Comment thread maintainers/AceHack/cluster-nodes/node-e5a176/node.yaml
AceHack added a commit that referenced this pull request May 27, 2026
…ick shard via isolated worktree (#5381)

Fresh autonomous-loop cold-boot:
- `CronList` empty (catch-43 fired) → sentinel `271e3030` re-armed as first action
- Root checkout on operator's primary `main` with 30+ untracked peer-WIP (PR-discussion files + decompose-4847-* dirs)
- Substrate written via isolated worktree off `origin/main` HEAD `46ac81c4a` per `zeta-expected-branch.md` race-window-caveat + agent-worktree-hygiene "never hold main" floor
- Tier: Normal (GraphQL 4791/5000); dotgit recovered (3 stuck procs); peer Otto-CLI active (PR #5380 ~2 min ago)
- First shard for 2026-05-27 UTC-day; ~4h gap since 2026-05-26T22:08Z (documented session-exit-non-persistence cadence)

Per `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` condition #3 — concrete bounded artifact in own lane.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
@AceHack
Copy link
Copy Markdown
Member Author

AceHack commented May 27, 2026

Structural fix for the /dev/sda 0B finding is shipping in #5385 (fix(B-0835 Bug 4+5)) — adds $2 != "0B" filter to the storage probe at zeta-install.sh:781 so future registrations exclude zero-size placeholder devices (SD card readers, empty optical bays, etc.).

Three resolution options for THIS PR's /dev/sda 0B entry:

  1. Merge as-is — one-time data error; documented + fix-forward via fix(B-0835 Bug 4+5 — Aaron 2026-05-27 control-plane install): storage probe filters 0B devices + gh CLI in installed system PATH #5385; future installs won't repeat
  2. Manually edit node.yaml to remove the /dev/sda 0B line + push to this branch
  3. Wait for fix(B-0835 Bug 4+5 — Aaron 2026-05-27 control-plane install): storage probe filters 0B devices + gh CLI in installed system PATH #5385 to merge, re-run tools/cluster/register-node.ts on node-e5a176 (or re-run the installer's iter-5.4.1 path) to regenerate clean — overwrites this PR with current state

The thread tracks the structural bug; #5385 closes it.

(actor: Otto-CLI via borrowed gh auth per the B-0847 attribution-gap discipline)

AceHack added a commit that referenced this pull request May 27, 2026
…al anchors): zeta-install.sh storage probe filters 0B devices + common.nix adds gh CLI to systemPackages (#5385)

Two empirical anchors from Aaron's iter-5.4 install of `node-e5a176`
(PR #5380 self-registered) where install completed but operator hit
two distinct gaps on first login:

Bug 4 — `/dev/sda 0B` zero-size storage device in node.yaml
================================================================

The storage probe in zeta-install.sh (line 781) emitted EVERY block
device from lsblk, including 0-byte placeholder devices (empty SD
card readers, empty optical bays, removable-media readers without
media). Aaron's Intel Core Ultra 9 185H node has /dev/sda 0B
(likely the laptop's empty SD card reader) which got registered as
"storage" — Copilot P1 finding on PR #5380.

Fix: add `$2 != "0B"` filter to the awk pipeline so zero-size
placeholders are excluded from the spec.hardware.storage list.

  -    STORAGE_LINES=$(lsblk -ndo NAME,SIZE,TYPE -e7 2>/dev/null |
  -      awk '$3=="disk"{print "..."}' || echo "")
  +    STORAGE_LINES=$(lsblk -ndo NAME,SIZE,TYPE -e7 2>/dev/null |
  +      awk '$3=="disk" && $2!="0B"{print "..."}' || echo "")

This prevents reconcilers reading spec.hardware.storage from treating
0-byte devices as usable storage targets.

Bug 5 — gh CLI not in installed system's PATH after reboot
================================================================

Operator framing: "when i log in gh command is not found"

The installer ISO had gh in PATH (used by iter-5.4.0 for `gh auth
login` during Step 6.8) but common.nix systemPackages did not
include gh, so post-reboot the auth tokens stored in ~/.config/gh
are useless without the binary. The gap surfaced empirically on
Aaron's first login to the freshly-installed node-e5a176.

Fix: add `gh` to common.nix environment.systemPackages so the
installed system has it for ongoing operator workflows (re-auth,
ssh-key sync, future register/deregister-node tooling, kubectl
helpers that wrap gh, etc.).

Composes with: B-0813 (cluster-node schema), B-0817 (register-node
tool), iter-5.4 install cascade.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 27, 2026
…reports K8s cluster status — operator interactive-login pattern; first concrete instance of B-0847 AI-on-cluster substrate (Aaron 2026-05-26) (#5386)

Operator framing (verbatim):

  > "oh shit is that pr fully automatic?  can we make an claude agent
  > get installed and do what you do on there but it's main goal is just
  > to get it to steward the registerain pr for now and then after it's
  > checked in report on the status of the k8s cluster, i can
  > interactive login like gh if that works."

Direct response to PR #5380 (Aaron's `node-e5a176` self-registration)
being auto-merge-armed but blocked on 1 Copilot thread — Aaron's
recognition that the bounded PR-stewardship work Otto-CLI does on his
Mac can be done by a node-local Claude agent on the cluster itself.

Two-phase scope:

- Phase 1: steward the node's own registration PR (poll → diagnose
  threads → fix Copilot findings → rebase → resolve threads →
  auto-merge fires)
- Phase 2: after registration merged + cluster running, report on K8s
  cluster status (kubectl get nodes/applications/pods/events; synthesize
  per-tick health report to operator-visible surface)

Auth model mirrors gh: operator-interactive `claude login` via device
flow (parallel to iter-5.4.0 `gh auth login`); token stored in
~/.config/claude/; per-AI identity migration composes with B-0847
when that ratifies.

Bounded scope explicit: read-only K8s queries + scoped GitHub PR
actions on own-registration only; NOT arbitrary cluster mutations
(no kubectl apply/delete/drain). Operator stays in loop for
irreversible actions per NCI HC-8 + the autonomous-loop discipline
this conversation already established.

5-phase landing:

- Phase 0 (this row): substrate landing
- Phase 1: manual install + operator interactive login + PR-stewardship
  validation on node-e5a176
- Phase 2: K8s health reporter scope expansion
- Phase 3: NixOS module + multi-node composability
- Phase 4: per-AI GitHub identity migration (composes B-0847)
- Phase 5: cluster-wide coordination (composes B-0796 Twilio sibling)

Composes with: B-0847 (per-AI GitHub identity; this row IS first
concrete instance) · B-0794 (iter-5.4.0 interactive-login pattern) ·
B-0795/B-0812/B-0813 (the registration substrate this agent stewards) ·
B-0796 (Twilio voice-interface sibling at cluster-AI-support scope) ·
B-0628 (Knights Guild ratification) · B-0751 (per-agent isolated
clones) · B-0835 Bug 5 (gh in systemPackages; claude-code is parallel
addition).

Per the .claude/rules/algo-wink-failure-mode.md + the algo-wink-
attribution memory entry: node-local Claude inherits the substrate-
honest attribution discipline (token-owner ≠ actor; cross-reference
Co-Authored-By trailer).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 27, 2026
… + NetBIOS (nmbd) + DHCP-hostname; reliability for 'i can't ping it by name' (Aaron 2026-05-27) (#5387)

* fix(B-0835 Bug 6+7 — Aaron 2026-05-27 name-resolution reliability ask): multi-protocol name resolution — Avahi hardening + NetBIOS via Samba's nmbd + DHCP-hostname registration; belt-and-suspenders for `i can't ping it by name`

Operator framing (verbatim):

  > "my mac is ethernet connected and i connected to the same wifi as
  > it but i still can't ping could it be something else or can we
  > make hostname more reliable?  maybe a netbios or something?  i
  > like ashai or whatever it is but can we make it reliable?  i
  > think this is looking very good."

Aaron empirically observed mDNS unreliable even with operator Mac on
both ethernet AND same WiFi as node-e5a176. Diagnostic from Mac:
ping by IP works, SSH works, but `dscacheutil -q host -a name
node-e5a176.local` empty AND unicast mDNS query to 192.168.4.128:5353
TIMED OUT (not just connection-attempt-noise — actual no-response).

Multi-protocol additive approach (preserve operator's preferred
Avahi/Bonjour AND add fallback mechanisms with different failure
modes):

Bug 6 — Avahi hardening
========================

Adds:
  - nssmdns6 = true (IPv6 nss-mdns; some macOS configs prefer AAAA)
  - ipv4 + ipv6 explicit (vs defaults that might bind one or other)
  - reflector = true (forward mDNS across subnets — composes with
    multi-segment LAN setups)
  - publish.hinfo + publish.userServices (additional discoverability)

Bug 7 — NetBIOS via Samba's nmbd (additive belt-and-suspenders)
================================================================

NetBIOS uses UDP broadcast on port 137 (vs mDNS multicast on 5353)
— different failure modes. If network drops IGMP/multicast but
allows broadcast, `node-e5a176` resolves via NetBIOS where
`node-e5a176.local` fails via mDNS.

Operator usage (any LAN host):
  nmblookup node-e5a176         # Linux/macOS NetBIOS lookup
  smbutil lookup node-e5a176    # macOS native NetBIOS
  ping node-e5a176              # if nsswitch has wins (default macOS)

Samba is enabled for NetBIOS name-advertisement ONLY (no shares
declared = no SMB file-share exposure). The "disable netbios = no"
+ workgroup ZETA + per-host netbios-name = config.networking.hostName
config matches the per-node identity from injected-hostname.nix.

DHCP-hostname registration (3rd reliability layer)
===================================================

NetworkManager already advertises hostname via DHCP option 12 by
default. Many home routers (Asus/Netgear/Eero/etc) register DHCP
client hostnames as DNS names like `node-e5a176.lan` — no NixOS
config change needed beyond the existing networking.networkmanager.

Operator now has 3 ways to find node-e5a176:
  1. `node-e5a176.local`              (mDNS — preferred, may flake)
  2. `node-e5a176` / `nmblookup ...`  (NetBIOS — different protocol)
  3. `node-e5a176.lan` (or .home)     (router DHCP — works for most home routers)

Plus the always-reliable:
  4. IP address (192.168.4.128 in Aaron's case; via arp -a)

Composes with: B-0792 (injected-hostname); iter-5.4.1 self-
registration (PR #5380 has the MAC + hostname; operator can
correlate); B-0848 (node-local Claude needs reliable name
resolution to act on cluster).

Diagnostic surface preserved at operator side: ssh in + run
`systemctl status avahi-daemon nmbd` + `journalctl -u
avahi-daemon -u nmbd --since "1 hour ago"` to see why a
specific mechanism failed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(PR-5387 Copilot 3 findings — P0+P1 security + P2 name-attribution): NetBIOS-only Samba via smbd.enable=false + explicit allowedUDPPorts; replace 'Aaron' with 'operator'/'maintainer' per .github/copilot-instructions.md

3 substantive findings, all real:

P0 — services.samba.openFirewall=true contradicted the "name
resolution only" claim by opening 139/tcp + 445/tcp (SMB ports).
Fix: openFirewall=false + explicit networking.firewall.allowedUDPPorts
= [ 137 138 ] (NetBIOS-NS + NetBIOS-DGM only).

P1 — comment claimed "disables SMB file-sharing entirely" but the
config kept smbd active via `smb ports = "445"`. Fix: actually
disable smbd via services.samba.smbd.enable = false; keep
services.samba.nmbd.enable = true. Now ONLY nmbd runs — zero SMB
attack surface, comment matches reality.

P2 — comments contained personal name attribution ("Aaron ...") which
violates .github/copilot-instructions.md "No name attribution in code,
docs, or skills". Fix: replaced with "operator" / "maintainer" /
"control-plane physical-hardware-support test" framings. Verbatim
quotes from operator already preserved at the backlog row + PR body
(history surfaces); code/module comments use role-refs only.

Substrate-honest about the security: PR #5387 as originally pushed
WOULD have opened SMB ports on cluster nodes despite the stated
goal. Reviewer caught it; the fix actually delivers the
"NetBIOS-name-resolution-only" promise.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
@AceHack
Copy link
Copy Markdown
Member Author

AceHack commented May 27, 2026

Closing per operator 2026-05-27: 'we can close the one about device register because we are able to test registering again once claude is on there'. With iter-5.5.0 substrate merged (#5388 + iter-5.5.1 alignment fix-fwd #5389), the next node install (or this node re-running iter-5.4.1 self-registration) will produce a clean registration PR carrying: (a) Bug 4 fix — no /dev/sda 0B entries, (b) Bug 8 fix — gh+claude credentials persisted, (c) full mise-managed runtime substrate. Pre-emptively closing this one rather than merging the pre-fix data state. Composes with B-0848 Phase 1 (node-local Claude install) — once claude is on the node, registration becomes a steward-able PR per the substrate.

@AceHack AceHack closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants