Skip to content

fix(B-0835 Bug 6+7): multi-protocol name resolution — Avahi hardening + NetBIOS (nmbd) + DHCP-hostname; reliability for 'i can't ping it by name' (Aaron 2026-05-27)#5387

Merged
AceHack merged 2 commits into
mainfrom
fix-b0835-multi-protocol-name-resolution-netbios-avahi-hardening-2026-05-26-2305z
May 27, 2026
Merged

fix(B-0835 Bug 6+7): multi-protocol name resolution — Avahi hardening + NetBIOS (nmbd) + DHCP-hostname; reliability for 'i can't ping it by name' (Aaron 2026-05-27)#5387
AceHack merged 2 commits into
mainfrom
fix-b0835-multi-protocol-name-resolution-netbios-avahi-hardening-2026-05-26-2305z

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 27, 2026

Summary

Aaron 2026-05-27 (verbatim):

"my mac is ethernet connected and i connected to the same wifi as it but i still can't ping could it be something else or can we make hostname more reliable? maybe a netbios or something? i like ashai or whatever it is but can we make it reliable? i think this is looking very good."

Empirical: ping by IP works ✓, SSH works ✓, but Bonjour resolution times out AND unicast mDNS query to port 5353/udp times out (actual no-response, not connection-attempt noise). Avahi alone proved unreliable.

Multi-protocol additive approach

Operator's preferred Avahi/Bonjour stays + 2 fallback mechanisms added (different protocols, different failure modes):

Bug 6 — Avahi hardening

  • `nssmdns6 = true` (IPv6 nss-mdns alongside IPv4; some macOS configs prefer AAAA queries first)
  • `ipv4 + ipv6` explicit
  • `reflector = true` (forwards mDNS across subnets — composes with multi-segment LAN setups)
  • `publish.hinfo + publish.userServices` (additional discoverability)

Bug 7 — NetBIOS via Samba's nmbd (belt-and-suspenders)

NetBIOS uses UDP broadcast on port 137 (vs mDNS multicast on 5353) — different failure modes. If network drops IGMP/multicast but allows broadcast (common on home/SMB switches), `node-e5a176` resolves via NetBIOS where `node-e5a176.local` fails via mDNS.

Operator usage (any LAN host):
```bash
nmblookup node-e5a176 # Linux/macOS NetBIOS lookup
smbutil lookup node-e5a176 # macOS native NetBIOS
ping node-e5a176 # if nsswitch has wins
```

Samba is enabled for NetBIOS name-advertisement only (no shares declared = no SMB file-share exposure).

DHCP-hostname registration (3rd layer)

NetworkManager already advertises hostname via DHCP option 12 by default. Many home routers register DHCP client hostnames as DNS names (`node-e5a176.lan` from Asus/Netgear/Eero). No config change needed.

Operator now has 3 name-resolution mechanisms

# Lookup Mechanism Failure mode
1 `node-e5a176.local` mDNS multicast IGMP filtering, multicast drop
2 `node-e5a176` (via nmblookup) NetBIOS broadcast Different protocol; works when mDNS fails
3 `node-e5a176.lan` Router DHCP+DNS Depends on router support
4 IP (192.168.4.128) Always reliable Need `arp -a` first if IP not memorized

Test plan

  • CI passes
  • Next ISO build picks up multi-protocol stack
  • On next install: validate all 3 mechanisms; document which work on operator's specific LAN

Composes with

B-0792 (injected-hostname) · iter-5.4.1 self-registration (PR #5380 carries MAC + hostname for correlation) · B-0848 (node-local Claude needs reliable name resolution)

🤖 Generated with Claude Code

…): multi-protocol name resolution — Avahi hardening + NetBIOS via Samba's nmbd + DHCP-hostname registration; belt-and-suspenders for `i can't ping it by name`

Operator framing (verbatim):

  > "my mac is ethernet connected and i connected to the same wifi as
  > it but i still can't ping could it be something else or can we
  > make hostname more reliable?  maybe a netbios or something?  i
  > like ashai or whatever it is but can we make it reliable?  i
  > think this is looking very good."

Aaron empirically observed mDNS unreliable even with operator Mac on
both ethernet AND same WiFi as node-e5a176. Diagnostic from Mac:
ping by IP works, SSH works, but `dscacheutil -q host -a name
node-e5a176.local` empty AND unicast mDNS query to 192.168.4.128:5353
TIMED OUT (not just connection-attempt-noise — actual no-response).

Multi-protocol additive approach (preserve operator's preferred
Avahi/Bonjour AND add fallback mechanisms with different failure
modes):

Bug 6 — Avahi hardening
========================

Adds:
  - nssmdns6 = true (IPv6 nss-mdns; some macOS configs prefer AAAA)
  - ipv4 + ipv6 explicit (vs defaults that might bind one or other)
  - reflector = true (forward mDNS across subnets — composes with
    multi-segment LAN setups)
  - publish.hinfo + publish.userServices (additional discoverability)

Bug 7 — NetBIOS via Samba's nmbd (additive belt-and-suspenders)
================================================================

NetBIOS uses UDP broadcast on port 137 (vs mDNS multicast on 5353)
— different failure modes. If network drops IGMP/multicast but
allows broadcast, `node-e5a176` resolves via NetBIOS where
`node-e5a176.local` fails via mDNS.

Operator usage (any LAN host):
  nmblookup node-e5a176         # Linux/macOS NetBIOS lookup
  smbutil lookup node-e5a176    # macOS native NetBIOS
  ping node-e5a176              # if nsswitch has wins (default macOS)

Samba is enabled for NetBIOS name-advertisement ONLY (no shares
declared = no SMB file-share exposure). The "disable netbios = no"
+ workgroup ZETA + per-host netbios-name = config.networking.hostName
config matches the per-node identity from injected-hostname.nix.

DHCP-hostname registration (3rd reliability layer)
===================================================

NetworkManager already advertises hostname via DHCP option 12 by
default. Many home routers (Asus/Netgear/Eero/etc) register DHCP
client hostnames as DNS names like `node-e5a176.lan` — no NixOS
config change needed beyond the existing networking.networkmanager.

Operator now has 3 ways to find node-e5a176:
  1. `node-e5a176.local`              (mDNS — preferred, may flake)
  2. `node-e5a176` / `nmblookup ...`  (NetBIOS — different protocol)
  3. `node-e5a176.lan` (or .home)     (router DHCP — works for most home routers)

Plus the always-reliable:
  4. IP address (192.168.4.128 in Aaron's case; via arp -a)

Composes with: B-0792 (injected-hostname); iter-5.4.1 self-
registration (PR #5380 has the MAC + hostname; operator can
correlate); B-0848 (node-local Claude needs reliable name
resolution to act on cluster).

Diagnostic surface preserved at operator side: ssh in + run
`systemctl status avahi-daemon nmbd` + `journalctl -u
avahi-daemon -u nmbd --since "1 hour ago"` to see why a
specific mechanism failed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 02:37
@AceHack AceHack enabled auto-merge (squash) May 27, 2026 02:37
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make cluster-node hostname resolution more reliable on typical home/SMB LANs by keeping Avahi/Bonjour mDNS and adding additional fallback mechanisms (notably NetBIOS name advertisement via Samba).

Changes:

  • Harden Avahi configuration (IPv6 NSS, explicit v4/v6 enablement, reflector, additional publish records).
  • Enable Samba with NetBIOS-focused settings to support broadcast-based name lookup as an mDNS fallback.
  • Document DHCP hostname registration as an additional expected fallback layer.

Comment thread full-ai-cluster/nixos/modules/common.nix Outdated
Comment thread full-ai-cluster/nixos/modules/common.nix Outdated
Comment thread full-ai-cluster/nixos/modules/common.nix Outdated
…): NetBIOS-only Samba via smbd.enable=false + explicit allowedUDPPorts; replace 'Aaron' with 'operator'/'maintainer' per .github/copilot-instructions.md

3 substantive findings, all real:

P0 — services.samba.openFirewall=true contradicted the "name
resolution only" claim by opening 139/tcp + 445/tcp (SMB ports).
Fix: openFirewall=false + explicit networking.firewall.allowedUDPPorts
= [ 137 138 ] (NetBIOS-NS + NetBIOS-DGM only).

P1 — comment claimed "disables SMB file-sharing entirely" but the
config kept smbd active via `smb ports = "445"`. Fix: actually
disable smbd via services.samba.smbd.enable = false; keep
services.samba.nmbd.enable = true. Now ONLY nmbd runs — zero SMB
attack surface, comment matches reality.

P2 — comments contained personal name attribution ("Aaron ...") which
violates .github/copilot-instructions.md "No name attribution in code,
docs, or skills". Fix: replaced with "operator" / "maintainer" /
"control-plane physical-hardware-support test" framings. Verbatim
quotes from operator already preserved at the backlog row + PR body
(history surfaces); code/module comments use role-refs only.

Substrate-honest about the security: PR #5387 as originally pushed
WOULD have opened SMB ports on cluster nodes despite the stated
goal. Reviewer caught it; the fix actually delivers the
"NetBIOS-name-resolution-only" promise.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
AceHack pushed a commit that referenced this pull request May 27, 2026
…namic UID/GID resolution + claude config chmod + skip-comment accuracy + samba comment scope

4 real Copilot findings on PR #5388 head, fixed:

P0 (bug) — ZETA_UID=1000/ZETA_GID=100 was hardcoded; would chown to
wrong owner if installed system uses different IDs (e.g., another user
created first, or NixOS module config changes). Fix: resolve via
`chroot /mnt id -u zeta` / `id -g zeta`; fallback to 1000:100 with
loud warning if chroot fails (degraded mode).

P0 (security) — ~/.config/claude was not chmod'd after `claude login`
completed; claude CLI may write tokens with default umask leaving
them group/world-readable. Fix: add explicit chown + chmod -R go-rwx
after the login step, parallel to the gh credential restriction
already present in 6.95c.

P2 (documentation) — Skip conditions comment said GH_AUTH_OK != 1
was a skip condition but actual code never checked it. Fix: update
comment to accurately describe control-flow (install + login attempted
regardless; gh credential persistence conditional on /root/.config/gh
existing which iter-5.4.0 only sets if gh auth succeeded).

P2 (documentation) — common.nix `samba` package comment said
"composes with services.samba below" but services.samba is NOT
configured in this PR (lives in PR #5387). Fix: comment now
correctly explains that #5388 brings client-side nmblookup/smbclient
tooling, server-side nmbd config lives in #5387, two PRs compose at
merge time.

2 stale findings (verified against current HEAD; resolved no-op
per .claude/rules/blocked-green-ci-investigate-threads.md
verify-before-fix discipline):

- P1 (bug) "npm install -g as root" — STALE; Aaron's "nodejs you
  mean bun?" catch already migrated to `sudo -u "#$ZETA_UID" bun
  install --global` in commit 7f3e29f. Copilot review fired on
  older commit 843bdb4 (initial nodejs version).
- P1 (bug) "NPM_CONFIG_PREFIX = $HOME/.npm-global literal" — STALE;
  same bun-migration commit changed to BUN_INSTALL (which IS used
  literal in environment.sessionVariables but is processed correctly
  by bun, NOT by NixOS attempting $HOME expansion in a Nix string).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@AceHack AceHack merged commit 1cc6169 into main May 27, 2026
28 of 29 checks passed
@AceHack AceHack deleted the fix-b0835-multi-protocol-name-resolution-netbios-avahi-hardening-2026-05-26-2305z branch May 27, 2026 02:46
AceHack added a commit that referenced this pull request May 27, 2026
…ode + interactive claude login + gh+claude credential persistence + Zeta repo pre-clone — automatic on boot (Aaron 2026-05-27) (#5388)

* feat(iter-5.5.0 B-0848 Phase 2 + B-0835 Bug 8): install-time claude-code install + interactive `claude login` + gh+claude credential persistence + Zeta repo pre-clone — automatic on boot before first login

Operator framing 2026-05-27 (verbatim):

  > "also wanna make this automatic on boot before i even login and have
  > it save my claude code device login like gh, also make sure they are
  > all on path for me to play with when i log in?"

  > "this will be a hell of a start."

Mirrors iter-5.4.0's gh-auth pattern at install-time for the node-local
Claude Code agent. Closes B-0848 Phase 2 (manual node setup → automated
install-step) AND fixes a previously-undocumented credential-persistence
gap (Bug 8) where iter-5.4.0's gh auth tokens at /root/.config/gh in
the installer environment were NEVER copied to /mnt/home/zeta/.config/gh
on the installed system — operator had no auth post-reboot.

iter-5.5.0 = 4-part installer step (Step 6.95, runs AFTER nixos-install
when /mnt/home/zeta exists):

  6.95a — INSTALL @anthropic-ai/claude-code via npm to
          /mnt/home/zeta/.npm-global/ (per-user writable prefix; NixOS
          /nix/store is RO so global npm install goes to user-scope).
          Owned by zeta UID:GID; survives reboot.

  6.95b — INTERACTIVE `claude login` (device-flow, same shape as
          iter-5.4.0 gh auth login). Operator presses Enter to
          accept default YES, sees device-code prompt + URL, visits
          on Mac/laptop browser, approves. Credentials land at
          /mnt/home/zeta/.config/claude/.

  6.95c — PERSIST iter-5.4.0 gh credentials by copying
          /root/.config/gh → /mnt/home/zeta/.config/gh with zeta
          ownership + go-rwx restriction. Closes Bug 8 — previously
          gh auth tokens stayed in installer environment.

  6.95d — PRE-CLONE Zeta repo to /mnt/home/zeta/Zeta so first-login
          operator workflow is `cd ~/Zeta && claude` with zero
          extra setup.

common.nix additions:

  - nodejs_22 in systemPackages: installed system has npm for post-
    install updates of claude-code without bootstrapping node first
  - samba in systemPackages: NetBIOS lookup tools (nmblookup/smbclient)
    compose with services.samba from PR #5387
  - NPM_CONFIG_PREFIX env var = $HOME/.npm-global so npm respects
    the per-user prefix for global installs
  - /etc/profile.d/zeta-user-paths.sh: prepend $HOME/.npm-global/bin
    to PATH at login-shell init (claude reachable without manual
    PATH munging)

First-login operator now has on PATH (without any setup):
  gh + claude + kubectl + helm + k9s + argocd + cilium-cli + hubble
  + nmblookup + smbclient + git + nodejs/npm + standard tools

And in $HOME:
  ~/Zeta/                  (pre-cloned)
  ~/.config/gh/            (iter-5.4.0 gh auth persisted)
  ~/.config/claude/        (iter-5.5.0 claude login persisted)
  ~/.npm-global/bin/       (on PATH)

Bug 8 sibling-discovery: same gap also applies to the ssh-key pubkey
copy (iter-5.4.0 wrote /mnt/etc/zeta/operator-authorized-keys which IS
read at install time via Bug 1 symlink fix → that part already works).
The gh/claude credential gap is the auth-token-only sub-class.

Composes with: B-0794 + B-0795 + B-0812 + B-0813 (iter-5.4 install
cascade) · B-0835 (install bug cluster — Bug 4+5+6+7 + Bug 8 here)
· B-0847 (per-AI GitHub identity; this row uses borrowed gh auth
until that ratifies) · B-0848 (node-local Claude substrate;
this row IS Phase 2 automation) · PR #5387 (multi-protocol name
resolution — samba additions compose here for NetBIOS tooling).

Per .claude/rules/non-coercion-invariant.md HC-8: operator
interactive YES/n prompt preserves operator authority over
whether to auth at install time vs post-reboot. Per
.claude/rules/algo-wink-failure-mode.md: claude login is operator
device-flow consent (authorization-source), not algorithmic auth.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(PR-5388 Rule 0 — Aaron 2026-05-27 catch): nodejs → bun (Zeta canonical TS/JS runtime per `.claude/rules/rule-0-no-sh-files.md`)

Aaron caught the violation: "nodejs you mean bun?"

Per Rule 0: bun is the canonical TS/JS runtime in Zeta (not nodejs).
The earlier commit added nodejs_22 to systemPackages + used npm-global
prefix; updated to bun + bun-global prefix throughout.

Changes:

- common.nix:
  - nodejs_22 → bun in systemPackages
  - environment.sessionVariables: NPM_CONFIG_PREFIX → BUN_INSTALL
  - /etc/profile.d/zeta-user-paths.sh: ~/.npm-global/bin → ~/.bun/bin

- zeta-install.sh Step 6.95a:
  - mkdir -p $ZETA_HOME/.npm-global → $ZETA_HOME/.bun/bin
  - command -v npm → command -v bun
  - NPM_CONFIG_PREFIX → BUN_INSTALL
  - `npm install -g @anthropic-ai/claude-code` → `bun install --global @anthropic-ai/claude-code`
  - CLAUDE_BIN path: .npm-global/bin/claude → .bun/bin/claude
  - Reorganized to run via `sudo -u "#$ZETA_UID"` so ownership starts
    correct (avoid post-install chown)

- Done message: added bun to PATH-listing summary

bun has high Node-compat; claude-code's CLI surface should run identically.
If specific Node-API addons fail, separate fix-forward via fallback path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(PR-5388 Copilot 4 real findings — 2 stale from bun migration): dynamic UID/GID resolution + claude config chmod + skip-comment accuracy + samba comment scope

4 real Copilot findings on PR #5388 head, fixed:

P0 (bug) — ZETA_UID=1000/ZETA_GID=100 was hardcoded; would chown to
wrong owner if installed system uses different IDs (e.g., another user
created first, or NixOS module config changes). Fix: resolve via
`chroot /mnt id -u zeta` / `id -g zeta`; fallback to 1000:100 with
loud warning if chroot fails (degraded mode).

P0 (security) — ~/.config/claude was not chmod'd after `claude login`
completed; claude CLI may write tokens with default umask leaving
them group/world-readable. Fix: add explicit chown + chmod -R go-rwx
after the login step, parallel to the gh credential restriction
already present in 6.95c.

P2 (documentation) — Skip conditions comment said GH_AUTH_OK != 1
was a skip condition but actual code never checked it. Fix: update
comment to accurately describe control-flow (install + login attempted
regardless; gh credential persistence conditional on /root/.config/gh
existing which iter-5.4.0 only sets if gh auth succeeded).

P2 (documentation) — common.nix `samba` package comment said
"composes with services.samba below" but services.samba is NOT
configured in this PR (lives in PR #5387). Fix: comment now
correctly explains that #5388 brings client-side nmblookup/smbclient
tooling, server-side nmbd config lives in #5387, two PRs compose at
merge time.

2 stale findings (verified against current HEAD; resolved no-op
per .claude/rules/blocked-green-ci-investigate-threads.md
verify-before-fix discipline):

- P1 (bug) "npm install -g as root" — STALE; Aaron's "nodejs you
  mean bun?" catch already migrated to `sudo -u "#$ZETA_UID" bun
  install --global` in commit 7f3e29f. Copilot review fired on
  older commit 843bdb4 (initial nodejs version).
- P1 (bug) "NPM_CONFIG_PREFIX = $HOME/.npm-global literal" — STALE;
  same bun-migration commit changed to BUN_INSTALL (which IS used
  literal in environment.sessionVariables but is processed correctly
  by bun, NOT by NixOS attempting $HOME expansion in a Nix string).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants