From 59dff338ed65728dc3c4140e3444a8980327dc1d Mon Sep 17 00:00:00 2001 From: Lior Date: Tue, 26 May 2026 22:37:02 -0400 Subject: [PATCH 1/2] =?UTF-8?q?fix(B-0835=20Bug=206+7=20=E2=80=94=20Aaron?= =?UTF-8?q?=202026-05-27=20name-resolution=20reliability=20ask):=20multi-p?= =?UTF-8?q?rotocol=20name=20resolution=20=E2=80=94=20Avahi=20hardening=20+?= =?UTF-8?q?=20NetBIOS=20via=20Samba's=20nmbd=20+=20DHCP-hostname=20registr?= =?UTF-8?q?ation;=20belt-and-suspenders=20for=20`i=20can't=20ping=20it=20b?= =?UTF-8?q?y=20name`?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Operator framing (verbatim): > "my mac is ethernet connected and i connected to the same wifi as > it but i still can't ping could it be something else or can we > make hostname more reliable? maybe a netbios or something? i > like ashai or whatever it is but can we make it reliable? i > think this is looking very good." Aaron empirically observed mDNS unreliable even with operator Mac on both ethernet AND same WiFi as node-e5a176. Diagnostic from Mac: ping by IP works, SSH works, but `dscacheutil -q host -a name node-e5a176.local` empty AND unicast mDNS query to 192.168.4.128:5353 TIMED OUT (not just connection-attempt-noise — actual no-response). Multi-protocol additive approach (preserve operator's preferred Avahi/Bonjour AND add fallback mechanisms with different failure modes): Bug 6 — Avahi hardening ======================== Adds: - nssmdns6 = true (IPv6 nss-mdns; some macOS configs prefer AAAA) - ipv4 + ipv6 explicit (vs defaults that might bind one or other) - reflector = true (forward mDNS across subnets — composes with multi-segment LAN setups) - publish.hinfo + publish.userServices (additional discoverability) Bug 7 — NetBIOS via Samba's nmbd (additive belt-and-suspenders) ================================================================ NetBIOS uses UDP broadcast on port 137 (vs mDNS multicast on 5353) — different failure modes. If network drops IGMP/multicast but allows broadcast, `node-e5a176` resolves via NetBIOS where `node-e5a176.local` fails via mDNS. Operator usage (any LAN host): nmblookup node-e5a176 # Linux/macOS NetBIOS lookup smbutil lookup node-e5a176 # macOS native NetBIOS ping node-e5a176 # if nsswitch has wins (default macOS) Samba is enabled for NetBIOS name-advertisement ONLY (no shares declared = no SMB file-share exposure). The "disable netbios = no" + workgroup ZETA + per-host netbios-name = config.networking.hostName config matches the per-node identity from injected-hostname.nix. DHCP-hostname registration (3rd reliability layer) =================================================== NetworkManager already advertises hostname via DHCP option 12 by default. Many home routers (Asus/Netgear/Eero/etc) register DHCP client hostnames as DNS names like `node-e5a176.lan` — no NixOS config change needed beyond the existing networking.networkmanager. Operator now has 3 ways to find node-e5a176: 1. `node-e5a176.local` (mDNS — preferred, may flake) 2. `node-e5a176` / `nmblookup ...` (NetBIOS — different protocol) 3. `node-e5a176.lan` (or .home) (router DHCP — works for most home routers) Plus the always-reliable: 4. IP address (192.168.4.128 in Aaron's case; via arp -a) Composes with: B-0792 (injected-hostname); iter-5.4.1 self- registration (PR #5380 has the MAC + hostname; operator can correlate); B-0848 (node-local Claude needs reliable name resolution to act on cluster). Diagnostic surface preserved at operator side: ssh in + run `systemctl status avahi-daemon nmbd` + `journalctl -u avahi-daemon -u nmbd --since "1 hour ago"` to see why a specific mechanism failed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- full-ai-cluster/nixos/modules/common.nix | 69 +++++++++++++++++++++--- 1 file changed, 62 insertions(+), 7 deletions(-) diff --git a/full-ai-cluster/nixos/modules/common.nix b/full-ai-cluster/nixos/modules/common.nix index 32f54691ba..ea47659c23 100644 --- a/full-ai-cluster/nixos/modules/common.nix +++ b/full-ai-cluster/nixos/modules/common.nix @@ -46,24 +46,79 @@ networking.networkmanager.enable = true; networking.firewall.enable = true; - # iter-5.1 (B-0792): Avahi mDNS publishing so cluster nodes resolve - # via `.local` from operator Mac (Bonjour) + Linux peers - # (nss-mdns) on the LAN without IP-discovery step. Without this, - # `ssh zeta@control-plane.local` fails to resolve even though the - # node is up. Empirical anchor: 2026-05-26 iter-4.2 PC1 test - # surfaced the gap. + # iter-5.1 (B-0792): Avahi mDNS publishing — `.local` + # resolution via Bonjour (macOS) + nss-mdns (Linux peers). + # Empirical 2026-05-27 (Aaron node-e5a176 control-plane test): mDNS + # alone proved unreliable — operator's Mac (en0 ethernet, also on + # WiFi) could ping by IP + SSH but Bonjour resolution timed out; + # unicast mDNS query to port 5353/udp also timed out from the Mac + # even though the install completed. Multi-protocol additive + # belt-and-suspenders below addresses the reliability gap without + # removing the operator's preferred Bonjour-style mechanism. services.avahi = { enable = true; nssmdns4 = true; - openFirewall = true; # firewall hole for mDNS (5353/udp) + nssmdns6 = true; # IPv6 nss-mdns alongside IPv4 (some operator + # macOS configs prefer AAAA queries first) + openFirewall = true; # firewall hole for mDNS (5353/udp) + ipv4 = true; + ipv6 = true; + reflector = true; # forward mDNS across multiple subnets (operator + # mac on one segment + node on another via router) publish = { enable = true; addresses = true; workstation = true; domain = true; + hinfo = true; # host info record — additional discoverability + userServices = true; # advertise user services so dns-sd browses see node }; }; + # iter-5.5 (B-0835 Bug 7 — Aaron 2026-05-27 reliability ask): + # NetBIOS name resolution via Samba's nmbd as additive belt-and- + # suspenders alongside Avahi mDNS. NetBIOS uses UDP broadcast on + # 137 (vs mDNS multicast on 5353) — different failure modes; if + # the network drops IGMP/multicast but allows broadcast, + # `node-e5a176` resolves via NetBIOS where `node-e5a176.local` + # fails via mDNS. Windows + macOS + Linux all speak NetBIOS via + # nmblookup / smbutil / nss-winbind. + # + # Operator usage (from any host on the LAN): + # nmblookup node-e5a176 # Linux/macOS NetBIOS lookup + # smbutil lookup node-e5a176 # macOS native NetBIOS + # ping node-e5a176 # may work if nsswitch has wins + services.samba = { + enable = true; + openFirewall = true; # 137/udp (NetBIOS-NS), 138/udp (NetBIOS-DGM), + # 139/tcp (NetBIOS-SSN), 445/tcp (SMB). + # We only need 137/udp for name resolution; + # the rest are firewalled at the file-share + # layer (no shares declared = no SMB exposure). + settings = { + global = { + "workgroup" = "ZETA"; + "server string" = "Zeta cluster node %h"; + "netbios name" = config.networking.hostName; + # Disable SMB file-sharing entirely — this Samba instance + # exists ONLY for NetBIOS name advertisement, NOT file shares. + # No shares declared below the global section. + "server min protocol" = "SMB3"; + "smb ports" = "445"; # don't bind 139 + "disable netbios" = "no"; + "name resolve order" = "bcast host"; + }; + }; + }; + + # DHCP-hostname registration: NetworkManager already advertises the + # hostname via DHCP option 12 by default. Many home routers register + # DHCP client hostnames as DNS names (e.g., `node-e5a176.lan` from + # Asus/Netgear/Eero). This is the 3rd reliability layer — operator's + # router becomes a fallback name resolver for `` and + # `.lan` (or `.home`/`.localdomain` depending on router). + # No additional NixOS config needed beyond NetworkManager being on. + services.openssh = { enable = true; settings = { From 28652a6d39171f630874418b023a3480936d1d85 Mon Sep 17 00:00:00 2001 From: Lior Date: Tue, 26 May 2026 22:44:07 -0400 Subject: [PATCH 2/2] =?UTF-8?q?fix(PR-5387=20Copilot=203=20findings=20?= =?UTF-8?q?=E2=80=94=20P0+P1=20security=20+=20P2=20name-attribution):=20Ne?= =?UTF-8?q?tBIOS-only=20Samba=20via=20smbd.enable=3Dfalse=20+=20explicit?= =?UTF-8?q?=20allowedUDPPorts;=20replace=20'Aaron'=20with=20'operator'/'ma?= =?UTF-8?q?intainer'=20per=20.github/copilot-instructions.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 3 substantive findings, all real: P0 — services.samba.openFirewall=true contradicted the "name resolution only" claim by opening 139/tcp + 445/tcp (SMB ports). Fix: openFirewall=false + explicit networking.firewall.allowedUDPPorts = [ 137 138 ] (NetBIOS-NS + NetBIOS-DGM only). P1 — comment claimed "disables SMB file-sharing entirely" but the config kept smbd active via `smb ports = "445"`. Fix: actually disable smbd via services.samba.smbd.enable = false; keep services.samba.nmbd.enable = true. Now ONLY nmbd runs — zero SMB attack surface, comment matches reality. P2 — comments contained personal name attribution ("Aaron ...") which violates .github/copilot-instructions.md "No name attribution in code, docs, or skills". Fix: replaced with "operator" / "maintainer" / "control-plane physical-hardware-support test" framings. Verbatim quotes from operator already preserved at the backlog row + PR body (history surfaces); code/module comments use role-refs only. Substrate-honest about the security: PR #5387 as originally pushed WOULD have opened SMB ports on cluster nodes despite the stated goal. Reviewer caught it; the fix actually delivers the "NetBIOS-name-resolution-only" promise. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- full-ai-cluster/nixos/modules/common.nix | 39 ++++++++++++++++-------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/full-ai-cluster/nixos/modules/common.nix b/full-ai-cluster/nixos/modules/common.nix index ea47659c23..1f35c984c6 100644 --- a/full-ai-cluster/nixos/modules/common.nix +++ b/full-ai-cluster/nixos/modules/common.nix @@ -48,8 +48,8 @@ # iter-5.1 (B-0792): Avahi mDNS publishing — `.local` # resolution via Bonjour (macOS) + nss-mdns (Linux peers). - # Empirical 2026-05-27 (Aaron node-e5a176 control-plane test): mDNS - # alone proved unreliable — operator's Mac (en0 ethernet, also on + # Empirical 2026-05-27 (control-plane physical-hardware-support test): + # mDNS alone proved unreliable — operator's Mac (en0 ethernet, also on # WiFi) could ping by IP + SSH but Bonjour resolution timed out; # unicast mDNS query to port 5353/udp also timed out from the Mac # even though the install completed. Multi-protocol additive @@ -75,7 +75,7 @@ }; }; - # iter-5.5 (B-0835 Bug 7 — Aaron 2026-05-27 reliability ask): + # iter-5.5 (B-0835 Bug 7 — operator 2026-05-27 reliability ask): # NetBIOS name resolution via Samba's nmbd as additive belt-and- # suspenders alongside Avahi mDNS. NetBIOS uses UDP broadcast on # 137 (vs mDNS multicast on 5353) — different failure modes; if @@ -88,29 +88,42 @@ # nmblookup node-e5a176 # Linux/macOS NetBIOS lookup # smbutil lookup node-e5a176 # macOS native NetBIOS # ping node-e5a176 # may work if nsswitch has wins + # + # SECURITY DISCIPLINE (P0+P1 fixes from PR #5387 Copilot review): + # We run ONLY nmbd (NetBIOS name daemon on 137/udp + 138/udp), NOT + # smbd (SMB file-sharing daemon on 139/tcp + 445/tcp). This is + # genuinely "NetBIOS-only" — zero SMB attack surface: + # - services.samba.smbd.enable = false (no smbd process) + # - services.samba.nmbd.enable = true (nmbd ONLY) + # - services.samba.openFirewall = false (we control firewall manually) + # - networking.firewall.allowedUDPPorts = [ 137 138 ] (NetBIOS only) + # Reviewer caught the prior `openFirewall = true` + `smb ports = "445"` + # config that opened 139/tcp + 445/tcp despite the "name resolution + # only" claim. Now genuinely true. services.samba = { enable = true; - openFirewall = true; # 137/udp (NetBIOS-NS), 138/udp (NetBIOS-DGM), - # 139/tcp (NetBIOS-SSN), 445/tcp (SMB). - # We only need 137/udp for name resolution; - # the rest are firewalled at the file-share - # layer (no shares declared = no SMB exposure). + openFirewall = false; # we open ONLY 137/138 UDP below; no SMB ports + smbd.enable = false; # NO SMB file-sharing daemon + nmbd.enable = true; # NetBIOS name daemon ONLY settings = { global = { "workgroup" = "ZETA"; "server string" = "Zeta cluster node %h"; "netbios name" = config.networking.hostName; - # Disable SMB file-sharing entirely — this Samba instance - # exists ONLY for NetBIOS name advertisement, NOT file shares. - # No shares declared below the global section. - "server min protocol" = "SMB3"; - "smb ports" = "445"; # don't bind 139 "disable netbios" = "no"; "name resolve order" = "bcast host"; }; }; }; + # Explicit NetBIOS-only firewall holes (P0 fix per PR #5387 review): + # 137/udp = NetBIOS-NS (name service queries) + # 138/udp = NetBIOS-DGM (datagram service for browse-list announcements) + # We do NOT open 139/tcp (NetBIOS-SSN) or 445/tcp (SMB) since smbd is + # disabled. This is genuinely "NetBIOS name resolution only" — no SMB + # file-share surface exposed even if smbd accidentally got re-enabled. + networking.firewall.allowedUDPPorts = [ 137 138 ]; + # DHCP-hostname registration: NetworkManager already advertises the # hostname via DHCP option 12 by default. Many home routers register # DHCP client hostnames as DNS names (e.g., `node-e5a176.lan` from