feat(infra): per-host configs control-plane + worker-gpu-01/02 (PR 3 of Addison's plan)#4899
Conversation
PR 3 of Addison's NixOS-AI-cluster bootstrap plan. Adds three host configs that compose the shared modules from PR #4898: infra/nixos/hosts/control-plane/: - configuration.nix: imports common + k3s-server - hardware-configuration.nix: placeholder (replaced during install by `nixos-generate-config --root /mnt`) - README.md: install runbook, post-install verification commands infra/nixos/hosts/worker-gpu-01/: - configuration.nix: imports common + k3s-agent + gpu - serverAddr points at control-plane.zeta.local:6443 - hardware-configuration.nix: placeholder infra/nixos/hosts/worker-gpu-02/: - identical shape to worker-gpu-01 (separate file so per-machine labels / hardware specifics declare per host) flake.nix: - nixosConfigurations now exposes control-plane, worker-gpu-01, worker-gpu-02 alongside installer Placeholder hardware-configuration.nix files ship with minimal valid stubs (not-detected.nix import + DHCP + ext4 by-label devices) so `nix flake check` succeeds in CI. Each comment block names the generator command that replaces them during real install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e42034a
into
feat/addison-flake-and-modules-2026-05-24
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ee011111c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ## Post-install verification | ||
|
|
||
| ```bash | ||
| ssh zeta@control-plane |
There was a problem hiding this comment.
Add credential bootstrap before SSH verification
Following this runbook verbatim on a fresh install will fail at the first verification step because ssh zeta@control-plane assumes remote auth is already configured, but this commit's host config leaves users.users.zeta.openssh.authorizedKeys.keys empty while the shared baseline uses key-only SSH and no initial password. In that state, operators cannot complete post-install verification remotely; add an explicit pre-SSH step to install a key (or set a password) before this command.
Useful? React with 👍 / 👎.
… add-host (#4901) PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 unresolved threads from Copilot — fixed the real issues, marking the rest as outdated. P0 fixes: k3s-server.nix manifests path (line 55): `../../../k8s/` resolves to `<repo-root>/k8s/` (one level above `infra/`), which doesn't exist. Module at `infra/nixos/modules/` needs `../../k8s/` to reach `infra/k8s/`. Fixed all 3 references (argocd-namespace, argocd-install, root-application). k3s-server.nix clusterInit (line 31): unconditional `true` is wrong for multi-server HA — only the first control-plane node should set clusterInit; additional servers join via serverAddr. Changed to `lib.mkDefault true` and documented the per-host override pattern for HA expansion. P1 fixes: k3s-server.nix kubeconfig mode (line 40): `--write-kubeconfig-mode=0644` makes the admin kubeconfig world-readable, leaking cluster-admin creds to any unprivileged user on the control-plane node. Changed to 0640 + `--write-kubeconfig-group=wheel` so the wheel group can use kubectl without sudo, but other users can't read it. flake.nix supportedSystems / installer-iso (line 75): the installer NixOS config is x86_64-linux only, but the package was published on all `supportedSystems` including aarch64-linux, which would fail evaluation. Gated with `nixpkgs.lib.optionalAttrs (system == "x86_64-linux")` so `packages.aarch64-linux` is empty and `packages.x86_64-linux.installer-iso` resolves cleanly. devShell + formatter remain on all systems. Marked outdated (no fix needed): flake.nix line 20 (Copilot reviewed before PR #4900 landed): k8s/ directory now exists post-stack-merge. flake.nix line 80 (Copilot reviewed before PR #4899 landed): "Future hosts land in PR 2" comment removed by per-host PR. Not fixed in this commit: flake.lock (Copilot P1 line 5): requires `nix flake update` on a machine with Nix installed; not present on the autonomous agent's workstation. First maintainer with Nix runs the update and commits the resulting lock file as a follow-up — that commit is byte-stable and reviewable in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#4898) * feat(infra): flake.nix + shared NixOS modules (common, k3s-server, k3s-agent, gpu) Wires the installer config from PR #4897 into a buildable flake and seeds the shared modules every cluster host will import. flake.nix: - nixosConfigurations.installer → builds the USB ISO that PR #4897 declared - packages.installer-iso convenience alias - devShells.default with cluster admin toolkit - nixosModules.{common,k3s-server,k3s-agent,gpu} for downstream per-host configs infra/nixos/modules/common.nix: - Nix + flakes settings (cache, GC, trusted users) - Locale + time defaults - NetworkManager + firewall ON - SSH key-only (no PermitRootLogin password, no PasswordAuthentication) - `zeta` admin user (no initialPassword) - Baseline package set (git/vim/htop/kubectl/k9s/etc) - systemd-boot UEFI - powerManagement.cpuFreqGovernor = "performance" (AI workloads) infra/nixos/modules/k3s-server.nix: - role=server with embedded etcd (clusterInit=true) - Disables bundled servicelb + traefik (ArgoCD will land replacements) - Auto-applies k8s/bootstrap/* manifests on first boot so ArgoCD self-installs and immediately starts reconciling root-application - Firewall opens 6443/10250/2379/2380 + 8472/udp for flannel - KUBECONFIG env baked in infra/nixos/modules/k3s-agent.nix: - role=agent joins via serverAddr + tokenFile - Node label zeta.io/role=worker for placement - Firewall opens 10250 + 8472/udp infra/nixos/modules/gpu.nix: - NVIDIA driver (production branch) + container toolkit - allowUnfreePredicate scoped to nvidia + cuda packages only - nvidia-persistenced enabled (avoids first-pod cold-start tax) - Node label zeta.io/gpu=nvidia for `nvidia.com/gpu` pod requests - nvtop + cudart + nvcc on the host for diagnostics .gitignore additions: - result, result-* (nix build outputs) - .direnv/, .envrc.local (worker-shell flake integration) - .nix-eval-cache/ - /hardware-configuration.nix (top-level only; per-host configs keep theirs under infra/nixos/hosts/<host>/) Tokens are placeholder-pathed (tokenFile = /var/lib/rancher/k3s/.../token) so plaintext secrets never land in Git. sops-nix or agenix wiring lands in a follow-up PR alongside the per-host configs that need real tokens. Bootstrap manifests referenced from k3s-server.nix (k8s/bootstrap/*) land in PR 3; until then the manifests reference resolves to a not-yet-existent path, which is fine because no host imports k3s-server.nix yet (per-host configs land in PR 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): per-host configs (control-plane + worker-gpu-01/02) (#4899) PR 3 of Addison's NixOS-AI-cluster bootstrap plan. Adds three host configs that compose the shared modules from PR #4898: infra/nixos/hosts/control-plane/: - configuration.nix: imports common + k3s-server - hardware-configuration.nix: placeholder (replaced during install by `nixos-generate-config --root /mnt`) - README.md: install runbook, post-install verification commands infra/nixos/hosts/worker-gpu-01/: - configuration.nix: imports common + k3s-agent + gpu - serverAddr points at control-plane.zeta.local:6443 - hardware-configuration.nix: placeholder infra/nixos/hosts/worker-gpu-02/: - identical shape to worker-gpu-01 (separate file so per-machine labels / hardware specifics declare per host) flake.nix: - nixosConfigurations now exposes control-plane, worker-gpu-01, worker-gpu-02 alongside installer Placeholder hardware-configuration.nix files ship with minimal valid stubs (not-detected.nix import + DHCP + ext4 by-label devices) so `nix flake check` succeeds in CI. Each comment block names the generator command that replaces them during real install. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): k8s bootstrap + ArgoCD App-of-Apps (orleans, gitlab, argo-workflows, argo-rollouts) (#4900) PR 4 of Addison's NixOS-AI-cluster bootstrap plan. Lands the Kubernetes substrate that K3S auto-applies and ArgoCD then takes over reconciling. infra/k8s/bootstrap/ (K3S auto-applies on first boot via services.k3s.manifests in k3s-server.nix): - argocd-namespace.yaml - argocd-install.yaml: kustomize ref to ArgoCD v2.13.2 upstream manifest (pinned for reproducibility) - initial-orleans.yaml: minimal Orleans bootstrap StatefulSet scaled to replicas: 0 until a real silo image is published. Includes namespace, ServiceAccount, Role+RoleBinding for Kubernetes-clustering pod/endpoint discovery, headless service, client gateway service. infra/k8s/applications/ (ArgoCD watches this dir recursively): - root-application.yaml: App-of-Apps root; auto-applied by K3S. Selects Application.yaml at any depth via include glob. - orleans/Application.yaml: ArgoCD-managed Orleans, supersedes the bootstrap StatefulSet once reconcile completes - orleans/{deployment,service,rbac,configmap}.yaml: full Orleans StatefulSet (replicas: 0 placeholder), headless silo + client + dashboard services, RBAC, cluster config - gitlab/Application.yaml: GitLab CE Helm chart with bundled cert-manager/nginx/prometheus DISABLED (cluster has its own) and runners enabled for in-cluster CI - argoworkflows/Application.yaml: Argo Workflows 3.6 family; 7-day workflow retention; parallelism 50 - argorollouts/Application.yaml: Argo Rollouts 1.8 family with dashboard enabled for canary/blue-green inspection Add-a-workload-to-the-cluster flow: 1. mkdir infra/k8s/applications/<name>/ 2. write Application.yaml + supporting manifests 3. git commit + push to main 4. ArgoCD picks it up on next sync (~3 min) 5. K3S applies it The flake IS the tick source. The cluster reconciles toward it. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(infra): add infra/README.md — bootstrap runbook + add-workload + add-host (#4901) PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address Copilot P0/P1 review on PR #4898 7 unresolved threads from Copilot — fixed the real issues, marking the rest as outdated. P0 fixes: k3s-server.nix manifests path (line 55): `../../../k8s/` resolves to `<repo-root>/k8s/` (one level above `infra/`), which doesn't exist. Module at `infra/nixos/modules/` needs `../../k8s/` to reach `infra/k8s/`. Fixed all 3 references (argocd-namespace, argocd-install, root-application). k3s-server.nix clusterInit (line 31): unconditional `true` is wrong for multi-server HA — only the first control-plane node should set clusterInit; additional servers join via serverAddr. Changed to `lib.mkDefault true` and documented the per-host override pattern for HA expansion. P1 fixes: k3s-server.nix kubeconfig mode (line 40): `--write-kubeconfig-mode=0644` makes the admin kubeconfig world-readable, leaking cluster-admin creds to any unprivileged user on the control-plane node. Changed to 0640 + `--write-kubeconfig-group=wheel` so the wheel group can use kubectl without sudo, but other users can't read it. flake.nix supportedSystems / installer-iso (line 75): the installer NixOS config is x86_64-linux only, but the package was published on all `supportedSystems` including aarch64-linux, which would fail evaluation. Gated with `nixpkgs.lib.optionalAttrs (system == "x86_64-linux")` so `packages.aarch64-linux` is empty and `packages.x86_64-linux.installer-iso` resolves cleanly. devShell + formatter remain on all systems. Marked outdated (no fix needed): flake.nix line 20 (Copilot reviewed before PR #4900 landed): k8s/ directory now exists post-stack-merge. flake.nix line 80 (Copilot reviewed before PR #4899 landed): "Future hosts land in PR 2" comment removed by per-host PR. Not fixed in this commit: flake.lock (Copilot P1 line 5): requires `nix flake update` on a machine with Nix installed; not present on the autonomous agent's workstation. First maintainer with Nix runs the update and commits the resulting lock file as a follow-up — that commit is byte-stable and reviewable in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address 2nd Copilot review wave (P0 port, README, comments) Second batch of Copilot review (8 new threads in addition to the 7 from the first wave already fixed in the prior commit). P0 fix: k3s-server.nix firewall (line 77): missing port 9345/TCP — the K3S supervisor/registration port. Without it, agents cannot complete the join handshake and additional server nodes can't participate in HA. Added with explanatory comment. P1 fixes: infra/README.md serverAddr scheme (line 92): documented as `control-plane.zeta.local:6443` without `https://`. NixOS `services.k3s.serverAddr` requires the scheme. Updated to show the full URL and named the validation constraint. infra/README.md secrets section (line 125): only documented the server-role token path. Agent-role token path (/var/lib/rancher/k3s/agent/token) was missing. Added both, plus the openssl generation one-liner and the "same value to all nodes" requirement. infra/k8s/bootstrap/initial-orleans.yaml header (line 6): contained named attribution + anthropomorphic content. Per codebase convention for current-state infra manifests, rewritten as factual scope description. P2 fixes: infra/nixos/modules/gpu.nix open-modules comment (line 51): said "Open-source kernel modules — works on RTX 20-series and newer" but default was `false` (proprietary). Comment now matches the default (proprietary chosen for hardware compatibility) and names the per-host override for newer-only nodes. infra/k8s/bootstrap/argocd-install.yaml ArgoCD version comment (line 25): conflated ArgoCD-version pinning with targetRevision-of-tracked-Git-ref. Rewrote to separate the two upgrade vectors clearly. Already-resolved-by-prior-commit (no fix needed in this commit): k3s-server.nix line 49 P0 (write-kubeconfig-mode 0644): same finding as line 54 from the first wave; the prior commit already changed to 0640 + group=wheel. flake.nix line 53 P1 (aarch64-linux): same finding as the line 75 thread from the first wave; the prior commit already gated installer-iso to x86_64-linux only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): markdownlint MD007 — nested list indent in README CI surfaced two MD007/ul-indent errors at infra/README.md lines 126-127: nested unordered list items used 4-space indent (expected 2-space). The nested list was the per-role K3S token path enumeration added in the prior 2nd-wave fix commit (PR #4898 Copilot review thread PRRT_kwDOSF9kNM6EcQFw). Restructured to fix the indent and add a blank line between the nested list and the continuation paragraph so markdownlint sees the structure cleanly. Verified locally: `npx markdownlint-cli2 infra/README.md` now returns 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): 3rd Copilot review wave (manifest, .example refs, framing) 7 new threads after the prior fix-batch CI pass. P1 fixes: k3s-server.nix manifests + README claim: README documented K3S applying initial-orleans.yaml, but it wasn't in the manifests list. ADD initial-orleans.source to manifests (the right fix — Orleans namespace + RBAC + skeleton StatefulSet should be bootstrapped alongside ArgoCD). Updated the surrounding comment to reflect the 4-manifest seed (argocd-namespace, argocd-install, initial-orleans, root-application). k3s-server.nix servicelb/traefik comment (line 52): said "ArgoCD will install MetalLB + ingress-nginx as Applications" but no such Applications exist under infra/k8s/applications/. Reworded to name the bootstrap-period gap (LoadBalancer Services stay Pending; use NodePort or host-network during bootstrap). control-plane/README.md (line 58): referenced hardware-configuration.nix.example which was removed in the fix-up that gave each host a real placeholder. Replaced with current-state description (placeholder content + generator command to replace it on real install). worker-gpu-01/configuration.nix import comment (line 12): same hardware-configuration.nix.example reference. Updated to match current placeholder + generator command pattern. P2 fix: infra/README.md "The framing" section (line 148): contained named attribution ("Per Addison's spec") in a current-state infra doc. Per codebase convention for current-state surfaces (vs. history/roster surfaces which exempt attribution), reworded as factual design statement; kept the substrate intent (declarative desired state, drift reconciliation, single source of truth) without the attribution. Resolved as outdated (no fix needed): flake.nix line 90 (Copilot reading the PR DESCRIPTION which said per-host configs land in PR 3 — the stack collapsed and this PR contains them; description is historical and stale, the code is correct). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PR 3 of Addison's NixOS-AI-cluster bootstrap plan. Adds the three per-host configs that compose the shared modules from #4898.
Base: #4898 (will rebase to main once #4898 merges).
Files
control-planecommon+k3s-serverworker-gpu-01common+k3s-agent+gpuworker-gpu-02common+k3s-agent+gpuEach host directory has:
configuration.nix— host identity + module imports + per-host overrideshardware-configuration.nix— placeholder stub (replaced during real install bynixos-generate-config --root /mnt)README.mdon control-plane — install runbook + post-install verificationflake.nixnow exposes all four configs innixosConfigurations:installer,control-plane,worker-gpu-01,worker-gpu-02.Hardware config placeholders
Real
hardware-configuration.nixis generator output specific to each target machine. Placeholders ship as minimal valid stubs (not-detected.niximport + DHCP + ext4 by-label fileSystems) so:nix flake checkpasses in CInix build .#nixosConfigurations.control-planesucceeds at evaluationEach placeholder has a comment block naming the generator command.
Test plan
nix flake checkpassesnix build .#nixosConfigurations.{control-plane,worker-gpu-01,worker-gpu-02}succeedservices.k3s.manifestspaths cleanlyCo-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com