feat(infra): single-file installer packages for USB stick (Addison)#4897
Conversation
Adds infra/nixos/hosts/installer/configuration.nix — one file that declares every package needed on the bootable USB installer image for the NixOS-based AI cluster bootstrap (Addison's request). Organized by install-time role so the right section is reachable when something is missing mid-install: - Version control (git, git-lfs, gnupg, openssh) - Editors (vim, neovim, nano) - Shell QoL (tmux, htop, ripgrep, jq, yq-go, fzf, bat, eza, ...) - Network (curl, nmap, networkmanager, iwd, wireguard-tools, ...) - Disk (parted, gptfdisk, cryptsetup, zfs, lvm2, mdadm, ...) - Hardware inspection (lshw, dmidecode, nvme-cli, lm_sensors, ...) - GPU detection (glxinfo, vulkan-tools, clinfo) - NixOS install tooling (nixos-install-tools, nom, nvd, nh) - Kubernetes clients (kubectl, helm, k9s, argocd, k3s binary) - Secrets (age, sops, ssh-to-age) - Build helpers (gcc, gnumake, pkg-config, coreutils) - Observability (iotop, iftop, ncdu, pv) - Documentation (man-pages, tldr) K3S/ArgoCD/Orleans/GitLab runtime is NOT on the stick — it lands on the target machine via \`nixos-install --flake\` pulling from this same Git repo. The stick is one-shot ignition; the flake-in-Git is the strange attractor that draws desired state. Pre-stages the Zeta flake on the stick at /etc/zeta with the install runbook baked in via environment.etc, so the install works offline once the stick is dd'd. Composes with the upcoming flake.nix (gated on Addison) which will wire \`nixosConfigurations.installer\` to this file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2ace3ed9a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds a NixOS installer ISO configuration (nixosSystem module) intended to be the single Git-tracked source of truth for what packages/tools are present on a bootable USB installer used to bootstrap the cluster.
Changes:
- Introduces a new NixOS installer configuration importing the upstream minimal install CD modules.
- Defines a large
environment.systemPackagesset for install-time workflows (disk, network, secrets, k8s CLIs, etc.). - Adds ISO branding plus a baked
/etc/zeta/README.mdrunbook.
Copilot P0 (line 67): firewall-off + SSH-on + PermitRootLogin=yes is
trivially remotely exploitable on any networked install.
- networking.firewall.enable = true
- services.openssh.enable = false (off by default; enable manually
for headless install)
- When enabled, key-only: PermitRootLogin=prohibit-password,
PasswordAuthentication=false, KbdInteractiveAuthentication=false
Copilot P0 (line 77): hardcoded initialPassword="zeta" for root+nixos
plus wheelNeedsPassword=false = known-credential priv-escalation.
- Removed all initialPassword settings; rely on upstream
installation-cd-minimal.nix passwordless-root-at-console default
- Removed wheelNeedsPassword=false; sudo requires password now
- Documented headless workflow (manual passwd + systemctl start sshd
+ pre-seeded SSH key)
Copilot P1 (line 243): runbook referenced per-host READMEs that
don't exist yet.
- Rewrote runbook to reflect what's actually shippable today
- Tagged follow-up PRs for per-host configs explicitly
Copilot P1 (line 244): claimed "/etc/zeta pre-staged" but only
README.md was written.
- Renamed to /etc/zeta-install.md (honest filename)
- Documented that the flake isn't auto-staged; clone-from-network
is the workflow today
- Tagged future `environment.etc."zeta".source = inputs.self` as
a follow-up that requires repo-root flake.nix to land first
Codex P1 (line 230): "Bundle the actual flake under /etc/zeta"
- Same fix as Copilot P1 (line 244) — acknowledged as follow-up
work that requires flake.nix at repo root, which is a separate PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ab288c8bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # SSH key into `users.users.nixos.openssh.authorizedKeys.keys` here | ||
| # before building the ISO. | ||
| services.openssh = { | ||
| enable = false; |
There was a problem hiding this comment.
Open SSH firewall port for documented headless workflow
This config documents a headless path (systemctl start sshd) but keeps networking.firewall.enable = true while declaring services.openssh.enable = false, so TCP/22 is never opened by NixOS and remote SSH attempts to the live installer will be dropped. This only affects the headless install scenario, but in that scenario the documented instructions won’t work unless the firewall is relaxed or port 22 is explicitly allowed.
Useful? React with 👍 / 👎.
… add-host (#4901) PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#4898) * feat(infra): flake.nix + shared NixOS modules (common, k3s-server, k3s-agent, gpu) Wires the installer config from PR #4897 into a buildable flake and seeds the shared modules every cluster host will import. flake.nix: - nixosConfigurations.installer → builds the USB ISO that PR #4897 declared - packages.installer-iso convenience alias - devShells.default with cluster admin toolkit - nixosModules.{common,k3s-server,k3s-agent,gpu} for downstream per-host configs infra/nixos/modules/common.nix: - Nix + flakes settings (cache, GC, trusted users) - Locale + time defaults - NetworkManager + firewall ON - SSH key-only (no PermitRootLogin password, no PasswordAuthentication) - `zeta` admin user (no initialPassword) - Baseline package set (git/vim/htop/kubectl/k9s/etc) - systemd-boot UEFI - powerManagement.cpuFreqGovernor = "performance" (AI workloads) infra/nixos/modules/k3s-server.nix: - role=server with embedded etcd (clusterInit=true) - Disables bundled servicelb + traefik (ArgoCD will land replacements) - Auto-applies k8s/bootstrap/* manifests on first boot so ArgoCD self-installs and immediately starts reconciling root-application - Firewall opens 6443/10250/2379/2380 + 8472/udp for flannel - KUBECONFIG env baked in infra/nixos/modules/k3s-agent.nix: - role=agent joins via serverAddr + tokenFile - Node label zeta.io/role=worker for placement - Firewall opens 10250 + 8472/udp infra/nixos/modules/gpu.nix: - NVIDIA driver (production branch) + container toolkit - allowUnfreePredicate scoped to nvidia + cuda packages only - nvidia-persistenced enabled (avoids first-pod cold-start tax) - Node label zeta.io/gpu=nvidia for `nvidia.com/gpu` pod requests - nvtop + cudart + nvcc on the host for diagnostics .gitignore additions: - result, result-* (nix build outputs) - .direnv/, .envrc.local (worker-shell flake integration) - .nix-eval-cache/ - /hardware-configuration.nix (top-level only; per-host configs keep theirs under infra/nixos/hosts/<host>/) Tokens are placeholder-pathed (tokenFile = /var/lib/rancher/k3s/.../token) so plaintext secrets never land in Git. sops-nix or agenix wiring lands in a follow-up PR alongside the per-host configs that need real tokens. Bootstrap manifests referenced from k3s-server.nix (k8s/bootstrap/*) land in PR 3; until then the manifests reference resolves to a not-yet-existent path, which is fine because no host imports k3s-server.nix yet (per-host configs land in PR 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): per-host configs (control-plane + worker-gpu-01/02) (#4899) PR 3 of Addison's NixOS-AI-cluster bootstrap plan. Adds three host configs that compose the shared modules from PR #4898: infra/nixos/hosts/control-plane/: - configuration.nix: imports common + k3s-server - hardware-configuration.nix: placeholder (replaced during install by `nixos-generate-config --root /mnt`) - README.md: install runbook, post-install verification commands infra/nixos/hosts/worker-gpu-01/: - configuration.nix: imports common + k3s-agent + gpu - serverAddr points at control-plane.zeta.local:6443 - hardware-configuration.nix: placeholder infra/nixos/hosts/worker-gpu-02/: - identical shape to worker-gpu-01 (separate file so per-machine labels / hardware specifics declare per host) flake.nix: - nixosConfigurations now exposes control-plane, worker-gpu-01, worker-gpu-02 alongside installer Placeholder hardware-configuration.nix files ship with minimal valid stubs (not-detected.nix import + DHCP + ext4 by-label devices) so `nix flake check` succeeds in CI. Each comment block names the generator command that replaces them during real install. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): k8s bootstrap + ArgoCD App-of-Apps (orleans, gitlab, argo-workflows, argo-rollouts) (#4900) PR 4 of Addison's NixOS-AI-cluster bootstrap plan. Lands the Kubernetes substrate that K3S auto-applies and ArgoCD then takes over reconciling. infra/k8s/bootstrap/ (K3S auto-applies on first boot via services.k3s.manifests in k3s-server.nix): - argocd-namespace.yaml - argocd-install.yaml: kustomize ref to ArgoCD v2.13.2 upstream manifest (pinned for reproducibility) - initial-orleans.yaml: minimal Orleans bootstrap StatefulSet scaled to replicas: 0 until a real silo image is published. Includes namespace, ServiceAccount, Role+RoleBinding for Kubernetes-clustering pod/endpoint discovery, headless service, client gateway service. infra/k8s/applications/ (ArgoCD watches this dir recursively): - root-application.yaml: App-of-Apps root; auto-applied by K3S. Selects Application.yaml at any depth via include glob. - orleans/Application.yaml: ArgoCD-managed Orleans, supersedes the bootstrap StatefulSet once reconcile completes - orleans/{deployment,service,rbac,configmap}.yaml: full Orleans StatefulSet (replicas: 0 placeholder), headless silo + client + dashboard services, RBAC, cluster config - gitlab/Application.yaml: GitLab CE Helm chart with bundled cert-manager/nginx/prometheus DISABLED (cluster has its own) and runners enabled for in-cluster CI - argoworkflows/Application.yaml: Argo Workflows 3.6 family; 7-day workflow retention; parallelism 50 - argorollouts/Application.yaml: Argo Rollouts 1.8 family with dashboard enabled for canary/blue-green inspection Add-a-workload-to-the-cluster flow: 1. mkdir infra/k8s/applications/<name>/ 2. write Application.yaml + supporting manifests 3. git commit + push to main 4. ArgoCD picks it up on next sync (~3 min) 5. K3S applies it The flake IS the tick source. The cluster reconciles toward it. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(infra): add infra/README.md — bootstrap runbook + add-workload + add-host (#4901) PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address Copilot P0/P1 review on PR #4898 7 unresolved threads from Copilot — fixed the real issues, marking the rest as outdated. P0 fixes: k3s-server.nix manifests path (line 55): `../../../k8s/` resolves to `<repo-root>/k8s/` (one level above `infra/`), which doesn't exist. Module at `infra/nixos/modules/` needs `../../k8s/` to reach `infra/k8s/`. Fixed all 3 references (argocd-namespace, argocd-install, root-application). k3s-server.nix clusterInit (line 31): unconditional `true` is wrong for multi-server HA — only the first control-plane node should set clusterInit; additional servers join via serverAddr. Changed to `lib.mkDefault true` and documented the per-host override pattern for HA expansion. P1 fixes: k3s-server.nix kubeconfig mode (line 40): `--write-kubeconfig-mode=0644` makes the admin kubeconfig world-readable, leaking cluster-admin creds to any unprivileged user on the control-plane node. Changed to 0640 + `--write-kubeconfig-group=wheel` so the wheel group can use kubectl without sudo, but other users can't read it. flake.nix supportedSystems / installer-iso (line 75): the installer NixOS config is x86_64-linux only, but the package was published on all `supportedSystems` including aarch64-linux, which would fail evaluation. Gated with `nixpkgs.lib.optionalAttrs (system == "x86_64-linux")` so `packages.aarch64-linux` is empty and `packages.x86_64-linux.installer-iso` resolves cleanly. devShell + formatter remain on all systems. Marked outdated (no fix needed): flake.nix line 20 (Copilot reviewed before PR #4900 landed): k8s/ directory now exists post-stack-merge. flake.nix line 80 (Copilot reviewed before PR #4899 landed): "Future hosts land in PR 2" comment removed by per-host PR. Not fixed in this commit: flake.lock (Copilot P1 line 5): requires `nix flake update` on a machine with Nix installed; not present on the autonomous agent's workstation. First maintainer with Nix runs the update and commits the resulting lock file as a follow-up — that commit is byte-stable and reviewable in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address 2nd Copilot review wave (P0 port, README, comments) Second batch of Copilot review (8 new threads in addition to the 7 from the first wave already fixed in the prior commit). P0 fix: k3s-server.nix firewall (line 77): missing port 9345/TCP — the K3S supervisor/registration port. Without it, agents cannot complete the join handshake and additional server nodes can't participate in HA. Added with explanatory comment. P1 fixes: infra/README.md serverAddr scheme (line 92): documented as `control-plane.zeta.local:6443` without `https://`. NixOS `services.k3s.serverAddr` requires the scheme. Updated to show the full URL and named the validation constraint. infra/README.md secrets section (line 125): only documented the server-role token path. Agent-role token path (/var/lib/rancher/k3s/agent/token) was missing. Added both, plus the openssl generation one-liner and the "same value to all nodes" requirement. infra/k8s/bootstrap/initial-orleans.yaml header (line 6): contained named attribution + anthropomorphic content. Per codebase convention for current-state infra manifests, rewritten as factual scope description. P2 fixes: infra/nixos/modules/gpu.nix open-modules comment (line 51): said "Open-source kernel modules — works on RTX 20-series and newer" but default was `false` (proprietary). Comment now matches the default (proprietary chosen for hardware compatibility) and names the per-host override for newer-only nodes. infra/k8s/bootstrap/argocd-install.yaml ArgoCD version comment (line 25): conflated ArgoCD-version pinning with targetRevision-of-tracked-Git-ref. Rewrote to separate the two upgrade vectors clearly. Already-resolved-by-prior-commit (no fix needed in this commit): k3s-server.nix line 49 P0 (write-kubeconfig-mode 0644): same finding as line 54 from the first wave; the prior commit already changed to 0640 + group=wheel. flake.nix line 53 P1 (aarch64-linux): same finding as the line 75 thread from the first wave; the prior commit already gated installer-iso to x86_64-linux only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): markdownlint MD007 — nested list indent in README CI surfaced two MD007/ul-indent errors at infra/README.md lines 126-127: nested unordered list items used 4-space indent (expected 2-space). The nested list was the per-role K3S token path enumeration added in the prior 2nd-wave fix commit (PR #4898 Copilot review thread PRRT_kwDOSF9kNM6EcQFw). Restructured to fix the indent and add a blank line between the nested list and the continuation paragraph so markdownlint sees the structure cleanly. Verified locally: `npx markdownlint-cli2 infra/README.md` now returns 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): 3rd Copilot review wave (manifest, .example refs, framing) 7 new threads after the prior fix-batch CI pass. P1 fixes: k3s-server.nix manifests + README claim: README documented K3S applying initial-orleans.yaml, but it wasn't in the manifests list. ADD initial-orleans.source to manifests (the right fix — Orleans namespace + RBAC + skeleton StatefulSet should be bootstrapped alongside ArgoCD). Updated the surrounding comment to reflect the 4-manifest seed (argocd-namespace, argocd-install, initial-orleans, root-application). k3s-server.nix servicelb/traefik comment (line 52): said "ArgoCD will install MetalLB + ingress-nginx as Applications" but no such Applications exist under infra/k8s/applications/. Reworded to name the bootstrap-period gap (LoadBalancer Services stay Pending; use NodePort or host-network during bootstrap). control-plane/README.md (line 58): referenced hardware-configuration.nix.example which was removed in the fix-up that gave each host a real placeholder. Replaced with current-state description (placeholder content + generator command to replace it on real install). worker-gpu-01/configuration.nix import comment (line 12): same hardware-configuration.nix.example reference. Updated to match current placeholder + generator command pattern. P2 fix: infra/README.md "The framing" section (line 148): contained named attribution ("Per Addison's spec") in a current-state infra doc. Per codebase convention for current-state surfaces (vs. history/roster surfaces which exempt attribution), reworded as factual design statement; kept the substrate intent (declarative desired state, drift reconciliation, single source of truth) without the attribution. Resolved as outdated (no fix needed): flake.nix line 90 (Copilot reading the PR DESCRIPTION which said per-host configs land in PR 3 — the stack collapsed and this PR contains them; description is historical and stale, the code is correct). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
infra/nixos/hosts/installer/configuration.nix— one file that declares every package needed on the bootable USB installer image for the NixOS-based AI cluster bootstrap.Addison (19, working with Aaron) asked for a single Git-tracked file containing every package the USB stick needs. This is that file.
What's on the stick (~70 packages, organized by install-time role)
git,git-lfs,gnupg,opensshvim,neovim,nanotmux,htop,ripgrep,jq,yq-go,fzf,bat,eza, ...curl,nmap,networkmanager,iwd,wireguard-tools, ...parted,gptfdisk,cryptsetup,zfs,lvm2,mdadm,smartmontoolslshw,dmidecode,nvme-cli,lm_sensors, ...glxinfo,vulkan-tools,clinfonixos-install-tools,nom,nvd,nhkubectl,helm,k9s,argocd,k3sbinaryage,sops,ssh-to-agegcc,gnumake,pkg-config, coreutils, ...iotop,iftop,ncdu,pvman-pages,tldrWhat's NOT on the stick
K3S / ArgoCD / Orleans / GitLab / Argo Workflows / Argo Rollouts runtime is deliberately not baked into the ISO. Those land on the target machine via
nixos-install --flake .#<host>pulling from this same Git repo. The stick is one-shot ignition; the flake-in-Git is the strange attractor that draws desired state.Only the Kubernetes/GitOps CLIs (
kubectl,helm,argocd,k9s) ship so you can talk to a freshly-installed control plane from the live USB before reboot.How it's built
The flake at the repo root (next file, gated on Addison) wires:
```nix
nixosConfigurations.installer = nixpkgs.lib.nixosSystem {
modules = [ ./infra/nixos/hosts/installer/configuration.nix ];
};
```
Then:
```bash
nix build .#nixosConfigurations.installer.config.system.build.isoImage
dd if=result/iso/zeta-installer-*.iso of=/dev/sdX bs=4M status=progress
```
Pre-staged runbook on the ISO
environment.etc."zeta/README.md"bakes the install instructions onto the stick itself, so the runbook is reachable offline once booted.Test plan
nix flake checkonceflake.nixlands wiringnixosConfigurations.installernix build .#nixosConfigurations.installer.config.system.build.isoImageproduces an ISOnixos-install --flake /etc/zeta#<host>works against a per-host configCo-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com