-
Notifications
You must be signed in to change notification settings - Fork 1
feat(ai-cluster): cookie-cutter node provisioning via disko + Longhorn multi-disk #4950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| # Provisioning a new node — cookie-cutter workflow | ||
|
|
||
| End-to-end: physical box arrives → boots into running cluster | ||
| member with replicated Longhorn capacity. Six values to change | ||
| per box, no hand-partitioning, no shell scripts. | ||
|
|
||
| ## What you need | ||
|
|
||
| - A NixOS installer USB built from this repo (`nix build .#installer-iso`) | ||
| - The new box wired to the cluster network with internet access | ||
| - The maintainer's public SSH key | ||
| - A few minutes to read off two disk serial numbers | ||
|
|
||
| ## Step 1: copy the template | ||
|
|
||
| ```bash | ||
| HOST=worker-gpu-03 # pick the next free number | ||
| cp -r full-ai-cluster/nixos/hosts/worker-template \ | ||
| full-ai-cluster/nixos/hosts/$HOST | ||
| ``` | ||
|
|
||
| ## Step 2: change the six placeholder values | ||
|
|
||
| Open `full-ai-cluster/nixos/hosts/$HOST/default.nix` and edit | ||
| each of the six clearly-marked PLACEHOLDER blocks: | ||
|
|
||
| | What | Where to get it | | ||
| |------|-----------------| | ||
| | `networking.hostName` | the name you chose above (`worker-gpu-03`) | | ||
| | `networking.hostId` | `head -c4 /dev/urandom \| od -A n -t x4 \| tr -d ' '` | | ||
| | `zeta.disko.nvme0` | On the live system: `ls -l /dev/disk/by-id/ \| grep nvme \| awk '{print $9, $11}'` — pick the disk you want to BE the boot disk (gets OS + first Longhorn data path) | | ||
| | `zeta.disko.nvme1` | Same listing, the other NVMe (becomes pure Longhorn data) | | ||
| | Network config | Static IP block if you don't use DHCP | | ||
| | `users.users.zeta.openssh.authorizedKeys` | Maintainer key | | ||
|
|
||
| ## Step 3: wire into the flake | ||
|
|
||
| Open `full-ai-cluster/flake.nix`, add an entry mirroring | ||
| `worker-template`: | ||
|
|
||
| ```nix | ||
| "worker-gpu-03" = mkSystem { | ||
| modules = [ | ||
| ./nixos/hosts/worker-gpu-03/default.nix | ||
| ]; | ||
| }; | ||
| ``` | ||
|
|
||
| Commit + push to main so the install reads from a real ref. | ||
|
|
||
| ## Step 4: boot the box on the USB | ||
|
|
||
| UEFI boot order → USB first. Network up via `nmtui` if not DHCP. | ||
|
|
||
| ```bash | ||
| # Clone Zeta to the live system's writable scratch | ||
| sudo git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta | ||
| cd /mnt/etc/zeta/full-ai-cluster | ||
| ``` | ||
|
|
||
| ## Step 5: disko + nixos-install (the actual cookie-cutter install) | ||
|
|
||
| ```bash | ||
| # Step 5a — disko wipes + partitions + formats + mounts both disks | ||
| sudo disko --mode disko --flake .#worker-gpu-03 | ||
|
|
||
| # Step 5b — install NixOS onto the mounted layout | ||
| sudo nixos-install --flake .#worker-gpu-03 --no-root-password | ||
|
|
||
| # Step 5c — reboot. Box joins cluster on first boot. | ||
| sudo reboot | ||
| ``` | ||
|
|
||
| That's it. Subsequent boxes: repeat steps 1-5 with new placeholder | ||
| values. Each provision is ~10 minutes wall-clock, ~6 lines of | ||
| human edits, zero hand-partitioning. | ||
|
|
||
| ## What happens after first boot | ||
|
|
||
| 1. systemd-boot → kernel → NixOS userland (~30s) | ||
| 2. K3S agent service starts → contacts `control-plane.zeta.local:6443` | ||
| 3. Cluster admits the node → kubelet reports both `/var/lib/longhorn-disk1` | ||
| and `/var/lib/longhorn-disk2` as filesystem entries | ||
| 4. Longhorn DaemonSet pod schedules → reads `/etc/longhorn/node-disks.yaml` | ||
| → patches the Longhorn Node CR to add both data paths | ||
| 5. Longhorn rebalancer notices the new capacity → starts placing | ||
| replicas of existing volumes onto this node | ||
| 6. ArgoCD reconciles any node-affinity workloads that target this | ||
| node's labels | ||
|
|
||
| Check it landed: | ||
|
|
||
| ```bash | ||
| kubectl get nodes -o wide | ||
| kubectl -n longhorn-system get nodes.longhorn.io worker-gpu-03 -o yaml | grep -A20 disks: | ||
| ``` | ||
|
|
||
| ## Disk failure recovery | ||
|
|
||
| NVMe dies → Longhorn marks the data path Unavailable → the cluster's | ||
| other replicas (default replica count 3 means 2 healthy copies | ||
| remain) keep serving the volumes → no app-visible interruption. | ||
|
|
||
| Replace the dead drive, then either: | ||
|
|
||
| - **Hot path** (drive replaced with identical model + position): | ||
| reboot, disko recreates the partition table on the fresh drive, | ||
| Longhorn re-registers the data path, replicas rebuild from peers. | ||
| - **Slow path** (drive serial changed): update the `zeta.disko.nvme0` | ||
| or `nvme1` by-id symlink in `nixos/hosts/<host>/default.nix`, | ||
| `nixos-rebuild switch --flake .#<host> --target-host <host>` from | ||
| any admin machine, then rebuild as above. | ||
|
|
||
| OS itself: the `/` partition lives on `nvme0` only, so a `nvme1` | ||
| failure leaves the node fully bootable + Longhorn capacity | ||
| degrades by half until repair. An `nvme0` failure takes the OS | ||
| down — reinstall via Step 5 onto the replacement disk; Longhorn | ||
| data on `nvme1` is re-imported when the rebuilt node rejoins. | ||
|
|
||
| ## Multi-shape support | ||
|
|
||
| `disko-shapes/2nvme.nix` is the shape for the current hardware. | ||
| Adding a new hardware class (e.g. 4 NVMes, or NVMe + SATA SSD mix) | ||
| means: | ||
|
|
||
| 1. Author `disko-shapes/<new-shape>.nix` matching the | ||
| `zeta.disko` options pattern | ||
| 2. Author a new host template under `hosts/<new-class>-template/` | ||
| that imports it | ||
| 3. Cookie-cutter from THAT template for boxes of the new class | ||
|
|
||
| The Longhorn module (`modules/longhorn-disks.nix`) is shape- | ||
| agnostic — it takes a list of mount paths and wires them, no | ||
| matter how many disks contributed those mounts. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
101 changes: 101 additions & 0 deletions
101
full-ai-cluster/nixos/hosts/worker-template/default.nix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| # full-ai-cluster/nixos/hosts/worker-template/default.nix | ||
| # | ||
| # Cookie-cutter worker node config. Adding a new identical box: | ||
| # | ||
| # 1. cp -r nixos/hosts/worker-template nixos/hosts/worker-gpu-NN | ||
| # 2. Edit the new file — change SIX placeholder values: | ||
| # - networking.hostName (line ~30) | ||
| # - networking.hostId (line ~32; new random 8-hex) | ||
| # - networking.interfaces (per-host MAC / static IP) | ||
| # - zeta.disko.nvme0 (per-host /dev/disk/by-id) | ||
| # - zeta.disko.nvme1 (per-host /dev/disk/by-id) | ||
| # - users.users.zeta.openssh.authorizedKeys (maintainer key) | ||
| # 3. Add `worker-gpu-NN` to flake.nix nixosConfigurations | ||
| # 4. Boot the box on the installer USB, then: | ||
| # nix run github:nix-community/disko -- \ | ||
| # --mode disko \ | ||
| # --flake /mnt/etc/zeta/full-ai-cluster#worker-gpu-NN | ||
| # nixos-install --flake /mnt/etc/zeta/full-ai-cluster#worker-gpu-NN | ||
| # 5. Reboot. Node joins cluster, Longhorn picks up both disks, | ||
| # ArgoCD reconciles workloads. | ||
| # | ||
| # Hardware shape: x86_64, UEFI, 2 NVMes (any size, same shape), | ||
| # 1+ NVIDIA GPU. For AMD-only or Intel-only GPU nodes change the | ||
| # `zeta.gpu-device-plugin.vendors` setting; for non-GPU workers | ||
| # drop the GPU imports entirely. | ||
|
|
||
| { config, pkgs, lib, inputs, ... }: | ||
|
|
||
| { | ||
| imports = [ | ||
| # Declarative disk layout — disko shapes the partitions, | ||
| # longhorn-disks wires the mounts to Longhorn data paths. | ||
| inputs.disko.nixosModules.disko | ||
| ../../modules/disko-shapes/2nvme.nix | ||
| ../../modules/longhorn-disks.nix | ||
|
|
||
| # Cluster role + hardware-class modules. | ||
| ../../modules/common.nix | ||
| ../../modules/k3s-agent.nix | ||
| ../../modules/gpu.nix | ||
| ../../modules/gpu-device-plugin.nix | ||
| ../../modules/gpu-passthrough.nix | ||
| ../../modules/docker.nix | ||
| ../../modules/local-storage.nix | ||
| ]; | ||
|
|
||
| # ── PLACEHOLDER: change per-host ───────────────────────────── | ||
| networking.hostName = "worker-template"; | ||
| networking.hostId = "00000000"; # `head -c4 /dev/urandom | od -A n -t x4 | tr -d ' '` | ||
| # ───────────────────────────────────────────────────────────── | ||
|
|
||
| # ── PLACEHOLDER: change per-host (disk IDs) ────────────────── | ||
| # On the live system, run: ls -l /dev/disk/by-id/ | grep nvme | ||
| zeta.disko = { | ||
| nvme0 = "/dev/disk/by-id/nvme-REPLACE_ME_BOOT_DISK"; | ||
| nvme1 = "/dev/disk/by-id/nvme-REPLACE_ME_LONGHORN_DISK"; | ||
| # rootSize = "256G"; # default; override if needed | ||
| }; | ||
| # ───────────────────────────────────────────────────────────── | ||
|
|
||
| # ── PLACEHOLDER: per-host static IP if not using DHCP ──────── | ||
| # networking.useDHCP = false; | ||
| # networking.interfaces.eno1.ipv4.addresses = [{ | ||
| # address = "10.0.0.21"; | ||
| # prefixLength = 24; | ||
| # }]; | ||
| # networking.defaultGateway = "10.0.0.1"; | ||
| # networking.nameservers = [ "10.0.0.1" "1.1.1.1" ]; | ||
| # ───────────────────────────────────────────────────────────── | ||
|
|
||
| # K3S join target — same for every worker in the cluster. | ||
| services.k3s.serverAddr = "https://control-plane.zeta.local:6443"; | ||
|
|
||
| # GPU device plugin vendor mix. Override per-host if AMD or Intel. | ||
| zeta.gpu-device-plugin = { | ||
| enable = true; | ||
| vendors = [ "nvidia" ]; | ||
| }; | ||
|
|
||
| # VFIO passthrough off by default; enable per-host with PCI IDs. | ||
| zeta.gpu-passthrough = { | ||
| enable = false; | ||
| pciIds = [ ]; | ||
| }; | ||
|
|
||
| # Node labels — uncomment + customize per hardware spec so the | ||
| # scheduler can target nodes by GPU model / count. | ||
| services.k3s.extraFlags = lib.mkAfter [ | ||
| # "--node-label=zeta.io/gpu-model=rtx-4090" | ||
| # "--node-label=zeta.io/gpu-count=2" | ||
| # "--node-label=zeta.io/dram-gb=128" | ||
| ]; | ||
|
|
||
| # ── PLACEHOLDER: maintainer SSH keys ───────────────────────── | ||
| users.users.zeta.openssh.authorizedKeys.keys = [ | ||
| # "ssh-ed25519 AAAAC3Nz... aaron@zeta" | ||
| ]; | ||
| # ───────────────────────────────────────────────────────────── | ||
|
|
||
| system.stateVersion = "24.11"; | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step states that a Longhorn DaemonSet will read
/etc/longhorn/node-disks.yamland patch Node CRs automatically, but this commit does not add any manifest or script underfull-ai-cluster/k8s/that consumes that file (andnixos/modules/longhorn-disks.nixstill documents the patch job as TODO/manual). As written, operators can complete the runbook believing both data disks are active while Longhorn still uses default disk config, which can misreport usable capacity and scheduling behavior immediately after node bring-up.Useful? React with 👍 / 👎.