Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions full-ai-cluster/PROVISIONING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Provisioning a new node — cookie-cutter workflow

End-to-end: physical box arrives → boots into running cluster
member with replicated Longhorn capacity. Six values to change
per box, no hand-partitioning, no shell scripts.

## What you need

- A NixOS installer USB built from this repo (`nix build .#installer-iso`)
- The new box wired to the cluster network with internet access
- The maintainer's public SSH key
- A few minutes to read off two disk serial numbers

## Step 1: copy the template

```bash
HOST=worker-gpu-03 # pick the next free number
cp -r full-ai-cluster/nixos/hosts/worker-template \
full-ai-cluster/nixos/hosts/$HOST
```

## Step 2: change the six placeholder values

Open `full-ai-cluster/nixos/hosts/$HOST/default.nix` and edit
each of the six clearly-marked PLACEHOLDER blocks:

| What | Where to get it |
|------|-----------------|
| `networking.hostName` | the name you chose above (`worker-gpu-03`) |
| `networking.hostId` | `head -c4 /dev/urandom \| od -A n -t x4 \| tr -d ' '` |
| `zeta.disko.nvme0` | On the live system: `ls -l /dev/disk/by-id/ \| grep nvme \| awk '{print $9, $11}'` — pick the disk you want to BE the boot disk (gets OS + first Longhorn data path) |
| `zeta.disko.nvme1` | Same listing, the other NVMe (becomes pure Longhorn data) |
| Network config | Static IP block if you don't use DHCP |
| `users.users.zeta.openssh.authorizedKeys` | Maintainer key |

## Step 3: wire into the flake

Open `full-ai-cluster/flake.nix`, add an entry mirroring
`worker-template`:

```nix
"worker-gpu-03" = mkSystem {
modules = [
./nixos/hosts/worker-gpu-03/default.nix
];
};
```

Commit + push to main so the install reads from a real ref.

## Step 4: boot the box on the USB

UEFI boot order → USB first. Network up via `nmtui` if not DHCP.

```bash
# Clone Zeta to the live system's writable scratch
sudo git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta
cd /mnt/etc/zeta/full-ai-cluster
```

## Step 5: disko + nixos-install (the actual cookie-cutter install)

```bash
# Step 5a — disko wipes + partitions + formats + mounts both disks
sudo disko --mode disko --flake .#worker-gpu-03

# Step 5b — install NixOS onto the mounted layout
sudo nixos-install --flake .#worker-gpu-03 --no-root-password

# Step 5c — reboot. Box joins cluster on first boot.
sudo reboot
```

That's it. Subsequent boxes: repeat steps 1-5 with new placeholder
values. Each provision is ~10 minutes wall-clock, ~6 lines of
human edits, zero hand-partitioning.

## What happens after first boot

1. systemd-boot → kernel → NixOS userland (~30s)
2. K3S agent service starts → contacts `control-plane.zeta.local:6443`
3. Cluster admits the node → kubelet reports both `/var/lib/longhorn-disk1`
and `/var/lib/longhorn-disk2` as filesystem entries
4. Longhorn DaemonSet pod schedules → reads `/etc/longhorn/node-disks.yaml`
→ patches the Longhorn Node CR to add both data paths
Comment on lines +84 to +85
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsupported auto-patch step from provisioning workflow

This step states that a Longhorn DaemonSet will read /etc/longhorn/node-disks.yaml and patch Node CRs automatically, but this commit does not add any manifest or script under full-ai-cluster/k8s/ that consumes that file (and nixos/modules/longhorn-disks.nix still documents the patch job as TODO/manual). As written, operators can complete the runbook believing both data disks are active while Longhorn still uses default disk config, which can misreport usable capacity and scheduling behavior immediately after node bring-up.

Useful? React with 👍 / 👎.

5. Longhorn rebalancer notices the new capacity → starts placing
replicas of existing volumes onto this node
6. ArgoCD reconciles any node-affinity workloads that target this
node's labels

Check it landed:

```bash
kubectl get nodes -o wide
kubectl -n longhorn-system get nodes.longhorn.io worker-gpu-03 -o yaml | grep -A20 disks:
```

## Disk failure recovery

NVMe dies → Longhorn marks the data path Unavailable → the cluster's
other replicas (default replica count 3 means 2 healthy copies
remain) keep serving the volumes → no app-visible interruption.

Replace the dead drive, then either:

- **Hot path** (drive replaced with identical model + position):
reboot, disko recreates the partition table on the fresh drive,
Longhorn re-registers the data path, replicas rebuild from peers.
- **Slow path** (drive serial changed): update the `zeta.disko.nvme0`
or `nvme1` by-id symlink in `nixos/hosts/<host>/default.nix`,
`nixos-rebuild switch --flake .#<host> --target-host <host>` from
any admin machine, then rebuild as above.

OS itself: the `/` partition lives on `nvme0` only, so a `nvme1`
failure leaves the node fully bootable + Longhorn capacity
degrades by half until repair. An `nvme0` failure takes the OS
down — reinstall via Step 5 onto the replacement disk; Longhorn
data on `nvme1` is re-imported when the rebuilt node rejoins.

## Multi-shape support

`disko-shapes/2nvme.nix` is the shape for the current hardware.
Adding a new hardware class (e.g. 4 NVMes, or NVMe + SATA SSD mix)
means:

1. Author `disko-shapes/<new-shape>.nix` matching the
`zeta.disko` options pattern
2. Author a new host template under `hosts/<new-class>-template/`
that imports it
3. Cookie-cutter from THAT template for boxes of the new class

The Longhorn module (`modules/longhorn-disks.nix`) is shape-
agnostic — it takes a list of mount paths and wires them, no
matter how many disks contributed those mounts.
26 changes: 25 additions & 1 deletion full-ai-cluster/flake.nix
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,19 @@
url = "github:nix-darwin/nix-darwin/nix-darwin-24.11";
inputs.nixpkgs.follows = "nixpkgs";
};

# disko — declarative disk partitioning + formatting + mounting.
# Together with the disko-shapes/ modules under ./nixos/modules,
# adding a new node is: copy a host template, change hostname/IP,
# commit, run `nixos-install --flake .#<host> --disko`.
# No interactive partitioning, no per-host shell scripts.
disko = {
url = "github:nix-community/disko";
inputs.nixpkgs.follows = "nixpkgs";
};
};

outputs = { self, nixpkgs, nixos-hardware, flake-utils, nix-darwin, ... }@inputs:
outputs = { self, nixpkgs, nixos-hardware, flake-utils, nix-darwin, disko, ... }@inputs:
let
stateVersion = "24.11";

Expand Down Expand Up @@ -79,6 +89,18 @@
./nixos/hosts/worker-gpu/configuration.nix
];
};

# Cookie-cutter worker template — uses disko for declarative
# disk partitioning + Longhorn multi-disk wiring. Copy
# ./nixos/hosts/worker-template/ to ./nixos/hosts/worker-gpu-NN/,
# change the six placeholder values documented in the file,
# then add a `worker-gpu-NN = mkSystem { ... };` entry here
# mirroring this one. See full-ai-cluster/PROVISIONING.md.
worker-template = mkSystem {
modules = [
./nixos/hosts/worker-template/default.nix
];
};
};

# Shared NixOS modules — per-host configs import these via
Expand All @@ -93,6 +115,8 @@
gpu-device-plugin = ./nixos/modules/gpu-device-plugin.nix;
docker = ./nixos/modules/docker.nix;
local-storage = ./nixos/modules/local-storage.nix;
longhorn-disks = ./nixos/modules/longhorn-disks.nix;
disko-shape-2nvme = ./nixos/modules/disko-shapes/2nvme.nix;
};

# nix-darwin config for maintainer Macs (Apple Silicon). Enables
Expand Down
101 changes: 101 additions & 0 deletions full-ai-cluster/nixos/hosts/worker-template/default.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# full-ai-cluster/nixos/hosts/worker-template/default.nix
#
# Cookie-cutter worker node config. Adding a new identical box:
#
# 1. cp -r nixos/hosts/worker-template nixos/hosts/worker-gpu-NN
# 2. Edit the new file — change SIX placeholder values:
# - networking.hostName (line ~30)
# - networking.hostId (line ~32; new random 8-hex)
# - networking.interfaces (per-host MAC / static IP)
# - zeta.disko.nvme0 (per-host /dev/disk/by-id)
# - zeta.disko.nvme1 (per-host /dev/disk/by-id)
# - users.users.zeta.openssh.authorizedKeys (maintainer key)
# 3. Add `worker-gpu-NN` to flake.nix nixosConfigurations
# 4. Boot the box on the installer USB, then:
# nix run github:nix-community/disko -- \
# --mode disko \
# --flake /mnt/etc/zeta/full-ai-cluster#worker-gpu-NN
# nixos-install --flake /mnt/etc/zeta/full-ai-cluster#worker-gpu-NN
# 5. Reboot. Node joins cluster, Longhorn picks up both disks,
# ArgoCD reconciles workloads.
#
# Hardware shape: x86_64, UEFI, 2 NVMes (any size, same shape),
# 1+ NVIDIA GPU. For AMD-only or Intel-only GPU nodes change the
# `zeta.gpu-device-plugin.vendors` setting; for non-GPU workers
# drop the GPU imports entirely.

{ config, pkgs, lib, inputs, ... }:

{
imports = [
# Declarative disk layout — disko shapes the partitions,
# longhorn-disks wires the mounts to Longhorn data paths.
inputs.disko.nixosModules.disko
../../modules/disko-shapes/2nvme.nix
../../modules/longhorn-disks.nix

# Cluster role + hardware-class modules.
../../modules/common.nix
../../modules/k3s-agent.nix
../../modules/gpu.nix
../../modules/gpu-device-plugin.nix
../../modules/gpu-passthrough.nix
../../modules/docker.nix
../../modules/local-storage.nix
];

# ── PLACEHOLDER: change per-host ─────────────────────────────
networking.hostName = "worker-template";
networking.hostId = "00000000"; # `head -c4 /dev/urandom | od -A n -t x4 | tr -d ' '`
# ─────────────────────────────────────────────────────────────

# ── PLACEHOLDER: change per-host (disk IDs) ──────────────────
# On the live system, run: ls -l /dev/disk/by-id/ | grep nvme
zeta.disko = {
nvme0 = "/dev/disk/by-id/nvme-REPLACE_ME_BOOT_DISK";
nvme1 = "/dev/disk/by-id/nvme-REPLACE_ME_LONGHORN_DISK";
# rootSize = "256G"; # default; override if needed
};
# ─────────────────────────────────────────────────────────────

# ── PLACEHOLDER: per-host static IP if not using DHCP ────────
# networking.useDHCP = false;
# networking.interfaces.eno1.ipv4.addresses = [{
# address = "10.0.0.21";
# prefixLength = 24;
# }];
# networking.defaultGateway = "10.0.0.1";
# networking.nameservers = [ "10.0.0.1" "1.1.1.1" ];
# ─────────────────────────────────────────────────────────────

# K3S join target — same for every worker in the cluster.
services.k3s.serverAddr = "https://control-plane.zeta.local:6443";

# GPU device plugin vendor mix. Override per-host if AMD or Intel.
zeta.gpu-device-plugin = {
enable = true;
vendors = [ "nvidia" ];
};

# VFIO passthrough off by default; enable per-host with PCI IDs.
zeta.gpu-passthrough = {
enable = false;
pciIds = [ ];
};

# Node labels — uncomment + customize per hardware spec so the
# scheduler can target nodes by GPU model / count.
services.k3s.extraFlags = lib.mkAfter [
# "--node-label=zeta.io/gpu-model=rtx-4090"
# "--node-label=zeta.io/gpu-count=2"
# "--node-label=zeta.io/dram-gb=128"
];

# ── PLACEHOLDER: maintainer SSH keys ─────────────────────────
users.users.zeta.openssh.authorizedKeys.keys = [
# "ssh-ed25519 AAAAC3Nz... aaron@zeta"
];
# ─────────────────────────────────────────────────────────────

system.stateVersion = "24.11";
}
Loading
Loading