diff --git a/infra/README.md b/infra/README.md new file mode 100644 index 0000000000..622074931f --- /dev/null +++ b/infra/README.md @@ -0,0 +1,148 @@ +# infra/ + +Declarative desired-state for the Zeta AI cluster. Every machine, +every package, every Kubernetes workload reachable from this +directory. The flake at the repo root is the entry point. + +``` +infra/ +├── nixos/ +│ ├── modules/ ← shared NixOS modules +│ │ ├── common.nix ← baseline imported by every host +│ │ ├── k3s-server.nix ← K3S control-plane role +│ │ ├── k3s-agent.nix ← K3S worker role +│ │ └── gpu.nix ← NVIDIA driver + container toolkit +│ └── hosts/ ← per-machine configurations +│ ├── installer/ ← USB bootable ISO +│ ├── control-plane/ ← K3S server + ArgoCD bootstrap +│ ├── worker-gpu-01/ ← NVIDIA AI worker +│ └── worker-gpu-02/ ← NVIDIA AI worker +└── k8s/ + ├── bootstrap/ ← K3S auto-applies on first boot + │ ├── argocd-namespace.yaml + │ ├── argocd-install.yaml ← pinned ArgoCD v2.13.2 + │ └── initial-orleans.yaml ← scaled-to-0 Orleans skeleton + └── applications/ ← ArgoCD watches recursively + ├── root-application.yaml ← App-of-Apps root + ├── orleans/ ← distributed-chron substrate + ├── gitlab/ ← post-bootstrap Git host + ├── argoworkflows/ ← DAG job scheduler + └── argorollouts/ ← progressive delivery +``` + +## Bootstrap (start to running cluster) + +### 1. Build the installer ISO + +```bash +# From any machine with Nix installed: +nix build .#installer-iso +# Output at result/iso/zeta-installer-*.iso +``` + +### 2. Write it to a USB stick + +```bash +sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdX bs=4M status=progress conv=fsync +``` + +Replace `/dev/sdX` with the USB device (check with `lsblk`). + +### 3. Boot the target machine on the USB + +Console root access (no password, console-only — secure default). +Bring up the network: + +```bash +nmtui +# or: +nmcli device wifi connect password +``` + +### 4. Clone Zeta + install + +```bash +# Partition + mount /mnt as desired (parted / gptfdisk / cryptsetup +# / zfs / etc — all tools are on the stick). +git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta + +# Generate per-machine hardware config: +nixos-generate-config --root /mnt +cp /mnt/etc/nixos/hardware-configuration.nix \ + /mnt/etc/zeta/infra/nixos/hosts//hardware-configuration.nix + +# Install: +nixos-install --flake /mnt/etc/zeta# + +# Reboot — done. K3S + ArgoCD + Orleans land automatically. +``` + +Where `` is one of `control-plane`, `worker-gpu-01`, `worker-gpu-02`, +or any future host declared in [`/flake.nix`](../flake.nix) `nixosConfigurations`. + +## Bootstrap order (what the cluster does on first boot) + +1. **Control-plane boots** → K3S server starts with embedded etcd +2. K3S applies `infra/k8s/bootstrap/argocd-namespace.yaml` +3. K3S applies `infra/k8s/bootstrap/argocd-install.yaml` → ArgoCD pods come up +4. K3S applies `infra/k8s/bootstrap/initial-orleans.yaml` → Orleans namespace + skeleton StatefulSet +5. K3S applies `infra/k8s/applications/root-application.yaml` → App-of-Apps root +6. ArgoCD reads root Application → discovers child Apps via include glob +7. ArgoCD reconciles `orleans/`, `gitlab/`, `argoworkflows/`, `argorollouts/` in parallel +8. **Workers boot** → K3S agents join via `serverAddr = control-plane.zeta.local:6443` +9. Pods schedule onto workers based on `zeta.io/gpu=nvidia` node labels + +After step 9 the cluster is self-managing. Every subsequent change +lands by committing to this repo. + +## Add a new workload + +```bash +mkdir infra/k8s/applications// +$EDITOR infra/k8s/applications//Application.yaml +git add . && git commit -m "feat(infra): add " && git push +# ArgoCD picks it up on next sync (~3 min) +``` + +## Add a new host + +1. `mkdir infra/nixos/hosts//` +2. Author `configuration.nix` (copy from an existing worker as template) +3. Add a `nixosConfigurations.` entry to `flake.nix` +4. Boot the machine on the USB, generate hardware config, install + +## Update ArgoCD / Orleans / GitLab / Argo Workflows / Argo Rollouts + +Bump the `targetRevision` in the corresponding `Application.yaml` and +commit. ArgoCD reconciles automatically. + +## Secrets + +Tokens, passwords, and certs use `sops-nix` or `agenix` (TBD — +follow-up PR). Until then: + +- K3S cluster token: place at `/var/lib/rancher/k3s/server/token` + manually post-install +- GitLab initial root password: create the `gitlab-initial-root-password` + Secret in the `gitlab` namespace before its Application syncs +- SSH keys: add to `users.users.zeta.openssh.authorizedKeys.keys` + in each host's `configuration.nix` + +**Never commit plaintext credentials to this repo.** + +## devShell — admin from your workstation + +```bash +nix develop +# Brings up a shell with kubectl, helm, k9s, argocd, jq, yq, sops, age, etc. +``` + +## The framing + +Per Addison's spec: every text file in this directory is the desired +state. The flake is the strange attractor that draws the cluster +toward it. Drift gets reconciled. Nothing about the cluster lives +outside this repo (after GitLab installs, post-GitLab workloads move +to the self-hosted GitLab — but the bootstrap path stays here). + +The cluster is the body. The Git repo is the soul.