-
Notifications
You must be signed in to change notification settings - Fork 1
docs(infra): infra/README.md — bootstrap runbook (PR 5 of Addison's plan) #4901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| # infra/ | ||
|
|
||
| Declarative desired-state for the Zeta AI cluster. Every machine, | ||
| every package, every Kubernetes workload reachable from this | ||
| directory. The flake at the repo root is the entry point. | ||
|
|
||
| ``` | ||
| infra/ | ||
| ├── nixos/ | ||
| │ ├── modules/ ← shared NixOS modules | ||
| │ │ ├── common.nix ← baseline imported by every host | ||
| │ │ ├── k3s-server.nix ← K3S control-plane role | ||
| │ │ ├── k3s-agent.nix ← K3S worker role | ||
| │ │ └── gpu.nix ← NVIDIA driver + container toolkit | ||
| │ └── hosts/ ← per-machine configurations | ||
| │ ├── installer/ ← USB bootable ISO | ||
| │ ├── control-plane/ ← K3S server + ArgoCD bootstrap | ||
| │ ├── worker-gpu-01/ ← NVIDIA AI worker | ||
| │ └── worker-gpu-02/ ← NVIDIA AI worker | ||
| └── k8s/ | ||
| ├── bootstrap/ ← K3S auto-applies on first boot | ||
| │ ├── argocd-namespace.yaml | ||
| │ ├── argocd-install.yaml ← pinned ArgoCD v2.13.2 | ||
| │ └── initial-orleans.yaml ← scaled-to-0 Orleans skeleton | ||
| └── applications/ ← ArgoCD watches recursively | ||
| ├── root-application.yaml ← App-of-Apps root | ||
| ├── orleans/ ← distributed-chron substrate | ||
| ├── gitlab/ ← post-bootstrap Git host | ||
| ├── argoworkflows/ ← DAG job scheduler | ||
| └── argorollouts/ ← progressive delivery | ||
| ``` | ||
|
|
||
| ## Bootstrap (start to running cluster) | ||
|
|
||
| ### 1. Build the installer ISO | ||
|
|
||
| ```bash | ||
| # From any machine with Nix installed: | ||
| nix build .#installer-iso | ||
| # Output at result/iso/zeta-installer-*.iso | ||
| ``` | ||
|
|
||
| ### 2. Write it to a USB stick | ||
|
|
||
| ```bash | ||
| sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdX bs=4M status=progress conv=fsync | ||
| ``` | ||
|
|
||
| Replace `/dev/sdX` with the USB device (check with `lsblk`). | ||
|
|
||
| ### 3. Boot the target machine on the USB | ||
|
|
||
| Console root access (no password, console-only — secure default). | ||
| Bring up the network: | ||
|
|
||
| ```bash | ||
| nmtui | ||
| # or: | ||
| nmcli device wifi connect <SSID> password <PSK> | ||
| ``` | ||
|
|
||
| ### 4. Clone Zeta + install | ||
|
|
||
| ```bash | ||
| # Partition + mount /mnt as desired (parted / gptfdisk / cryptsetup | ||
| # / zfs / etc — all tools are on the stick). | ||
| git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta | ||
|
|
||
| # Generate per-machine hardware config: | ||
| nixos-generate-config --root /mnt | ||
| cp /mnt/etc/nixos/hardware-configuration.nix \ | ||
| /mnt/etc/zeta/infra/nixos/hosts/<host>/hardware-configuration.nix | ||
|
|
||
| # Install: | ||
| nixos-install --flake /mnt/etc/zeta#<host> | ||
|
|
||
| # Reboot — done. K3S + ArgoCD + Orleans land automatically. | ||
| ``` | ||
|
|
||
| Where `<host>` is one of `control-plane`, `worker-gpu-01`, `worker-gpu-02`, | ||
| or any future host declared in [`/flake.nix`](../flake.nix) `nixosConfigurations`. | ||
|
|
||
| ## Bootstrap order (what the cluster does on first boot) | ||
|
|
||
| 1. **Control-plane boots** → K3S server starts with embedded etcd | ||
| 2. K3S applies `infra/k8s/bootstrap/argocd-namespace.yaml` | ||
| 3. K3S applies `infra/k8s/bootstrap/argocd-install.yaml` → ArgoCD pods come up | ||
| 4. K3S applies `infra/k8s/bootstrap/initial-orleans.yaml` → Orleans namespace + skeleton StatefulSet | ||
| 5. K3S applies `infra/k8s/applications/root-application.yaml` → App-of-Apps root | ||
| 6. ArgoCD reads root Application → discovers child Apps via include glob | ||
| 7. ArgoCD reconciles `orleans/`, `gitlab/`, `argoworkflows/`, `argorollouts/` in parallel | ||
| 8. **Workers boot** → K3S agents join via `serverAddr = control-plane.zeta.local:6443` | ||
| 9. Pods schedule onto workers based on `zeta.io/gpu=nvidia` node labels | ||
|
|
||
| After step 9 the cluster is self-managing. Every subsequent change | ||
| lands by committing to this repo. | ||
|
|
||
| ## Add a new workload | ||
|
|
||
| ```bash | ||
| mkdir infra/k8s/applications/<name>/ | ||
| $EDITOR infra/k8s/applications/<name>/Application.yaml | ||
| git add . && git commit -m "feat(infra): add <name>" && git push | ||
| # ArgoCD picks it up on next sync (~3 min) | ||
| ``` | ||
|
|
||
| ## Add a new host | ||
|
|
||
| 1. `mkdir infra/nixos/hosts/<host>/` | ||
| 2. Author `configuration.nix` (copy from an existing worker as template) | ||
| 3. Add a `nixosConfigurations.<host>` entry to `flake.nix` | ||
| 4. Boot the machine on the USB, generate hardware config, install | ||
|
|
||
| ## Update ArgoCD / Orleans / GitLab / Argo Workflows / Argo Rollouts | ||
|
|
||
| Bump the `targetRevision` in the corresponding `Application.yaml` and | ||
| commit. ArgoCD reconciles automatically. | ||
|
Comment on lines
+116
to
+117
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The update guidance says to bump Useful? React with 👍 / 👎. |
||
|
|
||
| ## Secrets | ||
|
|
||
| Tokens, passwords, and certs use `sops-nix` or `agenix` (TBD — | ||
| follow-up PR). Until then: | ||
|
|
||
| - K3S cluster token: place at `/var/lib/rancher/k3s/server/token` | ||
| manually post-install | ||
| - GitLab initial root password: create the `gitlab-initial-root-password` | ||
| Secret in the `gitlab` namespace before its Application syncs | ||
| - SSH keys: add to `users.users.zeta.openssh.authorizedKeys.keys` | ||
| in each host's `configuration.nix` | ||
|
|
||
| **Never commit plaintext credentials to this repo.** | ||
|
|
||
| ## devShell — admin from your workstation | ||
|
|
||
| ```bash | ||
| nix develop | ||
| # Brings up a shell with kubectl, helm, k9s, argocd, jq, yq, sops, age, etc. | ||
| ``` | ||
|
|
||
| ## The framing | ||
|
|
||
| Per Addison's spec: every text file in this directory is the desired | ||
| state. The flake is the strange attractor that draws the cluster | ||
| toward it. Drift gets reconciled. Nothing about the cluster lives | ||
| outside this repo (after GitLab installs, post-GitLab workloads move | ||
| to the self-hosted GitLab — but the bootstrap path stays here). | ||
|
|
||
| The cluster is the body. The Git repo is the soul. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step documents
infra/k8s/bootstrap/initial-orleans.yamlas a K3S first-boot auto-apply, butservices.k3s.manifestsininfra/nixos/modules/k3s-server.nixonly registersargocd-namespace,argocd-install, androot-application. That mismatch makes the runbook inaccurate during bring-up and can cause operators to troubleshoot for a bootstrap manifest that is never actually applied by K3S.Useful? React with 👍 / 👎.