Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions infra/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# infra/

Declarative desired-state for the Zeta AI cluster. Every machine,
every package, every Kubernetes workload reachable from this
directory. The flake at the repo root is the entry point.

```
infra/
├── nixos/
│ ├── modules/ ← shared NixOS modules
│ │ ├── common.nix ← baseline imported by every host
│ │ ├── k3s-server.nix ← K3S control-plane role
│ │ ├── k3s-agent.nix ← K3S worker role
│ │ └── gpu.nix ← NVIDIA driver + container toolkit
│ └── hosts/ ← per-machine configurations
│ ├── installer/ ← USB bootable ISO
│ ├── control-plane/ ← K3S server + ArgoCD bootstrap
│ ├── worker-gpu-01/ ← NVIDIA AI worker
│ └── worker-gpu-02/ ← NVIDIA AI worker
└── k8s/
├── bootstrap/ ← K3S auto-applies on first boot
│ ├── argocd-namespace.yaml
│ ├── argocd-install.yaml ← pinned ArgoCD v2.13.2
│ └── initial-orleans.yaml ← scaled-to-0 Orleans skeleton
└── applications/ ← ArgoCD watches recursively
├── root-application.yaml ← App-of-Apps root
├── orleans/ ← distributed-chron substrate
├── gitlab/ ← post-bootstrap Git host
├── argoworkflows/ ← DAG job scheduler
└── argorollouts/ ← progressive delivery
```

## Bootstrap (start to running cluster)

### 1. Build the installer ISO

```bash
# From any machine with Nix installed:
nix build .#installer-iso
# Output at result/iso/zeta-installer-*.iso
```

### 2. Write it to a USB stick

```bash
sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdX bs=4M status=progress conv=fsync
```

Replace `/dev/sdX` with the USB device (check with `lsblk`).

### 3. Boot the target machine on the USB

Console root access (no password, console-only — secure default).
Bring up the network:

```bash
nmtui
# or:
nmcli device wifi connect <SSID> password <PSK>
```

### 4. Clone Zeta + install

```bash
# Partition + mount /mnt as desired (parted / gptfdisk / cryptsetup
# / zfs / etc — all tools are on the stick).
git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta

# Generate per-machine hardware config:
nixos-generate-config --root /mnt
cp /mnt/etc/nixos/hardware-configuration.nix \
/mnt/etc/zeta/infra/nixos/hosts/<host>/hardware-configuration.nix

# Install:
nixos-install --flake /mnt/etc/zeta#<host>

# Reboot — done. K3S + ArgoCD + Orleans land automatically.
```

Where `<host>` is one of `control-plane`, `worker-gpu-01`, `worker-gpu-02`,
or any future host declared in [`/flake.nix`](../flake.nix) `nixosConfigurations`.

## Bootstrap order (what the cluster does on first boot)

1. **Control-plane boots** → K3S server starts with embedded etcd
2. K3S applies `infra/k8s/bootstrap/argocd-namespace.yaml`
3. K3S applies `infra/k8s/bootstrap/argocd-install.yaml` → ArgoCD pods come up
4. K3S applies `infra/k8s/bootstrap/initial-orleans.yaml` → Orleans namespace + skeleton StatefulSet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove nonexistent bootstrap step for initial-orleans

This step documents infra/k8s/bootstrap/initial-orleans.yaml as a K3S first-boot auto-apply, but services.k3s.manifests in infra/nixos/modules/k3s-server.nix only registers argocd-namespace, argocd-install, and root-application. That mismatch makes the runbook inaccurate during bring-up and can cause operators to troubleshoot for a bootstrap manifest that is never actually applied by K3S.

Useful? React with 👍 / 👎.

5. K3S applies `infra/k8s/applications/root-application.yaml` → App-of-Apps root
6. ArgoCD reads root Application → discovers child Apps via include glob
7. ArgoCD reconciles `orleans/`, `gitlab/`, `argoworkflows/`, `argorollouts/` in parallel
8. **Workers boot** → K3S agents join via `serverAddr = control-plane.zeta.local:6443`
9. Pods schedule onto workers based on `zeta.io/gpu=nvidia` node labels

After step 9 the cluster is self-managing. Every subsequent change
lands by committing to this repo.

## Add a new workload

```bash
mkdir infra/k8s/applications/<name>/
$EDITOR infra/k8s/applications/<name>/Application.yaml
git add . && git commit -m "feat(infra): add <name>" && git push
# ArgoCD picks it up on next sync (~3 min)
```

## Add a new host

1. `mkdir infra/nixos/hosts/<host>/`
2. Author `configuration.nix` (copy from an existing worker as template)
3. Add a `nixosConfigurations.<host>` entry to `flake.nix`
4. Boot the machine on the USB, generate hardware config, install

## Update ArgoCD / Orleans / GitLab / Argo Workflows / Argo Rollouts

Bump the `targetRevision` in the corresponding `Application.yaml` and
commit. ArgoCD reconciles automatically.
Comment on lines +116 to +117
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix ArgoCD upgrade instructions to point at real source

The update guidance says to bump targetRevision in the corresponding Application.yaml for ArgoCD, but ArgoCD in this repo is pinned via infra/k8s/bootstrap/argocd-install.yaml (remote manifest tag), and there is no ArgoCD Application.yaml to edit. As written, the documented ArgoCD upgrade path is not executable and will send maintainers to the wrong file.

Useful? React with 👍 / 👎.


## Secrets

Tokens, passwords, and certs use `sops-nix` or `agenix` (TBD —
follow-up PR). Until then:

- K3S cluster token: place at `/var/lib/rancher/k3s/server/token`
manually post-install
- GitLab initial root password: create the `gitlab-initial-root-password`
Secret in the `gitlab` namespace before its Application syncs
- SSH keys: add to `users.users.zeta.openssh.authorizedKeys.keys`
in each host's `configuration.nix`

**Never commit plaintext credentials to this repo.**

## devShell — admin from your workstation

```bash
nix develop
# Brings up a shell with kubectl, helm, k9s, argocd, jq, yq, sops, age, etc.
```

## The framing

Per Addison's spec: every text file in this directory is the desired
state. The flake is the strange attractor that draws the cluster
toward it. Drift gets reconciled. Nothing about the cluster lives
outside this repo (after GitLab installs, post-GitLab workloads move
to the self-hosted GitLab — but the bootstrap path stays here).

The cluster is the body. The Git repo is the soul.
Loading