From 220a09b273aaa31370766b19ad6758129e33a117 Mon Sep 17 00:00:00 2001
From: Lior <lior@zeta.dev>
Date: Mon, 25 May 2026 01:43:51 -0400
Subject: [PATCH 1/5] feat(ai-cluster-bootstrap): two-directory declarative AI
 cluster scaffold
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds two clean separate top-level directories on branch
ai-cluster-bootstrap per Aaron's spec:

usb-nixos-installer/  (3 files — USB-only, nothing extra)
  - README.md scoped to USB bootstrap only
  - flake.nix that produces installer-iso
  - nixos/installer/configuration.nix with package list

full-ai-cluster/  (62 files — end-to-end cluster)
  - usb-nixos-installer/ — byte-identical copy of the standalone
    directory (the USB part as the start snippet)
  - flake.nix — cluster flake (installer + per-host configs +
    nix-darwin linux-builder for Apple Silicon maintainers)
  - README.md — full bootstrap runbook
  - nixos/modules/ — 8 modules:
      common, k3s-server, k3s-agent, gpu, gpu-passthrough,
      gpu-device-plugin (NVIDIA + AMD + Intel), docker,
      local-storage
  - nixos/hosts/control-plane, worker-gpu — configuration.nix
    + placeholder hardware-configuration.nix + README per host
  - k8s/bootstrap/ — 4 manifests K3S auto-applies on first boot:
      cilium-namespace, argocd-namespace, argocd-install (pinned),
      root-application (App-of-Apps)
  - k8s/applications/ — 30 ArgoCD Application.yaml + supporting
    manifests covering every component in the spec

Component composition:
  NixFlake layer (OS): K3S, Cilium-host-prep, Docker, local-path
    storage class, GPU drivers + container toolkit, GPU passthrough
    (VFIO), GPU device plugins for K8s (NVIDIA/AMD/Intel)
  ArgoCD layer (cluster): Cilium (KPR + Hubble + BPF MASQUERADE),
    Orleans, Temporal (TS), Dapr Actors, GitLab, Forgejo, Argo
    Workflows + Rollouts, Longhorn, CockroachDB, Hindsight, OZ,
    Hermes (+ OZ integration via env vars), Warp, Ollama, vLLM,
    Deepseek Coder, Qwen Coder, kube-prometheus-stack
    (Prometheus + Grafana), NATS, Redis, Weaviate, Loki, Tempo,
    Alloy, Mimir, Istio (base), Open Policy Agent (Gatekeeper),
    Sealed Secrets, Vault

Cilium displaces K3S's default networking — k3s-server.nix passes
--flannel-backend=none, --disable-network-policy, and
--disable-kube-proxy so Cilium owns CNI + KPR + policy end-to-end.

Ambiguous components — placeholder Application.yamls with TODO(aaron)
markers + flagged in README "Component status" + PR description:
  - OZ (OpenZiti? Auth0? Aaron-specific?)
  - Hermes (Cosmos IBC relayer? Aaron AI-agent? Hermes integrates
    with OZ + has SOPS-baked-into-image secrets + talks to Ollama/vLLM
    per spec — the deployment.yaml wires the env-var structure for
    when the image lands)
  - Warp (Cloudflare? Warp terminal? Dagger Warp engine?)
  - Hindsight (Lockheed Martin OTel processor? other?)

Sharpen each by editing repoURL + chart in the relevant
Application.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 full-ai-cluster/README.md                     | 242 ++++++++++++++++++
 full-ai-cluster/flake.nix                     | 167 ++++++++++++
 .../k8s/applications/alloy/Application.yaml   |  46 ++++
 .../argo-rollouts/Application.yaml            |  30 +++
 .../argo-workflows/Application.yaml           |  34 +++
 .../k8s/applications/cilium/Application.yaml  |  69 +++++
 .../applications/cockroachdb/Application.yaml |  43 ++++
 .../k8s/applications/dapr/Application.yaml    |  32 +++
 .../deepseek-coder/Application.yaml           |  28 ++
 .../deepseek-coder/configmap.yaml             |  18 ++
 .../k8s/applications/forgejo/Application.yaml |  32 +++
 .../k8s/applications/gitlab/Application.yaml  |  39 +++
 .../k8s/applications/hermes/Application.yaml  |  37 +++
 .../k8s/applications/hermes/deployment.yaml   |  63 +++++
 .../applications/hindsight/Application.yaml   |  35 +++
 .../k8s/applications/istio/Application.yaml   |  31 +++
 .../kube-prometheus-stack/Application.yaml    |  46 ++++
 .../k8s/applications/loki/Application.yaml    |  45 ++++
 .../applications/longhorn/Application.yaml    |  34 +++
 .../k8s/applications/mimir/Application.yaml   |  46 ++++
 .../k8s/applications/nats/Application.yaml    |  35 +++
 .../k8s/applications/ollama/Application.yaml  |  48 ++++
 .../open-policy-agent/Application.yaml        |  28 ++
 .../k8s/applications/orleans/Application.yaml |  33 +++
 .../k8s/applications/orleans/configmap.yaml   |  14 +
 .../k8s/applications/orleans/namespace.yaml   |   7 +
 .../k8s/applications/orleans/rbac.yaml        |  35 +++
 .../k8s/applications/orleans/service.yaml     |  33 +++
 .../k8s/applications/orleans/statefulset.yaml |  50 ++++
 .../k8s/applications/oz/Application.yaml      |  34 +++
 .../applications/qwen-coder/Application.yaml  |  23 ++
 .../applications/qwen-coder/configmap.yaml    |  12 +
 .../k8s/applications/redis/Application.yaml   |  39 +++
 .../sealed-secrets/Application.yaml           |  26 ++
 .../k8s/applications/tempo/Application.yaml   |  33 +++
 .../applications/temporal/Application.yaml    |  46 ++++
 .../k8s/applications/vault/Application.yaml   |  49 ++++
 .../k8s/applications/vllm/Application.yaml    |  29 +++
 .../k8s/applications/vllm/deployment.yaml     |  69 +++++
 .../k8s/applications/warp/Application.yaml    |  32 +++
 .../applications/weaviate/Application.yaml    |  39 +++
 .../k8s/bootstrap/argocd-install.yaml         |  19 ++
 .../k8s/bootstrap/argocd-namespace.yaml       |  12 +
 .../k8s/bootstrap/cilium-namespace.yaml       |  12 +
 .../k8s/bootstrap/root-application.yaml       |  41 +++
 .../nixos/hosts/control-plane/README.md       |  44 ++++
 .../hosts/control-plane/configuration.nix     |  32 +++
 .../control-plane/hardware-configuration.nix  |  37 +++
 .../nixos/hosts/worker-gpu/README.md          |  60 +++++
 .../nixos/hosts/worker-gpu/configuration.nix  |  53 ++++
 .../worker-gpu/hardware-configuration.nix     |  34 +++
 full-ai-cluster/nixos/modules/common.nix      |  66 +++++
 full-ai-cluster/nixos/modules/docker.nix      |  39 +++
 .../nixos/modules/gpu-device-plugin.nix       | 170 ++++++++++++
 .../nixos/modules/gpu-passthrough.nix         |  75 ++++++
 full-ai-cluster/nixos/modules/gpu.nix         |  55 ++++
 full-ai-cluster/nixos/modules/k3s-agent.nix   |  41 +++
 full-ai-cluster/nixos/modules/k3s-server.nix  |  73 ++++++
 .../nixos/modules/local-storage.nix           | 133 ++++++++++
 full-ai-cluster/usb-nixos-installer/README.md |  88 +++++++
 full-ai-cluster/usb-nixos-installer/flake.nix |  49 ++++
 .../nixos/installer/configuration.nix         | 204 +++++++++++++++
 usb-nixos-installer/README.md                 |  88 +++++++
 usb-nixos-installer/flake.nix                 |  49 ++++
 .../nixos/installer/configuration.nix         | 204 +++++++++++++++
 65 files changed, 3509 insertions(+)
 create mode 100644 full-ai-cluster/README.md
 create mode 100644 full-ai-cluster/flake.nix
 create mode 100644 full-ai-cluster/k8s/applications/alloy/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/argo-workflows/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/cilium/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/cockroachdb/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/dapr/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml
 create mode 100644 full-ai-cluster/k8s/applications/forgejo/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/gitlab/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/hermes/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/hermes/deployment.yaml
 create mode 100644 full-ai-cluster/k8s/applications/hindsight/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/istio/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/loki/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/longhorn/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/mimir/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/nats/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/ollama/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/configmap.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/namespace.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/rbac.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/service.yaml
 create mode 100644 full-ai-cluster/k8s/applications/orleans/statefulset.yaml
 create mode 100644 full-ai-cluster/k8s/applications/oz/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/qwen-coder/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml
 create mode 100644 full-ai-cluster/k8s/applications/redis/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/tempo/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/temporal/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/vault/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/vllm/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/vllm/deployment.yaml
 create mode 100644 full-ai-cluster/k8s/applications/warp/Application.yaml
 create mode 100644 full-ai-cluster/k8s/applications/weaviate/Application.yaml
 create mode 100644 full-ai-cluster/k8s/bootstrap/argocd-install.yaml
 create mode 100644 full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml
 create mode 100644 full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml
 create mode 100644 full-ai-cluster/k8s/bootstrap/root-application.yaml
 create mode 100644 full-ai-cluster/nixos/hosts/control-plane/README.md
 create mode 100644 full-ai-cluster/nixos/hosts/control-plane/configuration.nix
 create mode 100644 full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix
 create mode 100644 full-ai-cluster/nixos/hosts/worker-gpu/README.md
 create mode 100644 full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix
 create mode 100644 full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix
 create mode 100644 full-ai-cluster/nixos/modules/common.nix
 create mode 100644 full-ai-cluster/nixos/modules/docker.nix
 create mode 100644 full-ai-cluster/nixos/modules/gpu-device-plugin.nix
 create mode 100644 full-ai-cluster/nixos/modules/gpu-passthrough.nix
 create mode 100644 full-ai-cluster/nixos/modules/gpu.nix
 create mode 100644 full-ai-cluster/nixos/modules/k3s-agent.nix
 create mode 100644 full-ai-cluster/nixos/modules/k3s-server.nix
 create mode 100644 full-ai-cluster/nixos/modules/local-storage.nix
 create mode 100644 full-ai-cluster/usb-nixos-installer/README.md
 create mode 100644 full-ai-cluster/usb-nixos-installer/flake.nix
 create mode 100644 full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix
 create mode 100644 usb-nixos-installer/README.md
 create mode 100644 usb-nixos-installer/flake.nix
 create mode 100644 usb-nixos-installer/nixos/installer/configuration.nix

diff --git a/full-ai-cluster/README.md b/full-ai-cluster/README.md
new file mode 100644
index 0000000000..e60e2abb14
--- /dev/null
+++ b/full-ai-cluster/README.md
@@ -0,0 +1,242 @@
+# full-ai-cluster
+
+End-to-end declarative AI cluster. Starts with the USB bootstrap
+(identical snippet at `./usb-nixos-installer/`) and continues
+through every layer up to running AI workloads.
+
+## What's inside
+
+```
+full-ai-cluster/
+├── usb-nixos-installer/        ← byte-identical copy of ../usb-nixos-installer
+├── flake.nix                   ← cluster flake (host configs + linux-builder)
+├── nixos/
+│   ├── modules/                ← shared NixOS modules
+│   │   ├── common.nix
+│   │   ├── k3s-server.nix      ← K3S control-plane (flannel disabled for Cilium)
+│   │   ├── k3s-agent.nix       ← K3S worker
+│   │   ├── gpu.nix             ← NVIDIA drivers + container toolkit
+│   │   ├── gpu-passthrough.nix ← VFIO passthrough for VM workloads
+│   │   ├── gpu-device-plugin.nix ← K8s device plugin (NVIDIA/AMD/Intel)
+│   │   ├── docker.nix          ← Docker via NixFlake
+│   │   └── local-storage.nix   ← local-path-provisioner storage class
+│   └── hosts/
+│       ├── control-plane/      ← configuration.nix + hardware + README
+│       └── worker-gpu/         ← configuration.nix + hardware + README
+└── k8s/
+    ├── bootstrap/              ← K3S auto-applies on first boot
+    │   ├── argocd-namespace.yaml
+    │   ├── argocd-install.yaml
+    │   └── root-application.yaml ← App-of-Apps root
+    └── applications/           ← ArgoCD watches recursively
+        ├── cilium/             ← CNI + Hubble + KPR + BPF MASQUERADE
+        ├── orleans/            ← distributed cron #1
+        ├── temporal/           ← distributed cron #2 (TS)
+        ├── dapr/               ← distributed cron #3 (actors)
+        ├── gitlab/             ← self-hosted Git host (option A)
+        ├── forgejo/            ← self-hosted Git host (option B, lighter)
+        ├── argo-workflows/     ← DAG job scheduler
+        ├── argo-rollouts/      ← progressive delivery
+        ├── longhorn/           ← distributed block storage
+        ├── cockroachdb/        ← distributed SQL
+        ├── hindsight/          ← TODO: ambiguous component (see PR)
+        ├── oz/                 ← TODO: ambiguous component (see PR)
+        ├── hermes/             ← TODO: ambiguous component + OZ integration
+        ├── warp/               ← TODO: ambiguous component (see PR)
+        ├── ollama/             ← LLM serving (option A — local)
+        ├── vllm/               ← LLM serving (option B — high-throughput)
+        ├── deepseek-coder/     ← model deploy → Ollama or vLLM
+        ├── qwen-coder/         ← model deploy → Ollama or vLLM
+        ├── kube-prometheus-stack/ ← Prometheus + Grafana + Alertmanager
+        ├── nats/               ← messaging
+        ├── redis/              ← cache
+        ├── weaviate/           ← vector DB
+        ├── loki/               ← logs
+        ├── tempo/              ← traces
+        ├── alloy/              ← OpenTelemetry collector
+        ├── mimir/              ← long-term metrics storage
+        ├── istio/              ← service mesh
+        ├── open-policy-agent/  ← admission policy
+        ├── sealed-secrets/     ← secrets at rest in git (option A)
+        └── vault/              ← runtime secrets (option B)
+```
+
+## Two layers, two reconcilers
+
+- **OS layer** is reconciled by **Nix + NixOS**. Everything in
+  `./nixos/` lands on a target machine via `nixos-install --flake`
+  (initial install) or `nixos-rebuild switch --flake` (updates).
+- **Cluster layer** is reconciled by **ArgoCD**. K3S auto-applies
+  the three bootstrap manifests at `./k8s/bootstrap/` on first boot;
+  ArgoCD then reads `./k8s/applications/root-application.yaml`
+  (App-of-Apps) and reconciles every workload from the same Git
+  repo every ~3 minutes.
+
+This split is intentional: anything that must run BEFORE the
+cluster API exists (kernel modules, CNI host setup, container
+runtime, GPU drivers, K3S itself, base packages, storage class
+host bits) belongs in Nix. Everything else belongs in K8s manifests
+ArgoCD reconciles.
+
+## Bootstrap end-to-end
+
+### 1. Build the installer ISO (one-time, on your workstation)
+
+```bash
+cd full-ai-cluster
+nix build .#installer-iso
+# Output: ./result/iso/zeta-installer-24.11.iso (~1.5-2 GB)
+```
+
+If you're on Apple Silicon and don't yet have the linux-builder
+running, apply the nix-darwin config first:
+
+```bash
+nix run nix-darwin/nix-darwin-24.11#darwin-rebuild -- switch --flake .#zeta-mac
+```
+
+### 2. Write to USB stick
+
+```bash
+# macOS:
+diskutil list
+diskutil unmountDisk /dev/diskN          # N = your USB device number
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/rdiskN bs=4m status=progress
+diskutil eject /dev/diskN
+
+# Linux:
+lsblk
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdX bs=4M status=progress conv=fsync
+sync
+```
+
+(Or use Balena Etcher / Rufus for a GUI — same outcome.)
+
+### 3. Install on each target machine
+
+Boot the target on the USB stick. Then at the console:
+
+```bash
+# Network up:
+nmtui
+
+# Pick the target disk + partition (parted/gptfdisk/zfs all on the stick).
+# Example minimal layout (single ext4 + EFI):
+sgdisk --zap-all /dev/sda
+sgdisk -n 1:0:+512M -t 1:ef00 -c 1:boot /dev/sda
+sgdisk -n 2:0:0     -t 2:8300 -c 2:nixos /dev/sda
+mkfs.fat -F 32 -n boot /dev/sda1
+mkfs.ext4 -L nixos /dev/sda2
+mount /dev/disk/by-label/nixos /mnt
+mkdir -p /mnt/boot && mount /dev/disk/by-label/boot /mnt/boot
+
+# Clone the cluster flake:
+git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta
+
+# Generate per-machine hardware config and copy into the host dir:
+nixos-generate-config --root /mnt
+cp /mnt/etc/nixos/hardware-configuration.nix \
+   /mnt/etc/zeta/full-ai-cluster/nixos/hosts/<host>/hardware-configuration.nix
+
+# Seed the K3S cluster token (control-plane only on first run):
+nixos-enter --root /mnt -- bash -c '
+  mkdir -p /var/lib/rancher/k3s/server
+  openssl rand -hex 64 > /var/lib/rancher/k3s/server/token
+  chmod 600 /var/lib/rancher/k3s/server/token
+'
+# (Copy this token to /var/lib/rancher/k3s/agent/token on every worker)
+
+# Install:
+nixos-install --flake /mnt/etc/zeta/full-ai-cluster#<host>
+# <host> = control-plane | worker-gpu | ...
+
+# Reboot. K3S, Cilium, ArgoCD, all workloads come up declaratively.
+reboot
+```
+
+### 4. Verify the cluster is alive
+
+After the control-plane reboots:
+
+```bash
+ssh zeta@control-plane.zeta.local
+sudo kubectl get nodes
+sudo kubectl -n argocd get pods
+sudo kubectl -n argocd get applications
+sudo cilium status
+sudo cilium hubble enable --ui   # if not already enabled by Helm values
+```
+
+### 5. Add more machines
+
+Repeat step 3 on each machine with the appropriate `<host>` name.
+Add new `nixosConfigurations.<host>` entries to `flake.nix` as needed.
+
+## Component status
+
+- ✅ Well-defined upstream charts (Cilium, ArgoCD, Temporal, GitLab,
+  Forgejo, Argo Workflows / Rollouts, Longhorn, CockroachDB, NATS,
+  Redis, Weaviate, Loki / Tempo / Alloy / Mimir, kube-prometheus-stack,
+  Istio, OPA, Sealed Secrets, Vault, Ollama, vLLM)
+- 🟡 Custom workloads needing your input on image + Helm chart
+  choice (Orleans Silo, Dapr Actors config, Hermes Docker image
+  with SOPS, Deepseek/Qwen Coder model deploys)
+- ⚠️ Ambiguous components that need you to confirm which upstream:
+  - **OZ** — multiple possibilities (OpenZiti? Auth0 OZ? specific
+    Aaron-built component?). Application.yaml has a TODO.
+  - **Hermes** — multiple possibilities (Cosmos IBC relayer? a
+    message broker? an Aaron-built agent?). Application.yaml has
+    a TODO with SOPS-baking-into-the-image structure already wired.
+  - **Warp** — multiple possibilities (Cloudflare Warp? Warp
+    terminal? Dagger Warp engine? an Aaron-built component?).
+  - **Hindsight** — multiple possibilities (Lockheed Martin's
+    OpenTelemetry processor? Microsoft's Hindsight? AWS?).
+
+Sharpen each ambiguous one by editing its `Application.yaml`
+to point at the right repoURL/chart/version.
+
+## Secrets
+
+- **Sealed Secrets** — store encrypted secrets directly in Git,
+  decrypted by the controller at apply time. Good for low-churn
+  config-style secrets.
+- **HashiCorp Vault** — runtime secrets injection via the Vault
+  Agent or external-secrets operator. Good for high-churn secrets
+  + rotation + audit.
+- **SOPS** — file-level encryption (age/gpg/KMS); used for
+  Hermes-image-time secrets baked at Docker build per your spec.
+
+All three coexist deliberately: different secrets have different
+lifetimes + access patterns.
+
+## Component composition
+
+| Component | NixFlake or ArgoCD | Notes |
+|---|---|---|
+| NixOS + bootloader | Nix | USB installer |
+| K3S | Nix (per-host module) | flannel + servicelb disabled (Cilium takes over) |
+| Cilium | ArgoCD | KPR, Hubble Relay + UI, BPF MASQUERADE enabled |
+| Docker | Nix (per-host module) | for non-K8s container workloads |
+| Local-path storage | Nix (per-host module) | host-path PV for stateless workloads |
+| GPU drivers (NVIDIA) | Nix (per-host module) | proprietary driver, container toolkit |
+| GPU passthrough (VFIO) | Nix (per-host module) | for VM workloads on the same hosts |
+| GPU device plugin (K8s) | Nix (per-host module) | exposes `nvidia.com/gpu`, `amd.com/gpu`, `intel.com/gpu` to pods |
+| Everything else | ArgoCD | reconciled from `k8s/applications/` |
+
+The Cilium choice DISPLACES K3S's default flannel CNI. The
+control-plane's `k3s-server.nix` passes `--flannel-backend=none`
+and `--disable-network-policy` to K3S; Cilium owns CNI, kube-proxy
+replacement, and network policy.
+
+## Updating the cluster
+
+- **OS layer** changes: edit the relevant file under `./nixos/`,
+  commit, push. Then on each target:
+  `sudo nixos-rebuild switch --flake /etc/zeta/full-ai-cluster#<host>`
+- **Cluster layer** changes: edit the relevant `Application.yaml`
+  or referenced manifest, commit, push. ArgoCD reconciles within
+  ~3 minutes.
+
+For a full cluster rebuild from scratch: this directory IS the
+desired state. Wipe everything, rerun the bootstrap, end up at
+the same place.
diff --git a/full-ai-cluster/flake.nix b/full-ai-cluster/flake.nix
new file mode 100644
index 0000000000..d388097e6f
--- /dev/null
+++ b/full-ai-cluster/flake.nix
@@ -0,0 +1,167 @@
+# full-ai-cluster/flake.nix
+#
+# End-to-end declarative AI cluster flake.
+#
+# The USB installer (./usb-nixos-installer/) is the snippet at the
+# start of this directory. After installation completes, K3S auto-
+# applies the bootstrap manifests at ./k8s/bootstrap/ which install
+# ArgoCD. ArgoCD then reconciles every other workload from
+# ./k8s/applications/.
+#
+# Bootstrap flow:
+#   1. Build USB:       nix build .#installer-iso
+#   2. Write to USB:    sudo dd if=result/iso/*.iso of=/dev/sdX bs=4M
+#   3. Boot target on USB
+#   4. Clone Zeta + nixos-install --flake .#<host>
+#   5. Reboot. K3S + Cilium + ArgoCD + everything come up declaratively.
+
+{
+  description = "Zeta full AI cluster — declarative from USB to running workloads";
+
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/nixos-24.11";
+    nixos-hardware.url = "github:NixOS/nixos-hardware/master";
+    flake-utils.url = "github:numtide/flake-utils";
+
+    # nix-darwin pinned to matching release branch so Apple Silicon
+    # maintainers can build the x86_64-linux ISO via the linux-builder
+    # VM (Virtualization.framework + Rosetta 2).
+    nix-darwin = {
+      url = "github:nix-darwin/nix-darwin/nix-darwin-24.11";
+      inputs.nixpkgs.follows = "nixpkgs";
+    };
+  };
+
+  outputs = { self, nixpkgs, nixos-hardware, flake-utils, nix-darwin, ... }@inputs:
+    let
+      stateVersion = "24.11";
+
+      supportedSystems = [
+        "x86_64-linux"
+        "aarch64-linux"
+        "aarch64-darwin"
+      ];
+
+      isoBuildSystems = [
+        "x86_64-linux"
+        "aarch64-darwin"
+      ];
+
+      mkSystem = { system ? "x86_64-linux", modules }: nixpkgs.lib.nixosSystem {
+        inherit system;
+        specialArgs = { inherit inputs stateVersion; };
+        modules = modules;
+      };
+    in
+    {
+      # NixOS configurations: installer image + per-host targets.
+      nixosConfigurations = {
+        # USB installer ISO — identical to the standalone
+        # usb-nixos-installer/ flake at the parent level.
+        installer = mkSystem {
+          modules = [
+            ./usb-nixos-installer/nixos/installer/configuration.nix
+          ];
+        };
+
+        # Control-plane: K3S server + Cilium CNI + ArgoCD bootstrap.
+        control-plane = mkSystem {
+          modules = [
+            ./nixos/hosts/control-plane/configuration.nix
+          ];
+        };
+
+        # GPU worker template. Duplicate this entry per physical worker
+        # (worker-gpu-01, worker-gpu-02, ...) once hardware-configuration
+        # files for each are committed.
+        worker-gpu = mkSystem {
+          modules = [
+            ./nixos/hosts/worker-gpu/configuration.nix
+          ];
+        };
+      };
+
+      # Shared NixOS modules — per-host configs import these via
+      # relative path; this output exposes them so external flakes
+      # can reuse the same modules.
+      nixosModules = {
+        common = ./nixos/modules/common.nix;
+        k3s-server = ./nixos/modules/k3s-server.nix;
+        k3s-agent = ./nixos/modules/k3s-agent.nix;
+        gpu = ./nixos/modules/gpu.nix;
+        gpu-passthrough = ./nixos/modules/gpu-passthrough.nix;
+        gpu-device-plugin = ./nixos/modules/gpu-device-plugin.nix;
+        docker = ./nixos/modules/docker.nix;
+        local-storage = ./nixos/modules/local-storage.nix;
+      };
+
+      # nix-darwin config for maintainer Macs (Apple Silicon). Enables
+      # the linux-builder VM so `nix build .#installer-iso` works
+      # locally without Parallels / Lima / remote builders.
+      darwinConfigurations.zeta-mac = nix-darwin.lib.darwinSystem {
+        system = "aarch64-darwin";
+        specialArgs = { inherit inputs; };
+        modules = [
+          ({ pkgs, lib, ... }: {
+            nix.settings = {
+              experimental-features = [ "nix-command" "flakes" ];
+              trusted-users = [ "@admin" ];
+              extra-platforms = [ "x86_64-linux" ];
+            };
+            nix.linux-builder = {
+              enable = true;
+              ephemeral = false;
+              maxJobs = 4;
+              supportedFeatures = [ "kvm" "benchmark" "big-parallel" ];
+              config = {
+                virtualisation = {
+                  darwin-builder = {
+                    diskSize = 40 * 1024;
+                    memorySize = 8 * 1024;
+                  };
+                  cores = 6;
+                };
+              };
+            };
+            environment.systemPackages = with pkgs; [
+              git gh jq yq-go ripgrep fd htop
+              kubectl kubernetes-helm k9s argocd
+              age sops ssh-to-age
+              nix-output-monitor nvd nh
+            ];
+            system.stateVersion = 5;
+            nixpkgs.hostPlatform = "aarch64-darwin";
+          })
+        ];
+      };
+    } // flake-utils.lib.eachSystem supportedSystems (system:
+      let
+        pkgs = import nixpkgs { inherit system; };
+      in
+      {
+        packages = nixpkgs.lib.optionalAttrs (builtins.elem system isoBuildSystems) {
+          installer-iso =
+            self.nixosConfigurations.installer.config.system.build.isoImage;
+          default = self.packages.${system}.installer-iso;
+        };
+
+        devShells.default = pkgs.mkShell {
+          name = "zeta-ai-cluster-admin";
+          packages = with pkgs; [
+            nix-output-monitor nvd nh
+            kubectl kubernetes-helm k9s argocd
+            cilium-cli hubble
+            age sops ssh-to-age
+            git gh jq yq-go ripgrep fd
+          ];
+          shellHook = ''
+            echo "zeta-ai-cluster admin shell."
+            echo "  Build USB ISO:        nix build .#installer-iso"
+            echo "  Build host system:    nixos-rebuild build --flake .#<host>"
+            echo "  Talk to cluster:      kubectl / k9s / argocd / cilium / hubble"
+          '';
+        };
+
+        formatter = pkgs.nixpkgs-fmt;
+      });
+}
diff --git a/full-ai-cluster/k8s/applications/alloy/Application.yaml b/full-ai-cluster/k8s/applications/alloy/Application.yaml
new file mode 100644
index 0000000000..54f28525fa
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/alloy/Application.yaml
@@ -0,0 +1,46 @@
+# Alloy — Grafana's OpenTelemetry collector. Ships logs to Loki,
+# metrics to Mimir/Prometheus, traces to Tempo.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: alloy
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://grafana.github.io/helm-charts
+    chart: alloy
+    targetRevision: 0.10.1
+    helm:
+      releaseName: alloy
+      valuesObject:
+        controller:
+          type: daemonset   # collect from every node
+        alloy:
+          configMap:
+            create: true
+            content: |
+              logging {
+                level = "info"
+                format = "logfmt"
+              }
+              // Logs -> Loki
+              loki.write "loki" {
+                endpoint { url = "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push" }
+              }
+              // Traces -> Tempo
+              otelcol.exporter.otlp "tempo" {
+                client { endpoint = "http://tempo.tempo.svc.cluster.local:4317" }
+              }
+              // Metrics -> Mimir (or fall back to local Prometheus)
+              prometheus.remote_write "mimir" {
+                endpoint { url = "http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push" }
+              }
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: alloy
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml b/full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml
new file mode 100644
index 0000000000..98205a6887
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml
@@ -0,0 +1,30 @@
+# Argo Rollouts — progressive delivery (canary, blue-green).
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: argo-rollouts
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://argoproj.github.io/argo-helm
+    chart: argo-rollouts
+    targetRevision: 2.39.5
+    helm:
+      releaseName: argo-rollouts
+      valuesObject:
+        controller:
+          replicas: 1
+        dashboard:
+          enabled: true
+          serviceType: ClusterIP
+        notifications:
+          enabled: true
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: argo-rollouts
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/argo-workflows/Application.yaml b/full-ai-cluster/k8s/applications/argo-workflows/Application.yaml
new file mode 100644
index 0000000000..9f19ab65a9
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/argo-workflows/Application.yaml
@@ -0,0 +1,34 @@
+# Argo Workflows — DAG job scheduler. Drives AI pipelines + batch jobs.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: argo-workflows
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://argoproj.github.io/argo-helm
+    chart: argo-workflows
+    targetRevision: 0.42.5
+    helm:
+      releaseName: argo-workflows
+      valuesObject:
+        controller:
+          workflowDefaults:
+            spec:
+              activeDeadlineSeconds: 86400
+              ttlStrategy: { secondsAfterCompletion: 604800 }
+              podGC: { strategy: OnPodCompletion }
+          parallelism: 50
+        server:
+          authModes: [ server ]
+        executor:
+          image: { tag: "" }
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: argo-workflows
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/cilium/Application.yaml b/full-ai-cluster/k8s/applications/cilium/Application.yaml
new file mode 100644
index 0000000000..1a008a576f
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/cilium/Application.yaml
@@ -0,0 +1,69 @@
+# Cilium — CNI + KPR + Hubble + BPF MASQUERADE.
+# Replaces K3S's default flannel + kube-proxy (disabled in k3s-server.nix).
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: cilium
+  namespace: argocd
+  finalizers:
+    - resources-finalizer.argocd.argoproj.io
+spec:
+  project: default
+  source:
+    repoURL: https://helm.cilium.io/
+    chart: cilium
+    targetRevision: 1.16.5
+    helm:
+      releaseName: cilium
+      valuesObject:
+        kubeProxyReplacement: true
+        k8sServiceHost: control-plane.zeta.local
+        k8sServicePort: 6443
+        ipam:
+          mode: cluster-pool
+          operator:
+            clusterPoolIPv4PodCIDRList: [ "10.42.0.0/16" ]
+
+        # BPF MASQUERADE (required by the spec).
+        bpf:
+          masquerade: true
+
+        # Routing: native routing is faster than VXLAN encapsulation
+        # when the underlying network supports it.
+        routingMode: native
+        ipv4NativeRoutingCIDR: "10.42.0.0/16"
+        autoDirectNodeRoutes: true
+
+        # Hubble (observability + Hubble Relay + Hubble UI).
+        hubble:
+          enabled: true
+          relay:
+            enabled: true
+          ui:
+            enabled: true
+          metrics:
+            enabled:
+              - dns
+              - drop
+              - tcp
+              - flow
+              - icmp
+              - http
+            enableOpenMetrics: true
+            serviceMonitor:
+              enabled: false   # enabled once kube-prometheus-stack lands
+
+        operator:
+          replicas: 1   # bump to 2 once a second control-plane exists
+
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: kube-system
+  syncPolicy:
+    automated:
+      prune: false   # never prune Cilium — too load-bearing
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=true
+      - ServerSideApply=true
diff --git a/full-ai-cluster/k8s/applications/cockroachdb/Application.yaml b/full-ai-cluster/k8s/applications/cockroachdb/Application.yaml
new file mode 100644
index 0000000000..53f3f65f86
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/cockroachdb/Application.yaml
@@ -0,0 +1,43 @@
+# CockroachDB — distributed SQL. Backs Temporal + GitLab + Forgejo
+# + any workload wanting strong consistency + horizontal scale.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: cockroachdb
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://charts.cockroachdb.com/
+    chart: cockroachdb
+    targetRevision: 14.0.5
+    helm:
+      releaseName: cockroachdb
+      valuesObject:
+        statefulset:
+          replicas: 3
+        storage:
+          persistentVolume:
+            enabled: true
+            size: 100Gi
+            storageClass: longhorn
+        conf:
+          single-node: false
+          cache: 25%
+          max-sql-memory: 25%
+        tls:
+          enabled: true
+          certs:
+            selfSigner:
+              enabled: true
+              caProvided: false
+        prometheus:
+          enabled: true   # for kube-prometheus-stack scrape
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: cockroachdb
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/dapr/Application.yaml b/full-ai-cluster/k8s/applications/dapr/Application.yaml
new file mode 100644
index 0000000000..4bfd9f9efd
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/dapr/Application.yaml
@@ -0,0 +1,32 @@
+# Dapr — distributed actor runtime #3.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: dapr
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://dapr.github.io/helm-charts/
+    chart: dapr
+    targetRevision: 1.14.4
+    helm:
+      releaseName: dapr
+      valuesObject:
+        global:
+          ha: { enabled: false }   # bump once multi-control-plane
+        dapr_placement:
+          replicaCount: 1
+        # Actors persistence: configure a Redis/CockroachDB state
+        # store via a Dapr Component once those land. Example:
+        # configuration:
+        #   metricSpec:
+        #     enabled: true
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: dapr-system
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml b/full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml
new file mode 100644
index 0000000000..a033fbed35
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml
@@ -0,0 +1,28 @@
+# Deepseek Coder — model deploy via Ollama OR vLLM.
+#
+# This Application is structural — the model itself is pulled by
+# whichever serving stack you chose (Ollama via the `models.pull:`
+# in its Helm values, OR vLLM via its `--model` arg). This
+# Application just declares the namespace + an info ConfigMap so
+# operators can grep for the model deploy via the same tooling.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: deepseek-coder
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/deepseek-coder
+    directory:
+      include: '{namespace,configmap}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: models
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml b/full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml
new file mode 100644
index 0000000000..bae1b3a391
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml
@@ -0,0 +1,18 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: models
+  labels: { app.kubernetes.io/part-of: zeta-ai }
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: deepseek-coder
+  namespace: models
+data:
+  model: "deepseek-coder:33b"
+  served-by: "ollama|vllm"   # set to whichever you enabled
+  endpoint-ollama: "http://ollama.ollama.svc.cluster.local:11434"
+  endpoint-vllm:   "http://vllm.vllm.svc.cluster.local:8000"
+  size-vram-gb: "24"   # rough VRAM target for q4 quant
+  license: "deepseek-license-v1"
diff --git a/full-ai-cluster/k8s/applications/forgejo/Application.yaml b/full-ai-cluster/k8s/applications/forgejo/Application.yaml
new file mode 100644
index 0000000000..d0651ebcec
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/forgejo/Application.yaml
@@ -0,0 +1,32 @@
+# Forgejo — self-hosted Git. Lighter than GitLab; pick one.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: forgejo
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://code.forgejo.org/forgejo-helm/
+    chart: forgejo
+    targetRevision: 9.0.6
+    helm:
+      releaseName: forgejo
+      valuesObject:
+        gitea:
+          admin:
+            existingSecret: forgejo-initial-admin
+        persistence:
+          enabled: true
+          size: 20Gi
+          storageClass: zeta-local-path   # or longhorn for HA
+        postgresql:
+          enabled: true   # ships its own postgres; swap to CockroachDB later
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: forgejo
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/gitlab/Application.yaml b/full-ai-cluster/k8s/applications/gitlab/Application.yaml
new file mode 100644
index 0000000000..957b1fb820
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/gitlab/Application.yaml
@@ -0,0 +1,39 @@
+# GitLab CE — self-hosted Git. Heavier than Forgejo; pick one.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: gitlab
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://charts.gitlab.io/
+    chart: gitlab
+    targetRevision: 8.7.0
+    helm:
+      releaseName: gitlab
+      valuesObject:
+        global:
+          edition: ce
+          hosts:
+            domain: gitlab.zeta.local
+          ingress: { tls: { enabled: false } }   # turn on after cert-manager
+          initialRootPassword:
+            secret: gitlab-initial-root-password
+            key: password
+        # Disable bundled components — cluster has its own.
+        certmanager:    { install: false }
+        nginx-ingress:  { enabled: false }
+        prometheus:     { install: false }
+        gitlab-runner:
+          install: true
+          runners: { privileged: false, tags: "kubernetes,zeta-cluster" }
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: gitlab
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true, SkipDryRunOnMissingResource=true ]
+    retry: { limit: 5, backoff: { duration: 30s, factor: 2, maxDuration: 5m } }
diff --git a/full-ai-cluster/k8s/applications/hermes/Application.yaml b/full-ai-cluster/k8s/applications/hermes/Application.yaml
new file mode 100644
index 0000000000..5cd833eb9e
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/hermes/Application.yaml
@@ -0,0 +1,37 @@
+# Hermes — TODO: AMBIGUOUS COMPONENT.
+#
+# Possibilities:
+#   - Cosmos Hermes IBC relayer (https://github.com/informalsystems/hermes)
+#   - Comma.ai Hermes
+#   - Hermes message broker (multiple projects with this name)
+#   - An Aaron-specific Hermes (AI agent, terminal-tooling, etc.)
+#
+# Per spec: "integrated with OZ" + "SOPS into Hermes Docker image" +
+# "Hermes access to Ollama or vLLM" — these hint at an AI-agent
+# Hermes that talks to OZ + needs Ollama/vLLM access + has secrets
+# baked at image-build via SOPS.
+#
+# This Application points at local manifests so you can drop the
+# real image + supporting config without changing the Application
+# itself.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: hermes
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/hermes
+    directory:
+      include: '{namespace,deployment,service,rbac}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: hermes
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/hermes/deployment.yaml b/full-ai-cluster/k8s/applications/hermes/deployment.yaml
new file mode 100644
index 0000000000..1ee6d84692
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/hermes/deployment.yaml
@@ -0,0 +1,63 @@
+# Hermes deployment with SOPS-baked secrets via image build.
+#
+# The image at `ghcr.io/lucent-financial-group/zeta-hermes` is
+# expected to be built via the Docker module (NixFlake) on a
+# maintainer host with SOPS-decrypted secrets baked in at build
+# time (rather than mounted at runtime). The build pipeline:
+#
+#   1. age decrypt encrypted/* → secrets/  (locally, via sops)
+#   2. docker buildx build --secret id=hermes-secrets,src=secrets/ ...
+#   3. docker push ghcr.io/lucent-financial-group/zeta-hermes:vN.N.N
+#   4. Bump `image:` below + commit + push
+#
+# Hermes' connections (placeholders — wire after OZ + Ollama land):
+#   - OZ:        oz-router.oz.svc.cluster.local:443
+#   - Ollama:    ollama.ollama.svc.cluster.local:11434
+#   - vLLM:      vllm.vllm.svc.cluster.local:8000
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: hermes
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: hermes
+  namespace: hermes
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: hermes
+  namespace: hermes
+spec:
+  replicas: 0   # set >=1 once a real image exists
+  selector:
+    matchLabels: { app.kubernetes.io/name: hermes }
+  template:
+    metadata:
+      labels: { app.kubernetes.io/name: hermes }
+    spec:
+      serviceAccountName: hermes
+      containers:
+        - name: hermes
+          image: ghcr.io/lucent-financial-group/zeta-hermes:placeholder
+          env:
+            - { name: OZ_ENDPOINT,      value: "oz-router.oz.svc.cluster.local:443" }
+            - { name: OLLAMA_ENDPOINT,  value: "http://ollama.ollama.svc.cluster.local:11434" }
+            - { name: VLLM_ENDPOINT,    value: "http://vllm.vllm.svc.cluster.local:8000" }
+          resources:
+            requests: { cpu: "200m", memory: "256Mi" }
+            limits:   { cpu: "1",    memory: "1Gi"   }
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: hermes
+  namespace: hermes
+spec:
+  type: ClusterIP
+  selector: { app.kubernetes.io/name: hermes }
+  ports:
+    - { name: http, port: 80, targetPort: 8080 }
diff --git a/full-ai-cluster/k8s/applications/hindsight/Application.yaml b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
new file mode 100644
index 0000000000..b211de4609
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
@@ -0,0 +1,35 @@
+# Hindsight — TODO: AMBIGUOUS COMPONENT.
+#
+# Multiple possibilities in 2026:
+#   - Hindsight (Lockheed Martin) — OpenTelemetry trace processor
+#     for tail-sampling: https://github.com/lockheed-martin/hindsight
+#   - Hindsight (Microsoft) — observability tool
+#   - Some other "Hindsight" specific to Aaron's stack
+#
+# This placeholder Application points at the Lockheed Martin
+# Hindsight (most likely match given OpenTelemetry context with
+# Loki/Tempo/Alloy/Mimir all on this cluster). UPDATE the
+# repoURL + chart fields below to whichever Hindsight you mean.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: hindsight
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    # TODO(aaron): confirm Hindsight identity. Placeholder points at
+    # the Lockheed Martin OTel trace processor manifests:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/hindsight
+    directory:
+      include: '{namespace,deployment,service}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: hindsight
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/istio/Application.yaml b/full-ai-cluster/k8s/applications/istio/Application.yaml
new file mode 100644
index 0000000000..9f25a77fc1
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/istio/Application.yaml
@@ -0,0 +1,31 @@
+# Istio — service mesh. Sidecar OR ambient mode (ambient = no
+# sidecar per pod; lighter). Cilium handles low-level networking;
+# Istio handles L7 routing / auth / mTLS.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: istio
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://istio-release.storage.googleapis.com/charts
+    chart: base
+    targetRevision: 1.24.0
+    helm:
+      releaseName: istio-base
+      valuesObject:
+        defaultRevision: default
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: istio-system
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
+# NOTE: Istio actually needs THREE Helm releases to be useful
+# (base, istiod, gateway). This Application installs the CRDs (base);
+# add istio-istiod/Application.yaml + istio-gateway/Application.yaml
+# in follow-up dirs once base is healthy. Splitting the apps lets
+# ArgoCD sync them in the right order.
diff --git a/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml b/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
new file mode 100644
index 0000000000..408c3deea8
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
@@ -0,0 +1,46 @@
+# Prometheus + Grafana + Alertmanager (kube-prometheus-stack).
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: kube-prometheus-stack
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://prometheus-community.github.io/helm-charts
+    chart: kube-prometheus-stack
+    targetRevision: 65.5.0
+    helm:
+      releaseName: kube-prometheus-stack
+      valuesObject:
+        prometheus:
+          prometheusSpec:
+            retention: 15d
+            storageSpec:
+              volumeClaimTemplate:
+                spec:
+                  storageClassName: longhorn
+                  accessModes: [ ReadWriteOnce ]
+                  resources: { requests: { storage: 100Gi } }
+        grafana:
+          adminPassword: changeme   # override via Sealed Secret or Vault
+          persistence:
+            enabled: true
+            storageClassName: longhorn
+            size: 10Gi
+        alertmanager:
+          alertmanagerSpec:
+            storage:
+              volumeClaimTemplate:
+                spec:
+                  storageClassName: longhorn
+                  accessModes: [ ReadWriteOnce ]
+                  resources: { requests: { storage: 10Gi } }
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: monitoring
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/loki/Application.yaml b/full-ai-cluster/k8s/applications/loki/Application.yaml
new file mode 100644
index 0000000000..7ee2b2b375
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/loki/Application.yaml
@@ -0,0 +1,45 @@
+# Loki — log aggregation.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: loki
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://grafana.github.io/helm-charts
+    chart: loki
+    targetRevision: 6.18.0
+    helm:
+      releaseName: loki
+      valuesObject:
+        deploymentMode: SimpleScalable
+        loki:
+          auth_enabled: false
+          schemaConfig:
+            configs:
+              - from: "2024-01-01"
+                store: tsdb
+                object_store: s3
+                schema: v13
+                index: { prefix: loki_index_, period: 24h }
+          storage:
+            type: s3
+            # TODO: point at an in-cluster S3 (MinIO, SeaweedFS) or
+            # external S3-compatible storage. For now, bucket lives
+            # in a placeholder values-set you wire after picking
+            # object-storage backend.
+            bucketNames: { chunks: loki-chunks, ruler: loki-ruler }
+        write: { replicas: 2 }
+        read:  { replicas: 2 }
+        backend: { replicas: 2 }
+        chunksCache: { enabled: true }
+        resultsCache: { enabled: true }
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: loki
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/longhorn/Application.yaml b/full-ai-cluster/k8s/applications/longhorn/Application.yaml
new file mode 100644
index 0000000000..bdbce9cf78
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/longhorn/Application.yaml
@@ -0,0 +1,34 @@
+# Longhorn — distributed block storage. Stateful workloads land here
+# (Postgres for GitLab, CockroachDB data dirs, Weaviate, etc.).
+# Local-path storage (NixFlake module) covers stateless workloads.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: longhorn
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://charts.longhorn.io
+    chart: longhorn
+    targetRevision: 1.7.2
+    helm:
+      releaseName: longhorn
+      valuesObject:
+        defaultSettings:
+          defaultDataPath: /var/lib/longhorn
+          defaultReplicaCount: 2   # bump to 3 once 3+ workers exist
+        persistence:
+          defaultClass: false
+          defaultClassReplicaCount: 2
+          reclaimPolicy: Retain
+        ingress:
+          enabled: false   # turn on after cert-manager + ingress-nginx
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: longhorn-system
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }   # never prune storage
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/mimir/Application.yaml b/full-ai-cluster/k8s/applications/mimir/Application.yaml
new file mode 100644
index 0000000000..312d1acdb0
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/mimir/Application.yaml
@@ -0,0 +1,46 @@
+# Mimir — long-term metrics storage. Complements kube-prometheus-stack
+# which keeps recent metrics; Mimir holds the long tail for trend
+# analysis.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: mimir
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://grafana.github.io/helm-charts
+    chart: mimir-distributed
+    targetRevision: 5.5.1
+    helm:
+      releaseName: mimir
+      valuesObject:
+        metaMonitoring:
+          serviceMonitor: { enabled: false }
+        # MinIO bundled for object storage. Swap to external S3
+        # once you stand up real object storage.
+        minio:
+          enabled: true
+          persistence:
+            storageClass: longhorn
+            size: 100Gi
+        ingester:
+          persistentVolume:
+            storageClass: longhorn
+            size: 50Gi
+        store_gateway:
+          persistentVolume:
+            storageClass: longhorn
+            size: 50Gi
+        compactor:
+          persistentVolume:
+            storageClass: longhorn
+            size: 50Gi
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: mimir
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/nats/Application.yaml b/full-ai-cluster/k8s/applications/nats/Application.yaml
new file mode 100644
index 0000000000..c2102a6b1f
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/nats/Application.yaml
@@ -0,0 +1,35 @@
+# NATS — messaging.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: nats
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://nats-io.github.io/k8s/helm/charts/
+    chart: nats
+    targetRevision: 1.2.7
+    helm:
+      releaseName: nats
+      valuesObject:
+        config:
+          jetstream:
+            enabled: true
+            fileStore:
+              pvc:
+                size: 20Gi
+                storageClassName: longhorn
+        cluster:
+          enabled: true
+          replicas: 3
+        natsBox:
+          enabled: true
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: nats
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/ollama/Application.yaml b/full-ai-cluster/k8s/applications/ollama/Application.yaml
new file mode 100644
index 0000000000..ab3347ebc0
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/ollama/Application.yaml
@@ -0,0 +1,48 @@
+# Ollama — LLM serving (option A).
+# Schedules onto worker-gpu nodes via nvidia.com/gpu request.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: ollama
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://otwld.github.io/ollama-helm/
+    chart: ollama
+    targetRevision: 1.6.0
+    helm:
+      releaseName: ollama
+      valuesObject:
+        ollama:
+          gpu:
+            enabled: true
+            type: nvidia
+            number: 1
+          models:
+            pull:
+              - deepseek-coder:33b
+              - qwen2.5-coder:32b
+            run:
+              - deepseek-coder:33b
+              - qwen2.5-coder:32b
+        persistentVolume:
+          enabled: true
+          size: 200Gi
+          storageClass: longhorn
+        nodeSelector:
+          zeta.io/gpu: nvidia
+        resources:
+          requests: { cpu: "2", memory: "8Gi", "nvidia.com/gpu": 1 }
+          limits:   { cpu: "8", memory: "32Gi", "nvidia.com/gpu": 1 }
+        service:
+          type: ClusterIP
+          port: 11434
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: ollama
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml b/full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml
new file mode 100644
index 0000000000..5e6cecd10c
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml
@@ -0,0 +1,28 @@
+# Open Policy Agent (Gatekeeper) — admission policy.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: open-policy-agent
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://open-policy-agent.github.io/gatekeeper/charts
+    chart: gatekeeper
+    targetRevision: 3.18.1
+    helm:
+      releaseName: gatekeeper
+      valuesObject:
+        replicas: 3
+        controllerManager:
+          metricsPort: 8888
+        validatingWebhookFailurePolicy: Fail
+        mutatingWebhookFailurePolicy: Ignore   # safer default while iterating
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: gatekeeper-system
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/orleans/Application.yaml b/full-ai-cluster/k8s/applications/orleans/Application.yaml
new file mode 100644
index 0000000000..58613554b6
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/Application.yaml
@@ -0,0 +1,33 @@
+# Orleans — distributed actor runtime / "distributed cron" #1.
+#
+# Orleans doesn't have an official Helm chart. The pattern is to
+# build a custom Silo image embedding your grain code and deploy it
+# as a StatefulSet. This Application points at the manifests in
+# the same directory; bump replicas to >=1 and replace the image
+# once you have a published silo image.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: orleans
+  namespace: argocd
+  finalizers:
+    - resources-finalizer.argocd.argoproj.io
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/orleans
+    directory:
+      include: '{namespace,rbac,configmap,service,statefulset}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: orleans
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=true
+      - ServerSideApply=true
diff --git a/full-ai-cluster/k8s/applications/orleans/configmap.yaml b/full-ai-cluster/k8s/applications/orleans/configmap.yaml
new file mode 100644
index 0000000000..6b8f808a0f
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/configmap.yaml
@@ -0,0 +1,14 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: orleans-config
+  namespace: orleans
+data:
+  cluster.json: |
+    {
+      "serviceId": "zeta",
+      "clusterId": "zeta-prod",
+      "silo": { "siloPort": 11111, "gatewayPort": 30000 },
+      "clustering": { "provider": "kubernetes", "namespace": "orleans" },
+      "telemetry": { "dashboard": { "enabled": true, "port": 8080 } }
+    }
diff --git a/full-ai-cluster/k8s/applications/orleans/namespace.yaml b/full-ai-cluster/k8s/applications/orleans/namespace.yaml
new file mode 100644
index 0000000000..d74fa38ee5
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/namespace.yaml
@@ -0,0 +1,7 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: orleans
+  labels:
+    app.kubernetes.io/part-of: zeta
+    zeta.io/distributed-cron: "true"
diff --git a/full-ai-cluster/k8s/applications/orleans/rbac.yaml b/full-ai-cluster/k8s/applications/orleans/rbac.yaml
new file mode 100644
index 0000000000..e2ecf6d6c0
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/rbac.yaml
@@ -0,0 +1,35 @@
+# Orleans needs to discover sibling silos via the K8s API
+# (Orleans Kubernetes clustering provider).
+
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: orleans-silo
+  namespace: orleans
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: orleans-silo
+  namespace: orleans
+rules:
+  - apiGroups: [""]
+    resources: ["pods", "endpoints"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["apps"]
+    resources: ["statefulsets"]
+    verbs: ["get", "list", "watch"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: orleans-silo
+  namespace: orleans
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: orleans-silo
+subjects:
+  - kind: ServiceAccount
+    name: orleans-silo
+    namespace: orleans
diff --git a/full-ai-cluster/k8s/applications/orleans/service.yaml b/full-ai-cluster/k8s/applications/orleans/service.yaml
new file mode 100644
index 0000000000..2a62577556
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/service.yaml
@@ -0,0 +1,33 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: orleans-silo
+  namespace: orleans
+spec:
+  clusterIP: None
+  selector: { app.kubernetes.io/name: orleans-silo }
+  ports:
+    - { name: silo, port: 11111, targetPort: silo }
+    - { name: gateway, port: 30000, targetPort: gateway }
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: orleans-gateway
+  namespace: orleans
+spec:
+  type: ClusterIP
+  selector: { app.kubernetes.io/name: orleans-silo }
+  ports:
+    - { name: gateway, port: 30000, targetPort: gateway }
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: orleans-dashboard
+  namespace: orleans
+spec:
+  type: ClusterIP
+  selector: { app.kubernetes.io/name: orleans-silo }
+  ports:
+    - { name: http, port: 8080, targetPort: dashboard }
diff --git a/full-ai-cluster/k8s/applications/orleans/statefulset.yaml b/full-ai-cluster/k8s/applications/orleans/statefulset.yaml
new file mode 100644
index 0000000000..46fb815da5
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/orleans/statefulset.yaml
@@ -0,0 +1,50 @@
+# Orleans Silo StatefulSet. Replicas start at 0 until you publish
+# a real silo image. Replace `image:` and bump `replicas:` >= 1.
+
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: orleans-silo
+  namespace: orleans
+spec:
+  serviceName: orleans-silo
+  replicas: 0
+  selector:
+    matchLabels: { app.kubernetes.io/name: orleans-silo }
+  podManagementPolicy: Parallel
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: orleans-silo
+        orleans.io/serviceId: zeta
+        orleans.io/clusterId: zeta-prod
+    spec:
+      serviceAccountName: orleans-silo
+      topologySpreadConstraints:
+        - maxSkew: 1
+          topologyKey: kubernetes.io/hostname
+          whenUnsatisfiable: ScheduleAnyway
+          labelSelector:
+            matchLabels: { app.kubernetes.io/name: orleans-silo }
+      containers:
+        - name: silo
+          image: ghcr.io/lucent-financial-group/zeta-orleans-silo:latest
+          ports:
+            - { name: silo, containerPort: 11111 }
+            - { name: gateway, containerPort: 30000 }
+            - { name: dashboard, containerPort: 8080 }
+          env:
+            - { name: ORLEANS_SERVICE_ID, value: zeta }
+            - { name: ORLEANS_CLUSTER_ID, value: zeta-prod }
+            - { name: POD_NAME, valueFrom: { fieldRef: { fieldPath: metadata.name } } }
+            - { name: POD_NAMESPACE, valueFrom: { fieldRef: { fieldPath: metadata.namespace } } }
+            - { name: POD_IP, valueFrom: { fieldRef: { fieldPath: status.podIP } } }
+          volumeMounts:
+            - { name: config, mountPath: /etc/orleans, readOnly: true }
+          resources:
+            requests: { cpu: "500m", memory: "512Mi" }
+            limits:   { cpu: "2",    memory: "2Gi"  }
+          livenessProbe:  { tcpSocket: { port: silo }, initialDelaySeconds: 30, periodSeconds: 15 }
+          readinessProbe: { tcpSocket: { port: gateway }, initialDelaySeconds: 10, periodSeconds: 5 }
+      volumes:
+        - { name: config, configMap: { name: orleans-config } }
diff --git a/full-ai-cluster/k8s/applications/oz/Application.yaml b/full-ai-cluster/k8s/applications/oz/Application.yaml
new file mode 100644
index 0000000000..e9a1a1772b
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/oz/Application.yaml
@@ -0,0 +1,34 @@
+# OZ — TODO: AMBIGUOUS COMPONENT.
+#
+# "OZ" maps to several possibilities in 2026:
+#   - OpenZiti (zero-trust networking): https://openziti.io/
+#   - An Aaron-specific AI agent named OZ
+#   - Auth0's OZ component
+#
+# This placeholder uses the local k8s/applications/oz/ path so you
+# can drop in real manifests by editing the files in this directory.
+# UPDATE repoURL + chart once the identity is confirmed.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: oz
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    # TODO(aaron): confirm OZ identity. If OpenZiti, switch to:
+    #   repoURL: https://docs.openziti.io/helm-charts/
+    #   chart: ziti-controller
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/oz
+    directory:
+      include: '{namespace,deployment,service}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: oz
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/qwen-coder/Application.yaml b/full-ai-cluster/k8s/applications/qwen-coder/Application.yaml
new file mode 100644
index 0000000000..2968f0df9d
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/qwen-coder/Application.yaml
@@ -0,0 +1,23 @@
+# Qwen Coder — same pattern as deepseek-coder/. Served by Ollama
+# or vLLM via the same endpoints.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: qwen-coder
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/qwen-coder
+    directory:
+      include: '{namespace,configmap}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: models
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml b/full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml
new file mode 100644
index 0000000000..c7e4dba477
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml
@@ -0,0 +1,12 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: qwen-coder
+  namespace: models
+data:
+  model: "qwen2.5-coder:32b"
+  served-by: "ollama|vllm"
+  endpoint-ollama: "http://ollama.ollama.svc.cluster.local:11434"
+  endpoint-vllm:   "http://vllm.vllm.svc.cluster.local:8000"
+  size-vram-gb: "24"
+  license: "tongyi-qianwen-license"
diff --git a/full-ai-cluster/k8s/applications/redis/Application.yaml b/full-ai-cluster/k8s/applications/redis/Application.yaml
new file mode 100644
index 0000000000..7c1750676f
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/redis/Application.yaml
@@ -0,0 +1,39 @@
+# Redis — cache. Bitnami chart for the standard ergonomics.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: redis
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://charts.bitnami.com/bitnami
+    chart: redis
+    targetRevision: 20.5.0
+    helm:
+      releaseName: redis
+      valuesObject:
+        architecture: replication
+        auth:
+          enabled: true
+          existingSecret: redis-auth   # create via Sealed Secret / Vault
+          existingSecretPasswordKey: password
+        master:
+          persistence:
+            enabled: true
+            storageClass: longhorn
+            size: 10Gi
+        replica:
+          replicaCount: 2
+          persistence:
+            enabled: true
+            storageClass: longhorn
+            size: 10Gi
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: redis
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml b/full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml
new file mode 100644
index 0000000000..df31cc3a41
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml
@@ -0,0 +1,26 @@
+# Sealed Secrets — encrypted secrets in Git, decrypted by the
+# controller at apply time.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: sealed-secrets
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://bitnami-labs.github.io/sealed-secrets
+    chart: sealed-secrets
+    targetRevision: 2.16.2
+    helm:
+      releaseName: sealed-secrets
+      valuesObject:
+        fullnameOverride: sealed-secrets-controller
+        keyrenewperiod: 720h   # 30 days
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: kube-system
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }   # never prune — would orphan secrets
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/tempo/Application.yaml b/full-ai-cluster/k8s/applications/tempo/Application.yaml
new file mode 100644
index 0000000000..87dc33d692
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/tempo/Application.yaml
@@ -0,0 +1,33 @@
+# Tempo — distributed traces.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: tempo
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://grafana.github.io/helm-charts
+    chart: tempo
+    targetRevision: 1.18.0
+    helm:
+      releaseName: tempo
+      valuesObject:
+        tempo:
+          storage:
+            trace:
+              backend: local
+              local:
+                path: /var/tempo/traces
+        persistence:
+          enabled: true
+          storageClassName: longhorn
+          size: 50Gi
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: tempo
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/temporal/Application.yaml b/full-ai-cluster/k8s/applications/temporal/Application.yaml
new file mode 100644
index 0000000000..4d16e8b6e7
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/temporal/Application.yaml
@@ -0,0 +1,46 @@
+# Temporal (TS workers) — distributed cron #2.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: temporal
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://go.temporal.io/helm-charts
+    chart: temporal
+    targetRevision: 0.59.0
+    helm:
+      releaseName: temporal
+      valuesObject:
+        server:
+          replicaCount: 1
+        cassandra:
+          enabled: false
+        elasticsearch:
+          enabled: false
+        prometheus:
+          enabled: false
+        grafana:
+          enabled: false
+        # CockroachDB (via its ArgoCD app) is the persistence layer.
+        # Wire after cockroachdb/Application.yaml comes up:
+        # server:
+        #   config:
+        #     persistence:
+        #       default:
+        #         driver: sql
+        #         sql:
+        #           driver: postgres12
+        #           host: cockroachdb-public.cockroachdb.svc.cluster.local
+        #           port: 26257
+        #           database: temporal
+        #           user: root
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: temporal
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/vault/Application.yaml b/full-ai-cluster/k8s/applications/vault/Application.yaml
new file mode 100644
index 0000000000..6a0ffa492e
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/vault/Application.yaml
@@ -0,0 +1,49 @@
+# HashiCorp Vault — runtime secrets engine + Vault Agent injector
+# for pods. Pair with Sealed Secrets (above): Sealed for config-style
+# secrets, Vault for dynamic + rotated secrets + audit trail.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: vault
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://helm.releases.hashicorp.com
+    chart: vault
+    targetRevision: 0.29.1
+    helm:
+      releaseName: vault
+      valuesObject:
+        global:
+          enabled: true
+          tlsDisable: false   # generate TLS via cert-manager or self-signed
+        server:
+          ha:
+            enabled: true
+            replicas: 3
+            raft:
+              enabled: true
+              setNodeId: true
+          dataStorage:
+            enabled: true
+            storageClass: longhorn
+            size: 20Gi
+          auditStorage:
+            enabled: true
+            storageClass: longhorn
+            size: 10Gi
+        injector:
+          enabled: true
+          replicas: 2
+        ui:
+          enabled: true
+          serviceType: ClusterIP
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: vault
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/vllm/Application.yaml b/full-ai-cluster/k8s/applications/vllm/Application.yaml
new file mode 100644
index 0000000000..730b584f16
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/vllm/Application.yaml
@@ -0,0 +1,29 @@
+# vLLM — LLM serving (option B). Higher throughput than Ollama for
+# production-shape inference; harder to swap models on the fly.
+# Choose one of Ollama / vLLM (or run both targeting different model
+# tiers — Ollama for quick interactive, vLLM for high-concurrency).
+#
+# No official Helm chart from the vLLM project. We point at a
+# community chart (substratusai/llm-operator-style) or hand-rolled
+# manifests in this directory.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: vllm
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/vllm
+    directory:
+      include: '{namespace,deployment,service}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: vllm
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/vllm/deployment.yaml b/full-ai-cluster/k8s/applications/vllm/deployment.yaml
new file mode 100644
index 0000000000..bb27f19fe3
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/vllm/deployment.yaml
@@ -0,0 +1,69 @@
+# vLLM serving deployment. Replicas start at 0 — choose a model
+# (image tag) and bump to 1+ once you've picked between Ollama and
+# vLLM (or which models each one serves).
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: vllm
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm
+  namespace: vllm
+spec:
+  replicas: 0
+  selector:
+    matchLabels: { app.kubernetes.io/name: vllm }
+  template:
+    metadata:
+      labels: { app.kubernetes.io/name: vllm }
+    spec:
+      nodeSelector:
+        zeta.io/gpu: nvidia
+      containers:
+        - name: vllm
+          image: vllm/vllm-openai:latest
+          args:
+            # Defaults — override per model by tweaking these args:
+            - --model
+            - deepseek-ai/deepseek-coder-33b-instruct
+            - --tensor-parallel-size
+            - "1"
+            - --host
+            - "0.0.0.0"
+            - --port
+            - "8000"
+          ports:
+            - { containerPort: 8000, name: http }
+          resources:
+            requests: { cpu: "4", memory: "16Gi", "nvidia.com/gpu": 1 }
+            limits:   { cpu: "8", memory: "64Gi", "nvidia.com/gpu": 1 }
+          volumeMounts:
+            - { name: cache, mountPath: /root/.cache/huggingface }
+      volumes:
+        - name: cache
+          persistentVolumeClaim: { claimName: vllm-cache }
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: vllm-cache
+  namespace: vllm
+spec:
+  accessModes: [ ReadWriteOnce ]
+  storageClassName: longhorn
+  resources:
+    requests: { storage: 200Gi }
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm
+  namespace: vllm
+spec:
+  type: ClusterIP
+  selector: { app.kubernetes.io/name: vllm }
+  ports:
+    - { name: http, port: 8000, targetPort: 8000 }
diff --git a/full-ai-cluster/k8s/applications/warp/Application.yaml b/full-ai-cluster/k8s/applications/warp/Application.yaml
new file mode 100644
index 0000000000..3e3f1f0a74
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/warp/Application.yaml
@@ -0,0 +1,32 @@
+# Warp — TODO: AMBIGUOUS COMPONENT.
+#
+# Possibilities:
+#   - Cloudflare Warp (consumer VPN)
+#   - Warp Terminal (https://www.warp.dev/) — IDE-like terminal
+#   - Dagger Warp engine (build system component)
+#   - HashiCorp Waypoint (formerly considered)
+#   - An Aaron-specific Warp (AI agent, custom orchestrator, etc.)
+#
+# Placeholder; replace repoURL + chart with the real one.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: warp
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    # TODO(aaron): confirm Warp identity.
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications/warp
+    directory:
+      include: '{namespace,deployment,service}.yaml'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: warp
+  syncPolicy:
+    automated: { prune: true, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/weaviate/Application.yaml b/full-ai-cluster/k8s/applications/weaviate/Application.yaml
new file mode 100644
index 0000000000..f7ac07186d
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/weaviate/Application.yaml
@@ -0,0 +1,39 @@
+# Weaviate — vector DB for RAG + embeddings.
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: weaviate
+  namespace: argocd
+  finalizers: [ resources-finalizer.argocd.argoproj.io ]
+spec:
+  project: default
+  source:
+    repoURL: https://weaviate.github.io/weaviate-helm
+    chart: weaviate
+    targetRevision: 17.6.0
+    helm:
+      releaseName: weaviate
+      valuesObject:
+        replicas: 1   # bump to 3 for HA once cluster has the headroom
+        storage:
+          size: 100Gi
+          storageClassName: longhorn
+        modules:
+          # Use the cluster's own LLM serving for vectorization +
+          # generative endpoints. Swap to text-embeddings-inference
+          # if you want a separate embedding service.
+          text2vec-ollama:
+            enabled: true
+            apiEndpoint: http://ollama.ollama.svc.cluster.local:11434
+            modelId: nomic-embed-text
+          generative-ollama:
+            enabled: true
+            apiEndpoint: http://ollama.ollama.svc.cluster.local:11434
+            modelId: qwen2.5-coder:32b
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: weaviate
+  syncPolicy:
+    automated: { prune: false, selfHeal: true }
+    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/bootstrap/argocd-install.yaml b/full-ai-cluster/k8s/bootstrap/argocd-install.yaml
new file mode 100644
index 0000000000..5b8bf7ea35
--- /dev/null
+++ b/full-ai-cluster/k8s/bootstrap/argocd-install.yaml
@@ -0,0 +1,19 @@
+# full-ai-cluster/k8s/bootstrap/argocd-install.yaml
+#
+# ArgoCD install via kustomize remote reference. K3S auto-applies
+# on first boot. Pinned to ArgoCD v2.13.2 — bump in lockstep with
+# any ArgoCD-Application charts that target a different API.
+
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: argocd
+
+resources:
+  - https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.2/manifests/install.yaml
+
+# Site-specific patches go here:
+#   - cmd-params for ingress / TLS termination
+#   - resource limits for argocd-server on small control-planes
+#   - SSO config (Dex, GitHub OAuth, etc.)
+patches: []
diff --git a/full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml b/full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml
new file mode 100644
index 0000000000..7ba39d4b7c
--- /dev/null
+++ b/full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml
@@ -0,0 +1,12 @@
+# full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml
+#
+# ArgoCD namespace. Applied by K3S on first boot — must apply
+# before argocd-install.yaml.
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: argocd
+  labels:
+    app.kubernetes.io/name: argocd
+    app.kubernetes.io/part-of: argocd
diff --git a/full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml b/full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml
new file mode 100644
index 0000000000..d7957e2896
--- /dev/null
+++ b/full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml
@@ -0,0 +1,12 @@
+# full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml
+#
+# Cilium runs in kube-system by convention. This file just ensures
+# the namespace exists before ArgoCD's Cilium Application tries to
+# create resources in it. Applied by K3S on first boot.
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: kube-system
+  labels:
+    app.kubernetes.io/managed-by: zeta-bootstrap
diff --git a/full-ai-cluster/k8s/bootstrap/root-application.yaml b/full-ai-cluster/k8s/bootstrap/root-application.yaml
new file mode 100644
index 0000000000..76092eae01
--- /dev/null
+++ b/full-ai-cluster/k8s/bootstrap/root-application.yaml
@@ -0,0 +1,41 @@
+# full-ai-cluster/k8s/bootstrap/root-application.yaml
+#
+# App-of-Apps root. K3S auto-applies after ArgoCD is running.
+# ArgoCD then watches `k8s/applications/` for Application CRs and
+# reconciles everything declared there.
+#
+# Adding a workload to the cluster:
+#   1. mkdir full-ai-cluster/k8s/applications/<name>/
+#   2. Author Application.yaml + any supporting manifests
+#   3. git commit + push to main
+#   4. ArgoCD picks it up on the next sync (~3 min)
+
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: zeta-root
+  namespace: argocd
+  finalizers:
+    - resources-finalizer.argocd.argoproj.io
+spec:
+  project: default
+  source:
+    repoURL: https://github.com/Lucent-Financial-Group/Zeta
+    targetRevision: main
+    path: full-ai-cluster/k8s/applications
+    directory:
+      recurse: true
+      # Only pick up Application CRs (one per workload directory).
+      # Supporting manifests in the same dirs are reconciled by the
+      # specific app's Application, not by the root.
+      include: '{*/Application.yaml,Application.yaml}'
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: argocd
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=true
+      - ServerSideApply=true
diff --git a/full-ai-cluster/nixos/hosts/control-plane/README.md b/full-ai-cluster/nixos/hosts/control-plane/README.md
new file mode 100644
index 0000000000..5e50858db8
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/control-plane/README.md
@@ -0,0 +1,44 @@
+# control-plane
+
+K3S server + Cilium CNI bootstrap + ArgoCD reconciler. No GPU.
+
+## What it runs
+
+- K3S server with embedded etcd, flannel + kube-proxy disabled
+  (Cilium takes over)
+- Cilium CNI (Helm install via ArgoCD's first reconcile)
+- ArgoCD itself (auto-applied from `k8s/bootstrap/` via
+  `services.k3s.manifests` in `k3s-server.nix`)
+- Local-path storage class for stateless workloads
+- Docker (for any non-K8s container tooling)
+
+## What it does NOT run
+
+- No GPU workloads (those go on `worker-gpu` hosts)
+- No big AI models locally (LLMs serve from worker-gpu via Ollama/vLLM)
+
+## Install
+
+See the parent [`../../README.md`](../../README.md) bootstrap flow.
+This host's `<host>` name when installing is `control-plane`:
+
+```bash
+nixos-install --flake /mnt/etc/zeta/full-ai-cluster#control-plane
+```
+
+## Post-install verification
+
+```bash
+ssh zeta@control-plane.zeta.local
+sudo kubectl get nodes
+sudo kubectl -n kube-system get pods                    # cilium pods
+sudo kubectl -n argocd get pods
+sudo kubectl -n argocd get applications
+sudo cilium status
+sudo cilium hubble enable --ui
+```
+
+## Hardware config
+
+`hardware-configuration.nix` ships as a placeholder. Replace at
+install time per the parent README.
diff --git a/full-ai-cluster/nixos/hosts/control-plane/configuration.nix b/full-ai-cluster/nixos/hosts/control-plane/configuration.nix
new file mode 100644
index 0000000000..1d6e3ec790
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/control-plane/configuration.nix
@@ -0,0 +1,32 @@
+# full-ai-cluster/nixos/hosts/control-plane/configuration.nix
+#
+# K3S server + ArgoCD bootstrap. Cilium CNI takes over from flannel.
+# No GPU on this host — control-plane stays lean.
+
+{ config, pkgs, lib, ... }:
+
+{
+  imports = [
+    ./hardware-configuration.nix
+    ../../modules/common.nix
+    ../../modules/k3s-server.nix
+    ../../modules/docker.nix
+    ../../modules/local-storage.nix
+  ];
+
+  networking.hostName = "control-plane";
+
+  # Static IP recommended so worker nodes have a stable serverAddr.
+  # Per-site override here:
+  #   networking.interfaces.eth0.ipv4.addresses = [{
+  #     address = "192.168.1.10";
+  #     prefixLength = 24;
+  #   }];
+  #   networking.defaultGateway = "192.168.1.1";
+  #   networking.nameservers = [ "1.1.1.1" "9.9.9.9" ];
+
+  # Add maintainer SSH keys for the `zeta` admin user:
+  users.users.zeta.openssh.authorizedKeys.keys = [
+    # "ssh-ed25519 AAAAC3Nz... aaron@zeta"
+  ];
+}
diff --git a/full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix b/full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix
new file mode 100644
index 0000000000..1e4222d2da
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix
@@ -0,0 +1,37 @@
+# full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix
+#
+# PLACEHOLDER — replace during real install via:
+#   nixos-generate-config --root /mnt
+#   cp /mnt/etc/nixos/hardware-configuration.nix \
+#      /mnt/etc/zeta/full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix
+#
+# This stub exists so `nix flake check` succeeds in CI before the
+# host is provisioned. Real generator output replaces all values.
+
+{ config, lib, modulesPath, ... }:
+
+{
+  imports = [
+    (modulesPath + "/installer/scan/not-detected.nix")
+  ];
+
+  boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "nvme" "usb_storage" "sd_mod" ];
+  boot.initrd.kernelModules = [ ];
+  boot.kernelModules = [ "kvm-intel" "kvm-amd" ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = lib.mkDefault {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "ext4";
+  };
+
+  fileSystems."/boot" = lib.mkDefault {
+    device = "/dev/disk/by-label/boot";
+    fsType = "vfat";
+  };
+
+  swapDevices = lib.mkDefault [ ];
+
+  networking.useDHCP = lib.mkDefault true;
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
diff --git a/full-ai-cluster/nixos/hosts/worker-gpu/README.md b/full-ai-cluster/nixos/hosts/worker-gpu/README.md
new file mode 100644
index 0000000000..21d95350e4
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/worker-gpu/README.md
@@ -0,0 +1,60 @@
+# worker-gpu
+
+K3S worker template + NVIDIA GPU + container toolkit + K8s device
+plugin + VFIO passthrough (optional).
+
+## What it runs
+
+- K3S agent joining `https://control-plane.zeta.local:6443`
+- NVIDIA proprietary driver + `nvidia-container-toolkit` (so K3S
+  pods can request `nvidia.com/gpu` resources)
+- NVIDIA Kubernetes device plugin DaemonSet (advertises GPUs)
+- Docker (for non-K8s container workloads on this host)
+- Local-path storage class
+
+## Optional: GPU passthrough for VM workloads
+
+If this host hosts VMs that need a dedicated GPU (e.g. running
+a Windows VM with passthrough alongside K8s workloads on the
+remaining GPUs), enable VFIO in `configuration.nix`:
+
+```nix
+zeta.gpu-passthrough = {
+  enable = true;
+  pciIds = [ "10de:2204" "10de:1aef" ];   # find via `lspci -nn`
+};
+```
+
+## Mixed-vendor hosts
+
+If this worker has AMD or Intel GPUs alongside NVIDIA, edit the
+`zeta.gpu-device-plugin.vendors` list:
+
+```nix
+zeta.gpu-device-plugin = {
+  enable = true;
+  vendors = [ "nvidia" "amd" "intel" ];
+};
+```
+
+Each enabled vendor gets its own DaemonSet advertising the
+appropriate resource name to K8s.
+
+## Per-physical-worker scaling
+
+This file is a **template**. For each additional GPU worker:
+
+1. Copy `worker-gpu/` to `worker-gpu-NN/`
+2. Update `networking.hostName` in the new copy
+3. Drop the per-host `hardware-configuration.nix` from
+   `nixos-generate-config`
+4. Add `nixosConfigurations.worker-gpu-NN = ...` entry to
+   `../../flake.nix`
+5. Install: `nixos-install --flake /mnt/etc/zeta/full-ai-cluster#worker-gpu-NN`
+
+## Install
+
+See parent [`../../README.md`](../../README.md) bootstrap flow.
+Important: write the K3S cluster token (from the control-plane)
+to `/var/lib/rancher/k3s/agent/token` BEFORE running
+`nixos-install`. K3S refuses to start without it.
diff --git a/full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix b/full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix
new file mode 100644
index 0000000000..3e0973c6a1
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix
@@ -0,0 +1,53 @@
+# full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix
+#
+# Worker template. Per physical worker, duplicate this file under
+# nixos/hosts/worker-gpu-NN/, add a per-host hardware-configuration,
+# and add a nixosConfigurations.worker-gpu-NN entry to flake.nix.
+#
+# This template runs: NVIDIA GPU + container-toolkit + K8s device
+# plugin + Docker + local storage. VFIO passthrough OFF by default
+# (enable per-host).
+
+{ config, pkgs, lib, ... }:
+
+{
+  imports = [
+    ./hardware-configuration.nix
+    ../../modules/common.nix
+    ../../modules/k3s-agent.nix
+    ../../modules/gpu.nix
+    ../../modules/gpu-device-plugin.nix
+    ../../modules/gpu-passthrough.nix
+    ../../modules/docker.nix
+    ../../modules/local-storage.nix
+  ];
+
+  networking.hostName = "worker-gpu";
+
+  # Cluster join target. Override per-site.
+  services.k3s.serverAddr = "https://control-plane.zeta.local:6443";
+
+  # Vendor mix for the K8s device plugin. Override per-host if
+  # this worker has AMD or Intel GPUs alongside (or instead of) NVIDIA.
+  zeta.gpu-device-plugin = {
+    enable = true;
+    vendors = [ "nvidia" ];
+  };
+
+  # VFIO passthrough disabled by default. Enable + list PCI IDs
+  # per-host when you want a GPU bound to vfio-pci for VM workloads.
+  zeta.gpu-passthrough = {
+    enable = false;
+    pciIds = [ ];   # e.g. [ "10de:2204" "10de:1aef" ]
+  };
+
+  # Per-host node labels — let the scheduler target hardware specs.
+  services.k3s.extraFlags = lib.mkAfter [
+    # "--node-label=zeta.io/gpu-model=rtx-4090"
+    # "--node-label=zeta.io/gpu-count=2"
+  ];
+
+  users.users.zeta.openssh.authorizedKeys.keys = [
+    # "ssh-ed25519 AAAAC3Nz... aaron@zeta"
+  ];
+}
diff --git a/full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix b/full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix
new file mode 100644
index 0000000000..d3287a4d11
--- /dev/null
+++ b/full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix
@@ -0,0 +1,34 @@
+# full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix
+#
+# PLACEHOLDER — replace at install time via:
+#   nixos-generate-config --root /mnt
+#   cp /mnt/etc/nixos/hardware-configuration.nix \
+#      /mnt/etc/zeta/full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix
+
+{ config, lib, modulesPath, ... }:
+
+{
+  imports = [
+    (modulesPath + "/installer/scan/not-detected.nix")
+  ];
+
+  boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "nvme" "usb_storage" "sd_mod" ];
+  boot.initrd.kernelModules = [ ];
+  boot.kernelModules = [ "kvm-intel" "kvm-amd" ];
+  boot.extraModulePackages = [ ];
+
+  fileSystems."/" = lib.mkDefault {
+    device = "/dev/disk/by-label/nixos";
+    fsType = "ext4";
+  };
+
+  fileSystems."/boot" = lib.mkDefault {
+    device = "/dev/disk/by-label/boot";
+    fsType = "vfat";
+  };
+
+  swapDevices = lib.mkDefault [ ];
+
+  networking.useDHCP = lib.mkDefault true;
+  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
+}
diff --git a/full-ai-cluster/nixos/modules/common.nix b/full-ai-cluster/nixos/modules/common.nix
new file mode 100644
index 0000000000..51fd0270a6
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/common.nix
@@ -0,0 +1,66 @@
+# full-ai-cluster/nixos/modules/common.nix
+#
+# Shared baseline every cluster host imports.
+
+{ config, pkgs, lib, stateVersion ? "24.11", ... }:
+
+{
+  nix.settings = {
+    experimental-features = [ "nix-command" "flakes" ];
+    auto-optimise-store = true;
+    trusted-users = [ "root" "@wheel" ];
+    substituters = [
+      "https://cache.nixos.org"
+      "https://nix-community.cachix.org"
+    ];
+    trusted-public-keys = [
+      "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+      "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs="
+    ];
+  };
+
+  nix.gc = {
+    automatic = true;
+    dates = "weekly";
+    options = "--delete-older-than 30d";
+  };
+
+  time.timeZone = lib.mkDefault "America/New_York";
+  i18n.defaultLocale = "en_US.UTF-8";
+
+  networking.networkmanager.enable = true;
+  networking.firewall.enable = true;
+
+  services.openssh = {
+    enable = true;
+    settings = {
+      PermitRootLogin = lib.mkDefault "prohibit-password";
+      PasswordAuthentication = lib.mkDefault false;
+      KbdInteractiveAuthentication = lib.mkDefault false;
+    };
+  };
+
+  users.users.zeta = {
+    isNormalUser = true;
+    extraGroups = [ "wheel" "networkmanager" ];
+  };
+  security.sudo.wheelNeedsPassword = lib.mkDefault true;
+
+  environment.systemPackages = with pkgs; [
+    git vim htop btop tmux ripgrep jq yq-go curl wget rsync tree
+    file unzip iproute2 iputils dnsutils nmap tcpdump mtr
+    pciutils usbutils lshw nvme-cli smartmontools lm_sensors
+    skopeo
+    kubectl kubernetes-helm k9s argocd
+    cilium-cli hubble
+  ];
+
+  boot.loader = {
+    systemd-boot.enable = lib.mkDefault true;
+    efi.canTouchEfiVariables = lib.mkDefault true;
+  };
+
+  powerManagement.cpuFreqGovernor = lib.mkDefault "performance";
+
+  system.stateVersion = lib.mkDefault stateVersion;
+}
diff --git a/full-ai-cluster/nixos/modules/docker.nix b/full-ai-cluster/nixos/modules/docker.nix
new file mode 100644
index 0000000000..51982fd8da
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/docker.nix
@@ -0,0 +1,39 @@
+# full-ai-cluster/nixos/modules/docker.nix
+#
+# Docker daemon for non-K8s container workloads (local builds,
+# devcontainers, the Hermes image build with SOPS-baked secrets,
+# any tooling that needs a real Docker socket).
+#
+# K3S uses containerd under the hood — this is separate.
+
+{ config, pkgs, lib, ... }:
+
+{
+  virtualisation.docker = {
+    enable = true;
+
+    # rootless-by-default avoids accidental privileged-container
+    # surprises. Maintainers can still use `sudo docker` for cases
+    # that need the system daemon.
+    rootless = {
+      enable = true;
+      setSocketVariable = true;
+    };
+
+    # Enable on-host BuildKit so the SOPS-baking Hermes image build
+    # uses build-secrets and cache-mounts.
+    daemon.settings = {
+      features = { buildkit = true; };
+      "experimental" = false;
+    };
+  };
+
+  # Tooling: docker CLI, compose, buildx.
+  environment.systemPackages = with pkgs; [
+    docker
+    docker-compose
+    docker-buildx
+  ];
+
+  users.users.zeta.extraGroups = [ "docker" ];
+}
diff --git a/full-ai-cluster/nixos/modules/gpu-device-plugin.nix b/full-ai-cluster/nixos/modules/gpu-device-plugin.nix
new file mode 100644
index 0000000000..2e87a38aa1
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/gpu-device-plugin.nix
@@ -0,0 +1,170 @@
+# full-ai-cluster/nixos/modules/gpu-device-plugin.nix
+#
+# Exposes GPUs to Kubernetes pods via the appropriate device plugin
+# DaemonSet. The plugins themselves run as K8s DaemonSets (deployed
+# by NixOS into the K3S manifests directory so they come up at first
+# boot, before ArgoCD takes over).
+#
+# Each vendor's plugin advertises a different K8s resource name:
+#   NVIDIA → `nvidia.com/gpu`
+#   AMD    → `amd.com/gpu`
+#   Intel  → `gpu.intel.com/i915`  (Xe) or  `gpu.intel.com/xe`
+#
+# Per-host config sets `zeta.gpu-device-plugin.vendors = [ "nvidia" "amd" ];`
+# Plugins only get installed for the vendors enabled.
+
+{ config, pkgs, lib, ... }:
+
+let
+  cfg = config.zeta.gpu-device-plugin;
+in
+{
+  options.zeta.gpu-device-plugin = {
+    enable = lib.mkEnableOption "K8s GPU device plugins";
+
+    vendors = lib.mkOption {
+      type = lib.types.listOf (lib.types.enum [ "nvidia" "amd" "intel" ]);
+      default = [ ];
+      description = ''
+        Which vendor device plugins to install on this host.
+        Multiple can coexist on the same node if it has mixed GPUs.
+      '';
+      example = [ "nvidia" "amd" ];
+    };
+
+    nvidiaVersion = lib.mkOption {
+      type = lib.types.str;
+      default = "v0.17.4";
+      description = "NVIDIA k8s-device-plugin chart version.";
+    };
+
+    amdVersion = lib.mkOption {
+      type = lib.types.str;
+      default = "v1.31.0";
+      description = "AMD k8s-device-plugin chart version.";
+    };
+
+    intelVersion = lib.mkOption {
+      type = lib.types.str;
+      default = "v0.32.1";
+      description = "Intel device-plugins-for-kubernetes version.";
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    # Drop K3S manifests for each enabled vendor. K3S applies them on
+    # first boot. ArgoCD then takes ownership via the
+    # k8s/applications/gpu-device-plugin/Application.yaml so updates
+    # become git-driven (see that file).
+    services.k3s.manifests = lib.mkMerge [
+      (lib.mkIf (lib.elem "nvidia" cfg.vendors) {
+        nvidia-device-plugin.source = pkgs.writeText "nvidia-device-plugin.yaml" ''
+          apiVersion: apps/v1
+          kind: DaemonSet
+          metadata:
+            name: nvidia-device-plugin-daemonset
+            namespace: kube-system
+          spec:
+            selector:
+              matchLabels:
+                name: nvidia-device-plugin-ds
+            updateStrategy:
+              type: RollingUpdate
+            template:
+              metadata:
+                labels:
+                  name: nvidia-device-plugin-ds
+              spec:
+                tolerations:
+                  - { key: CriticalAddonsOnly, operator: Exists }
+                  - { key: nvidia.com/gpu, operator: Exists, effect: NoSchedule }
+                nodeSelector:
+                  zeta.io/gpu: nvidia
+                priorityClassName: system-node-critical
+                containers:
+                  - image: nvcr.io/nvidia/k8s-device-plugin:${cfg.nvidiaVersion}
+                    name: nvidia-device-plugin-ctr
+                    env:
+                      - { name: FAIL_ON_INIT_ERROR, value: "false" }
+                    securityContext:
+                      allowPrivilegeEscalation: false
+                      capabilities: { drop: ["ALL"] }
+                    volumeMounts:
+                      - { name: device-plugin, mountPath: /var/lib/kubelet/device-plugins }
+                volumes:
+                  - { name: device-plugin, hostPath: { path: /var/lib/kubelet/device-plugins } }
+        '';
+      })
+
+      (lib.mkIf (lib.elem "amd" cfg.vendors) {
+        amd-device-plugin.source = pkgs.writeText "amd-device-plugin.yaml" ''
+          apiVersion: apps/v1
+          kind: DaemonSet
+          metadata:
+            name: amdgpu-device-plugin-daemonset
+            namespace: kube-system
+          spec:
+            selector:
+              matchLabels:
+                name: amdgpu-dp-ds
+            template:
+              metadata:
+                labels:
+                  name: amdgpu-dp-ds
+              spec:
+                nodeSelector:
+                  zeta.io/gpu: amd
+                priorityClassName: system-node-critical
+                containers:
+                  - image: rocm/k8s-device-plugin:${cfg.amdVersion}
+                    name: amdgpu-dp-cntr
+                    securityContext:
+                      allowPrivilegeEscalation: false
+                      capabilities: { drop: ["ALL"] }
+                    volumeMounts:
+                      - { name: dp, mountPath: /var/lib/kubelet/device-plugins }
+                      - { name: sys, mountPath: /sys }
+                volumes:
+                  - { name: dp, hostPath: { path: /var/lib/kubelet/device-plugins } }
+                  - { name: sys, hostPath: { path: /sys } }
+        '';
+      })
+
+      (lib.mkIf (lib.elem "intel" cfg.vendors) {
+        intel-device-plugin.source = pkgs.writeText "intel-device-plugin.yaml" ''
+          apiVersion: apps/v1
+          kind: DaemonSet
+          metadata:
+            name: intel-gpu-plugin
+            namespace: kube-system
+          spec:
+            selector:
+              matchLabels:
+                app: intel-gpu-plugin
+            template:
+              metadata:
+                labels:
+                  app: intel-gpu-plugin
+              spec:
+                nodeSelector:
+                  zeta.io/gpu: intel
+                priorityClassName: system-node-critical
+                containers:
+                  - name: intel-gpu-plugin
+                    image: intel/intel-gpu-plugin:${cfg.intelVersion}
+                    securityContext:
+                      allowPrivilegeEscalation: false
+                      capabilities: { drop: ["ALL"] }
+                    volumeMounts:
+                      - { name: devfs, mountPath: /dev/dri, readOnly: true }
+                      - { name: sysfs, mountPath: /sys/class/drm, readOnly: true }
+                      - { name: kubeletsockets, mountPath: /var/lib/kubelet/device-plugins }
+                volumes:
+                  - { name: devfs, hostPath: { path: /dev/dri } }
+                  - { name: sysfs, hostPath: { path: /sys/class/drm } }
+                  - { name: kubeletsockets, hostPath: { path: /var/lib/kubelet/device-plugins } }
+        '';
+      })
+    ];
+  };
+}
diff --git a/full-ai-cluster/nixos/modules/gpu-passthrough.nix b/full-ai-cluster/nixos/modules/gpu-passthrough.nix
new file mode 100644
index 0000000000..62d1960ec4
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/gpu-passthrough.nix
@@ -0,0 +1,75 @@
+# full-ai-cluster/nixos/modules/gpu-passthrough.nix
+#
+# VFIO GPU passthrough setup. Lets a host bind one or more GPUs to
+# vfio-pci at boot so they can be assigned to virtual machines
+# (libvirt/QEMU/Cloud-Hypervisor) running on the same host alongside
+# K3S workloads.
+#
+# Per-host override required: set the PCI vendor:device IDs for the
+# GPUs you want VFIO-bound. Find them with `lspci -nn | grep VGA`.
+# Example:  10de:2204  (NVIDIA RTX 3090)
+
+{ config, pkgs, lib, ... }:
+
+let
+  cfg = config.zeta.gpu-passthrough;
+in
+{
+  options.zeta.gpu-passthrough = {
+    enable = lib.mkEnableOption "VFIO GPU passthrough";
+
+    pciIds = lib.mkOption {
+      type = lib.types.listOf lib.types.str;
+      default = [ ];
+      description = ''
+        PCI vendor:device IDs to bind to vfio-pci at boot. Find via
+        `lspci -nn | grep VGA`. Example: [ "10de:2204" "10de:1aef" ]
+        for a 3090 + its audio function.
+      '';
+      example = [ "10de:2204" "10de:1aef" ];
+    };
+  };
+
+  config = lib.mkIf cfg.enable {
+    # IOMMU on. AMD: amd_iommu=on. Intel: intel_iommu=on.
+    boot.kernelParams = [
+      "intel_iommu=on"        # safe on AMD too — ignored if no Intel iommu
+      "amd_iommu=on"
+      "iommu=pt"
+    ] ++ lib.optional (cfg.pciIds != [ ]) "vfio-pci.ids=${lib.concatStringsSep "," cfg.pciIds}";
+
+    boot.kernelModules = [
+      "vfio_pci"
+      "vfio"
+      "vfio_iommu_type1"
+    ];
+
+    # Bind early so the NVIDIA driver doesn't grab the device first.
+    boot.initrd.kernelModules = [
+      "vfio_pci"
+      "vfio"
+      "vfio_iommu_type1"
+    ];
+
+    # Libvirt + QEMU stack for hosting passthrough VMs.
+    virtualisation.libvirtd = {
+      enable = true;
+      qemu = {
+        package = pkgs.qemu_kvm;
+        runAsRoot = true;
+        ovmf = {
+          enable = true;
+          packages = [ pkgs.OVMFFull.fd ];
+        };
+      };
+    };
+    users.users.zeta.extraGroups = [ "libvirtd" "kvm" ];
+
+    environment.systemPackages = with pkgs; [
+      virt-manager
+      virt-viewer
+      OVMFFull
+      qemu_kvm
+    ];
+  };
+}
diff --git a/full-ai-cluster/nixos/modules/gpu.nix b/full-ai-cluster/nixos/modules/gpu.nix
new file mode 100644
index 0000000000..9facff91a6
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/gpu.nix
@@ -0,0 +1,55 @@
+# full-ai-cluster/nixos/modules/gpu.nix
+#
+# NVIDIA driver + container toolkit for AI worker nodes.
+# AMD ROCm + Intel oneAPI live in sibling modules (TODO when first
+# AMD/Intel cards land).
+
+{ config, pkgs, lib, ... }:
+
+{
+  nixpkgs.config.allowUnfreePredicate = pkg:
+    let name = lib.getName pkg; in
+    builtins.elem name [
+      "nvidia-x11"
+      "nvidia-settings"
+      "nvidia-persistenced"
+      "nvidia-docker"
+      "nvidia-container-toolkit"
+    ]
+    || lib.hasPrefix "cuda" name
+    || lib.hasPrefix "libcu" name
+    || lib.hasPrefix "libnv" name
+    || lib.hasPrefix "libnp" name
+    || name == "cuda-merged";
+
+  services.xserver.videoDrivers = [ "nvidia" ];
+
+  hardware.nvidia = {
+    package = config.boot.kernelPackages.nvidiaPackages.production;
+    modesetting.enable = true;
+    nvidiaPersistenced = true;
+    powerManagement.enable = false;
+    powerManagement.finegrained = false;
+    open = lib.mkDefault false;
+  };
+
+  hardware.graphics = {
+    enable = true;
+    enable32Bit = true;
+  };
+
+  hardware.nvidia-container-toolkit.enable = true;
+
+  environment.systemPackages = with pkgs; [
+    nvtopPackages.nvidia
+    cudaPackages.cuda_cudart
+    cudaPackages.cuda_nvcc
+    glxinfo
+    vulkan-tools
+    clinfo
+  ];
+
+  services.k3s.extraFlags = lib.mkAfter [
+    "--node-label=zeta.io/gpu=nvidia"
+  ];
+}
diff --git a/full-ai-cluster/nixos/modules/k3s-agent.nix b/full-ai-cluster/nixos/modules/k3s-agent.nix
new file mode 100644
index 0000000000..d230bf98e5
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/k3s-agent.nix
@@ -0,0 +1,41 @@
+# full-ai-cluster/nixos/modules/k3s-agent.nix
+#
+# K3S worker. Matches the server's CNI takeover (no flannel,
+# no kube-proxy, Cilium owns the network).
+
+{ config, pkgs, lib, ... }:
+
+{
+  services.k3s = {
+    enable = true;
+    role = "agent";
+    serverAddr = lib.mkDefault "https://control-plane.zeta.local:6443";
+    tokenFile = lib.mkDefault "/var/lib/rancher/k3s/agent/token";
+
+    extraFlags = [
+      "--node-label=zeta.io/role=worker"
+
+      # Same CNI takeover settings as the server. The agent must
+      # NOT bring up flannel or kube-proxy — Cilium handles both.
+      "--flannel-backend=none"
+      "--disable-network-policy"
+      "--disable-kube-proxy"
+    ];
+  };
+
+  networking.firewall = {
+    allowedTCPPorts = [
+      10250   # kubelet
+      4244    # Hubble server
+      8472    # legacy VXLAN
+    ];
+    allowedUDPPorts = [
+      8472
+    ];
+    trustedInterfaces = [ "cilium_host" "cilium_net" "cni0" "lxc+" ];
+  };
+
+  systemd.tmpfiles.rules = [
+    "d /var/lib/rancher/k3s 0755 root root - -"
+  ];
+}
diff --git a/full-ai-cluster/nixos/modules/k3s-server.nix b/full-ai-cluster/nixos/modules/k3s-server.nix
new file mode 100644
index 0000000000..d012f34e7e
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/k3s-server.nix
@@ -0,0 +1,73 @@
+# full-ai-cluster/nixos/modules/k3s-server.nix
+#
+# K3S control-plane configured for Cilium CNI takeover.
+#
+# K3S ships with flannel (CNI), kube-proxy, network-policy, servicelb,
+# and traefik. Cilium replaces flannel + kube-proxy + network-policy.
+# We disable all five so Cilium owns the network layer end-to-end.
+
+{ config, pkgs, lib, ... }:
+
+{
+  services.k3s = {
+    enable = true;
+    role = "server";
+    tokenFile = lib.mkDefault "/var/lib/rancher/k3s/server/token";
+    clusterInit = lib.mkDefault true;
+
+    extraFlags = [
+      "--write-kubeconfig-mode=0640"
+      "--write-kubeconfig-group=wheel"
+
+      # CNI takeover by Cilium — disable flannel + kube-proxy + the
+      # built-in network-policy controller. Cilium handles all three.
+      "--flannel-backend=none"
+      "--disable-network-policy"
+      "--disable-kube-proxy"
+
+      # Disable bundled servicelb + traefik — ArgoCD installs MetalLB
+      # + ingress-nginx (or Istio gateway) declaratively.
+      "--disable=servicelb"
+      "--disable=traefik"
+
+      # Cluster CIDR — give Cilium a /16 to work with.
+      "--cluster-cidr=10.42.0.0/16"
+      "--service-cidr=10.43.0.0/16"
+    ];
+
+    # K3S applies these manifests on first boot. We seed only what's
+    # required to get Cilium + ArgoCD running. ArgoCD takes over and
+    # reconciles every other workload from k8s/applications/.
+    manifests = {
+      cilium-namespace.source = ../../k8s/bootstrap/cilium-namespace.yaml;
+      argocd-namespace.source = ../../k8s/bootstrap/argocd-namespace.yaml;
+      argocd-install.source = ../../k8s/bootstrap/argocd-install.yaml;
+      root-application.source = ../../k8s/bootstrap/root-application.yaml;
+    };
+  };
+
+  networking.firewall = {
+    allowedTCPPorts = [
+      6443    # K3S API
+      9345    # K3S supervisor/join
+      10250   # kubelet
+      2379    # etcd client
+      2380    # etcd peer
+      4244    # Hubble server
+      4245    # Hubble Relay
+      8472    # legacy flannel/VXLAN (kept for safety)
+    ];
+    allowedUDPPorts = [
+      8472    # VXLAN (Cilium can also run native-routing)
+    ];
+    trustedInterfaces = [ "cilium_host" "cilium_net" "cni0" "lxc+" ];
+  };
+
+  environment.variables = {
+    KUBECONFIG = "/etc/rancher/k3s/k3s.yaml";
+  };
+
+  systemd.tmpfiles.rules = [
+    "d /var/lib/rancher/k3s 0755 root root - -"
+  ];
+}
diff --git a/full-ai-cluster/nixos/modules/local-storage.nix b/full-ai-cluster/nixos/modules/local-storage.nix
new file mode 100644
index 0000000000..cf05b16a83
--- /dev/null
+++ b/full-ai-cluster/nixos/modules/local-storage.nix
@@ -0,0 +1,133 @@
+# full-ai-cluster/nixos/modules/local-storage.nix
+#
+# Local-path storage class for K8s. Provisions hostPath PVs out of
+# /var/lib/zeta-local-storage/ on whichever node a pod lands on.
+#
+# Good for stateless workloads, scratch, cache. NOT for anything
+# that needs to survive node failure — Longhorn (via ArgoCD) handles
+# distributed storage for stateful workloads.
+#
+# Installed as a K3S auto-applied manifest so it's available before
+# ArgoCD comes up.
+
+{ config, pkgs, lib, ... }:
+
+{
+  # Ensure the host directory exists with correct ownership.
+  systemd.tmpfiles.rules = [
+    "d /var/lib/zeta-local-storage 0755 root root - -"
+  ];
+
+  # Install rancher/local-path-provisioner via K3S manifest. K3S
+  # actually ships this by default but we re-declare it explicitly
+  # so the path + storage class name match across the cluster.
+  services.k3s.manifests = {
+    local-path-provisioner.source = pkgs.writeText "local-path-provisioner.yaml" ''
+      apiVersion: v1
+      kind: Namespace
+      metadata:
+        name: local-path-storage
+      ---
+      apiVersion: storage.k8s.io/v1
+      kind: StorageClass
+      metadata:
+        name: zeta-local-path
+        annotations:
+          storageclass.kubernetes.io/is-default-class: "true"
+      provisioner: rancher.io/local-path
+      reclaimPolicy: Delete
+      volumeBindingMode: WaitForFirstConsumer
+      ---
+      apiVersion: v1
+      kind: ConfigMap
+      metadata:
+        name: local-path-config
+        namespace: local-path-storage
+      data:
+        config.json: |
+          {
+            "nodePathMap": [
+              {
+                "node": "DEFAULT_PATH_FOR_NON_LISTED_NODES",
+                "paths": ["/var/lib/zeta-local-storage"]
+              }
+            ]
+          }
+        setup: |-
+          #!/bin/sh
+          path=$VOL_DIR
+          mkdir -m 0777 -p $path
+        teardown: |-
+          #!/bin/sh
+          path=$VOL_DIR
+          rm -rf $path
+      ---
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: local-path-provisioner
+        namespace: local-path-storage
+      spec:
+        replicas: 1
+        selector:
+          matchLabels: { app: local-path-provisioner }
+        template:
+          metadata:
+            labels: { app: local-path-provisioner }
+          spec:
+            serviceAccountName: local-path-provisioner-service-account
+            containers:
+              - name: local-path-provisioner
+                image: rancher/local-path-provisioner:v0.0.30
+                imagePullPolicy: IfNotPresent
+                command:
+                  - local-path-provisioner
+                  - start
+                  - --config
+                  - /etc/config/config.json
+                volumeMounts:
+                  - { name: config-volume, mountPath: /etc/config/ }
+                env:
+                  - { name: POD_NAMESPACE, valueFrom: { fieldRef: { fieldPath: metadata.namespace } } }
+            volumes:
+              - { name: config-volume, configMap: { name: local-path-config } }
+      ---
+      apiVersion: v1
+      kind: ServiceAccount
+      metadata:
+        name: local-path-provisioner-service-account
+        namespace: local-path-storage
+      ---
+      apiVersion: rbac.authorization.k8s.io/v1
+      kind: ClusterRole
+      metadata:
+        name: local-path-provisioner-role
+      rules:
+        - apiGroups: [""]
+          resources: ["nodes", "persistentvolumeclaims", "configmaps", "pods", "pods/log"]
+          verbs: ["get", "list", "watch"]
+        - apiGroups: [""]
+          resources: ["persistentvolumes"]
+          verbs: ["get", "list", "watch", "create", "patch", "update", "delete"]
+        - apiGroups: [""]
+          resources: ["events"]
+          verbs: ["create", "patch"]
+        - apiGroups: ["storage.k8s.io"]
+          resources: ["storageclasses"]
+          verbs: ["get", "list", "watch"]
+      ---
+      apiVersion: rbac.authorization.k8s.io/v1
+      kind: ClusterRoleBinding
+      metadata:
+        name: local-path-provisioner-bind
+      roleRef:
+        apiGroup: rbac.authorization.k8s.io
+        kind: ClusterRole
+        name: local-path-provisioner-role
+      subjects:
+        - kind: ServiceAccount
+          name: local-path-provisioner-service-account
+          namespace: local-path-storage
+    '';
+  };
+}
diff --git a/full-ai-cluster/usb-nixos-installer/README.md b/full-ai-cluster/usb-nixos-installer/README.md
new file mode 100644
index 0000000000..34bbb0e54e
--- /dev/null
+++ b/full-ai-cluster/usb-nixos-installer/README.md
@@ -0,0 +1,88 @@
+# usb-nixos-installer
+
+**Scope: ONLY the USB bootstrap portion.**
+
+This directory contains exactly the four things needed to produce a
+bootable NixOS USB installer that can install the target operating
+system on a new machine over USB or Ethernet:
+
+1. **NixOS declarative configuration** — `nixos/installer/configuration.nix`
+2. **NixFlakes for packages** — `flake.nix` at the directory root
+3. **Git for version text storage** — every file here lives in git;
+   the flake.nix references inputs by pinned revision
+4. **The OS Flake on a USB stick** — `nix build .#installer-iso`
+   produces a bootable ISO image you `dd` to a USB stick. The same
+   ISO supports Ethernet install (boot the target on the stick,
+   then `nixos-install --flake <git-url>#<host>` over the network).
+
+**This directory is intentionally minimal.** It does NOT contain
+K3S, ArgoCD, Orleans, GitLab, observability, GPU runtime, or any
+cluster workload. Those live in `../full-ai-cluster/`.
+
+For the full end-to-end AI cluster (including this USB bootstrap
+as its starting snippet), see `../full-ai-cluster/`.
+
+## Build the USB stick
+
+From any machine with Nix installed:
+
+```bash
+cd usb-nixos-installer
+nix build .#installer-iso
+# Output: result/iso/zeta-installer-*.iso (~1.5-2 GB)
+```
+
+## Write the ISO to a USB stick
+
+### macOS
+
+```bash
+diskutil list                      # find the USB device (e.g. /dev/disk4)
+diskutil unmountDisk /dev/disk4    # replace 4 with your USB device number
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/rdisk4 bs=4m status=progress
+diskutil eject /dev/disk4
+```
+
+### Linux
+
+```bash
+lsblk                              # find the USB device (e.g. /dev/sdb)
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdb bs=4M status=progress conv=fsync
+sync
+```
+
+## Install on a target machine
+
+1. Boot the target on the USB stick.
+2. Log in at the console as `root` (no password — upstream NixOS
+   installer default; console-only).
+3. Bring up the network with `nmtui` (interactive) or
+   `nmcli device wifi connect <SSID> password <PSK>`.
+4. Identify the target disk with `lsblk`.
+5. Partition + mount as desired (parted/gptfdisk/cryptsetup/zfs
+   are all on the stick).
+6. Generate per-machine hardware config:
+   `nixos-generate-config --root /mnt`
+7. Install:
+   `nixos-install --flake <git-url>#<host>` where `<host>` is one
+   of the names declared in `flake.nix` `nixosConfigurations`.
+   (This minimal installer only declares `installer` itself —
+   target-machine hosts live in `../full-ai-cluster/flake.nix`.)
+8. Reboot.
+
+## What's on the stick
+
+The complete package list lives in
+[`nixos/installer/configuration.nix`](nixos/installer/configuration.nix)
+under `environment.systemPackages`. Categories include:
+
+- Version control: git, git-lfs, gnupg, openssh
+- Editors: vim, neovim, nano
+- Shell QoL: tmux, htop, ripgrep, jq, yq-go, fzf, bat
+- Network: curl, wget, nmap, networkmanager, iwd, wireguard-tools
+- Disk: parted, gptfdisk, cryptsetup, zfs, lvm2, mdadm
+- Hardware inspection: lshw, dmidecode, nvme-cli, lm_sensors
+- NixOS install tooling: nixos-install-tools, nix-output-monitor
+
+The flake itself is the tick source. Every subsequent install
+reconciles toward the desired state declared here.
diff --git a/full-ai-cluster/usb-nixos-installer/flake.nix b/full-ai-cluster/usb-nixos-installer/flake.nix
new file mode 100644
index 0000000000..1b8e80afe3
--- /dev/null
+++ b/full-ai-cluster/usb-nixos-installer/flake.nix
@@ -0,0 +1,49 @@
+# usb-nixos-installer/flake.nix
+#
+# USB-only flake. Produces a bootable NixOS installer ISO.
+# Builds on Linux x86_64 natively; on Apple Silicon Macs use
+# the linux-builder pattern from ../full-ai-cluster/.
+
+{
+  description = "Zeta USB installer — NixOS bootable image for AI-cluster bootstrap";
+
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/nixos-24.11";
+    flake-utils.url = "github:numtide/flake-utils";
+  };
+
+  outputs = { self, nixpkgs, flake-utils, ... }@inputs:
+    let
+      stateVersion = "24.11";
+    in
+    {
+      nixosConfigurations.installer = nixpkgs.lib.nixosSystem {
+        system = "x86_64-linux";
+        specialArgs = { inherit inputs stateVersion; };
+        modules = [
+          ./nixos/installer/configuration.nix
+        ];
+      };
+    } // flake-utils.lib.eachSystem [ "x86_64-linux" ] (system:
+      let
+        pkgs = import nixpkgs { inherit system; };
+      in
+      {
+        packages = {
+          installer-iso =
+            self.nixosConfigurations.installer.config.system.build.isoImage;
+          default = self.packages.${system}.installer-iso;
+        };
+
+        devShells.default = pkgs.mkShell {
+          name = "zeta-usb-installer";
+          packages = with pkgs; [
+            git
+            nix-output-monitor
+            nh
+          ];
+        };
+
+        formatter = pkgs.nixpkgs-fmt;
+      });
+}
diff --git a/full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix b/full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix
new file mode 100644
index 0000000000..8181c9fd5a
--- /dev/null
+++ b/full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix
@@ -0,0 +1,204 @@
+# usb-nixos-installer/nixos/installer/configuration.nix
+#
+# Single-file declarative installer image. Contains ONLY what's
+# needed to boot a target machine and run `nixos-install --flake`
+# against a host config from this repo.
+
+{ config, pkgs, lib, modulesPath, ... }:
+
+{
+  imports = [
+    "${modulesPath}/installer/cd-dvd/installation-cd-minimal.nix"
+    "${modulesPath}/installer/cd-dvd/channel.nix"
+  ];
+
+  networking.hostName = "zeta-installer";
+  time.timeZone = "America/New_York";
+  i18n.defaultLocale = "en_US.UTF-8";
+
+  nix.settings = {
+    experimental-features = [ "nix-command" "flakes" ];
+    auto-optimise-store = true;
+    trusted-users = [ "root" "nixos" ];
+    substituters = [
+      "https://cache.nixos.org"
+      "https://nix-community.cachix.org"
+    ];
+    trusted-public-keys = [
+      "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+      "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs="
+    ];
+  };
+
+  networking.networkmanager.enable = true;
+  networking.wireless.enable = lib.mkForce false;
+  networking.firewall.enable = true;
+
+  # SSH off by default; console-only install. Enable manually for
+  # headless install with `sudo passwd nixos; sudo systemctl start sshd`.
+  services.openssh = {
+    enable = lib.mkForce false;
+    settings = {
+      PermitRootLogin = lib.mkForce "prohibit-password";
+      PasswordAuthentication = lib.mkForce false;
+      KbdInteractiveAuthentication = lib.mkForce false;
+    };
+  };
+
+  users.users.nixos = {
+    isNormalUser = true;
+    extraGroups = [ "wheel" "networkmanager" ];
+  };
+
+  environment.systemPackages = with pkgs; [
+    # Version control: pull the cluster flake onto the target
+    git
+    git-lfs
+    gnupg
+    openssh
+
+    # Editors
+    vim
+    neovim
+    nano
+
+    # Shell quality of life
+    bash
+    zsh
+    tmux
+    screen
+    htop
+    btop
+    tree
+    ripgrep
+    fd
+    fzf
+    bat
+    eza
+    jq
+    yq-go
+    less
+    file
+    which
+    unzip
+    zip
+    p7zip
+    rsync
+
+    # Network
+    curl
+    wget
+    iproute2
+    iputils
+    inetutils
+    dnsutils
+    nmap
+    tcpdump
+    mtr
+    ethtool
+    bind
+    networkmanager
+    iwd
+    wpa_supplicant
+    openvpn
+    wireguard-tools
+
+    # Disk / partitioning / filesystems
+    parted
+    gptfdisk
+    util-linux
+    cryptsetup
+    dosfstools
+    e2fsprogs
+    xfsprogs
+    btrfs-progs
+    zfs
+    lvm2
+    mdadm
+    smartmontools
+
+    # Hardware inspection
+    pciutils
+    usbutils
+    lshw
+    dmidecode
+    hwinfo
+    inxi
+    lm_sensors
+    nvme-cli
+    hdparm
+
+    # GPU detection (drivers come in per-host on installed system)
+    glxinfo
+    vulkan-tools
+    clinfo
+
+    # NixOS install tooling
+    nixos-install-tools
+    nix-output-monitor
+    nvd
+    nh
+
+    # Secrets management
+    age
+    sops
+    ssh-to-age
+
+    # Build helpers
+    coreutils
+    findutils
+    gawk
+    gnused
+    gnugrep
+    diffutils
+    patch
+    gcc
+    gnumake
+    pkg-config
+
+    # Observability of the install itself
+    iotop
+    iftop
+    ncdu
+    pv
+    progress
+
+    # Documentation on the stick
+    man-pages
+    man-pages-posix
+    tldr
+  ];
+
+  isoImage = {
+    isoName = lib.mkForce "zeta-installer-${config.system.nixos.release}.iso";
+    volumeID = lib.mkForce "ZETA_INSTALL";
+    makeEfiBootable = true;
+    makeUsbBootable = true;
+  };
+
+  environment.etc."zeta-install.md".text = ''
+    Zeta USB installer
+    ==================
+
+    1. Boot this USB on the target machine.
+    2. Log in at the console as `root` (no password — upstream
+       installer default; only usable from the local TTY).
+    3. Bring up the network:
+         nmtui                       # interactive, or
+         nmcli device wifi connect <SSID> password <PSK>
+    4. Identify the target disk:
+         lsblk
+    5. Partition + mount /mnt as desired.
+    6. Generate hardware config:
+         nixos-generate-config --root /mnt
+    7. Clone the full cluster flake (or this minimal USB flake):
+         git clone <git-url> /mnt/etc/zeta
+    8. Install:
+         nixos-install --flake /mnt/etc/zeta/full-ai-cluster#<host>
+       or for USB-only:
+         nixos-install --flake /mnt/etc/zeta/usb-nixos-installer#installer
+    9. Reboot.
+  '';
+
+  system.stateVersion = "24.11";
+}
diff --git a/usb-nixos-installer/README.md b/usb-nixos-installer/README.md
new file mode 100644
index 0000000000..34bbb0e54e
--- /dev/null
+++ b/usb-nixos-installer/README.md
@@ -0,0 +1,88 @@
+# usb-nixos-installer
+
+**Scope: ONLY the USB bootstrap portion.**
+
+This directory contains exactly the four things needed to produce a
+bootable NixOS USB installer that can install the target operating
+system on a new machine over USB or Ethernet:
+
+1. **NixOS declarative configuration** — `nixos/installer/configuration.nix`
+2. **NixFlakes for packages** — `flake.nix` at the directory root
+3. **Git for version text storage** — every file here lives in git;
+   the flake.nix references inputs by pinned revision
+4. **The OS Flake on a USB stick** — `nix build .#installer-iso`
+   produces a bootable ISO image you `dd` to a USB stick. The same
+   ISO supports Ethernet install (boot the target on the stick,
+   then `nixos-install --flake <git-url>#<host>` over the network).
+
+**This directory is intentionally minimal.** It does NOT contain
+K3S, ArgoCD, Orleans, GitLab, observability, GPU runtime, or any
+cluster workload. Those live in `../full-ai-cluster/`.
+
+For the full end-to-end AI cluster (including this USB bootstrap
+as its starting snippet), see `../full-ai-cluster/`.
+
+## Build the USB stick
+
+From any machine with Nix installed:
+
+```bash
+cd usb-nixos-installer
+nix build .#installer-iso
+# Output: result/iso/zeta-installer-*.iso (~1.5-2 GB)
+```
+
+## Write the ISO to a USB stick
+
+### macOS
+
+```bash
+diskutil list                      # find the USB device (e.g. /dev/disk4)
+diskutil unmountDisk /dev/disk4    # replace 4 with your USB device number
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/rdisk4 bs=4m status=progress
+diskutil eject /dev/disk4
+```
+
+### Linux
+
+```bash
+lsblk                              # find the USB device (e.g. /dev/sdb)
+sudo dd if=result/iso/zeta-installer-*.iso of=/dev/sdb bs=4M status=progress conv=fsync
+sync
+```
+
+## Install on a target machine
+
+1. Boot the target on the USB stick.
+2. Log in at the console as `root` (no password — upstream NixOS
+   installer default; console-only).
+3. Bring up the network with `nmtui` (interactive) or
+   `nmcli device wifi connect <SSID> password <PSK>`.
+4. Identify the target disk with `lsblk`.
+5. Partition + mount as desired (parted/gptfdisk/cryptsetup/zfs
+   are all on the stick).
+6. Generate per-machine hardware config:
+   `nixos-generate-config --root /mnt`
+7. Install:
+   `nixos-install --flake <git-url>#<host>` where `<host>` is one
+   of the names declared in `flake.nix` `nixosConfigurations`.
+   (This minimal installer only declares `installer` itself —
+   target-machine hosts live in `../full-ai-cluster/flake.nix`.)
+8. Reboot.
+
+## What's on the stick
+
+The complete package list lives in
+[`nixos/installer/configuration.nix`](nixos/installer/configuration.nix)
+under `environment.systemPackages`. Categories include:
+
+- Version control: git, git-lfs, gnupg, openssh
+- Editors: vim, neovim, nano
+- Shell QoL: tmux, htop, ripgrep, jq, yq-go, fzf, bat
+- Network: curl, wget, nmap, networkmanager, iwd, wireguard-tools
+- Disk: parted, gptfdisk, cryptsetup, zfs, lvm2, mdadm
+- Hardware inspection: lshw, dmidecode, nvme-cli, lm_sensors
+- NixOS install tooling: nixos-install-tools, nix-output-monitor
+
+The flake itself is the tick source. Every subsequent install
+reconciles toward the desired state declared here.
diff --git a/usb-nixos-installer/flake.nix b/usb-nixos-installer/flake.nix
new file mode 100644
index 0000000000..1b8e80afe3
--- /dev/null
+++ b/usb-nixos-installer/flake.nix
@@ -0,0 +1,49 @@
+# usb-nixos-installer/flake.nix
+#
+# USB-only flake. Produces a bootable NixOS installer ISO.
+# Builds on Linux x86_64 natively; on Apple Silicon Macs use
+# the linux-builder pattern from ../full-ai-cluster/.
+
+{
+  description = "Zeta USB installer — NixOS bootable image for AI-cluster bootstrap";
+
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/nixos-24.11";
+    flake-utils.url = "github:numtide/flake-utils";
+  };
+
+  outputs = { self, nixpkgs, flake-utils, ... }@inputs:
+    let
+      stateVersion = "24.11";
+    in
+    {
+      nixosConfigurations.installer = nixpkgs.lib.nixosSystem {
+        system = "x86_64-linux";
+        specialArgs = { inherit inputs stateVersion; };
+        modules = [
+          ./nixos/installer/configuration.nix
+        ];
+      };
+    } // flake-utils.lib.eachSystem [ "x86_64-linux" ] (system:
+      let
+        pkgs = import nixpkgs { inherit system; };
+      in
+      {
+        packages = {
+          installer-iso =
+            self.nixosConfigurations.installer.config.system.build.isoImage;
+          default = self.packages.${system}.installer-iso;
+        };
+
+        devShells.default = pkgs.mkShell {
+          name = "zeta-usb-installer";
+          packages = with pkgs; [
+            git
+            nix-output-monitor
+            nh
+          ];
+        };
+
+        formatter = pkgs.nixpkgs-fmt;
+      });
+}
diff --git a/usb-nixos-installer/nixos/installer/configuration.nix b/usb-nixos-installer/nixos/installer/configuration.nix
new file mode 100644
index 0000000000..8181c9fd5a
--- /dev/null
+++ b/usb-nixos-installer/nixos/installer/configuration.nix
@@ -0,0 +1,204 @@
+# usb-nixos-installer/nixos/installer/configuration.nix
+#
+# Single-file declarative installer image. Contains ONLY what's
+# needed to boot a target machine and run `nixos-install --flake`
+# against a host config from this repo.
+
+{ config, pkgs, lib, modulesPath, ... }:
+
+{
+  imports = [
+    "${modulesPath}/installer/cd-dvd/installation-cd-minimal.nix"
+    "${modulesPath}/installer/cd-dvd/channel.nix"
+  ];
+
+  networking.hostName = "zeta-installer";
+  time.timeZone = "America/New_York";
+  i18n.defaultLocale = "en_US.UTF-8";
+
+  nix.settings = {
+    experimental-features = [ "nix-command" "flakes" ];
+    auto-optimise-store = true;
+    trusted-users = [ "root" "nixos" ];
+    substituters = [
+      "https://cache.nixos.org"
+      "https://nix-community.cachix.org"
+    ];
+    trusted-public-keys = [
+      "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
+      "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs="
+    ];
+  };
+
+  networking.networkmanager.enable = true;
+  networking.wireless.enable = lib.mkForce false;
+  networking.firewall.enable = true;
+
+  # SSH off by default; console-only install. Enable manually for
+  # headless install with `sudo passwd nixos; sudo systemctl start sshd`.
+  services.openssh = {
+    enable = lib.mkForce false;
+    settings = {
+      PermitRootLogin = lib.mkForce "prohibit-password";
+      PasswordAuthentication = lib.mkForce false;
+      KbdInteractiveAuthentication = lib.mkForce false;
+    };
+  };
+
+  users.users.nixos = {
+    isNormalUser = true;
+    extraGroups = [ "wheel" "networkmanager" ];
+  };
+
+  environment.systemPackages = with pkgs; [
+    # Version control: pull the cluster flake onto the target
+    git
+    git-lfs
+    gnupg
+    openssh
+
+    # Editors
+    vim
+    neovim
+    nano
+
+    # Shell quality of life
+    bash
+    zsh
+    tmux
+    screen
+    htop
+    btop
+    tree
+    ripgrep
+    fd
+    fzf
+    bat
+    eza
+    jq
+    yq-go
+    less
+    file
+    which
+    unzip
+    zip
+    p7zip
+    rsync
+
+    # Network
+    curl
+    wget
+    iproute2
+    iputils
+    inetutils
+    dnsutils
+    nmap
+    tcpdump
+    mtr
+    ethtool
+    bind
+    networkmanager
+    iwd
+    wpa_supplicant
+    openvpn
+    wireguard-tools
+
+    # Disk / partitioning / filesystems
+    parted
+    gptfdisk
+    util-linux
+    cryptsetup
+    dosfstools
+    e2fsprogs
+    xfsprogs
+    btrfs-progs
+    zfs
+    lvm2
+    mdadm
+    smartmontools
+
+    # Hardware inspection
+    pciutils
+    usbutils
+    lshw
+    dmidecode
+    hwinfo
+    inxi
+    lm_sensors
+    nvme-cli
+    hdparm
+
+    # GPU detection (drivers come in per-host on installed system)
+    glxinfo
+    vulkan-tools
+    clinfo
+
+    # NixOS install tooling
+    nixos-install-tools
+    nix-output-monitor
+    nvd
+    nh
+
+    # Secrets management
+    age
+    sops
+    ssh-to-age
+
+    # Build helpers
+    coreutils
+    findutils
+    gawk
+    gnused
+    gnugrep
+    diffutils
+    patch
+    gcc
+    gnumake
+    pkg-config
+
+    # Observability of the install itself
+    iotop
+    iftop
+    ncdu
+    pv
+    progress
+
+    # Documentation on the stick
+    man-pages
+    man-pages-posix
+    tldr
+  ];
+
+  isoImage = {
+    isoName = lib.mkForce "zeta-installer-${config.system.nixos.release}.iso";
+    volumeID = lib.mkForce "ZETA_INSTALL";
+    makeEfiBootable = true;
+    makeUsbBootable = true;
+  };
+
+  environment.etc."zeta-install.md".text = ''
+    Zeta USB installer
+    ==================
+
+    1. Boot this USB on the target machine.
+    2. Log in at the console as `root` (no password — upstream
+       installer default; only usable from the local TTY).
+    3. Bring up the network:
+         nmtui                       # interactive, or
+         nmcli device wifi connect <SSID> password <PSK>
+    4. Identify the target disk:
+         lsblk
+    5. Partition + mount /mnt as desired.
+    6. Generate hardware config:
+         nixos-generate-config --root /mnt
+    7. Clone the full cluster flake (or this minimal USB flake):
+         git clone <git-url> /mnt/etc/zeta
+    8. Install:
+         nixos-install --flake /mnt/etc/zeta/full-ai-cluster#<host>
+       or for USB-only:
+         nixos-install --flake /mnt/etc/zeta/usb-nixos-installer#installer
+    9. Reboot.
+  '';
+
+  system.stateVersion = "24.11";
+}

From 803fcbe07fd6bb2e85c5bb8601f43fe22032a8b0 Mon Sep 17 00:00:00 2001
From: Lior <lior@zeta.dev>
Date: Mon, 25 May 2026 02:02:56 -0400
Subject: [PATCH 2/5] fix(ai-cluster): apply Aaron's 4-component clarifications
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

  1. OZ      → OpenZiti (real chart wired)
  2. Hermes  → custom + cloud-only (Aaron: "we don't care about
               local right now only cloud"); local LLM endpoints
               kept commented-out for later phase
  3. Warp    → removed entirely (Aaron: "we do not need it")
  4. Hindsight → standalone helm chart for ArgoCD, agent persistent
                 memory for Hermes; TODO awaiting chart URL

Changes:

oz/Application.yaml:
  - Real OpenZiti controller helm chart
    (docs.openziti.io/helm-charts/ziti-controller v1.4.5)
  - longhorn persistence; ClusterIP service endpoints
  - Note: ziti-router lands in a sibling app when first edge router added

hermes/deployment.yaml:
  - Reoriented to cloud LLM providers (Anthropic / OpenAI / Bedrock)
  - API keys baked into image at build time via SOPS (per spec)
  - OZ transport: ziti-controller.openziti.svc.cluster.local:443
  - Hindsight memory: hindsight.hindsight.svc.cluster.local
  - Ollama/vLLM env vars kept commented for when local models return

warp/:
  - Directory + Application.yaml removed

hindsight/Application.yaml:
  - TODO comment block requesting:
      repoURL (Helm repo)
      chart name
      version
  - Example shape provided in the comment so the swap is trivial
  - namespace.yaml placeholder declares the namespace
    (`zeta.io/integrates-with: hermes`)

README.md:
  - Updated component-tree annotations to reflect the 4 changes
  - "Component status" reorganized into ✅ confirmed / 🟡 custom /
    ⏳ deferred (local models) / ❓ awaiting input (Hindsight chart)

The 29-Application set still covers the original 30-component spec
(Warp removed by maintainer request).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 full-ai-cluster/README.md                     | 48 ++++++++---------
 .../k8s/applications/hermes/deployment.yaml   | 43 ++++++++++-----
 .../applications/hindsight/Application.yaml   | 39 +++++++++-----
 .../k8s/applications/hindsight/namespace.yaml | 10 ++++
 .../k8s/applications/oz/Application.yaml      | 53 +++++++++++--------
 .../k8s/applications/warp/Application.yaml    | 32 -----------
 6 files changed, 120 insertions(+), 105 deletions(-)
 create mode 100644 full-ai-cluster/k8s/applications/hindsight/namespace.yaml
 delete mode 100644 full-ai-cluster/k8s/applications/warp/Application.yaml

diff --git a/full-ai-cluster/README.md b/full-ai-cluster/README.md
index e60e2abb14..2661696d3e 100644
--- a/full-ai-cluster/README.md
+++ b/full-ai-cluster/README.md
@@ -39,14 +39,13 @@ full-ai-cluster/
         ├── argo-rollouts/      ← progressive delivery
         ├── longhorn/           ← distributed block storage
         ├── cockroachdb/        ← distributed SQL
-        ├── hindsight/          ← TODO: ambiguous component (see PR)
-        ├── oz/                 ← TODO: ambiguous component (see PR)
-        ├── hermes/             ← TODO: ambiguous component + OZ integration
-        ├── warp/               ← TODO: ambiguous component (see PR)
-        ├── ollama/             ← LLM serving (option A — local)
-        ├── vllm/               ← LLM serving (option B — high-throughput)
-        ├── deepseek-coder/     ← model deploy → Ollama or vLLM
-        ├── qwen-coder/         ← model deploy → Ollama or vLLM
+        ├── hindsight/          ← agent persistent memory for Hermes (chart URL TBD)
+        ├── oz/                 ← OpenZiti zero-trust overlay
+        ├── hermes/             ← custom AI agent (cloud LLMs via SOPS-baked keys, OZ transport, Hindsight memory)
+        ├── ollama/             ← LLM serving (option A — local — DEFERRED)
+        ├── vllm/               ← LLM serving (option B — high-throughput — DEFERRED)
+        ├── deepseek-coder/     ← model deploy → Ollama or vLLM (DEFERRED with local)
+        ├── qwen-coder/         ← model deploy → Ollama or vLLM (DEFERRED with local)
         ├── kube-prometheus-stack/ ← Prometheus + Grafana + Alertmanager
         ├── nats/               ← messaging
         ├── redis/              ← cache
@@ -177,23 +176,22 @@ Add new `nixosConfigurations.<host>` entries to `flake.nix` as needed.
 - ✅ Well-defined upstream charts (Cilium, ArgoCD, Temporal, GitLab,
   Forgejo, Argo Workflows / Rollouts, Longhorn, CockroachDB, NATS,
   Redis, Weaviate, Loki / Tempo / Alloy / Mimir, kube-prometheus-stack,
-  Istio, OPA, Sealed Secrets, Vault, Ollama, vLLM)
-- 🟡 Custom workloads needing your input on image + Helm chart
-  choice (Orleans Silo, Dapr Actors config, Hermes Docker image
-  with SOPS, Deepseek/Qwen Coder model deploys)
-- ⚠️ Ambiguous components that need you to confirm which upstream:
-  - **OZ** — multiple possibilities (OpenZiti? Auth0 OZ? specific
-    Aaron-built component?). Application.yaml has a TODO.
-  - **Hermes** — multiple possibilities (Cosmos IBC relayer? a
-    message broker? an Aaron-built agent?). Application.yaml has
-    a TODO with SOPS-baking-into-the-image structure already wired.
-  - **Warp** — multiple possibilities (Cloudflare Warp? Warp
-    terminal? Dagger Warp engine? an Aaron-built component?).
-  - **Hindsight** — multiple possibilities (Lockheed Martin's
-    OpenTelemetry processor? Microsoft's Hindsight? AWS?).
-
-Sharpen each ambiguous one by editing its `Application.yaml`
-to point at the right repoURL/chart/version.
+  Istio, OPA, Sealed Secrets, Vault, OpenZiti)
+- 🟡 Custom workloads needing maintainer input:
+  - **Hermes** — Aaron-built AI agent oriented at cloud LLM APIs
+    (Anthropic, OpenAI, etc.) with SOPS-baked keys + OZ transport
+    + Hindsight memory backend. Image build + push are maintainer
+    responsibility; the manifest scaffold + env vars are wired.
+  - **Orleans Silo** — custom Silo image embedding your grain code.
+- ⏳ Deferred (local-models phase — wait for now per "we only care about cloud right now"):
+  - Ollama, vLLM, Deepseek Coder, Qwen Coder Applications stay
+    in the tree at `replicas: 0` so the topology is preserved.
+    Bump replicas + rebuild Hermes against local endpoints when
+    the local-models phase comes back online.
+- ❓ Awaiting maintainer input:
+  - **Hindsight** — confirmed as standalone helm chart for agent
+    persistent memory for Hermes. `Application.yaml` has TODO
+    awaiting `repoURL` + chart name + version.
 
 ## Secrets
 
diff --git a/full-ai-cluster/k8s/applications/hermes/deployment.yaml b/full-ai-cluster/k8s/applications/hermes/deployment.yaml
index 1ee6d84692..41f995723e 100644
--- a/full-ai-cluster/k8s/applications/hermes/deployment.yaml
+++ b/full-ai-cluster/k8s/applications/hermes/deployment.yaml
@@ -1,19 +1,28 @@
-# Hermes deployment with SOPS-baked secrets via image build.
+# Hermes — custom AI agent oriented at CLOUD LLM endpoints.
 #
-# The image at `ghcr.io/lucent-financial-group/zeta-hermes` is
-# expected to be built via the Docker module (NixFlake) on a
-# maintainer host with SOPS-decrypted secrets baked in at build
-# time (rather than mounted at runtime). The build pipeline:
+# The image at `ghcr.io/lucent-financial-group/zeta-hermes` is built
+# via the Docker module (NixFlake) on a maintainer host. SOPS-decrypted
+# CLOUD API KEYS are baked into the image at build time so the running
+# container doesn't need access to the SOPS keys.
 #
-#   1. age decrypt encrypted/* → secrets/  (locally, via sops)
+# Build pipeline:
+#   1. sops -d encrypted/cloud-keys.env > secrets/cloud-keys.env
 #   2. docker buildx build --secret id=hermes-secrets,src=secrets/ ...
 #   3. docker push ghcr.io/lucent-financial-group/zeta-hermes:vN.N.N
 #   4. Bump `image:` below + commit + push
 #
-# Hermes' connections (placeholders — wire after OZ + Ollama land):
-#   - OZ:        oz-router.oz.svc.cluster.local:443
-#   - Ollama:    ollama.ollama.svc.cluster.local:11434
-#   - vLLM:      vllm.vllm.svc.cluster.local:8000
+# Hermes' connections (cloud-only for now; local models deferred):
+#   - OpenZiti (OZ) for zero-trust transport:
+#       ziti-controller.openziti.svc.cluster.local:443
+#   - Cloud LLM APIs via baked-in keys:
+#       Anthropic Claude
+#       OpenAI
+#       (add more as the SOPS file grows)
+#   - Hindsight (memory plugin) — once its Application lands, point at
+#       hindsight.hindsight.svc.cluster.local
+#
+# Local LLM serving (Ollama/vLLM) endpoints are kept in commented-out
+# form below for when local models come back online.
 
 apiVersion: v1
 kind: Namespace
@@ -44,9 +53,17 @@ spec:
         - name: hermes
           image: ghcr.io/lucent-financial-group/zeta-hermes:placeholder
           env:
-            - { name: OZ_ENDPOINT,      value: "oz-router.oz.svc.cluster.local:443" }
-            - { name: OLLAMA_ENDPOINT,  value: "http://ollama.ollama.svc.cluster.local:11434" }
-            - { name: VLLM_ENDPOINT,    value: "http://vllm.vllm.svc.cluster.local:8000" }
+            # OpenZiti transport
+            - { name: OZ_CONTROLLER_URL,
+                value: "https://ziti-controller.openziti.svc.cluster.local:443" }
+            # Cloud LLM providers — API keys baked in at image build via SOPS
+            - { name: LLM_PROVIDER, value: "anthropic" }  # or "openai" / "bedrock"
+            # Hindsight memory backend (Application lands separately)
+            - { name: HINDSIGHT_URL,
+                value: "http://hindsight.hindsight.svc.cluster.local" }
+            # Local LLM endpoints — kept commented for when local models are re-enabled:
+            # - { name: OLLAMA_ENDPOINT, value: "http://ollama.ollama.svc.cluster.local:11434" }
+            # - { name: VLLM_ENDPOINT,   value: "http://vllm.vllm.svc.cluster.local:8000" }
           resources:
             requests: { cpu: "200m", memory: "256Mi" }
             limits:   { cpu: "1",    memory: "1Gi"   }
diff --git a/full-ai-cluster/k8s/applications/hindsight/Application.yaml b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
index b211de4609..c400c3fd07 100644
--- a/full-ai-cluster/k8s/applications/hindsight/Application.yaml
+++ b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
@@ -1,15 +1,15 @@
-# Hindsight — TODO: AMBIGUOUS COMPONENT.
+# Hindsight — agent persistent memory system for Hermes.
+# Standalone Helm chart deployed via ArgoCD (per Aaron 2026-05-25).
 #
-# Multiple possibilities in 2026:
-#   - Hindsight (Lockheed Martin) — OpenTelemetry trace processor
-#     for tail-sampling: https://github.com/lockheed-martin/hindsight
-#   - Hindsight (Microsoft) — observability tool
-#   - Some other "Hindsight" specific to Aaron's stack
+# TODO(aaron): provide the Helm chart URL + chart name + version.
+# Several projects use the name "Hindsight" — confirm which one:
+#   - Is it a public OSS chart? (helm repo URL?)
+#   - A private chart hosted somewhere? (repoURL + auth?)
+#   - An Aaron-built chart in a sibling repo?
 #
-# This placeholder Application points at the Lockheed Martin
-# Hindsight (most likely match given OpenTelemetry context with
-# Loki/Tempo/Alloy/Mimir all on this cluster). UPDATE the
-# repoURL + chart fields below to whichever Hindsight you mean.
+# Once you provide repoURL + chart name + version, this Application
+# wires up directly. Until then, this placeholder declares the namespace
+# + intent so the structure is in place.
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
@@ -20,16 +20,27 @@ metadata:
 spec:
   project: default
   source:
-    # TODO(aaron): confirm Hindsight identity. Placeholder points at
-    # the Lockheed Martin OTel trace processor manifests:
+    # TODO(aaron): replace with the real Helm repo + chart name.
+    # Example shape:
+    #   repoURL: https://your-org.github.io/hindsight-chart/
+    #   chart: hindsight
+    #   targetRevision: 1.0.0
+    #   helm:
+    #     releaseName: hindsight
+    #     valuesObject:
+    #       persistence:
+    #         storageClass: longhorn
+    #         size: 20Gi
+    #       hermesIntegration:
+    #         enabled: true
     repoURL: https://github.com/Lucent-Financial-Group/Zeta
     targetRevision: main
     path: full-ai-cluster/k8s/applications/hindsight
     directory:
-      include: '{namespace,deployment,service}.yaml'
+      include: 'namespace.yaml'
   destination:
     server: https://kubernetes.default.svc
     namespace: hindsight
   syncPolicy:
-    automated: { prune: true, selfHeal: true }
+    automated: { prune: false, selfHeal: true }
     syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/hindsight/namespace.yaml b/full-ai-cluster/k8s/applications/hindsight/namespace.yaml
new file mode 100644
index 0000000000..25ba85ad9a
--- /dev/null
+++ b/full-ai-cluster/k8s/applications/hindsight/namespace.yaml
@@ -0,0 +1,10 @@
+# Namespace placeholder. Replaced by real Hindsight manifests
+# once the chart URL is provided.
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: hindsight
+  labels:
+    app.kubernetes.io/part-of: zeta
+    zeta.io/integrates-with: hermes
diff --git a/full-ai-cluster/k8s/applications/oz/Application.yaml b/full-ai-cluster/k8s/applications/oz/Application.yaml
index e9a1a1772b..5e6403d737 100644
--- a/full-ai-cluster/k8s/applications/oz/Application.yaml
+++ b/full-ai-cluster/k8s/applications/oz/Application.yaml
@@ -1,34 +1,45 @@
-# OZ — TODO: AMBIGUOUS COMPONENT.
+# OZ → OpenZiti. Zero-trust overlay network with per-connection
+# authentication + L4 mTLS.
 #
-# "OZ" maps to several possibilities in 2026:
-#   - OpenZiti (zero-trust networking): https://openziti.io/
-#   - An Aaron-specific AI agent named OZ
-#   - Auth0's OZ component
-#
-# This placeholder uses the local k8s/applications/oz/ path so you
-# can drop in real manifests by editing the files in this directory.
-# UPDATE repoURL + chart once the identity is confirmed.
+# OpenZiti splits into multiple components; this Application installs
+# the controller. Add ziti-router/Application.yaml in a sibling dir
+# when the first edge router lands.
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
 metadata:
-  name: oz
+  name: openziti-controller
   namespace: argocd
   finalizers: [ resources-finalizer.argocd.argoproj.io ]
 spec:
   project: default
   source:
-    # TODO(aaron): confirm OZ identity. If OpenZiti, switch to:
-    #   repoURL: https://docs.openziti.io/helm-charts/
-    #   chart: ziti-controller
-    repoURL: https://github.com/Lucent-Financial-Group/Zeta
-    targetRevision: main
-    path: full-ai-cluster/k8s/applications/oz
-    directory:
-      include: '{namespace,deployment,service}.yaml'
+    repoURL: https://docs.openziti.io/helm-charts/
+    chart: ziti-controller
+    targetRevision: 1.4.5
+    helm:
+      releaseName: ziti-controller
+      valuesObject:
+        ctrlPlane:
+          service:
+            type: ClusterIP
+        clientApi:
+          service:
+            type: ClusterIP
+        persistence:
+          enabled: true
+          storageClass: longhorn
+          size: 5Gi
+        # Bootstrap admin password — change before going to production.
+        # In production source from Vault or a Sealed Secret.
+        adminPassword: "changeme"
   destination:
     server: https://kubernetes.default.svc
-    namespace: oz
+    namespace: openziti
   syncPolicy:
-    automated: { prune: true, selfHeal: true }
-    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
+    automated:
+      prune: false   # never prune — would break the trust fabric
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=true
+      - ServerSideApply=true
diff --git a/full-ai-cluster/k8s/applications/warp/Application.yaml b/full-ai-cluster/k8s/applications/warp/Application.yaml
deleted file mode 100644
index 3e3f1f0a74..0000000000
--- a/full-ai-cluster/k8s/applications/warp/Application.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-# Warp — TODO: AMBIGUOUS COMPONENT.
-#
-# Possibilities:
-#   - Cloudflare Warp (consumer VPN)
-#   - Warp Terminal (https://www.warp.dev/) — IDE-like terminal
-#   - Dagger Warp engine (build system component)
-#   - HashiCorp Waypoint (formerly considered)
-#   - An Aaron-specific Warp (AI agent, custom orchestrator, etc.)
-#
-# Placeholder; replace repoURL + chart with the real one.
-
-apiVersion: argoproj.io/v1alpha1
-kind: Application
-metadata:
-  name: warp
-  namespace: argocd
-  finalizers: [ resources-finalizer.argocd.argoproj.io ]
-spec:
-  project: default
-  source:
-    # TODO(aaron): confirm Warp identity.
-    repoURL: https://github.com/Lucent-Financial-Group/Zeta
-    targetRevision: main
-    path: full-ai-cluster/k8s/applications/warp
-    directory:
-      include: '{namespace,deployment,service}.yaml'
-  destination:
-    server: https://kubernetes.default.svc
-    namespace: warp
-  syncPolicy:
-    automated: { prune: true, selfHeal: true }
-    syncOptions: [ CreateNamespace=true, ServerSideApply=true ]

From 34869abd66b66804e427cd4bd9c35070b40d4484 Mon Sep 17 00:00:00 2001
From: Lior <lior@zeta.dev>
Date: Mon, 25 May 2026 02:10:07 -0400
Subject: [PATCH 3/5] fix(ai-cluster): 10 Copilot review findings on PR #4910
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P1 — Cilium chicken-and-egg
  k3s-server.nix disables flannel + kube-proxy so K3S brings up NO
  CNI on boot. Without a CNI no pods can schedule (including ArgoCD
  itself), so ArgoCD-installs-Cilium-later is impossible.

  Fix: added k8s/bootstrap/cilium-install.yaml (kustomize ref to
  Cilium v1.16.5 install manifest) + wired into K3S manifests list
  BEFORE argocd-install. Boot sequence is now:
    cilium-namespace → cilium-install → argocd-namespace →
    argocd-install → root-application
  ArgoCD's own Cilium Application then adopts the running install
  and reconciles ongoing changes.

P1 — k3s-agent.nix server-only flags
  --flannel-backend=none, --disable-kube-proxy, and
  --disable-network-policy are SERVER-side flags. K3S rejects them
  on agents with "flag not supported". Removed from agent extraFlags;
  documented in a comment that the server-side flags disable CNI
  cluster-wide.

P1 — README path errors (3 locations)
  full-ai-cluster/usb-nixos-installer/README.md and
  full-ai-cluster/usb-nixos-installer/flake.nix both said
  `../full-ai-cluster/` which from inside full-ai-cluster/usb-nixos-installer/
  resolves to full-ai-cluster/full-ai-cluster/ (non-existent).
  Replaced both with absolute GitHub URLs that work regardless of
  which copy of the file is being read. Mirrored to the top-level
  usb-nixos-installer/ copy to keep them byte-identical.

P1 — full-ai-cluster/README.md missing cilium-namespace.yaml in tree
  Tree view at line 27 listed argocd-namespace + argocd-install +
  root-application but skipped cilium-namespace.yaml. Added
  cilium-namespace AND the new cilium-install with ordering comment.

P1 — full-ai-cluster/README.md "Cluster layer" path typo
  Said `./k8s/applications/root-application.yaml` but the file is
  at `./k8s/bootstrap/root-application.yaml`. Fixed.

P1 — usb-nixos-installer/README.md "pinned by revision" claim
  README claimed inputs were pinned by revision, but no flake.lock
  is committed. Updated to say flake.nix references inputs by Git
  branch + explicit instruction for the first maintainer with Nix
  to run `nix flake update` and commit the resulting flake.lock.
  Mirrored to the duplicate.

P1 (security) — Grafana adminPassword: changeme hardcoded
  Replaced with `grafana.admin.existingSecret: grafana-admin-credentials`
  reference + commented-out kubectl create / Sealed Secret pattern.

P1 (security) — local-path-provisioner unquoted $VOL_DIR
  setup + teardown scripts had `path=$VOL_DIR` and `mkdir/rm` calls
  using unquoted vars. Added quoting + non-empty check + allowlist
  check that VOL_DIR resolves under /var/lib/zeta-local-storage/
  before mkdir or rm. Defense against an empty/exotic VOL_DIR
  causing /var/lib/zeta-local-storage/ wide-open writes or rms.

P1 (security) — docker module added zeta to `docker` group
  Removed `users.users.zeta.extraGroups = [ "docker" ];`. Membership
  in the docker group is root-on-host equivalent via the daemon
  socket. With rootless Docker enabled (already in the module),
  zeta gets their own rootless socket at $XDG_RUNTIME_DIR/docker.sock
  — that's sufficient for normal use. For maintainer tasks needing
  the system daemon: `sudo docker`.

P1 — gpu-device-plugin.nix referenced non-existent ArgoCD app
  Comment claimed ArgoCD would take ownership via
  k8s/applications/gpu-device-plugin/Application.yaml but no such
  directory exists. Rewrote the comment to reflect the actual
  state: NixOS-managed manifest only, bumping versions is a
  nixos-rebuild operation.

Stale findings (no fix needed, will resolve as outdated):
  - warp/Application.yaml: file was deleted in the previous commit;
    Copilot flagged the now-non-existent file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 full-ai-cluster/README.md                     | 13 +++--
 .../kube-prometheus-stack/Application.yaml    | 12 ++++-
 .../k8s/bootstrap/cilium-install.yaml         | 47 +++++++++++++++++++
 full-ai-cluster/nixos/modules/docker.nix      |  7 ++-
 .../nixos/modules/gpu-device-plugin.nix       | 10 ++--
 full-ai-cluster/nixos/modules/k3s-agent.nix   | 12 +++--
 full-ai-cluster/nixos/modules/k3s-server.nix  |  7 +++
 .../nixos/modules/local-storage.nix           | 14 ++++--
 full-ai-cluster/usb-nixos-installer/README.md | 13 +++--
 full-ai-cluster/usb-nixos-installer/flake.nix |  5 +-
 usb-nixos-installer/README.md                 | 13 +++--
 usb-nixos-installer/flake.nix                 |  5 +-
 12 files changed, 129 insertions(+), 29 deletions(-)
 create mode 100644 full-ai-cluster/k8s/bootstrap/cilium-install.yaml

diff --git a/full-ai-cluster/README.md b/full-ai-cluster/README.md
index 2661696d3e..b717409cc6 100644
--- a/full-ai-cluster/README.md
+++ b/full-ai-cluster/README.md
@@ -24,7 +24,9 @@ full-ai-cluster/
 │       ├── control-plane/      ← configuration.nix + hardware + README
 │       └── worker-gpu/         ← configuration.nix + hardware + README
 └── k8s/
-    ├── bootstrap/              ← K3S auto-applies on first boot
+    ├── bootstrap/              ← K3S auto-applies on first boot (in this order)
+    │   ├── cilium-namespace.yaml
+    │   ├── cilium-install.yaml ← CNI must exist before any pods (incl. ArgoCD)
     │   ├── argocd-namespace.yaml
     │   ├── argocd-install.yaml
     │   └── root-application.yaml ← App-of-Apps root
@@ -66,10 +68,11 @@ full-ai-cluster/
   `./nixos/` lands on a target machine via `nixos-install --flake`
   (initial install) or `nixos-rebuild switch --flake` (updates).
 - **Cluster layer** is reconciled by **ArgoCD**. K3S auto-applies
-  the three bootstrap manifests at `./k8s/bootstrap/` on first boot;
-  ArgoCD then reads `./k8s/applications/root-application.yaml`
-  (App-of-Apps) and reconciles every workload from the same Git
-  repo every ~3 minutes.
+  the bootstrap manifests at `./k8s/bootstrap/` on first boot
+  (Cilium → ArgoCD → root Application); ArgoCD then reads
+  `./k8s/bootstrap/root-application.yaml` (App-of-Apps) and
+  reconciles every workload under `./k8s/applications/` from the
+  same Git repo every ~3 minutes.
 
 This split is intentional: anything that must run BEFORE the
 cluster API exists (kernel modules, CNI host setup, container
diff --git a/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml b/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
index 408c3deea8..c4440bae66 100644
--- a/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
+++ b/full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml
@@ -25,7 +25,17 @@ spec:
                   accessModes: [ ReadWriteOnce ]
                   resources: { requests: { storage: 100Gi } }
         grafana:
-          adminPassword: changeme   # override via Sealed Secret or Vault
+          # Admin password sourced from a Secret rather than
+          # hardcoded. Create the secret BEFORE this app syncs:
+          #   kubectl -n monitoring create secret generic grafana-admin-credentials \
+          #     --from-literal=admin-user=admin \
+          #     --from-literal=admin-password="$(openssl rand -hex 16)"
+          # Or use a Sealed Secret committed to Git so the credential
+          # is reproducible from the cluster definition.
+          admin:
+            existingSecret: grafana-admin-credentials
+            userKey: admin-user
+            passwordKey: admin-password
           persistence:
             enabled: true
             storageClassName: longhorn
diff --git a/full-ai-cluster/k8s/bootstrap/cilium-install.yaml b/full-ai-cluster/k8s/bootstrap/cilium-install.yaml
new file mode 100644
index 0000000000..fccf049f1f
--- /dev/null
+++ b/full-ai-cluster/k8s/bootstrap/cilium-install.yaml
@@ -0,0 +1,47 @@
+# full-ai-cluster/k8s/bootstrap/cilium-install.yaml
+#
+# Cilium CNI install applied by K3S on first boot, BEFORE the
+# ArgoCD Cilium Application takes over reconciliation.
+#
+# Why bootstrap-install instead of leaving everything to ArgoCD:
+# k3s-server.nix passes `--flannel-backend=none --disable-kube-proxy`
+# so K3S brings up no CNI at all. Without a CNI, no pods (including
+# ArgoCD's own pods) can be scheduled. ArgoCD has to be reachable to
+# install anything else — so something must install the CNI first.
+#
+# K3S applies this manifest before ArgoCD comes up. Cilium installs,
+# pods can schedule, ArgoCD comes up, and then ArgoCD's own Cilium
+# Application (`k8s/applications/cilium/Application.yaml`) takes over
+# the ongoing reconciliation.
+
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: kube-system
+
+resources:
+  # Cilium v1.16.5 install manifest. Generated via:
+  #   helm template cilium cilium/cilium --version 1.16.5 \
+  #     --namespace kube-system \
+  #     --set kubeProxyReplacement=true \
+  #     --set k8sServiceHost=control-plane.zeta.local \
+  #     --set k8sServicePort=6443 \
+  #     --set ipam.mode=cluster-pool \
+  #     --set ipam.operator.clusterPoolIPv4PodCIDRList[0]="10.42.0.0/16" \
+  #     --set bpf.masquerade=true \
+  #     --set routingMode=native \
+  #     --set ipv4NativeRoutingCIDR="10.42.0.0/16" \
+  #     --set autoDirectNodeRoutes=true \
+  #     --set hubble.enabled=true \
+  #     --set hubble.relay.enabled=true \
+  #     --set hubble.ui.enabled=true
+  #
+  # Re-generate + commit when bumping the version. ArgoCD's
+  # Application.yaml (in k8s/applications/cilium/) keeps the
+  # same chart pinned, so after the bootstrap install ArgoCD
+  # treats this as already-installed and reconciles in place.
+  - https://raw.githubusercontent.com/cilium/cilium/v1.16.5/install/kubernetes/cilium/templates.yaml
+
+# Site-specific patches: none yet. Add cmd-params for
+# load-balancer IP pools, BGP config, etc. as they come up.
+patches: []
diff --git a/full-ai-cluster/nixos/modules/docker.nix b/full-ai-cluster/nixos/modules/docker.nix
index 51982fd8da..d9a650a4b2 100644
--- a/full-ai-cluster/nixos/modules/docker.nix
+++ b/full-ai-cluster/nixos/modules/docker.nix
@@ -35,5 +35,10 @@
     docker-buildx
   ];
 
-  users.users.zeta.extraGroups = [ "docker" ];
+  # Intentionally NOT adding `zeta` to the `docker` group.
+  # Membership in `docker` is effectively root-on-host because the
+  # docker socket can mount any path. With rootless Docker enabled
+  # above, the `zeta` user gets its OWN rootless daemon socket at
+  # $XDG_RUNTIME_DIR/docker.sock — that's the only docker they need.
+  # For maintainer tasks requiring the system daemon, use `sudo docker`.
 }
diff --git a/full-ai-cluster/nixos/modules/gpu-device-plugin.nix b/full-ai-cluster/nixos/modules/gpu-device-plugin.nix
index 2e87a38aa1..6a02435681 100644
--- a/full-ai-cluster/nixos/modules/gpu-device-plugin.nix
+++ b/full-ai-cluster/nixos/modules/gpu-device-plugin.nix
@@ -53,9 +53,13 @@ in
 
   config = lib.mkIf cfg.enable {
     # Drop K3S manifests for each enabled vendor. K3S applies them on
-    # first boot. ArgoCD then takes ownership via the
-    # k8s/applications/gpu-device-plugin/Application.yaml so updates
-    # become git-driven (see that file).
+    # first boot so GPU resources are advertised to the scheduler
+    # before ArgoCD comes up. These manifests are static (no upgrade
+    # via ArgoCD today) — bumping the device-plugin version means
+    # editing the `*Version` options below and re-applying the host's
+    # nixos-rebuild. A future `k8s/applications/gpu-device-plugin/`
+    # Application could take over reconciliation, but it doesn't
+    # exist yet — the K3S-manifest path is the only one.
     services.k3s.manifests = lib.mkMerge [
       (lib.mkIf (lib.elem "nvidia" cfg.vendors) {
         nvidia-device-plugin.source = pkgs.writeText "nvidia-device-plugin.yaml" ''
diff --git a/full-ai-cluster/nixos/modules/k3s-agent.nix b/full-ai-cluster/nixos/modules/k3s-agent.nix
index d230bf98e5..b9db08277e 100644
--- a/full-ai-cluster/nixos/modules/k3s-agent.nix
+++ b/full-ai-cluster/nixos/modules/k3s-agent.nix
@@ -15,11 +15,13 @@
     extraFlags = [
       "--node-label=zeta.io/role=worker"
 
-      # Same CNI takeover settings as the server. The agent must
-      # NOT bring up flannel or kube-proxy — Cilium handles both.
-      "--flannel-backend=none"
-      "--disable-network-policy"
-      "--disable-kube-proxy"
+      # NOTE: server-only flags like `--flannel-backend=none`,
+      # `--disable-kube-proxy`, and `--disable-network-policy`
+      # are NOT set here — they're server-side and the agent
+      # inherits the network configuration from the server. K3S
+      # rejects them on agents with a `flag not supported` error.
+      # Cilium owns CNI on both sides; the server-side flags are
+      # what disables flannel cluster-wide.
     ];
   };
 
diff --git a/full-ai-cluster/nixos/modules/k3s-server.nix b/full-ai-cluster/nixos/modules/k3s-server.nix
index d012f34e7e..874db88049 100644
--- a/full-ai-cluster/nixos/modules/k3s-server.nix
+++ b/full-ai-cluster/nixos/modules/k3s-server.nix
@@ -39,9 +39,16 @@
     # required to get Cilium + ArgoCD running. ArgoCD takes over and
     # reconciles every other workload from k8s/applications/.
     manifests = {
+      # CNI MUST come first — without it no pods can schedule,
+      # including ArgoCD's own pods. Cilium installs here; ArgoCD's
+      # cilium Application (k8s/applications/cilium/) takes over
+      # reconciliation once it's healthy.
       cilium-namespace.source = ../../k8s/bootstrap/cilium-namespace.yaml;
+      cilium-install.source = ../../k8s/bootstrap/cilium-install.yaml;
+      # Then ArgoCD itself.
       argocd-namespace.source = ../../k8s/bootstrap/argocd-namespace.yaml;
       argocd-install.source = ../../k8s/bootstrap/argocd-install.yaml;
+      # Finally the App-of-Apps that hands off to ArgoCD.
       root-application.source = ../../k8s/bootstrap/root-application.yaml;
     };
   };
diff --git a/full-ai-cluster/nixos/modules/local-storage.nix b/full-ai-cluster/nixos/modules/local-storage.nix
index cf05b16a83..6f77dbaacd 100644
--- a/full-ai-cluster/nixos/modules/local-storage.nix
+++ b/full-ai-cluster/nixos/modules/local-storage.nix
@@ -55,12 +55,18 @@
           }
         setup: |-
           #!/bin/sh
-          path=$VOL_DIR
-          mkdir -m 0777 -p $path
+          set -eu
+          path="$VOL_DIR"
+          [ -n "$path" ] || { echo "VOL_DIR empty; refusing to mkdir"; exit 1; }
+          case "$path" in /var/lib/zeta-local-storage/*) ;; *) echo "VOL_DIR outside allowed root: $path"; exit 1 ;; esac
+          mkdir -m 0777 -p "$path"
         teardown: |-
           #!/bin/sh
-          path=$VOL_DIR
-          rm -rf $path
+          set -eu
+          path="$VOL_DIR"
+          [ -n "$path" ] || { echo "VOL_DIR empty; refusing to rm"; exit 1; }
+          case "$path" in /var/lib/zeta-local-storage/*) ;; *) echo "VOL_DIR outside allowed root: $path"; exit 1 ;; esac
+          rm -rf "$path"
       ---
       apiVersion: apps/v1
       kind: Deployment
diff --git a/full-ai-cluster/usb-nixos-installer/README.md b/full-ai-cluster/usb-nixos-installer/README.md
index 34bbb0e54e..86d444c5c7 100644
--- a/full-ai-cluster/usb-nixos-installer/README.md
+++ b/full-ai-cluster/usb-nixos-installer/README.md
@@ -9,7 +9,12 @@ system on a new machine over USB or Ethernet:
 1. **NixOS declarative configuration** — `nixos/installer/configuration.nix`
 2. **NixFlakes for packages** — `flake.nix` at the directory root
 3. **Git for version text storage** — every file here lives in git;
-   the flake.nix references inputs by pinned revision
+   `flake.nix` references inputs by Git branch. **Run
+   `nix flake update` and commit the resulting `flake.lock`** to
+   pin to specific revisions for fully-reproducible builds. The
+   lock file isn't committed yet (no maintainer with Nix has run
+   `nix flake update` on this branch yet); first maintainer to
+   build the ISO should commit it.
 4. **The OS Flake on a USB stick** — `nix build .#installer-iso`
    produces a bootable ISO image you `dd` to a USB stick. The same
    ISO supports Ethernet install (boot the target on the stick,
@@ -17,10 +22,12 @@ system on a new machine over USB or Ethernet:
 
 **This directory is intentionally minimal.** It does NOT contain
 K3S, ArgoCD, Orleans, GitLab, observability, GPU runtime, or any
-cluster workload. Those live in `../full-ai-cluster/`.
+cluster workload. Those live in the `full-ai-cluster/` directory
+at the repo root.
 
 For the full end-to-end AI cluster (including this USB bootstrap
-as its starting snippet), see `../full-ai-cluster/`.
+as its starting snippet), see
+<https://github.com/Lucent-Financial-Group/Zeta/tree/main/full-ai-cluster>.
 
 ## Build the USB stick
 
diff --git a/full-ai-cluster/usb-nixos-installer/flake.nix b/full-ai-cluster/usb-nixos-installer/flake.nix
index 1b8e80afe3..715791f2a9 100644
--- a/full-ai-cluster/usb-nixos-installer/flake.nix
+++ b/full-ai-cluster/usb-nixos-installer/flake.nix
@@ -1,8 +1,9 @@
 # usb-nixos-installer/flake.nix
 #
 # USB-only flake. Produces a bootable NixOS installer ISO.
-# Builds on Linux x86_64 natively; on Apple Silicon Macs use
-# the linux-builder pattern from ../full-ai-cluster/.
+# Builds on Linux x86_64 natively; on Apple Silicon Macs use the
+# nix-darwin linux-builder pattern (see the cluster flake at
+# https://github.com/Lucent-Financial-Group/Zeta/tree/main/full-ai-cluster).
 
 {
   description = "Zeta USB installer — NixOS bootable image for AI-cluster bootstrap";
diff --git a/usb-nixos-installer/README.md b/usb-nixos-installer/README.md
index 34bbb0e54e..86d444c5c7 100644
--- a/usb-nixos-installer/README.md
+++ b/usb-nixos-installer/README.md
@@ -9,7 +9,12 @@ system on a new machine over USB or Ethernet:
 1. **NixOS declarative configuration** — `nixos/installer/configuration.nix`
 2. **NixFlakes for packages** — `flake.nix` at the directory root
 3. **Git for version text storage** — every file here lives in git;
-   the flake.nix references inputs by pinned revision
+   `flake.nix` references inputs by Git branch. **Run
+   `nix flake update` and commit the resulting `flake.lock`** to
+   pin to specific revisions for fully-reproducible builds. The
+   lock file isn't committed yet (no maintainer with Nix has run
+   `nix flake update` on this branch yet); first maintainer to
+   build the ISO should commit it.
 4. **The OS Flake on a USB stick** — `nix build .#installer-iso`
    produces a bootable ISO image you `dd` to a USB stick. The same
    ISO supports Ethernet install (boot the target on the stick,
@@ -17,10 +22,12 @@ system on a new machine over USB or Ethernet:
 
 **This directory is intentionally minimal.** It does NOT contain
 K3S, ArgoCD, Orleans, GitLab, observability, GPU runtime, or any
-cluster workload. Those live in `../full-ai-cluster/`.
+cluster workload. Those live in the `full-ai-cluster/` directory
+at the repo root.
 
 For the full end-to-end AI cluster (including this USB bootstrap
-as its starting snippet), see `../full-ai-cluster/`.
+as its starting snippet), see
+<https://github.com/Lucent-Financial-Group/Zeta/tree/main/full-ai-cluster>.
 
 ## Build the USB stick
 
diff --git a/usb-nixos-installer/flake.nix b/usb-nixos-installer/flake.nix
index 1b8e80afe3..715791f2a9 100644
--- a/usb-nixos-installer/flake.nix
+++ b/usb-nixos-installer/flake.nix
@@ -1,8 +1,9 @@
 # usb-nixos-installer/flake.nix
 #
 # USB-only flake. Produces a bootable NixOS installer ISO.
-# Builds on Linux x86_64 natively; on Apple Silicon Macs use
-# the linux-builder pattern from ../full-ai-cluster/.
+# Builds on Linux x86_64 natively; on Apple Silicon Macs use the
+# nix-darwin linux-builder pattern (see the cluster flake at
+# https://github.com/Lucent-Financial-Group/Zeta/tree/main/full-ai-cluster).
 
 {
   description = "Zeta USB installer — NixOS bootable image for AI-cluster bootstrap";

From 042997e45f65a6e80c674fc984d4eb0ac0c6fb7f Mon Sep 17 00:00:00 2001
From: Lior <lior@zeta.dev>
Date: Mon, 25 May 2026 02:15:04 -0400
Subject: [PATCH 4/5] fix(bootstrap): switch argocd + cilium installs to K3S
 HelmChart CRs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex P1: services.k3s.manifests applies files as plain Kubernetes
resources — it does NOT run kustomize build. Both
argocd-install.yaml and cilium-install.yaml used
kustomize.config.k8s.io/v1beta1 Kustomization which is a BUILD-TIME
spec, not a K8s API resource. K3S would try to apply a Kustomization
CR (no such CRD installed in a fresh cluster), fail, and the actual
ArgoCD + Cilium would never come up.

Switched both to K3S's native helm.cattle.io/v1 HelmChart CR. K3S's
built-in Helm Controller fetches the chart at install time and
applies the rendered manifests automatically. Same end result as
kustomize-remote-resource, but in a format K3S's manifest applier
actually understands.

Cilium pinned to chart v1.16.5 with full KPR + Hubble (relay + UI)
+ BPF MASQUERADE + native routing — matches the ArgoCD Cilium
Application values exactly, so ArgoCD's eventual adoption sees a
matching configuration.

ArgoCD pinned to chart v7.7.10 (argo-cd helm chart) with
ClusterIP service + single-replica controller for the
single-control-plane bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../k8s/bootstrap/argocd-install.yaml         | 53 ++++++++----
 .../k8s/bootstrap/cilium-install.yaml         | 81 +++++++++----------
 2 files changed, 75 insertions(+), 59 deletions(-)

diff --git a/full-ai-cluster/k8s/bootstrap/argocd-install.yaml b/full-ai-cluster/k8s/bootstrap/argocd-install.yaml
index 5b8bf7ea35..968857634d 100644
--- a/full-ai-cluster/k8s/bootstrap/argocd-install.yaml
+++ b/full-ai-cluster/k8s/bootstrap/argocd-install.yaml
@@ -1,19 +1,40 @@
 # full-ai-cluster/k8s/bootstrap/argocd-install.yaml
 #
-# ArgoCD install via kustomize remote reference. K3S auto-applies
-# on first boot. Pinned to ArgoCD v2.13.2 — bump in lockstep with
-# any ArgoCD-Application charts that target a different API.
+# K3S HelmChart CR (`helm.cattle.io/v1`) — K3S's Helm Controller
+# resolves the chart at install time and applies the rendered
+# manifests. This is the native K3S pattern for "install Helm
+# chart at boot." Plain-resources format that
+# `services.k3s.manifests` actually understands.
 
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-
-namespace: argocd
-
-resources:
-  - https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.2/manifests/install.yaml
-
-# Site-specific patches go here:
-#   - cmd-params for ingress / TLS termination
-#   - resource limits for argocd-server on small control-planes
-#   - SSO config (Dex, GitHub OAuth, etc.)
-patches: []
+apiVersion: helm.cattle.io/v1
+kind: HelmChart
+metadata:
+  name: argocd
+  namespace: kube-system
+spec:
+  chart: argo-cd
+  repo: https://argoproj.github.io/argo-helm
+  version: 7.7.10
+  targetNamespace: argocd
+  createNamespace: false   # argocd-namespace.yaml creates it explicitly
+  valuesContent: |-
+    global:
+      domain: argocd.zeta.local
+    server:
+      service:
+        type: ClusterIP   # ingress lands separately
+    configs:
+      params:
+        server.insecure: true   # TLS terminated upstream once ingress lands
+    notifications:
+      enabled: false
+    dex:
+      enabled: false   # SSO wires in via a Sealed Secret + values patch
+    redis-ha:
+      enabled: false   # single-control-plane bootstrap
+    controller:
+      replicas: 1
+    repoServer:
+      replicas: 1
+    applicationSet:
+      enabled: true
diff --git a/full-ai-cluster/k8s/bootstrap/cilium-install.yaml b/full-ai-cluster/k8s/bootstrap/cilium-install.yaml
index fccf049f1f..c7d7100158 100644
--- a/full-ai-cluster/k8s/bootstrap/cilium-install.yaml
+++ b/full-ai-cluster/k8s/bootstrap/cilium-install.yaml
@@ -1,47 +1,42 @@
 # full-ai-cluster/k8s/bootstrap/cilium-install.yaml
 #
-# Cilium CNI install applied by K3S on first boot, BEFORE the
-# ArgoCD Cilium Application takes over reconciliation.
+# K3S HelmChart CR — K3S's Helm Controller installs Cilium at
+# first boot. This is the only thing that gets a CNI running before
+# ArgoCD's own pods need to schedule (chicken-and-egg avoidance).
 #
-# Why bootstrap-install instead of leaving everything to ArgoCD:
-# k3s-server.nix passes `--flannel-backend=none --disable-kube-proxy`
-# so K3S brings up no CNI at all. Without a CNI, no pods (including
-# ArgoCD's own pods) can be scheduled. ArgoCD has to be reachable to
-# install anything else — so something must install the CNI first.
-#
-# K3S applies this manifest before ArgoCD comes up. Cilium installs,
-# pods can schedule, ArgoCD comes up, and then ArgoCD's own Cilium
-# Application (`k8s/applications/cilium/Application.yaml`) takes over
-# the ongoing reconciliation.
-
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-
-namespace: kube-system
-
-resources:
-  # Cilium v1.16.5 install manifest. Generated via:
-  #   helm template cilium cilium/cilium --version 1.16.5 \
-  #     --namespace kube-system \
-  #     --set kubeProxyReplacement=true \
-  #     --set k8sServiceHost=control-plane.zeta.local \
-  #     --set k8sServicePort=6443 \
-  #     --set ipam.mode=cluster-pool \
-  #     --set ipam.operator.clusterPoolIPv4PodCIDRList[0]="10.42.0.0/16" \
-  #     --set bpf.masquerade=true \
-  #     --set routingMode=native \
-  #     --set ipv4NativeRoutingCIDR="10.42.0.0/16" \
-  #     --set autoDirectNodeRoutes=true \
-  #     --set hubble.enabled=true \
-  #     --set hubble.relay.enabled=true \
-  #     --set hubble.ui.enabled=true
-  #
-  # Re-generate + commit when bumping the version. ArgoCD's
-  # Application.yaml (in k8s/applications/cilium/) keeps the
-  # same chart pinned, so after the bootstrap install ArgoCD
-  # treats this as already-installed and reconciles in place.
-  - https://raw.githubusercontent.com/cilium/cilium/v1.16.5/install/kubernetes/cilium/templates.yaml
+# After cluster is up, the ArgoCD Cilium Application at
+# k8s/applications/cilium/Application.yaml adopts the running
+# install and reconciles ongoing changes.
 
-# Site-specific patches: none yet. Add cmd-params for
-# load-balancer IP pools, BGP config, etc. as they come up.
-patches: []
+apiVersion: helm.cattle.io/v1
+kind: HelmChart
+metadata:
+  name: cilium
+  namespace: kube-system
+spec:
+  chart: cilium
+  repo: https://helm.cilium.io
+  version: 1.16.5
+  targetNamespace: kube-system
+  valuesContent: |-
+    kubeProxyReplacement: true
+    k8sServiceHost: control-plane.zeta.local
+    k8sServicePort: 6443
+    ipam:
+      mode: cluster-pool
+      operator:
+        clusterPoolIPv4PodCIDRList: [ "10.42.0.0/16" ]
+    bpf:
+      masquerade: true
+    routingMode: native
+    ipv4NativeRoutingCIDR: "10.42.0.0/16"
+    autoDirectNodeRoutes: true
+    hubble:
+      enabled: true
+      relay: { enabled: true }
+      ui: { enabled: true }
+      metrics:
+        enabled: [ dns, drop, tcp, flow, icmp, http ]
+        enableOpenMetrics: true
+    operator:
+      replicas: 1

From dc8d92a9b73e25d6f028439b55faf270f69b66a4 Mon Sep 17 00:00:00 2001
From: Lior <lior@zeta.dev>
Date: Mon, 25 May 2026 02:36:13 -0400
Subject: [PATCH 5/5] fix(ai-cluster): 7 review findings (security +
 deferred-models + gating)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P1 (security) — OZ adminPassword hardcoded:
  oz/Application.yaml `adminPassword: "changeme"` replaced with
  `adminSecret: { name: ziti-admin-credentials, key: password }`.
  Maintainer creates the secret BEFORE syncing this app (one-liner
  kubectl + openssl in comment).

P1 — hindsight TODO named attribution:
  TODO(aaron) → TODO(maintainer); "per Aaron" → removed.
  Conforms to repo convention "no name attribution in current-state
  surfaces" — same fix shape as a prior PR's same finding.

P1 — k3s-server.nix references non-existent MetalLB/ingress-nginx apps:
  Comment around --disable=servicelb said "ArgoCD installs MetalLB +
  ingress-nginx as Applications" but no such apps exist in this PR.
  Rewrote to name the bootstrap-period gap (LoadBalancer Services
  stay Pending; use NodePort or port-forward during bootstrap) and
  defer the MetalLB/ingress-nginx Application authoring to a
  future commit.

P1 (security) — k3s-server.nix opened etcd ports at host firewall:
  2379 + 2380 were in allowedTCPPorts. K3S embedded etcd binds
  127.0.0.1 by default — opening them at the host firewall risks
  exposing etcd to the LAN if the bind address ever drifts.
  Removed both; added a comment explaining the rationale and the
  per-host-scoped pattern for multi-server HA.

P1 — Ollama auto-pulled 33B/32B models on first sync:
  Application enabled `automated: true` + had `models.pull` +
  `models.run` configured for deepseek-coder:33b + qwen2.5-coder:32b
  + replicas left at default (= 1). On first sync ArgoCD would
  immediately spin up GPU-resourced pods + pull ~60+ GB of model
  weights — directly conflicting with the README's "DEFERRED
  local-models phase" status.

  Fix:
    - replicaCount: 0 in chart values
    - models.pull / models.run lists emptied (kept as commented
      examples for when the local phase comes back)
    - syncPolicy set to manual-only (no `automated:` block).
      ArgoCD won't reconcile until `argocd app sync ollama`
      is run explicitly by a maintainer.
  Same gating applied to vllm/Application.yaml (also deferred).

P2 — Both GitLab and Forgejo would reconcile simultaneously:
  Root App-of-Apps picks up every */Application.yaml. Without
  gating, both GitLab and Forgejo would spin up + fight over
  cluster resources. Set GitLab as default-on; Forgejo gets
  `automated:` stripped (manual sync only). Header comment on
  Forgejo names the swap procedure. Root-application.yaml header
  documents the either/or pattern for future maintainer reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../k8s/applications/forgejo/Application.yaml |  9 +++++-
 .../applications/hindsight/Application.yaml   | 20 ++++++-------
 .../k8s/applications/ollama/Application.yaml  | 29 ++++++++++++++-----
 .../k8s/applications/oz/Application.yaml      | 12 ++++++--
 .../k8s/applications/vllm/Application.yaml    |  5 +++-
 .../k8s/bootstrap/root-application.yaml       |  7 +++++
 full-ai-cluster/nixos/modules/k3s-server.nix  | 17 ++++++++---
 7 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/full-ai-cluster/k8s/applications/forgejo/Application.yaml b/full-ai-cluster/k8s/applications/forgejo/Application.yaml
index d0651ebcec..ff06180de1 100644
--- a/full-ai-cluster/k8s/applications/forgejo/Application.yaml
+++ b/full-ai-cluster/k8s/applications/forgejo/Application.yaml
@@ -1,4 +1,11 @@
 # Forgejo — self-hosted Git. Lighter than GitLab; pick one.
+#
+# DEFAULT: NOT reconciled. The repo ships both gitlab/ and forgejo/
+# Application definitions for review; only one should run at a time.
+# GitLab is the default-on; Forgejo is manual-sync. To switch:
+#   1. Move `automated` block from gitlab/Application.yaml here
+#   2. Remove `automated` from gitlab/Application.yaml
+#   3. argocd app delete gitlab + argocd app sync forgejo
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
@@ -27,6 +34,6 @@ spec:
   destination:
     server: https://kubernetes.default.svc
     namespace: forgejo
+  # Manual sync only by default — see header comment for swap procedure.
   syncPolicy:
-    automated: { prune: false, selfHeal: true }
     syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/hindsight/Application.yaml b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
index c400c3fd07..defa46109a 100644
--- a/full-ai-cluster/k8s/applications/hindsight/Application.yaml
+++ b/full-ai-cluster/k8s/applications/hindsight/Application.yaml
@@ -1,15 +1,15 @@
 # Hindsight — agent persistent memory system for Hermes.
-# Standalone Helm chart deployed via ArgoCD (per Aaron 2026-05-25).
+# Standalone Helm chart deployed via ArgoCD.
 #
-# TODO(aaron): provide the Helm chart URL + chart name + version.
-# Several projects use the name "Hindsight" — confirm which one:
-#   - Is it a public OSS chart? (helm repo URL?)
-#   - A private chart hosted somewhere? (repoURL + auth?)
-#   - An Aaron-built chart in a sibling repo?
+# TODO(maintainer): provide the Helm chart URL + chart name + version.
+# Confirm which Hindsight chart this refers to:
+#   - public OSS chart (helm repo URL)
+#   - private chart (repoURL + auth)
+#   - in-repo chart (sibling repo URL + path)
 #
-# Once you provide repoURL + chart name + version, this Application
-# wires up directly. Until then, this placeholder declares the namespace
-# + intent so the structure is in place.
+# Once repoURL + chart name + version are provided, this Application
+# wires up directly. Until then, this placeholder declares the
+# namespace + intent so the structure is in place.
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
@@ -20,7 +20,7 @@ metadata:
 spec:
   project: default
   source:
-    # TODO(aaron): replace with the real Helm repo + chart name.
+    # TODO(maintainer): replace with the real Helm repo + chart name.
     # Example shape:
     #   repoURL: https://your-org.github.io/hindsight-chart/
     #   chart: hindsight
diff --git a/full-ai-cluster/k8s/applications/ollama/Application.yaml b/full-ai-cluster/k8s/applications/ollama/Application.yaml
index ab3347ebc0..61ba170ada 100644
--- a/full-ai-cluster/k8s/applications/ollama/Application.yaml
+++ b/full-ai-cluster/k8s/applications/ollama/Application.yaml
@@ -1,5 +1,13 @@
 # Ollama — LLM serving (option A).
 # Schedules onto worker-gpu nodes via nvidia.com/gpu request.
+#
+# DEFERRED: local-models phase is on hold per "we only care about
+# cloud right now." The Application is shipped in scaled-to-zero
+# form so the topology is preserved + the chart pin is reviewable,
+# but NO pods schedule + NO models pull until a maintainer:
+#   1. Removes the `replicaCount: 0` override below
+#   2. Re-enables automated sync (currently set to manual)
+#   3. Adds models back to `ollama.models.pull` + `.run`
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
@@ -16,18 +24,22 @@ spec:
     helm:
       releaseName: ollama
       valuesObject:
+        replicaCount: 0   # ← deferred local-models phase
         ollama:
           gpu:
             enabled: true
             type: nvidia
             number: 1
           models:
-            pull:
-              - deepseek-coder:33b
-              - qwen2.5-coder:32b
-            run:
-              - deepseek-coder:33b
-              - qwen2.5-coder:32b
+            # Pull + run lists intentionally empty — no model pull
+            # at deploy time. Re-enable when the local phase comes
+            # back online (example values left as comments):
+            pull: [ ]
+              # - deepseek-coder:33b
+              # - qwen2.5-coder:32b
+            run: [ ]
+              # - deepseek-coder:33b
+              # - qwen2.5-coder:32b
         persistentVolume:
           enabled: true
           size: 200Gi
@@ -43,6 +55,9 @@ spec:
   destination:
     server: https://kubernetes.default.svc
     namespace: ollama
+  # syncPolicy without `automated` — manual-sync-only so ArgoCD
+  # doesn't reconcile this app on its own during the deferred
+  # phase. Maintainer triggers via `argocd app sync ollama` when
+  # ready to enable local serving.
   syncPolicy:
-    automated: { prune: true, selfHeal: true }
     syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/applications/oz/Application.yaml b/full-ai-cluster/k8s/applications/oz/Application.yaml
index 5e6403d737..c898fadffa 100644
--- a/full-ai-cluster/k8s/applications/oz/Application.yaml
+++ b/full-ai-cluster/k8s/applications/oz/Application.yaml
@@ -30,9 +30,15 @@ spec:
           enabled: true
           storageClass: longhorn
           size: 5Gi
-        # Bootstrap admin password — change before going to production.
-        # In production source from Vault or a Sealed Secret.
-        adminPassword: "changeme"
+        # Admin password sourced from a Secret rather than hardcoded.
+        # Create the secret BEFORE this app syncs:
+        #   kubectl -n openziti create secret generic ziti-admin-credentials \
+        #     --from-literal=password="$(openssl rand -hex 24)"
+        # Or use a Sealed Secret committed to Git so the credential is
+        # reproducible from the cluster definition.
+        adminSecret:
+          name: ziti-admin-credentials
+          key: password
   destination:
     server: https://kubernetes.default.svc
     namespace: openziti
diff --git a/full-ai-cluster/k8s/applications/vllm/Application.yaml b/full-ai-cluster/k8s/applications/vllm/Application.yaml
index 730b584f16..2a82341557 100644
--- a/full-ai-cluster/k8s/applications/vllm/Application.yaml
+++ b/full-ai-cluster/k8s/applications/vllm/Application.yaml
@@ -24,6 +24,9 @@ spec:
   destination:
     server: https://kubernetes.default.svc
     namespace: vllm
+  # DEFERRED: local-models phase on hold. The deployment.yaml ships
+  # replicas: 0 (already) AND syncPolicy is manual-only so ArgoCD
+  # doesn't reconcile until a maintainer explicitly triggers
+  # `argocd app sync vllm`. Mirrors the ollama Application's gating.
   syncPolicy:
-    automated: { prune: true, selfHeal: true }
     syncOptions: [ CreateNamespace=true, ServerSideApply=true ]
diff --git a/full-ai-cluster/k8s/bootstrap/root-application.yaml b/full-ai-cluster/k8s/bootstrap/root-application.yaml
index 76092eae01..8d3d3763cc 100644
--- a/full-ai-cluster/k8s/bootstrap/root-application.yaml
+++ b/full-ai-cluster/k8s/bootstrap/root-application.yaml
@@ -9,6 +9,13 @@
 #   2. Author Application.yaml + any supporting manifests
 #   3. git commit + push to main
 #   4. ArgoCD picks it up on the next sync (~3 min)
+#
+# Either/or gating: some directories ship multiple alternatives
+# (e.g. gitlab/ vs forgejo/, ollama/ vs vllm/). The root App-of-Apps
+# picks them ALL up — gating happens at the per-Application
+# `syncPolicy` level. Default-on apps keep an `automated:` block;
+# alternatives omit it (manual sync only via `argocd app sync`).
+# Each alternative's Application.yaml header names the swap procedure.
 
 apiVersion: argoproj.io/v1alpha1
 kind: Application
diff --git a/full-ai-cluster/nixos/modules/k3s-server.nix b/full-ai-cluster/nixos/modules/k3s-server.nix
index 874db88049..a418b02156 100644
--- a/full-ai-cluster/nixos/modules/k3s-server.nix
+++ b/full-ai-cluster/nixos/modules/k3s-server.nix
@@ -25,8 +25,12 @@
       "--disable-network-policy"
       "--disable-kube-proxy"
 
-      # Disable bundled servicelb + traefik — ArgoCD installs MetalLB
-      # + ingress-nginx (or Istio gateway) declaratively.
+      # Disable bundled servicelb + traefik. No replacement L4
+      # load-balancer or ingress is declared in this PR — Services
+      # of type LoadBalancer will stay Pending until a maintainer
+      # commits a MetalLB + ingress-nginx Application under
+      # k8s/applications/. Bootstrap-period workloads needing
+      # external traffic should use NodePort or `kubectl port-forward`.
       "--disable=servicelb"
       "--disable=traefik"
 
@@ -58,11 +62,16 @@
       6443    # K3S API
       9345    # K3S supervisor/join
       10250   # kubelet
-      2379    # etcd client
-      2380    # etcd peer
       4244    # Hubble server
       4245    # Hubble Relay
       8472    # legacy flannel/VXLAN (kept for safety)
+      # etcd ports 2379/2380 intentionally NOT in this list.
+      # K3S embedded etcd binds 127.0.0.1 by default. Opening
+      # those ports at the host firewall would risk exposing etcd
+      # to the LAN if the bind address ever drifts. For multi-
+      # server HA, add 2379/2380 to a host-specific override that
+      # ALSO scopes them with `interfacesIn`/source-IP filtering to
+      # the other control-plane nodes only.
     ];
     allowedUDPPorts = [
       8472    # VXLAN (Cilium can also run native-routing)