Skip to content

feat(ai-cluster-bootstrap): two-directory declarative AI cluster scaffold#4910

Merged
AceHack merged 5 commits into
mainfrom
ai-cluster-bootstrap
May 25, 2026
Merged

feat(ai-cluster-bootstrap): two-directory declarative AI cluster scaffold#4910
AceHack merged 5 commits into
mainfrom
ai-cluster-bootstrap

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 25, 2026

Summary

Two clean separate top-level directories on branch `ai-cluster-bootstrap` per Addison/Aaron's spec.

`usb-nixos-installer/` — USB-only, nothing extra

3 files:

File Purpose
`README.md` Scope statement: USB bootstrap ONLY
`flake.nix` Produces `installer-iso`
`nixos/installer/configuration.nix` Single-file package list for the stick

`full-ai-cluster/` — end-to-end cluster

62 files. First, a byte-identical copy of the USB directory (the bootstrap snippet). Then the full stack:

NixFlake layer (OS):

  • K3S server + K3S agent (Cilium takeover: `--flannel-backend=none`, `--disable-kube-proxy`, `--disable-network-policy`)
  • Cilium-host-prep (firewall, trusted-interfaces)
  • Docker via NixFlake (separate from K3S containerd)
  • local-path storage class as a K3S auto-applied manifest
  • NVIDIA driver + container toolkit
  • GPU passthrough (VFIO) for VM workloads on the same hosts
  • GPU device plugin for K8s — NVIDIA + AMD + Intel
  • per-host `configuration.nix` for `control-plane` + `worker-gpu` (+ template for additional workers)

ArgoCD layer (cluster — 30 Application.yamls):

  • Cilium (KPR + Hubble Relay + Hubble UI + BPF MASQUERADE per spec)
  • Orleans, Temporal (TS), Dapr Actors — three distributed-cron substrates
  • GitLab + Forgejo (both shipped, pick one)
  • Argo Workflows + Argo Rollouts
  • Longhorn (distributed block storage)
  • CockroachDB (distributed SQL)
  • Ollama + vLLM (LLM serving)
  • Deepseek Coder + Qwen Coder (model deploys → Ollama or vLLM)
  • kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
  • NATS, Redis, Weaviate
  • Loki, Tempo, Alloy, Mimir (Grafana observability stack)
  • Istio, Open Policy Agent, Sealed Secrets, Vault
  • Hindsight, OZ, Hermes, Warp — placeholder Application.yamls (see "Ambiguous components" below)

Bootstrap flow

```
nix build .#installer-iso → dd to USB → boot target → partition + clone Zeta →
nixos-install --flake ...#control-plane → reboot
↓ K3S starts
↓ K3S auto-applies cilium-namespace.yaml + argocd-{namespace,install}.yaml + root-application.yaml
↓ ArgoCD starts
↓ ArgoCD reconciles every Application.yaml under k8s/applications/
↓ Cluster running every workload declared
```

Ambiguous components (need your confirmation)

These 4 components map to multiple possible upstreams. I shipped placeholder Application.yamls with `TODO(aaron)` markers — please confirm which upstream each refers to and I'll sharpen them:

Component Possibilities
OZ OpenZiti (zero-trust networking) / Auth0 OZ / Aaron-specific component
Hermes Cosmos IBC relayer / message broker / Aaron-AI-agent (the spec's "integrated with OZ" + "SOPS into Hermes Docker image" + "Hermes access to Ollama or vLLM" hints suggest an Aaron-built agent — the placeholder deployment.yaml wires the env-var structure for OZ + Ollama + vLLM endpoints)
Warp Cloudflare Warp / Warp Terminal / Dagger Warp engine / Aaron-specific
Hindsight Lockheed Martin OTel tail-sampling processor / Microsoft Hindsight / other

Build the USB (your Mac)

```bash

1. Clone (one-time)

cd ~/Documents/src/repos/Zeta
git fetch origin
git checkout ai-cluster-bootstrap

2. (Apple Silicon only — one-time linux-builder setup)

nix run nix-darwin/nix-darwin-24.11#darwin-rebuild -- switch
--flake full-ai-cluster#zeta-mac

3. Build the installer ISO

cd full-ai-cluster
nix build .#installer-iso
ls -lh result/iso/zeta-installer-*.iso

4. Write to USB (macOS — replace diskN with YOUR USB device number from `diskutil list`)

diskutil unmountDisk /dev/diskN
sudo dd if=result/iso/zeta-installer-*.iso of=/dev/rdiskN bs=4m status=progress
diskutil eject /dev/diskN
```

Install on a target machine

```bash

Boot the target on the USB. At the console:

Network up:

nmtui

Partition (example: single ext4 + EFI — replace /dev/sda with your target disk):

sgdisk --zap-all /dev/sda
sgdisk -n 1:0:+512M -t 1:ef00 -c 1:boot /dev/sda
sgdisk -n 2:0:0 -t 2:8300 -c 2:nixos /dev/sda
mkfs.fat -F 32 -n boot /dev/sda1
mkfs.ext4 -L nixos /dev/sda2
mount /dev/disk/by-label/nixos /mnt
mkdir -p /mnt/boot && mount /dev/disk/by-label/boot /mnt/boot

Clone cluster flake:

git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta

Per-machine hardware config (must be copied into the host dir):

nixos-generate-config --root /mnt
cp /mnt/etc/nixos/hardware-configuration.nix
/mnt/etc/zeta/full-ai-cluster/nixos/hosts//hardware-configuration.nix

K3S cluster token (control-plane only on first install — save the token for workers):

nixos-enter --root /mnt -- bash -c '
mkdir -p /var/lib/rancher/k3s/server
openssl rand -hex 64 > /var/lib/rancher/k3s/server/token
chmod 600 /var/lib/rancher/k3s/server/token
'
cat /mnt/var/lib/rancher/k3s/server/token # ← save this; needed on every worker

Install:

nixos-install --flake /mnt/etc/zeta/full-ai-cluster#

= control-plane | worker-gpu | ...

Set zeta user password + reboot:

nixos-enter --root /mnt -- passwd zeta
reboot
```

For each worker, repeat — but instead of `openssl rand`, write the control-plane's token to `/var/lib/rancher/k3s/agent/token` (chmod 600).

Verify after first reboot

```bash
ssh zeta@control-plane.zeta.local
sudo kubectl get nodes
sudo kubectl -n kube-system get pods # cilium pods
sudo kubectl -n argocd get pods
sudo kubectl -n argocd get applications # all 30 should appear, gradually Healthy
sudo cilium status
sudo cilium hubble enable --ui
```

File structure summary

```
usb-nixos-installer/ 3 files
└── README + flake + installer config

full-ai-cluster/ 62 files
├── usb-nixos-installer/ (identical copy, 3 files)
├── README + flake + 2 hosts (6 files) + 8 modules + 4 bootstrap + 30 apps (+ supporting manifests for Orleans, Hermes, vLLM, model configmaps)
```

Test plan

  • markdownlint passes
  • `nix flake check` passes on both flakes
  • Reviewer confirms ambiguous components or marks them OK to ship as placeholders
  • Post-merge: build ISO, boot on a test machine, run through the install flow

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…fold

Adds two clean separate top-level directories on branch
ai-cluster-bootstrap per Aaron's spec:

usb-nixos-installer/  (3 files — USB-only, nothing extra)
  - README.md scoped to USB bootstrap only
  - flake.nix that produces installer-iso
  - nixos/installer/configuration.nix with package list

full-ai-cluster/  (62 files — end-to-end cluster)
  - usb-nixos-installer/ — byte-identical copy of the standalone
    directory (the USB part as the start snippet)
  - flake.nix — cluster flake (installer + per-host configs +
    nix-darwin linux-builder for Apple Silicon maintainers)
  - README.md — full bootstrap runbook
  - nixos/modules/ — 8 modules:
      common, k3s-server, k3s-agent, gpu, gpu-passthrough,
      gpu-device-plugin (NVIDIA + AMD + Intel), docker,
      local-storage
  - nixos/hosts/control-plane, worker-gpu — configuration.nix
    + placeholder hardware-configuration.nix + README per host
  - k8s/bootstrap/ — 4 manifests K3S auto-applies on first boot:
      cilium-namespace, argocd-namespace, argocd-install (pinned),
      root-application (App-of-Apps)
  - k8s/applications/ — 30 ArgoCD Application.yaml + supporting
    manifests covering every component in the spec

Component composition:
  NixFlake layer (OS): K3S, Cilium-host-prep, Docker, local-path
    storage class, GPU drivers + container toolkit, GPU passthrough
    (VFIO), GPU device plugins for K8s (NVIDIA/AMD/Intel)
  ArgoCD layer (cluster): Cilium (KPR + Hubble + BPF MASQUERADE),
    Orleans, Temporal (TS), Dapr Actors, GitLab, Forgejo, Argo
    Workflows + Rollouts, Longhorn, CockroachDB, Hindsight, OZ,
    Hermes (+ OZ integration via env vars), Warp, Ollama, vLLM,
    Deepseek Coder, Qwen Coder, kube-prometheus-stack
    (Prometheus + Grafana), NATS, Redis, Weaviate, Loki, Tempo,
    Alloy, Mimir, Istio (base), Open Policy Agent (Gatekeeper),
    Sealed Secrets, Vault

Cilium displaces K3S's default networking — k3s-server.nix passes
--flannel-backend=none, --disable-network-policy, and
--disable-kube-proxy so Cilium owns CNI + KPR + policy end-to-end.

Ambiguous components — placeholder Application.yamls with TODO(aaron)
markers + flagged in README "Component status" + PR description:
  - OZ (OpenZiti? Auth0? Aaron-specific?)
  - Hermes (Cosmos IBC relayer? Aaron AI-agent? Hermes integrates
    with OZ + has SOPS-baked-into-image secrets + talks to Ollama/vLLM
    per spec — the deployment.yaml wires the env-var structure for
    when the image lands)
  - Warp (Cloudflare? Warp terminal? Dagger Warp engine?)
  - Hindsight (Lockheed Martin OTel processor? other?)

Sharpen each by editing repoURL + chart in the relevant
Application.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 05:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 220a09b273

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread full-ai-cluster/nixos/modules/k3s-server.nix
Comment thread full-ai-cluster/nixos/modules/k3s-agent.nix Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a two-directory, declarative Nix-based AI cluster scaffold: a minimal USB NixOS installer flake and a full end-to-end cluster flake that bootstraps K3S + ArgoCD and declaratively installs a broad set of workloads via ArgoCD Applications.

Changes:

  • Introduces a standalone usb-nixos-installer/ flake for building a bootable NixOS installer ISO.
  • Adds full-ai-cluster/ flake with NixOS host/modules for control-plane + GPU workers, plus K3S bootstrap manifests for Cilium/ArgoCD.
  • Adds ArgoCD “App-of-Apps” structure with many workload Application.yaml definitions and a few placeholder/custom components.

Reviewed changes

Copilot reviewed 65 out of 65 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
usb-nixos-installer/README.md Documents the minimal USB installer flow and contents.
usb-nixos-installer/nixos/installer/configuration.nix NixOS installer ISO configuration and package set.
usb-nixos-installer/flake.nix Standalone flake producing installer-iso and a devshell.
full-ai-cluster/usb-nixos-installer/README.md Copy of USB installer README bundled under full cluster.
full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix Copy of installer ISO configuration under full cluster.
full-ai-cluster/usb-nixos-installer/flake.nix Copy of USB installer flake under full cluster.
full-ai-cluster/README.md End-to-end bootstrap and architecture documentation for the full cluster.
full-ai-cluster/nixos/modules/local-storage.nix Declares local-path-provisioner storage class as a K3S manifest.
full-ai-cluster/nixos/modules/k3s-server.nix K3S server configuration for Cilium takeover + bootstrap manifests.
full-ai-cluster/nixos/modules/k3s-agent.nix K3S agent configuration aligned with Cilium takeover.
full-ai-cluster/nixos/modules/gpu.nix NVIDIA driver + container toolkit + node labeling.
full-ai-cluster/nixos/modules/gpu-passthrough.nix VFIO/libvirt/QEMU plumbing for optional GPU passthrough VMs.
full-ai-cluster/nixos/modules/gpu-device-plugin.nix Installs vendor GPU device-plugin DaemonSets via K3S manifests.
full-ai-cluster/nixos/modules/docker.nix Enables Docker (rootless) and related CLI tooling.
full-ai-cluster/nixos/modules/common.nix Shared baseline configuration for all cluster hosts.
full-ai-cluster/nixos/hosts/worker-gpu/README.md Worker template documentation and scaling instructions.
full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix Placeholder hardware config for worker template.
full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix Worker template host config wiring modules together.
full-ai-cluster/nixos/hosts/control-plane/README.md Control-plane documentation and verification steps.
full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix Placeholder hardware config for control-plane.
full-ai-cluster/nixos/hosts/control-plane/configuration.nix Control-plane host config wiring server/bootstrap modules.
full-ai-cluster/k8s/bootstrap/root-application.yaml ArgoCD root Application (App-of-Apps) pointing at workload Applications.
full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml Ensures required namespace exists before Cilium app sync.
full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml Creates the ArgoCD namespace for bootstrap install.
full-ai-cluster/k8s/bootstrap/argocd-install.yaml Bootstraps ArgoCD via pinned remote manifest reference.
full-ai-cluster/k8s/applications/weaviate/Application.yaml Weaviate Helm install with Ollama integration values.
full-ai-cluster/k8s/applications/warp/Application.yaml Placeholder ArgoCD app for ambiguous “Warp” component.
full-ai-cluster/k8s/applications/vllm/deployment.yaml Hand-rolled vLLM deployment/PVC/service manifests (replicas default 0).
full-ai-cluster/k8s/applications/vllm/Application.yaml ArgoCD app pointing at the vLLM hand-rolled manifests.
full-ai-cluster/k8s/applications/vault/Application.yaml Vault Helm install configuration (HA + raft).
full-ai-cluster/k8s/applications/temporal/Application.yaml Temporal Helm install with persistence wiring stubbed for CockroachDB.
full-ai-cluster/k8s/applications/tempo/Application.yaml Tempo Helm install with Longhorn-backed persistence.
full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml Sealed Secrets controller Helm install.
full-ai-cluster/k8s/applications/redis/Application.yaml Redis Helm install expecting an existing auth Secret.
full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml Model metadata ConfigMap for Qwen Coder in models namespace.
full-ai-cluster/k8s/applications/qwen-coder/Application.yaml ArgoCD app for the Qwen Coder metadata manifests.
full-ai-cluster/k8s/applications/oz/Application.yaml Placeholder ArgoCD app for ambiguous “OZ” component.
full-ai-cluster/k8s/applications/orleans/statefulset.yaml Skeleton Orleans silo StatefulSet (replicas default 0).
full-ai-cluster/k8s/applications/orleans/service.yaml Services for Orleans silo/gateway/dashboard.
full-ai-cluster/k8s/applications/orleans/rbac.yaml RBAC for Orleans Kubernetes clustering provider.
full-ai-cluster/k8s/applications/orleans/namespace.yaml Orleans namespace with cluster labeling.
full-ai-cluster/k8s/applications/orleans/configmap.yaml Orleans cluster config ConfigMap.
full-ai-cluster/k8s/applications/orleans/Application.yaml ArgoCD app pointing at Orleans manifests.
full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml Gatekeeper (OPA) Helm install configuration.
full-ai-cluster/k8s/applications/ollama/Application.yaml Ollama Helm install configured for NVIDIA GPU scheduling.
full-ai-cluster/k8s/applications/nats/Application.yaml NATS Helm install with JetStream persistence.
full-ai-cluster/k8s/applications/mimir/Application.yaml Mimir distributed Helm install (bundled MinIO enabled).
full-ai-cluster/k8s/applications/longhorn/Application.yaml Longhorn Helm install as distributed block storage.
full-ai-cluster/k8s/applications/loki/Application.yaml Loki Helm install scaffold configured for S3 storage.
full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml kube-prometheus-stack Helm install with persistence settings.
full-ai-cluster/k8s/applications/istio/Application.yaml Istio base chart install (CRDs) with follow-up apps noted.
full-ai-cluster/k8s/applications/hindsight/Application.yaml Placeholder ArgoCD app for ambiguous “Hindsight” component.
full-ai-cluster/k8s/applications/hermes/deployment.yaml Hermes placeholder deployment/service with env wiring for OZ/Ollama/vLLM.
full-ai-cluster/k8s/applications/hermes/Application.yaml ArgoCD app pointing at Hermes manifests.
full-ai-cluster/k8s/applications/gitlab/Application.yaml GitLab chart install values scaffold (runner enabled).
full-ai-cluster/k8s/applications/forgejo/Application.yaml Forgejo chart install values scaffold.
full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml Creates models namespace + Deepseek Coder metadata ConfigMap.
full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml ArgoCD app for Deepseek Coder metadata manifests.
full-ai-cluster/k8s/applications/dapr/Application.yaml Dapr Helm install values scaffold.
full-ai-cluster/k8s/applications/cockroachdb/Application.yaml CockroachDB Helm install values scaffold (3 replicas, TLS).
full-ai-cluster/k8s/applications/cilium/Application.yaml Cilium Helm install values for KPR/Hubble/BPF masquerade.
full-ai-cluster/k8s/applications/argo-workflows/Application.yaml Argo Workflows Helm install values scaffold.
full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml Argo Rollouts Helm install values scaffold.
full-ai-cluster/k8s/applications/alloy/Application.yaml Grafana Alloy Helm install with inline collector config.
full-ai-cluster/flake.nix Full cluster flake: installer + host configs + reusable modules + darwin linux-builder.

Comment thread full-ai-cluster/usb-nixos-installer/README.md
Comment thread full-ai-cluster/README.md
Comment thread full-ai-cluster/README.md Outdated
Comment thread usb-nixos-installer/README.md
Comment thread full-ai-cluster/nixos/modules/local-storage.nix Outdated
Comment thread full-ai-cluster/nixos/modules/docker.nix
Comment thread full-ai-cluster/k8s/applications/warp/Application.yaml Outdated
Comment thread full-ai-cluster/nixos/modules/gpu-device-plugin.nix Outdated
Comment thread full-ai-cluster/usb-nixos-installer/flake.nix
  1. OZ      → OpenZiti (real chart wired)
  2. Hermes  → custom + cloud-only (Aaron: "we don't care about
               local right now only cloud"); local LLM endpoints
               kept commented-out for later phase
  3. Warp    → removed entirely (Aaron: "we do not need it")
  4. Hindsight → standalone helm chart for ArgoCD, agent persistent
                 memory for Hermes; TODO awaiting chart URL

Changes:

oz/Application.yaml:
  - Real OpenZiti controller helm chart
    (docs.openziti.io/helm-charts/ziti-controller v1.4.5)
  - longhorn persistence; ClusterIP service endpoints
  - Note: ziti-router lands in a sibling app when first edge router added

hermes/deployment.yaml:
  - Reoriented to cloud LLM providers (Anthropic / OpenAI / Bedrock)
  - API keys baked into image at build time via SOPS (per spec)
  - OZ transport: ziti-controller.openziti.svc.cluster.local:443
  - Hindsight memory: hindsight.hindsight.svc.cluster.local
  - Ollama/vLLM env vars kept commented for when local models return

warp/:
  - Directory + Application.yaml removed

hindsight/Application.yaml:
  - TODO comment block requesting:
      repoURL (Helm repo)
      chart name
      version
  - Example shape provided in the comment so the swap is trivial
  - namespace.yaml placeholder declares the namespace
    (`zeta.io/integrates-with: hermes`)

README.md:
  - Updated component-tree annotations to reflect the 4 changes
  - "Component status" reorganized into ✅ confirmed / 🟡 custom /
    ⏳ deferred (local models) / ❓ awaiting input (Hindsight chart)

The 29-Application set still covers the original 30-component spec
(Warp removed by maintainer request).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 803fcbe07f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread full-ai-cluster/k8s/bootstrap/argocd-install.yaml Outdated
P1 — Cilium chicken-and-egg
  k3s-server.nix disables flannel + kube-proxy so K3S brings up NO
  CNI on boot. Without a CNI no pods can schedule (including ArgoCD
  itself), so ArgoCD-installs-Cilium-later is impossible.

  Fix: added k8s/bootstrap/cilium-install.yaml (kustomize ref to
  Cilium v1.16.5 install manifest) + wired into K3S manifests list
  BEFORE argocd-install. Boot sequence is now:
    cilium-namespace → cilium-install → argocd-namespace →
    argocd-install → root-application
  ArgoCD's own Cilium Application then adopts the running install
  and reconciles ongoing changes.

P1 — k3s-agent.nix server-only flags
  --flannel-backend=none, --disable-kube-proxy, and
  --disable-network-policy are SERVER-side flags. K3S rejects them
  on agents with "flag not supported". Removed from agent extraFlags;
  documented in a comment that the server-side flags disable CNI
  cluster-wide.

P1 — README path errors (3 locations)
  full-ai-cluster/usb-nixos-installer/README.md and
  full-ai-cluster/usb-nixos-installer/flake.nix both said
  `../full-ai-cluster/` which from inside full-ai-cluster/usb-nixos-installer/
  resolves to full-ai-cluster/full-ai-cluster/ (non-existent).
  Replaced both with absolute GitHub URLs that work regardless of
  which copy of the file is being read. Mirrored to the top-level
  usb-nixos-installer/ copy to keep them byte-identical.

P1 — full-ai-cluster/README.md missing cilium-namespace.yaml in tree
  Tree view at line 27 listed argocd-namespace + argocd-install +
  root-application but skipped cilium-namespace.yaml. Added
  cilium-namespace AND the new cilium-install with ordering comment.

P1 — full-ai-cluster/README.md "Cluster layer" path typo
  Said `./k8s/applications/root-application.yaml` but the file is
  at `./k8s/bootstrap/root-application.yaml`. Fixed.

P1 — usb-nixos-installer/README.md "pinned by revision" claim
  README claimed inputs were pinned by revision, but no flake.lock
  is committed. Updated to say flake.nix references inputs by Git
  branch + explicit instruction for the first maintainer with Nix
  to run `nix flake update` and commit the resulting flake.lock.
  Mirrored to the duplicate.

P1 (security) — Grafana adminPassword: changeme hardcoded
  Replaced with `grafana.admin.existingSecret: grafana-admin-credentials`
  reference + commented-out kubectl create / Sealed Secret pattern.

P1 (security) — local-path-provisioner unquoted $VOL_DIR
  setup + teardown scripts had `path=$VOL_DIR` and `mkdir/rm` calls
  using unquoted vars. Added quoting + non-empty check + allowlist
  check that VOL_DIR resolves under /var/lib/zeta-local-storage/
  before mkdir or rm. Defense against an empty/exotic VOL_DIR
  causing /var/lib/zeta-local-storage/ wide-open writes or rms.

P1 (security) — docker module added zeta to `docker` group
  Removed `users.users.zeta.extraGroups = [ "docker" ];`. Membership
  in the docker group is root-on-host equivalent via the daemon
  socket. With rootless Docker enabled (already in the module),
  zeta gets their own rootless socket at $XDG_RUNTIME_DIR/docker.sock
  — that's sufficient for normal use. For maintainer tasks needing
  the system daemon: `sudo docker`.

P1 — gpu-device-plugin.nix referenced non-existent ArgoCD app
  Comment claimed ArgoCD would take ownership via
  k8s/applications/gpu-device-plugin/Application.yaml but no such
  directory exists. Rewrote the comment to reflect the actual
  state: NixOS-managed manifest only, bumping versions is a
  nixos-rebuild operation.

Stale findings (no fix needed, will resolve as outdated):
  - warp/Application.yaml: file was deleted in the previous commit;
    Copilot flagged the now-non-existent file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 06:10
Codex P1: services.k3s.manifests applies files as plain Kubernetes
resources — it does NOT run kustomize build. Both
argocd-install.yaml and cilium-install.yaml used
kustomize.config.k8s.io/v1beta1 Kustomization which is a BUILD-TIME
spec, not a K8s API resource. K3S would try to apply a Kustomization
CR (no such CRD installed in a fresh cluster), fail, and the actual
ArgoCD + Cilium would never come up.

Switched both to K3S's native helm.cattle.io/v1 HelmChart CR. K3S's
built-in Helm Controller fetches the chart at install time and
applies the rendered manifests automatically. Same end result as
kustomize-remote-resource, but in a format K3S's manifest applier
actually understands.

Cilium pinned to chart v1.16.5 with full KPR + Hubble (relay + UI)
+ BPF MASQUERADE + native routing — matches the ArgoCD Cilium
Application values exactly, so ArgoCD's eventual adoption sees a
matching configuration.

ArgoCD pinned to chart v7.7.10 (argo-cd helm chart) with
ClusterIP service + single-replica controller for the
single-control-plane bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 66 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

full-ai-cluster/k8s/bootstrap/cilium-install.yaml:43

  • The comment says this bootstrap manifest is generated via helm template with specific settings (kube-proxy replacement, k8sServiceHost/Port, native routing, etc.), but the kustomize resource points at the upstream templates.yaml URL. To keep bootstrap behavior reproducible (and aligned with the required K3S flags like --disable-kube-proxy), either commit the rendered manifest that matches those values or update the comments/approach so it’s clear what configuration is actually being applied at bootstrap time.

Comment thread full-ai-cluster/k8s/applications/oz/Application.yaml Outdated
Comment thread full-ai-cluster/k8s/applications/hindsight/Application.yaml Outdated
Comment thread full-ai-cluster/nixos/modules/k3s-server.nix Outdated
Comment thread full-ai-cluster/nixos/modules/k3s-server.nix
Comment thread full-ai-cluster/k8s/applications/ollama/Application.yaml Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 042997e45f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread full-ai-cluster/k8s/applications/ollama/Application.yaml Outdated
Comment thread full-ai-cluster/k8s/bootstrap/root-application.yaml
@AceHack AceHack enabled auto-merge (squash) May 25, 2026 06:27
P1 (security) — OZ adminPassword hardcoded:
  oz/Application.yaml `adminPassword: "changeme"` replaced with
  `adminSecret: { name: ziti-admin-credentials, key: password }`.
  Maintainer creates the secret BEFORE syncing this app (one-liner
  kubectl + openssl in comment).

P1 — hindsight TODO named attribution:
  TODO(aaron) → TODO(maintainer); "per Aaron" → removed.
  Conforms to repo convention "no name attribution in current-state
  surfaces" — same fix shape as a prior PR's same finding.

P1 — k3s-server.nix references non-existent MetalLB/ingress-nginx apps:
  Comment around --disable=servicelb said "ArgoCD installs MetalLB +
  ingress-nginx as Applications" but no such apps exist in this PR.
  Rewrote to name the bootstrap-period gap (LoadBalancer Services
  stay Pending; use NodePort or port-forward during bootstrap) and
  defer the MetalLB/ingress-nginx Application authoring to a
  future commit.

P1 (security) — k3s-server.nix opened etcd ports at host firewall:
  2379 + 2380 were in allowedTCPPorts. K3S embedded etcd binds
  127.0.0.1 by default — opening them at the host firewall risks
  exposing etcd to the LAN if the bind address ever drifts.
  Removed both; added a comment explaining the rationale and the
  per-host-scoped pattern for multi-server HA.

P1 — Ollama auto-pulled 33B/32B models on first sync:
  Application enabled `automated: true` + had `models.pull` +
  `models.run` configured for deepseek-coder:33b + qwen2.5-coder:32b
  + replicas left at default (= 1). On first sync ArgoCD would
  immediately spin up GPU-resourced pods + pull ~60+ GB of model
  weights — directly conflicting with the README's "DEFERRED
  local-models phase" status.

  Fix:
    - replicaCount: 0 in chart values
    - models.pull / models.run lists emptied (kept as commented
      examples for when the local phase comes back)
    - syncPolicy set to manual-only (no `automated:` block).
      ArgoCD won't reconcile until `argocd app sync ollama`
      is run explicitly by a maintainer.
  Same gating applied to vllm/Application.yaml (also deferred).

P2 — Both GitLab and Forgejo would reconcile simultaneously:
  Root App-of-Apps picks up every */Application.yaml. Without
  gating, both GitLab and Forgejo would spin up + fight over
  cluster resources. Set GitLab as default-on; Forgejo gets
  `automated:` stripped (manual sync only). Header comment on
  Forgejo names the swap procedure. Root-application.yaml header
  documents the either/or pattern for future maintainer reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AceHack AceHack merged commit 93bbf04 into main May 25, 2026
26 checks passed
@AceHack AceHack deleted the ai-cluster-bootstrap branch May 25, 2026 06:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc8d92a9b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +23 to +25
initialRootPassword:
secret: gitlab-initial-root-password
key: password
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Provide GitLab root-password secret in bootstrap state

On a fresh cluster bootstrap, this app is auto-synced by k8s/bootstrap/root-application.yaml, but gitlab is configured to read global.initialRootPassword.secret from gitlab-initial-root-password without any manifest in this commit creating that Secret. In that default path, the chart cannot fully reconcile until an operator manually creates the Secret, so the initial declarative bring-up is left degraded/non-reproducible. Either commit a SealedSecret/Vault-backed secret resource for this name or gate GitLab behind manual sync until credentials are provisioned.

Useful? React with 👍 / 👎.

AceHack added a commit that referenced this pull request May 25, 2026
…trate-stale-superseded empirical anchor (#4914)

* shard(2026-05-25/1009Z): cold-boot — sentinel-fired-AGAIN + lior-substrate-stale-superseded empirical anchor

Second 2026-05-25 fresh-session in this lane (after PR #4911 0613Z anchor).
Substantively new observations:

- Sentinel empty at cold-boot AGAIN (catch-43 fired AGAIN, ~4h after PR #4911
  also re-armed at 0613Z); pattern: per-session non-persistence is the dominant
  mechanism, NOT 3-day auto-expire
- 0 stuck git procs sustained ~30h since 2026-05-24 0407Z first-0-procs reading;
  dotgit-recovered remains stable
- Cold-boot landed on peer Lior's `lior-pr-preservation-rebased` (7th+ occurrence
  of "lands on whoever-was-last-active's branch" failure mode)
- NEW anchor: Lior's branch stages 70 `full-ai-cluster/*` files that ALREADY
  landed on origin/main via PRs #4910/#4912/#4913 — substrate-drift class
  observation; pr-triage-tiers.md Tier 1 disposition applies if Lior's branch
  ever pushed as PR

Disposition: shard via isolated worktree on origin/main (lane-discipline preserved
Lior's branch); did NOT touch Lior's WIP.

Composes with the multi-anchor saturation-recovery series + the substrate-drift
class documented in pr-triage-tiers.md + the catch-43 reliability of the
autonomous-loop discipline.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard): correct relative-path depth (4→6) + MD038 code-span trailing-space

- 7 tick-shard relative-paths used `../../../../` (4 levels) but tick shards
  at `docs/hygiene-history/ticks/YYYY/MM/DD/` are 6 levels deep; fixed to
  `../../../../../../` to match the canonical pattern
- MD038 markdownlint: `A ` code span had trailing space inside backticks;
  reworded to `\`A\` + space` to preserve the semantic (git status 2-char
  code: first char `A`, second char space = staged-added with working tree
  matching index) without violating MD038

Both lint failures verified clean locally via markdownlint-cli2.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard): prepend 6-column pipe-row schema header + correct 105→70 staged count

Addresses 2 real reviewer findings on PR #4914:

1. P0 Copilot — schema violation: first non-empty line was H1, but per
   docs/hygiene-history/ticks/README.md the validator at
   tools/hygiene/check-tick-history-shard-schema.ts requires a 6-column
   pipe-row first. Prepended the canonical pipe-row carrying ISO timestamp,
   model id, cron sentinel, body summary, PR ref, and observation. H1 body
   preserved below (hybrid format per B-0529 Recommendation Option 3).

2. P2 Codex + Copilot — arithmetic discrepancy: shard reported "105 staged
   full-ai-cluster/* additions" but the verification table later reported
   70 staged. Re-counted via `git status --short | awk | sort | uniq -c`:
   70 staged-A (all full-ai-cluster/*) + 35 untracked-?? (other Lior WIP) +
   8 modified-M (7 PR-disc + 1 settings.json) = 113 total. The 105 was an
   arithmetic error (113 - 8). Corrected to substrate-honest breakdown.

Both fixes verified locally:
- Schema validator: shard passes (453 pre-existing violations in other
  shards — not introduced by this PR)
- markdownlint: clean

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard): split provenance commands per table metric

Addresses P2 Codex finding on PR #4914: prior shard text claimed both table
rows were verified via `git ls-tree -r origin/main full-ai-cluster/`, but
that command only counts files on origin/main (first row). The second row
(staged on Lior's branch) requires a different command (`git status --short
| grep "full-ai-cluster" | wc -l` run against the contested root checkout
still on lior-pr-preservation-rebased).

Replaced single-command claim with a per-row verification column that names
the actual command used for each metric. Both verified counts already
landed correctly (70 each); only the provenance attribution was off.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
AceHack pushed a commit that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants