feat(ai-cluster-bootstrap): two-directory declarative AI cluster scaffold#4910
Conversation
…fold
Adds two clean separate top-level directories on branch
ai-cluster-bootstrap per Aaron's spec:
usb-nixos-installer/ (3 files — USB-only, nothing extra)
- README.md scoped to USB bootstrap only
- flake.nix that produces installer-iso
- nixos/installer/configuration.nix with package list
full-ai-cluster/ (62 files — end-to-end cluster)
- usb-nixos-installer/ — byte-identical copy of the standalone
directory (the USB part as the start snippet)
- flake.nix — cluster flake (installer + per-host configs +
nix-darwin linux-builder for Apple Silicon maintainers)
- README.md — full bootstrap runbook
- nixos/modules/ — 8 modules:
common, k3s-server, k3s-agent, gpu, gpu-passthrough,
gpu-device-plugin (NVIDIA + AMD + Intel), docker,
local-storage
- nixos/hosts/control-plane, worker-gpu — configuration.nix
+ placeholder hardware-configuration.nix + README per host
- k8s/bootstrap/ — 4 manifests K3S auto-applies on first boot:
cilium-namespace, argocd-namespace, argocd-install (pinned),
root-application (App-of-Apps)
- k8s/applications/ — 30 ArgoCD Application.yaml + supporting
manifests covering every component in the spec
Component composition:
NixFlake layer (OS): K3S, Cilium-host-prep, Docker, local-path
storage class, GPU drivers + container toolkit, GPU passthrough
(VFIO), GPU device plugins for K8s (NVIDIA/AMD/Intel)
ArgoCD layer (cluster): Cilium (KPR + Hubble + BPF MASQUERADE),
Orleans, Temporal (TS), Dapr Actors, GitLab, Forgejo, Argo
Workflows + Rollouts, Longhorn, CockroachDB, Hindsight, OZ,
Hermes (+ OZ integration via env vars), Warp, Ollama, vLLM,
Deepseek Coder, Qwen Coder, kube-prometheus-stack
(Prometheus + Grafana), NATS, Redis, Weaviate, Loki, Tempo,
Alloy, Mimir, Istio (base), Open Policy Agent (Gatekeeper),
Sealed Secrets, Vault
Cilium displaces K3S's default networking — k3s-server.nix passes
--flannel-backend=none, --disable-network-policy, and
--disable-kube-proxy so Cilium owns CNI + KPR + policy end-to-end.
Ambiguous components — placeholder Application.yamls with TODO(aaron)
markers + flagged in README "Component status" + PR description:
- OZ (OpenZiti? Auth0? Aaron-specific?)
- Hermes (Cosmos IBC relayer? Aaron AI-agent? Hermes integrates
with OZ + has SOPS-baked-into-image secrets + talks to Ollama/vLLM
per spec — the deployment.yaml wires the env-var structure for
when the image lands)
- Warp (Cloudflare? Warp terminal? Dagger Warp engine?)
- Hindsight (Lockheed Martin OTel processor? other?)
Sharpen each by editing repoURL + chart in the relevant
Application.yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 220a09b273
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds a two-directory, declarative Nix-based AI cluster scaffold: a minimal USB NixOS installer flake and a full end-to-end cluster flake that bootstraps K3S + ArgoCD and declaratively installs a broad set of workloads via ArgoCD Applications.
Changes:
- Introduces a standalone
usb-nixos-installer/flake for building a bootable NixOS installer ISO. - Adds
full-ai-cluster/flake with NixOS host/modules for control-plane + GPU workers, plus K3S bootstrap manifests for Cilium/ArgoCD. - Adds ArgoCD “App-of-Apps” structure with many workload
Application.yamldefinitions and a few placeholder/custom components.
Reviewed changes
Copilot reviewed 65 out of 65 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| usb-nixos-installer/README.md | Documents the minimal USB installer flow and contents. |
| usb-nixos-installer/nixos/installer/configuration.nix | NixOS installer ISO configuration and package set. |
| usb-nixos-installer/flake.nix | Standalone flake producing installer-iso and a devshell. |
| full-ai-cluster/usb-nixos-installer/README.md | Copy of USB installer README bundled under full cluster. |
| full-ai-cluster/usb-nixos-installer/nixos/installer/configuration.nix | Copy of installer ISO configuration under full cluster. |
| full-ai-cluster/usb-nixos-installer/flake.nix | Copy of USB installer flake under full cluster. |
| full-ai-cluster/README.md | End-to-end bootstrap and architecture documentation for the full cluster. |
| full-ai-cluster/nixos/modules/local-storage.nix | Declares local-path-provisioner storage class as a K3S manifest. |
| full-ai-cluster/nixos/modules/k3s-server.nix | K3S server configuration for Cilium takeover + bootstrap manifests. |
| full-ai-cluster/nixos/modules/k3s-agent.nix | K3S agent configuration aligned with Cilium takeover. |
| full-ai-cluster/nixos/modules/gpu.nix | NVIDIA driver + container toolkit + node labeling. |
| full-ai-cluster/nixos/modules/gpu-passthrough.nix | VFIO/libvirt/QEMU plumbing for optional GPU passthrough VMs. |
| full-ai-cluster/nixos/modules/gpu-device-plugin.nix | Installs vendor GPU device-plugin DaemonSets via K3S manifests. |
| full-ai-cluster/nixos/modules/docker.nix | Enables Docker (rootless) and related CLI tooling. |
| full-ai-cluster/nixos/modules/common.nix | Shared baseline configuration for all cluster hosts. |
| full-ai-cluster/nixos/hosts/worker-gpu/README.md | Worker template documentation and scaling instructions. |
| full-ai-cluster/nixos/hosts/worker-gpu/hardware-configuration.nix | Placeholder hardware config for worker template. |
| full-ai-cluster/nixos/hosts/worker-gpu/configuration.nix | Worker template host config wiring modules together. |
| full-ai-cluster/nixos/hosts/control-plane/README.md | Control-plane documentation and verification steps. |
| full-ai-cluster/nixos/hosts/control-plane/hardware-configuration.nix | Placeholder hardware config for control-plane. |
| full-ai-cluster/nixos/hosts/control-plane/configuration.nix | Control-plane host config wiring server/bootstrap modules. |
| full-ai-cluster/k8s/bootstrap/root-application.yaml | ArgoCD root Application (App-of-Apps) pointing at workload Applications. |
| full-ai-cluster/k8s/bootstrap/cilium-namespace.yaml | Ensures required namespace exists before Cilium app sync. |
| full-ai-cluster/k8s/bootstrap/argocd-namespace.yaml | Creates the ArgoCD namespace for bootstrap install. |
| full-ai-cluster/k8s/bootstrap/argocd-install.yaml | Bootstraps ArgoCD via pinned remote manifest reference. |
| full-ai-cluster/k8s/applications/weaviate/Application.yaml | Weaviate Helm install with Ollama integration values. |
| full-ai-cluster/k8s/applications/warp/Application.yaml | Placeholder ArgoCD app for ambiguous “Warp” component. |
| full-ai-cluster/k8s/applications/vllm/deployment.yaml | Hand-rolled vLLM deployment/PVC/service manifests (replicas default 0). |
| full-ai-cluster/k8s/applications/vllm/Application.yaml | ArgoCD app pointing at the vLLM hand-rolled manifests. |
| full-ai-cluster/k8s/applications/vault/Application.yaml | Vault Helm install configuration (HA + raft). |
| full-ai-cluster/k8s/applications/temporal/Application.yaml | Temporal Helm install with persistence wiring stubbed for CockroachDB. |
| full-ai-cluster/k8s/applications/tempo/Application.yaml | Tempo Helm install with Longhorn-backed persistence. |
| full-ai-cluster/k8s/applications/sealed-secrets/Application.yaml | Sealed Secrets controller Helm install. |
| full-ai-cluster/k8s/applications/redis/Application.yaml | Redis Helm install expecting an existing auth Secret. |
| full-ai-cluster/k8s/applications/qwen-coder/configmap.yaml | Model metadata ConfigMap for Qwen Coder in models namespace. |
| full-ai-cluster/k8s/applications/qwen-coder/Application.yaml | ArgoCD app for the Qwen Coder metadata manifests. |
| full-ai-cluster/k8s/applications/oz/Application.yaml | Placeholder ArgoCD app for ambiguous “OZ” component. |
| full-ai-cluster/k8s/applications/orleans/statefulset.yaml | Skeleton Orleans silo StatefulSet (replicas default 0). |
| full-ai-cluster/k8s/applications/orleans/service.yaml | Services for Orleans silo/gateway/dashboard. |
| full-ai-cluster/k8s/applications/orleans/rbac.yaml | RBAC for Orleans Kubernetes clustering provider. |
| full-ai-cluster/k8s/applications/orleans/namespace.yaml | Orleans namespace with cluster labeling. |
| full-ai-cluster/k8s/applications/orleans/configmap.yaml | Orleans cluster config ConfigMap. |
| full-ai-cluster/k8s/applications/orleans/Application.yaml | ArgoCD app pointing at Orleans manifests. |
| full-ai-cluster/k8s/applications/open-policy-agent/Application.yaml | Gatekeeper (OPA) Helm install configuration. |
| full-ai-cluster/k8s/applications/ollama/Application.yaml | Ollama Helm install configured for NVIDIA GPU scheduling. |
| full-ai-cluster/k8s/applications/nats/Application.yaml | NATS Helm install with JetStream persistence. |
| full-ai-cluster/k8s/applications/mimir/Application.yaml | Mimir distributed Helm install (bundled MinIO enabled). |
| full-ai-cluster/k8s/applications/longhorn/Application.yaml | Longhorn Helm install as distributed block storage. |
| full-ai-cluster/k8s/applications/loki/Application.yaml | Loki Helm install scaffold configured for S3 storage. |
| full-ai-cluster/k8s/applications/kube-prometheus-stack/Application.yaml | kube-prometheus-stack Helm install with persistence settings. |
| full-ai-cluster/k8s/applications/istio/Application.yaml | Istio base chart install (CRDs) with follow-up apps noted. |
| full-ai-cluster/k8s/applications/hindsight/Application.yaml | Placeholder ArgoCD app for ambiguous “Hindsight” component. |
| full-ai-cluster/k8s/applications/hermes/deployment.yaml | Hermes placeholder deployment/service with env wiring for OZ/Ollama/vLLM. |
| full-ai-cluster/k8s/applications/hermes/Application.yaml | ArgoCD app pointing at Hermes manifests. |
| full-ai-cluster/k8s/applications/gitlab/Application.yaml | GitLab chart install values scaffold (runner enabled). |
| full-ai-cluster/k8s/applications/forgejo/Application.yaml | Forgejo chart install values scaffold. |
| full-ai-cluster/k8s/applications/deepseek-coder/configmap.yaml | Creates models namespace + Deepseek Coder metadata ConfigMap. |
| full-ai-cluster/k8s/applications/deepseek-coder/Application.yaml | ArgoCD app for Deepseek Coder metadata manifests. |
| full-ai-cluster/k8s/applications/dapr/Application.yaml | Dapr Helm install values scaffold. |
| full-ai-cluster/k8s/applications/cockroachdb/Application.yaml | CockroachDB Helm install values scaffold (3 replicas, TLS). |
| full-ai-cluster/k8s/applications/cilium/Application.yaml | Cilium Helm install values for KPR/Hubble/BPF masquerade. |
| full-ai-cluster/k8s/applications/argo-workflows/Application.yaml | Argo Workflows Helm install values scaffold. |
| full-ai-cluster/k8s/applications/argo-rollouts/Application.yaml | Argo Rollouts Helm install values scaffold. |
| full-ai-cluster/k8s/applications/alloy/Application.yaml | Grafana Alloy Helm install with inline collector config. |
| full-ai-cluster/flake.nix | Full cluster flake: installer + host configs + reusable modules + darwin linux-builder. |
1. OZ → OpenZiti (real chart wired)
2. Hermes → custom + cloud-only (Aaron: "we don't care about
local right now only cloud"); local LLM endpoints
kept commented-out for later phase
3. Warp → removed entirely (Aaron: "we do not need it")
4. Hindsight → standalone helm chart for ArgoCD, agent persistent
memory for Hermes; TODO awaiting chart URL
Changes:
oz/Application.yaml:
- Real OpenZiti controller helm chart
(docs.openziti.io/helm-charts/ziti-controller v1.4.5)
- longhorn persistence; ClusterIP service endpoints
- Note: ziti-router lands in a sibling app when first edge router added
hermes/deployment.yaml:
- Reoriented to cloud LLM providers (Anthropic / OpenAI / Bedrock)
- API keys baked into image at build time via SOPS (per spec)
- OZ transport: ziti-controller.openziti.svc.cluster.local:443
- Hindsight memory: hindsight.hindsight.svc.cluster.local
- Ollama/vLLM env vars kept commented for when local models return
warp/:
- Directory + Application.yaml removed
hindsight/Application.yaml:
- TODO comment block requesting:
repoURL (Helm repo)
chart name
version
- Example shape provided in the comment so the swap is trivial
- namespace.yaml placeholder declares the namespace
(`zeta.io/integrates-with: hermes`)
README.md:
- Updated component-tree annotations to reflect the 4 changes
- "Component status" reorganized into ✅ confirmed / 🟡 custom /
⏳ deferred (local models) / ❓ awaiting input (Hindsight chart)
The 29-Application set still covers the original 30-component spec
(Warp removed by maintainer request).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 803fcbe07f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 — Cilium chicken-and-egg
k3s-server.nix disables flannel + kube-proxy so K3S brings up NO
CNI on boot. Without a CNI no pods can schedule (including ArgoCD
itself), so ArgoCD-installs-Cilium-later is impossible.
Fix: added k8s/bootstrap/cilium-install.yaml (kustomize ref to
Cilium v1.16.5 install manifest) + wired into K3S manifests list
BEFORE argocd-install. Boot sequence is now:
cilium-namespace → cilium-install → argocd-namespace →
argocd-install → root-application
ArgoCD's own Cilium Application then adopts the running install
and reconciles ongoing changes.
P1 — k3s-agent.nix server-only flags
--flannel-backend=none, --disable-kube-proxy, and
--disable-network-policy are SERVER-side flags. K3S rejects them
on agents with "flag not supported". Removed from agent extraFlags;
documented in a comment that the server-side flags disable CNI
cluster-wide.
P1 — README path errors (3 locations)
full-ai-cluster/usb-nixos-installer/README.md and
full-ai-cluster/usb-nixos-installer/flake.nix both said
`../full-ai-cluster/` which from inside full-ai-cluster/usb-nixos-installer/
resolves to full-ai-cluster/full-ai-cluster/ (non-existent).
Replaced both with absolute GitHub URLs that work regardless of
which copy of the file is being read. Mirrored to the top-level
usb-nixos-installer/ copy to keep them byte-identical.
P1 — full-ai-cluster/README.md missing cilium-namespace.yaml in tree
Tree view at line 27 listed argocd-namespace + argocd-install +
root-application but skipped cilium-namespace.yaml. Added
cilium-namespace AND the new cilium-install with ordering comment.
P1 — full-ai-cluster/README.md "Cluster layer" path typo
Said `./k8s/applications/root-application.yaml` but the file is
at `./k8s/bootstrap/root-application.yaml`. Fixed.
P1 — usb-nixos-installer/README.md "pinned by revision" claim
README claimed inputs were pinned by revision, but no flake.lock
is committed. Updated to say flake.nix references inputs by Git
branch + explicit instruction for the first maintainer with Nix
to run `nix flake update` and commit the resulting flake.lock.
Mirrored to the duplicate.
P1 (security) — Grafana adminPassword: changeme hardcoded
Replaced with `grafana.admin.existingSecret: grafana-admin-credentials`
reference + commented-out kubectl create / Sealed Secret pattern.
P1 (security) — local-path-provisioner unquoted $VOL_DIR
setup + teardown scripts had `path=$VOL_DIR` and `mkdir/rm` calls
using unquoted vars. Added quoting + non-empty check + allowlist
check that VOL_DIR resolves under /var/lib/zeta-local-storage/
before mkdir or rm. Defense against an empty/exotic VOL_DIR
causing /var/lib/zeta-local-storage/ wide-open writes or rms.
P1 (security) — docker module added zeta to `docker` group
Removed `users.users.zeta.extraGroups = [ "docker" ];`. Membership
in the docker group is root-on-host equivalent via the daemon
socket. With rootless Docker enabled (already in the module),
zeta gets their own rootless socket at $XDG_RUNTIME_DIR/docker.sock
— that's sufficient for normal use. For maintainer tasks needing
the system daemon: `sudo docker`.
P1 — gpu-device-plugin.nix referenced non-existent ArgoCD app
Comment claimed ArgoCD would take ownership via
k8s/applications/gpu-device-plugin/Application.yaml but no such
directory exists. Rewrote the comment to reflect the actual
state: NixOS-managed manifest only, bumping versions is a
nixos-rebuild operation.
Stale findings (no fix needed, will resolve as outdated):
- warp/Application.yaml: file was deleted in the previous commit;
Copilot flagged the now-non-existent file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex P1: services.k3s.manifests applies files as plain Kubernetes resources — it does NOT run kustomize build. Both argocd-install.yaml and cilium-install.yaml used kustomize.config.k8s.io/v1beta1 Kustomization which is a BUILD-TIME spec, not a K8s API resource. K3S would try to apply a Kustomization CR (no such CRD installed in a fresh cluster), fail, and the actual ArgoCD + Cilium would never come up. Switched both to K3S's native helm.cattle.io/v1 HelmChart CR. K3S's built-in Helm Controller fetches the chart at install time and applies the rendered manifests automatically. Same end result as kustomize-remote-resource, but in a format K3S's manifest applier actually understands. Cilium pinned to chart v1.16.5 with full KPR + Hubble (relay + UI) + BPF MASQUERADE + native routing — matches the ArgoCD Cilium Application values exactly, so ArgoCD's eventual adoption sees a matching configuration. ArgoCD pinned to chart v7.7.10 (argo-cd helm chart) with ClusterIP service + single-replica controller for the single-control-plane bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 66 out of 66 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
full-ai-cluster/k8s/bootstrap/cilium-install.yaml:43
- The comment says this bootstrap manifest is generated via
helm templatewith specific settings (kube-proxy replacement, k8sServiceHost/Port, native routing, etc.), but the kustomize resource points at the upstreamtemplates.yamlURL. To keep bootstrap behavior reproducible (and aligned with the required K3S flags like--disable-kube-proxy), either commit the rendered manifest that matches those values or update the comments/approach so it’s clear what configuration is actually being applied at bootstrap time.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 042997e45f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 (security) — OZ adminPassword hardcoded:
oz/Application.yaml `adminPassword: "changeme"` replaced with
`adminSecret: { name: ziti-admin-credentials, key: password }`.
Maintainer creates the secret BEFORE syncing this app (one-liner
kubectl + openssl in comment).
P1 — hindsight TODO named attribution:
TODO(aaron) → TODO(maintainer); "per Aaron" → removed.
Conforms to repo convention "no name attribution in current-state
surfaces" — same fix shape as a prior PR's same finding.
P1 — k3s-server.nix references non-existent MetalLB/ingress-nginx apps:
Comment around --disable=servicelb said "ArgoCD installs MetalLB +
ingress-nginx as Applications" but no such apps exist in this PR.
Rewrote to name the bootstrap-period gap (LoadBalancer Services
stay Pending; use NodePort or port-forward during bootstrap) and
defer the MetalLB/ingress-nginx Application authoring to a
future commit.
P1 (security) — k3s-server.nix opened etcd ports at host firewall:
2379 + 2380 were in allowedTCPPorts. K3S embedded etcd binds
127.0.0.1 by default — opening them at the host firewall risks
exposing etcd to the LAN if the bind address ever drifts.
Removed both; added a comment explaining the rationale and the
per-host-scoped pattern for multi-server HA.
P1 — Ollama auto-pulled 33B/32B models on first sync:
Application enabled `automated: true` + had `models.pull` +
`models.run` configured for deepseek-coder:33b + qwen2.5-coder:32b
+ replicas left at default (= 1). On first sync ArgoCD would
immediately spin up GPU-resourced pods + pull ~60+ GB of model
weights — directly conflicting with the README's "DEFERRED
local-models phase" status.
Fix:
- replicaCount: 0 in chart values
- models.pull / models.run lists emptied (kept as commented
examples for when the local phase comes back)
- syncPolicy set to manual-only (no `automated:` block).
ArgoCD won't reconcile until `argocd app sync ollama`
is run explicitly by a maintainer.
Same gating applied to vllm/Application.yaml (also deferred).
P2 — Both GitLab and Forgejo would reconcile simultaneously:
Root App-of-Apps picks up every */Application.yaml. Without
gating, both GitLab and Forgejo would spin up + fight over
cluster resources. Set GitLab as default-on; Forgejo gets
`automated:` stripped (manual sync only). Header comment on
Forgejo names the swap procedure. Root-application.yaml header
documents the either/or pattern for future maintainer reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dc8d92a9b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| initialRootPassword: | ||
| secret: gitlab-initial-root-password | ||
| key: password |
There was a problem hiding this comment.
Provide GitLab root-password secret in bootstrap state
On a fresh cluster bootstrap, this app is auto-synced by k8s/bootstrap/root-application.yaml, but gitlab is configured to read global.initialRootPassword.secret from gitlab-initial-root-password without any manifest in this commit creating that Secret. In that default path, the chart cannot fully reconcile until an operator manually creates the Secret, so the initial declarative bring-up is left degraded/non-reproducible. Either commit a SealedSecret/Vault-backed secret resource for this name or gate GitLab behind manual sync until credentials are provisioned.
Useful? React with 👍 / 👎.
…trate-stale-superseded empirical anchor (#4914) * shard(2026-05-25/1009Z): cold-boot — sentinel-fired-AGAIN + lior-substrate-stale-superseded empirical anchor Second 2026-05-25 fresh-session in this lane (after PR #4911 0613Z anchor). Substantively new observations: - Sentinel empty at cold-boot AGAIN (catch-43 fired AGAIN, ~4h after PR #4911 also re-armed at 0613Z); pattern: per-session non-persistence is the dominant mechanism, NOT 3-day auto-expire - 0 stuck git procs sustained ~30h since 2026-05-24 0407Z first-0-procs reading; dotgit-recovered remains stable - Cold-boot landed on peer Lior's `lior-pr-preservation-rebased` (7th+ occurrence of "lands on whoever-was-last-active's branch" failure mode) - NEW anchor: Lior's branch stages 70 `full-ai-cluster/*` files that ALREADY landed on origin/main via PRs #4910/#4912/#4913 — substrate-drift class observation; pr-triage-tiers.md Tier 1 disposition applies if Lior's branch ever pushed as PR Disposition: shard via isolated worktree on origin/main (lane-discipline preserved Lior's branch); did NOT touch Lior's WIP. Composes with the multi-anchor saturation-recovery series + the substrate-drift class documented in pr-triage-tiers.md + the catch-43 reliability of the autonomous-loop discipline. Co-Authored-By: Claude <noreply@anthropic.com> * fix(shard): correct relative-path depth (4→6) + MD038 code-span trailing-space - 7 tick-shard relative-paths used `../../../../` (4 levels) but tick shards at `docs/hygiene-history/ticks/YYYY/MM/DD/` are 6 levels deep; fixed to `../../../../../../` to match the canonical pattern - MD038 markdownlint: `A ` code span had trailing space inside backticks; reworded to `\`A\` + space` to preserve the semantic (git status 2-char code: first char `A`, second char space = staged-added with working tree matching index) without violating MD038 Both lint failures verified clean locally via markdownlint-cli2. Co-Authored-By: Claude <noreply@anthropic.com> * fix(shard): prepend 6-column pipe-row schema header + correct 105→70 staged count Addresses 2 real reviewer findings on PR #4914: 1. P0 Copilot — schema violation: first non-empty line was H1, but per docs/hygiene-history/ticks/README.md the validator at tools/hygiene/check-tick-history-shard-schema.ts requires a 6-column pipe-row first. Prepended the canonical pipe-row carrying ISO timestamp, model id, cron sentinel, body summary, PR ref, and observation. H1 body preserved below (hybrid format per B-0529 Recommendation Option 3). 2. P2 Codex + Copilot — arithmetic discrepancy: shard reported "105 staged full-ai-cluster/* additions" but the verification table later reported 70 staged. Re-counted via `git status --short | awk | sort | uniq -c`: 70 staged-A (all full-ai-cluster/*) + 35 untracked-?? (other Lior WIP) + 8 modified-M (7 PR-disc + 1 settings.json) = 113 total. The 105 was an arithmetic error (113 - 8). Corrected to substrate-honest breakdown. Both fixes verified locally: - Schema validator: shard passes (453 pre-existing violations in other shards — not introduced by this PR) - markdownlint: clean Co-Authored-By: Claude <noreply@anthropic.com> * fix(shard): split provenance commands per table metric Addresses P2 Codex finding on PR #4914: prior shard text claimed both table rows were verified via `git ls-tree -r origin/main full-ai-cluster/`, but that command only counts files on origin/main (first row). The second row (staged on Lior's branch) requires a different command (`git status --short | grep "full-ai-cluster" | wc -l` run against the contested root checkout still on lior-pr-preservation-rebased). Replaced single-command claim with a per-row verification column that names the actual command used for each metric. Both verified counts already landed correctly (70 each); only the provenance attribution was off. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude <noreply@anthropic.com>
Summary
Two clean separate top-level directories on branch `ai-cluster-bootstrap` per Addison/Aaron's spec.
`usb-nixos-installer/` — USB-only, nothing extra
3 files:
`full-ai-cluster/` — end-to-end cluster
62 files. First, a byte-identical copy of the USB directory (the bootstrap snippet). Then the full stack:
NixFlake layer (OS):
ArgoCD layer (cluster — 30 Application.yamls):
Bootstrap flow
```
nix build .#installer-iso → dd to USB → boot target → partition + clone Zeta →
nixos-install --flake ...#control-plane → reboot
↓ K3S starts
↓ K3S auto-applies cilium-namespace.yaml + argocd-{namespace,install}.yaml + root-application.yaml
↓ ArgoCD starts
↓ ArgoCD reconciles every Application.yaml under k8s/applications/
↓ Cluster running every workload declared
```
Ambiguous components (need your confirmation)
These 4 components map to multiple possible upstreams. I shipped placeholder Application.yamls with `TODO(aaron)` markers — please confirm which upstream each refers to and I'll sharpen them:
Build the USB (your Mac)
```bash
1. Clone (one-time)
cd ~/Documents/src/repos/Zeta
git fetch origin
git checkout ai-cluster-bootstrap
2. (Apple Silicon only — one-time linux-builder setup)
nix run nix-darwin/nix-darwin-24.11#darwin-rebuild -- switch
--flake full-ai-cluster#zeta-mac
3. Build the installer ISO
cd full-ai-cluster
nix build .#installer-iso
ls -lh result/iso/zeta-installer-*.iso
4. Write to USB (macOS — replace diskN with YOUR USB device number from `diskutil list`)
diskutil unmountDisk /dev/diskN
sudo dd if=result/iso/zeta-installer-*.iso of=/dev/rdiskN bs=4m status=progress
diskutil eject /dev/diskN
```
Install on a target machine
```bash
Boot the target on the USB. At the console:
Network up:
nmtui
Partition (example: single ext4 + EFI — replace /dev/sda with your target disk):
sgdisk --zap-all /dev/sda
sgdisk -n 1:0:+512M -t 1:ef00 -c 1:boot /dev/sda
sgdisk -n 2:0:0 -t 2:8300 -c 2:nixos /dev/sda
mkfs.fat -F 32 -n boot /dev/sda1
mkfs.ext4 -L nixos /dev/sda2
mount /dev/disk/by-label/nixos /mnt
mkdir -p /mnt/boot && mount /dev/disk/by-label/boot /mnt/boot
Clone cluster flake:
git clone https://github.com/Lucent-Financial-Group/Zeta /mnt/etc/zeta
Per-machine hardware config (must be copied into the host dir):
nixos-generate-config --root /mnt
cp /mnt/etc/nixos/hardware-configuration.nix
/mnt/etc/zeta/full-ai-cluster/nixos/hosts//hardware-configuration.nix
K3S cluster token (control-plane only on first install — save the token for workers):
nixos-enter --root /mnt -- bash -c '
mkdir -p /var/lib/rancher/k3s/server
openssl rand -hex 64 > /var/lib/rancher/k3s/server/token
chmod 600 /var/lib/rancher/k3s/server/token
'
cat /mnt/var/lib/rancher/k3s/server/token # ← save this; needed on every worker
Install:
nixos-install --flake /mnt/etc/zeta/full-ai-cluster#
= control-plane | worker-gpu | ...
Set zeta user password + reboot:
nixos-enter --root /mnt -- passwd zeta
reboot
```
For each worker, repeat — but instead of `openssl rand`, write the control-plane's token to `/var/lib/rancher/k3s/agent/token` (chmod 600).
Verify after first reboot
```bash
ssh zeta@control-plane.zeta.local
sudo kubectl get nodes
sudo kubectl -n kube-system get pods # cilium pods
sudo kubectl -n argocd get pods
sudo kubectl -n argocd get applications # all 30 should appear, gradually Healthy
sudo cilium status
sudo cilium hubble enable --ui
```
File structure summary
```
usb-nixos-installer/ 3 files
└── README + flake + installer config
full-ai-cluster/ 62 files
├── usb-nixos-installer/ (identical copy, 3 files)
├── README + flake + 2 hosts (6 files) + 8 modules + 4 bootstrap + 30 apps (+ supporting manifests for Orleans, Hermes, vLLM, model configmaps)
```
Test plan
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com