docs(research): cluster bare-metal substrate architecture decision (NixOS + bare-metal k8s + Argo CD; no hypervisor for primary stack)#4808
Merged
AceHack merged 1 commit intoMay 24, 2026
Conversation
…cision (NixOS + bare-metal k8s + Argo CD; no hypervisor for primary stack) Architecture decisions for basement cluster build: - 20 GPUs + 20 phones via Cellhasher + Pi cluster + AI hats Aaron's substrate-engineering authority calls captured (verbatim quotes preserved): PRIMARY STACK DECIDED: - NixOS 24.11+ flake-based (declarative OS; DST-aligned) - Argo CD for GitOps (over Flux despite Flux being lighter — explicit operator preference) - Bare-metal Kubernetes (no hypervisor for primary stack) - containerd / Cilium CNI / Longhorn CSI + ZFS / NVIDIA k8s device plugin / systemd-boot / nixos-anywhere provisioning DEFERRED (backlog): - Talos Linux as alternative for k8s control-plane subset - KubeVirt as k8s extension for VM workloads if needed - Proxmox for separate experimental tier (outside framework DST) - k3s vs kubeadm decision REJECTED (with reasoning): - Guix System: FSF free-software-fundamentalism may block NVIDIA - Ubuntu/Debian/Fedora: mutable; not DST-aligned - Fedora CoreOS / Silverblue / Flatcar / Bottlerocket: less expressive than Nix; container-host shape only - Proxmox primary: imperative web-UI breaks DST + 3 layers vs 1 - ESXi / XCP-ng / Harvester: see body - Flux: operator preference for Argo HETEROGENEOUS COMPUTE ARCHITECTURE: - GPU compute nodes (NVIDIA + k8s workers) - Phone-orchestrator node (Cellhasher management; NOT k8s worker — phones are workload-substrate, not k8s control plane) - Pi-cluster + AI hats (may or may not run k8s depending on AI workload) Maps each architecture choice to specific framework substrate-engineering disciplines: DST, substrate-or-it-didn't-happen, glass-halo, NCI floor, additive-not-zero-sum, m/acc-multi-oracle, bandwidth-served falsifier. 8 open architecture questions captured for future decision. Authored via git plumbing fallback.
1 task
AceHack
added a commit
that referenced
this pull request
May 24, 2026
… Manager + k3d + Headscale + lend-resources pattern) (#4809) * docs(research): bundle-file dev-PC substrate architecture (Nix + Home Manager + k3d + Headscale + lend-resources pattern) Sibling to PR #4808 (cluster substrate). Per Aaron 2026-05-24 'yes bundle-file it (shadow*)' confirmation. PRIMARY STACK DECIDED (lightweight-first per Aaron-stated principle 'Lets do whatever is lightweigh now and ease into more heavy weight stuff'): LAYER 1 — Reproducible dev-PC substrate (Nix): - macOS: Determinate Systems Nix installer + nix-darwin + Home Manager - Linux: Nix package manager + Home Manager (on existing distro) - Windows: WSL2 + Nix in WSL2 + Home Manager - One flake repo covers cluster + dev PCs + every user's home directory LAYER 2 — Local k8s for testing: - k3d (lighter than kind) on each dev PC for manifest testing + GitOps practice WITHOUT touching production cluster LAYER 3 — Background service (lend-resources pattern): - Aaron framing: 'Dev boxes can be like lending resources to cluster' - Lightweight Bun/Node daemon polling NATS queue for opt-in work - NOT first-class k8s nodes (avoid trust-boundary issues) - Heavier alternative (k3s agent, Liqo federation) deferred LAYER 4 — Network substrate (Headscale + Tailscale): - Aaron framing: 'Tailscale is good but we also want headscale' - Tailscale clients on each device - Self-hosted Headscale control plane (sovereignty over user/device/ACL state; no commercial dependency; free at any node count) - DERP relay optional for NAT-traversal fallback DEFERRED (heavyweight ease-into-later): - Liqo federation - KubeFed v2 - k3s agent per dev PC - Custom DERP relays - Native Nix on Windows (when ships) - Full NixOS desktop on dev Linux box 5 open questions captured: Headscale deployment location, background- service queue tech, authentication boundary, lending workload-class restrictions, Addison's preferences (pending direct articulation per observation-not-fact consent discipline). Maps each choice to framework discipline (DST, glass-halo, NCI floor, m/acc-multi-oracle, bandwidth-served, additive, Aaron lightweight-first principle, Addison observation-not-fact discipline). Composes with cluster substrate archive + Addison consent archive + 9 framework rules. Authored via git plumbing fallback. * fix(PR #4809): correct impossible decision timestamp + consent-file date-prefix Two factual corrections caught by Codex P2 + Copilot: 1. Line 3: "Date decided: 2026-05-24 (~03:30Z)" was ~1.5h in the future relative to commit time (02:03Z). Corrected to ~02:03Z matching `gh pr view 4809 --json commits` last committed date. 2. Line 4: consent-file reference `addison-consent-pattern-observation-not-fact-discipline-aaron-otto.md` missing date prefix; actual file on disk is `2026-05-24-addison-consent-pattern-observation-not-fact-discipline-aaron-otto.md`. Added date prefix; reference now resolves. Mechanical fixes only. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
This was referenced May 24, 2026
AceHack
added a commit
that referenced
this pull request
May 24, 2026
…cision (NixOS + bare-metal k8s + Argo CD; no hypervisor for primary stack) (#4808) Architecture decisions for basement cluster build: - 20 GPUs + 20 phones via Cellhasher + Pi cluster + AI hats Aaron's substrate-engineering authority calls captured (verbatim quotes preserved): PRIMARY STACK DECIDED: - NixOS 24.11+ flake-based (declarative OS; DST-aligned) - Argo CD for GitOps (over Flux despite Flux being lighter — explicit operator preference) - Bare-metal Kubernetes (no hypervisor for primary stack) - containerd / Cilium CNI / Longhorn CSI + ZFS / NVIDIA k8s device plugin / systemd-boot / nixos-anywhere provisioning DEFERRED (backlog): - Talos Linux as alternative for k8s control-plane subset - KubeVirt as k8s extension for VM workloads if needed - Proxmox for separate experimental tier (outside framework DST) - k3s vs kubeadm decision REJECTED (with reasoning): - Guix System: FSF free-software-fundamentalism may block NVIDIA - Ubuntu/Debian/Fedora: mutable; not DST-aligned - Fedora CoreOS / Silverblue / Flatcar / Bottlerocket: less expressive than Nix; container-host shape only - Proxmox primary: imperative web-UI breaks DST + 3 layers vs 1 - ESXi / XCP-ng / Harvester: see body - Flux: operator preference for Argo HETEROGENEOUS COMPUTE ARCHITECTURE: - GPU compute nodes (NVIDIA + k8s workers) - Phone-orchestrator node (Cellhasher management; NOT k8s worker — phones are workload-substrate, not k8s control plane) - Pi-cluster + AI hats (may or may not run k8s depending on AI workload) Maps each architecture choice to specific framework substrate-engineering disciplines: DST, substrate-or-it-didn't-happen, glass-halo, NCI floor, additive-not-zero-sum, m/acc-multi-oracle, bandwidth-served falsifier. 8 open architecture questions captured for future decision. Authored via git plumbing fallback.
AceHack
added a commit
that referenced
this pull request
May 24, 2026
… Manager + k3d + Headscale + lend-resources pattern) (#4809) * docs(research): bundle-file dev-PC substrate architecture (Nix + Home Manager + k3d + Headscale + lend-resources pattern) Sibling to PR #4808 (cluster substrate). Per Aaron 2026-05-24 'yes bundle-file it (shadow*)' confirmation. PRIMARY STACK DECIDED (lightweight-first per Aaron-stated principle 'Lets do whatever is lightweigh now and ease into more heavy weight stuff'): LAYER 1 — Reproducible dev-PC substrate (Nix): - macOS: Determinate Systems Nix installer + nix-darwin + Home Manager - Linux: Nix package manager + Home Manager (on existing distro) - Windows: WSL2 + Nix in WSL2 + Home Manager - One flake repo covers cluster + dev PCs + every user's home directory LAYER 2 — Local k8s for testing: - k3d (lighter than kind) on each dev PC for manifest testing + GitOps practice WITHOUT touching production cluster LAYER 3 — Background service (lend-resources pattern): - Aaron framing: 'Dev boxes can be like lending resources to cluster' - Lightweight Bun/Node daemon polling NATS queue for opt-in work - NOT first-class k8s nodes (avoid trust-boundary issues) - Heavier alternative (k3s agent, Liqo federation) deferred LAYER 4 — Network substrate (Headscale + Tailscale): - Aaron framing: 'Tailscale is good but we also want headscale' - Tailscale clients on each device - Self-hosted Headscale control plane (sovereignty over user/device/ACL state; no commercial dependency; free at any node count) - DERP relay optional for NAT-traversal fallback DEFERRED (heavyweight ease-into-later): - Liqo federation - KubeFed v2 - k3s agent per dev PC - Custom DERP relays - Native Nix on Windows (when ships) - Full NixOS desktop on dev Linux box 5 open questions captured: Headscale deployment location, background- service queue tech, authentication boundary, lending workload-class restrictions, Addison's preferences (pending direct articulation per observation-not-fact consent discipline). Maps each choice to framework discipline (DST, glass-halo, NCI floor, m/acc-multi-oracle, bandwidth-served, additive, Aaron lightweight-first principle, Addison observation-not-fact discipline). Composes with cluster substrate archive + Addison consent archive + 9 framework rules. Authored via git plumbing fallback. * fix(PR #4809): correct impossible decision timestamp + consent-file date-prefix Two factual corrections caught by Codex P2 + Copilot: 1. Line 3: "Date decided: 2026-05-24 (~03:30Z)" was ~1.5h in the future relative to commit time (02:03Z). Corrected to ~02:03Z matching `gh pr view 4809 --json commits` last committed date. 2. Line 4: consent-file reference `addison-consent-pattern-observation-not-fact-discipline-aaron-otto.md` missing date prefix; actual file on disk is `2026-05-24-addison-consent-pattern-observation-not-fact-discipline-aaron-otto.md`. Added date prefix; reference now resolves. Mechanical fixes only. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architecture decision record for the bare-metal substrate layer below Kubernetes in the framework's basement cluster build (20 GPUs + 20 phones via Cellhasher + Pi cluster + AI hats).
Primary stack DECIDED
Deferred (backlog)
Rejected with reasoning
Guix System / Ubuntu/Debian/Fedora / Fedora CoreOS / Flatcar / Bottlerocket / Proxmox primary / ESXi / XCP-ng / Harvester / Flux — each with explicit reasoning.
Heterogeneous compute architecture
Three node classes via NixOS per-node-class modules from one flake:
Framework alignment
Maps each architecture choice to specific framework disciplines:
8 open architecture questions captured
k3s vs kubeadm / Pi hardware specs / GPU class / storage backplane / network fabric / PXE infra / secret management / observability stack.
Test plan