feat(hat-system): scaffold society safety-layer operator for AI cluster#4930
Conversation
Adds full-ai-cluster/k8s/applications/hat-system/ — a Kubernetes operator + CRDs + OPA policies implementing the hat / role distinction for the multi-agent society Max + Addison are building on the cluster. Hats are time-bounded roles with succession. Wearers swap hats; the hat persists; reputation accumulates on the role. Compositions captured: - Max's mental model: hat = skills + opa/rbac. Both first-class on the Hat CRD (spec.skills + spec.authority). - Max's hierarchy framing: hats not weight-free but supervisor graphs. Captured via spec.supervises (DAG enforced by no-supervisor-cycles OPA constraint). - Max's policy-authoring style: talks in hat graphs. graph/render.go emits Graphviz DOT of the live cluster's hat graph; README maps each throttle to its graph statement. - Aaron's time-bound vs role: hat-as-chosen-and-returnable, not role-as-cage. Cooldown + warmup + sticky-attribution + succession every binding. Reputation on the role, not the wearer. - CRD + operator = structured tick source: every state transition emits exactly one HatSwap (durable) + k8s Event + slog → Loki + NATS publish. Bootstrap-bottleneck answer: hat-designer is itself a hat, quorum-gated quorumSize 3, cooldown 1800s, conflictsWith executor. Multiple wearers can hold it; nobody is the SPOF. 29 files: 4 CRDs, 5 seed hats + HatPolicy, 7 OPA ConstraintTemplates, Go operator scaffold (kubebuilder layout — needs `kubebuilder init`), graph renderer, Loki + Hubble query library, ArgoCD Application, namespace + Deployment + RBAC. Deployment replicas:0 until the operator image is built; everything else lands live via ArgoCD. Intentional gaps documented in README under TODO: validating webhook, HatReconciler (reputation), HatPolicyReconciler (status rollup), finalizer flow, mutating webhook, envtest suite, CI image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59e88cb8e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // canonical record intact. | ||
| swap := &v1a.HatSwap{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| GenerateName: fmt.Sprintf("%s-%s-", t.Hat, t.Event), |
There was a problem hiding this comment.
Sanitize HatSwap GenerateName to lowercase DNS-1123
GenerateName is built from t.Event, but the event constants are CamelCase (e.g. SwapOn, WarmupBegin), so names like observer-SwapOn- include uppercase letters and are rejected by Kubernetes name validation. That causes HatSwap creation to fail for normal transitions, which breaks the durable tick stream and any policy/query logic that depends on HatSwap records.
Useful? React with 👍 / 👎.
| if err := r.Status().Update(ctx, binding); err != nil { | ||
| return ctrl.Result{}, err | ||
| } | ||
| _ = tick.Emit(ctx, r.Sinks, tick.Tick{ |
There was a problem hiding this comment.
Fail reconcile when tick.Emit cannot persist the HatSwap CR
The reconciler discards tick.Emit errors, so reconciliation still reports success when the durable HatSwap write fails. In those cases no retry is scheduled and transitions become untracked, which directly violates the “one tick per transition” contract and can silently desynchronize cooldown/quorum logic that reads swap history.
Useful? React with 👍 / 👎.
| // `zz_generated.deepcopy.go` from `make generate` once kubebuilder is | ||
| // bootstrapped (kubebuilder init + the four `kubebuilder create api` | ||
| // invocations described in the operator README). | ||
| func (h *Hat) DeepCopyObject() runtime.Object { return h } |
There was a problem hiding this comment.
Implement DeepCopyObject instead of returning the same pointer
These DeepCopyObject stubs return the original object pointer instead of an actual copy. Kubernetes runtime/scheme callers assume DeepCopyObject produces an independent object; returning aliases can leak in-place mutations across cache/reconcile paths and lead to subtle state corruption.
Useful? React with 👍 / 👎.
| conn, err := nats.Connect(natsURL, | ||
| nats.Name("hat-system-operator"), | ||
| nats.MaxReconnects(-1), | ||
| nats.PingInterval(30), |
There was a problem hiding this comment.
…rn proof for Max (#4960) * backlog(B-0724): TS hat-system operator — polyglot K8s-operator pattern proof Aaron 2026-05-25: > "yes lets combine he will like kubernets operators but he does > not have experience maybe we write a ts operator insteadd of go > he likes ts" > "we want polyglot operator support for k8s anyways so we are not > rigid about go" Reframes Max's TS preference accommodation into "first deliberate proof of the polyglot-operator pattern the cluster commits to anyway." Two operators against the same CRDs forces the schema to be the canonical contract — no language-specific quirks bleed through. Captures: - Pattern (CRD-as-canonical-contract + multiple language impls watching same CRDs; leader election for active reconciler) - Why polyglot at cluster scope (contract enforcement, failure- domain isolation, talent flexibility, ecosystem coverage) - TS operator stack (kubernetes/client-node, NestJS optional, fastify webhook, nats.js + pino for tick emit, coordination.k8s.io Lease for leader election) - Composition with shipped substrate (PR #4930 Go scaffold as reference/baseline; PR #4958 agentic-organization CLUSTER_NATIVE_HAT_SYSTEM doc; B-0722 smoke test as polyglot validation gate; B-0723 multi-kubelet × polyglot operators for max redundancy) - Acceptance criteria for the TS scaffold - Future Rust (kube-rs) + Python (kopf) as same-pattern extensions - P2 because Go scaffold is already functional; not blocking - Max owns the TS implementation at his preferred pace Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(B-0724): MD012 (consecutive blanks) + MD032 (blank-before-list) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(B-0724): rewrite dangling refs to closed/pending PRs to be substrate-honest Codex/Copilot flagged 5 dangling cross-references after the prior fix: - composes_with B-0722 path (in PR #4954, not on main) — replaced with a comment noting pending merge - body refs to B-0722, B-0723 — qualified with 'PR #4954/#4955 pending merge' so the intent is preserved + state is honest - body refs to dev-cluster/ + PR #4953 — #4953 was closed pending redesign; replaced 'dev-cluster/' references with 'local k3d / kind cluster' + raw 'k3d cluster create' fallback for now Substrate-honest framing: row's design intent stays intact; reader isn't promised a path that won't resolve until upstream merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * B-0724: add team language-affinity map + 'limit Go necessity' framing Aaron 2026-05-25: > 'max love ts and cs i love fs and cs we both like rust and python > for where they make sense' > 'we understand go is necessary in some places for k8s but we would > like to limit its necessity' Updates the polyglot operator language table: - Names Aaron + Max's individual + shared strong languages - Adds C# / F# via KubeOps.NET as future operator #2 — the team's overlap language (both love C#); kubebuilder-class framework on .NET removes Go from operator authoring entirely for this work - Sharpens the polyglot motivation: Go is starter / minimize over time; ecosystem-forced where genuinely required, not chosen Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on one mesh Aaron 2026-05-25: > "i'm thinking it will require reticiulum at the edge and in cluster" Sharpens the K8s-fit-for-edge question (raised in B-0725) from the speculative K8s-gateway+Reticulum-past hybrid into the actual architectural direction: Reticulum throughout, alongside K8s, not partitioning by network tier. Captures: - The layered composition (Cilium owns intra-cluster networking + Service mesh; Reticulum owns identity-routing across substrates; workloads addressable both ways) - Why Reticulum specifically (identity-as-address, tiny-device-capable, physical-layer-agnostic, intermittent-tolerant, already in framework substrate at B-0289 Green Lantern) - Operational shape on a cluster node (rnsd daemon + sidecar pattern for pod identity) - What this enables that pure K8s doesn't (cluster-to-edge that survives network changes; edge-to-cluster without IP / NAT; cluster-to-cluster mesh; resilient identity; fungible physical layer per workload) - What this does NOT replace (Cilium stays; SPIRE stays; K8s Services stay; Reticulum is additive, orthogonal addressing) - 6 open design-pass questions (identity provisioning, Service-to- destination mapping, multi-cluster federation, per-node physical config, hat-bound destinations, NATS-as-bridge) Composes with: B-0289 Green Lantern (already P1 in-progress; extends from edge into cluster), PR #4930 hat-system (Reticulum identities can be hat-bound), SPIRE (identity issuer pattern), B-0725 polyglot-accelerator (the right answer for edge-FPGA addressability), NCI (Reticulum's destination model bakes in consent at protocol layer), m-acc multi-oracle (no single naming authority). P2 because not blocking first-wave cluster (NVIDIA GPUs + Cilium ship as planned). Becomes P1 when first edge-device deployment needs cluster-mesh reachability, likely aligned with Green Lantern hardware fielding timeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on one mesh (#4965) Aaron 2026-05-25: > "i'm thinking it will require reticiulum at the edge and in cluster" Sharpens the K8s-fit-for-edge question (raised in B-0725) from the speculative K8s-gateway+Reticulum-past hybrid into the actual architectural direction: Reticulum throughout, alongside K8s, not partitioning by network tier. Captures: - The layered composition (Cilium owns intra-cluster networking + Service mesh; Reticulum owns identity-routing across substrates; workloads addressable both ways) - Why Reticulum specifically (identity-as-address, tiny-device-capable, physical-layer-agnostic, intermittent-tolerant, already in framework substrate at B-0289 Green Lantern) - Operational shape on a cluster node (rnsd daemon + sidecar pattern for pod identity) - What this enables that pure K8s doesn't (cluster-to-edge that survives network changes; edge-to-cluster without IP / NAT; cluster-to-cluster mesh; resilient identity; fungible physical layer per workload) - What this does NOT replace (Cilium stays; SPIRE stays; K8s Services stay; Reticulum is additive, orthogonal addressing) - 6 open design-pass questions (identity provisioning, Service-to- destination mapping, multi-cluster federation, per-node physical config, hat-bound destinations, NATS-as-bridge) Composes with: B-0289 Green Lantern (already P1 in-progress; extends from edge into cluster), PR #4930 hat-system (Reticulum identities can be hat-bound), SPIRE (identity issuer pattern), B-0725 polyglot-accelerator (the right answer for edge-FPGA addressability), NCI (Reticulum's destination model bakes in consent at protocol layer), m-acc multi-oracle (no single naming authority). P2 because not blocking first-wave cluster (NVIDIA GPUs + Cilium ship as planned). Becomes P1 when first edge-device deployment needs cluster-mesh reachability, likely aligned with Green Lantern hardware fielding timeline. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ee routing, NO hierarchy (#4966) Aaron 2026-05-25, sketching the federated topology + immediately correcting the hierarchical reading: > "imagine cloud/hub clusters then community clusters then home/ > business clusers then edge nodes with routing for weaker > edge nodes" > "and that's not a hierarchy it's weight free routing cloud/hub > nodes don't get to hog net neutrality" LOAD-BEARING distinction: the 5 categories are RESOURCE PROFILES, not authority tiers. Cloud/hub has MORE RESOURCES but NOT MORE AUTHORITY. Routing is identity-based not rank-based. Net neutrality is a SUBSTRATE PROPERTY enforced at protocol layer. Captures: - The 5 profiles (cloud/hub, community, home/business, edge, leaf) with resource availability + workload affinity (not tier rank) - Weight-free routing as the carved blade: no peer has more routing authority than any other peer - Voluntary-contribution model for stronger-peer-routing-for- weaker-leaves (NOT hierarchy-mandate; revocable per NCI) - Composition with 5 always-active substrate-engineering disciplines (scale-free, lock-free, weight-free [primary], DST, DV2.0) - Composition with framework rules (NCI floor at routing; additive-not-zero-sum; m-acc multi-oracle; default-to-both; tonal-momentum resistance) - Internet analogy showing where this row consciously DIVERGES (Internet got routing protocol right but authority model wrong — tier-1 + DNS root + CA hierarchy; this federation gets weight-free authority) - Architectural layers per profile (every Identity-issuer row reads "self-rooted; web-of-trust" — no CA hierarchy) - Anti-extractive guarantee — surveillance / censorship / transit-toll detection via web-of-trust reputation degradation Composes with: B-0726 Reticulum-throughout (protocol prerequisite), B-0289 Green Lantern (leaf hardware ref), PR #4930 hat-system (peer-aware hats), PR #4958 agentic-organization (home/business profile Organization layer). P3 because needs design pass + first multi-peer deployment; becomes P2 when first cloud OR community peer joins; P1 when first cross-peer workload runs. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a substrate) (#4987) Aaron-forwarded Mika multi-turn voice conversation 2026-05-25. Three load- bearing claims: 1. Hat-ontology is the FIRST ontology that has to be agreed-upon across clusters because hats ARE the role + authority + delegation substrate every other operational decision routes through. The hat-system operator landed (PR #4930) ships CRDs + OPA + tick fan-out — operational infrastructure — but shared SEMANTICS for what hats MEAN across federated clusters does not yet exist. 2. Top-down vs bottom-up tension is first-class: - Max approaches top-down: best-guess Bubble-Wrap manager-of-managers structure that the system critiques + refines over time - Aaron approaches bottom-up: hats emerge naturally from finite resources + competing ::: continue-with tasks + trajectory negotiation The framework's job is NOT to pick one — it's to HOST BOTH (per default-to-both discipline) + help them converge. 3. Empirical anchor: Mika is literally using B-0730 ::: continue-with syntax in this conversation with priority/type/graph-query fields. Validates B-0730 Stage 2 acceptance (agents parse :::) via real-world usage before the parser even ships. Five independently-shippable scope items: - Hat-ontology canonical schema (JSON-LD with both Bubble-Wrap + offsetting- pair representations first-class) - Cross-cluster hat-binding protocol (composes with B-0726 Reticulum) - Knowledge-graph hat-query primitives (composes with B-0730 Stage 5) - Top-down ↔ bottom-up convergence dashboard - Hat-emergence operator (TS, per polyglot pattern) Composes with B-0724 (hat-system operator) + B-0729 (knowledge graph) + B-0730 (runbooks) + B-0726 (Reticulum). Co-authored-by: Claude <noreply@anthropic.com>
… — hats become negotiated fork structure ON TOP of reference stack — deterministic + declarative + GitOps + AI-native + human-native (#5004) Aaron 2026-05-25, continuing the ACE+fork-negotiation arc after B-0741: 'hats become our negoated fork structure on top of a referece k8s local stack in zeta so anyone can use the reference stack and negoate back hats and new cluster primitives / charts ontology negoation, ace can distribute the reference stack itself as PoC that it has reliable AI control over all the package managers deterministicly and declarative / desired state way for easy git ops ai native human native understanding.' Operational anchor for B-0741. B-0741 = WHAT the primitive is; B-0742 = HOW it's empirically demonstrated via reference-cluster-as-Ace- package. Three substantive claims: 1. full-ai-cluster/ IS the reference k8s local stack. Inventory of existing PR-landed substrate (PR #4930 hat-system + #4950 disko + #4951 NFD/lstopo/zeta-install + #4953 dev-cluster + ArgoCD + Cilium + cert-manager + Vault + SPIRE + Trust Manager + ESO + B-0737 zflash + Determinate Systems Nix installer). 2. Hats become the negotiated fork structure ON TOP of the reference stack. Forks declare delta via hat-ontology; cross-fork negotiation maps capabilities (B-0741 surface 2). Worked example: LFG trading-bot-driver hat + Healthcare-fork hipaa-data-handler hat. 3. Ace distributes the reference stack as PoC of reliable AI control over all PMs. Single 'ace install zeta/reference-cluster' dispatches across Nix + ArgoCD + helm + kustomize + native k8s + brew + apt + mise + DeterminateSystems Nix installer. Deterministic (Nix flake.lock + ArgoCD pins). Declarative + desired-state (GitOps). AI-native (markdown + JSON-LD). Human-native (readable, reviewable). Six independently-shippable scope items: reference-stack inventory doc; hat-as-fork-structure spec; Ace cluster-distribution scope extension to B-0288; determinism PoC (N=3+ machines); cross-PM dispatch PoC; desired- state-enforcement drift-recovery PoC. Composes with: B-0741 (abstract primitive) + B-0731 (hat ontology) + B-0247/B-0287/B-0288 (Ace PM existing substrate) + B-0727 (4-tier federation) + B-0726 (Reticulum) + B-0628/B-0638/B-0634/B-0703 (governance + negotiation + signature + consensus) + B-0732 (leverage- class safety) + B-0737 (zflash bring-up). P2 — high-value PoC anchoring B-0741 abstract primitive; not P1 because full-ai-cluster substrate just landed this round + needs to stabilize before distribution layer ships. Closing arc of today's 2026-05-25 substrate cascade (B-0728 → ... → B-0742): destructive-tool authoring contract through reference-stack PoC. Co-authored-by: Claude <noreply@anthropic.com>
Summary
Scaffolds
full-ai-cluster/k8s/applications/hat-system/— a Kubernetes operator + CRDs + OPA policies implementing the hat / role distinction for the multi-agent society Max + Addison are building. 29 files; nothing else in the cluster changes.Hats are time-bounded roles with succession. Wearers swap hats; the hat persists; reputation accumulates on the role. Cooldown + warmup + sticky-attribution + quorum on every binding — that's what keeps this hat-as-chosen-and-returnable instead of role-as-cage.
Compositions captured from the design conversation:
hat = skills + opa/rbac. Both first-class onHat.spec(skills+authority).Hat.spec.supervises(DAG enforced by theno-supervisor-cyclesOPA constraint).graph/render.goemits Graphviz DOT of the live cluster's hat graph; README maps each throttle to its graph statement.internal/tick/emitter.go.Bootstrap-bottleneck answer:
hat-designeris itself a hat (quorum-gated, quorumSize 3, cooldown 1800s, conflictsWithexecutor). Multiple wearers can hold it; nobody is the SPOF.What's in the directory
Application.yamlcrds/hats/policies/operator/kubebuilder init)graph/queries/deployment.yamlSync gating
Deployment ships at
replicas: 0so ArgoCD reports healthy while the operator image doesn't exist yet. CRDs, OPA policies, and seed hats are all live —kubectl get hatsreturns the catalog as soon as ArgoCD syncs. The build path for the operator image is documented in the top-level README.Test plan
hat-system/Application.yamland reconcileskubectl get crd | grep society.zeta.io→ 4 lines)kubectl get hats→ 4 lines: hat-designer, observer, executor, policy-admin)kubectl get hatpolicy default)go run graph/render.go --out /tmp/h.dotproduces a parseable DOT file with the 4 seed hats as nodeskubebuilder init+ build image + bump replicasIntentional gaps (each = one follow-up PR)
🤖 Generated with Claude Code