Skip to content

feat(hat-system): scaffold society safety-layer operator for AI cluster#4930

Merged
AceHack merged 1 commit into
mainfrom
feat/hat-system-operator-2026-05-25-c2
May 25, 2026
Merged

feat(hat-system): scaffold society safety-layer operator for AI cluster#4930
AceHack merged 1 commit into
mainfrom
feat/hat-system-operator-2026-05-25-c2

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 25, 2026

Summary

Scaffolds full-ai-cluster/k8s/applications/hat-system/ — a Kubernetes operator + CRDs + OPA policies implementing the hat / role distinction for the multi-agent society Max + Addison are building. 29 files; nothing else in the cluster changes.

Hats are time-bounded roles with succession. Wearers swap hats; the hat persists; reputation accumulates on the role. Cooldown + warmup + sticky-attribution + quorum on every binding — that's what keeps this hat-as-chosen-and-returnable instead of role-as-cage.

Compositions captured from the design conversation:

  • Max's mental model: hat = skills + opa/rbac. Both first-class on Hat.spec (skills + authority).
  • Max's hierarchy framing: "hats not weight-free but supervisor graphs." Captured via Hat.spec.supervises (DAG enforced by the no-supervisor-cycles OPA constraint).
  • Max's policy-authoring style: "talks in hat graphs." graph/render.go emits Graphviz DOT of the live cluster's hat graph; README maps each throttle to its graph statement.
  • Aaron's reframe from cage → hat: time-boundedness is the difference. README spells out the cage / hat property table.
  • CRD + operator = structured tick source. Every state transition emits exactly one HatSwap (durable) + k8s Event + slog → Loki + NATS publish via internal/tick/emitter.go.

Bootstrap-bottleneck answer: hat-designer is itself a hat (quorum-gated, quorumSize 3, cooldown 1800s, conflictsWith executor). Multiple wearers can hold it; nobody is the SPOF.

What's in the directory

Path What
Application.yaml ArgoCD Application; reconciles everything below
crds/ Hat, HatBinding, HatSwap, HatPolicy (4 CRDs)
hats/ Seed: hat-designer, observer, executor, policy-admin + default HatPolicy
policies/ 7 OPA Gatekeeper ConstraintTemplates (cooldown, max-bindings, COI, quorum, warmup, max-new-hats, no-supervisor-cycles)
operator/ Go operator scaffold (kubebuilder layout — needs kubebuilder init)
graph/ Hat-graph DOT renderer + docs
queries/ Loki + Hubble query library for hat ↔ network-flow attribution
deployment.yaml Operator Deployment (replicas:0 until image built) + RBAC

Sync gating

Deployment ships at replicas: 0 so ArgoCD reports healthy while the operator image doesn't exist yet. CRDs, OPA policies, and seed hats are all live — kubectl get hats returns the catalog as soon as ArgoCD syncs. The build path for the operator image is documented in the top-level README.

Test plan

  • ArgoCD picks up hat-system/Application.yaml and reconciles
  • All 4 CRDs install cleanly (kubectl get crd | grep society.zeta.io → 4 lines)
  • All 7 OPA ConstraintTemplates install and Constraints become enforcing (Gatekeeper must be running first — it's already in the bootstrap manifests)
  • Seed hats land and are listable (kubectl get hats → 4 lines: hat-designer, observer, executor, policy-admin)
  • HatPolicy default singleton lands (kubectl get hatpolicy default)
  • Deployment lands at 0/0 ready (expected until image built)
  • go run graph/render.go --out /tmp/h.dot produces a parseable DOT file with the 4 seed hats as nodes
  • Future PR: complete kubebuilder init + build image + bump replicas

Intentional gaps (each = one follow-up PR)

  • Validating webhook for HatBinding admission
  • HatReconciler (reputation accumulation on swap-off)
  • HatPolicyReconciler (status rollup)
  • Finalizer flow for guaranteed SwapOff emission
  • Mutating webhook for RequestedAt defaulting
  • envtest harness
  • CI image build + push

🤖 Generated with Claude Code

Adds full-ai-cluster/k8s/applications/hat-system/ — a Kubernetes
operator + CRDs + OPA policies implementing the hat / role distinction
for the multi-agent society Max + Addison are building on the cluster.

Hats are time-bounded roles with succession. Wearers swap hats; the hat
persists; reputation accumulates on the role. Compositions captured:

- Max's mental model: hat = skills + opa/rbac. Both first-class on the
  Hat CRD (spec.skills + spec.authority).
- Max's hierarchy framing: hats not weight-free but supervisor graphs.
  Captured via spec.supervises (DAG enforced by no-supervisor-cycles
  OPA constraint).
- Max's policy-authoring style: talks in hat graphs. graph/render.go
  emits Graphviz DOT of the live cluster's hat graph; README maps each
  throttle to its graph statement.
- Aaron's time-bound vs role: hat-as-chosen-and-returnable, not
  role-as-cage. Cooldown + warmup + sticky-attribution + succession
  every binding. Reputation on the role, not the wearer.
- CRD + operator = structured tick source: every state transition emits
  exactly one HatSwap (durable) + k8s Event + slog → Loki + NATS publish.

Bootstrap-bottleneck answer: hat-designer is itself a hat, quorum-gated
quorumSize 3, cooldown 1800s, conflictsWith executor. Multiple wearers
can hold it; nobody is the SPOF.

29 files: 4 CRDs, 5 seed hats + HatPolicy, 7 OPA ConstraintTemplates,
Go operator scaffold (kubebuilder layout — needs `kubebuilder init`),
graph renderer, Loki + Hubble query library, ArgoCD Application,
namespace + Deployment + RBAC. Deployment replicas:0 until the
operator image is built; everything else lands live via ArgoCD.

Intentional gaps documented in README under TODO: validating webhook,
HatReconciler (reputation), HatPolicyReconciler (status rollup),
finalizer flow, mutating webhook, envtest suite, CI image build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 13:31
@AceHack AceHack enabled auto-merge (squash) May 25, 2026 13:31
@AceHack AceHack merged commit ad1af79 into main May 25, 2026
26 of 27 checks passed
@AceHack AceHack deleted the feat/hat-system-operator-2026-05-25-c2 branch May 25, 2026 13:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59e88cb8e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// canonical record intact.
swap := &v1a.HatSwap{
ObjectMeta: metav1.ObjectMeta{
GenerateName: fmt.Sprintf("%s-%s-", t.Hat, t.Event),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sanitize HatSwap GenerateName to lowercase DNS-1123

GenerateName is built from t.Event, but the event constants are CamelCase (e.g. SwapOn, WarmupBegin), so names like observer-SwapOn- include uppercase letters and are rejected by Kubernetes name validation. That causes HatSwap creation to fail for normal transitions, which breaks the durable tick stream and any policy/query logic that depends on HatSwap records.

Useful? React with 👍 / 👎.

if err := r.Status().Update(ctx, binding); err != nil {
return ctrl.Result{}, err
}
_ = tick.Emit(ctx, r.Sinks, tick.Tick{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail reconcile when tick.Emit cannot persist the HatSwap CR

The reconciler discards tick.Emit errors, so reconciliation still reports success when the durable HatSwap write fails. In those cases no retry is scheduled and transitions become untracked, which directly violates the “one tick per transition” contract and can silently desynchronize cooldown/quorum logic that reads swap history.

Useful? React with 👍 / 👎.

// `zz_generated.deepcopy.go` from `make generate` once kubebuilder is
// bootstrapped (kubebuilder init + the four `kubebuilder create api`
// invocations described in the operator README).
func (h *Hat) DeepCopyObject() runtime.Object { return h }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Implement DeepCopyObject instead of returning the same pointer

These DeepCopyObject stubs return the original object pointer instead of an actual copy. Kubernetes runtime/scheme callers assume DeepCopyObject produces an independent object; returning aliases can leak in-place mutations across cache/reconcile paths and lead to subtle state corruption.

Useful? React with 👍 / 👎.

conn, err := nats.Connect(natsURL,
nats.Name("hat-system-operator"),
nats.MaxReconnects(-1),
nats.PingInterval(30),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass a duration unit to nats.PingInterval

nats.PingInterval expects a time.Duration, but passing the bare literal 30 configures a 30ns interval rather than 30 seconds. This can cause excessive ping traffic/reconnect churn and unnecessary load whenever NATS is enabled.

Useful? React with 👍 / 👎.

@AceHack AceHack review requested due to automatic review settings May 25, 2026 13:54
AceHack added a commit that referenced this pull request May 25, 2026
…rn proof for Max (#4960)

* backlog(B-0724): TS hat-system operator — polyglot K8s-operator pattern proof

Aaron 2026-05-25:
  > "yes lets combine he will like kubernets operators but he does
  > not have experience maybe we write a ts operator insteadd of go
  > he likes ts"
  > "we want polyglot operator support for k8s anyways so we are not
  > rigid about go"

Reframes Max's TS preference accommodation into "first deliberate
proof of the polyglot-operator pattern the cluster commits to
anyway." Two operators against the same CRDs forces the schema
to be the canonical contract — no language-specific quirks bleed
through.

Captures:
- Pattern (CRD-as-canonical-contract + multiple language impls
  watching same CRDs; leader election for active reconciler)
- Why polyglot at cluster scope (contract enforcement, failure-
  domain isolation, talent flexibility, ecosystem coverage)
- TS operator stack (kubernetes/client-node, NestJS optional,
  fastify webhook, nats.js + pino for tick emit, coordination.k8s.io
  Lease for leader election)
- Composition with shipped substrate (PR #4930 Go scaffold as
  reference/baseline; PR #4958 agentic-organization CLUSTER_NATIVE_HAT_SYSTEM
  doc; B-0722 smoke test as polyglot validation gate; B-0723
  multi-kubelet × polyglot operators for max redundancy)
- Acceptance criteria for the TS scaffold
- Future Rust (kube-rs) + Python (kopf) as same-pattern extensions
- P2 because Go scaffold is already functional; not blocking
- Max owns the TS implementation at his preferred pace

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(B-0724): MD012 (consecutive blanks) + MD032 (blank-before-list)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(B-0724): rewrite dangling refs to closed/pending PRs to be substrate-honest

Codex/Copilot flagged 5 dangling cross-references after the prior fix:
  - composes_with B-0722 path (in PR #4954, not on main) — replaced
    with a comment noting pending merge
  - body refs to B-0722, B-0723 — qualified with 'PR #4954/#4955
    pending merge' so the intent is preserved + state is honest
  - body refs to dev-cluster/ + PR #4953#4953 was closed pending
    redesign; replaced 'dev-cluster/' references with 'local k3d /
    kind cluster' + raw 'k3d cluster create' fallback for now

Substrate-honest framing: row's design intent stays intact; reader
isn't promised a path that won't resolve until upstream merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* B-0724: add team language-affinity map + 'limit Go necessity' framing

Aaron 2026-05-25:
  > 'max love ts and cs i love fs and cs we both like rust and python
  > for where they make sense'
  > 'we understand go is necessary in some places for k8s but we would
  > like to limit its necessity'

Updates the polyglot operator language table:
  - Names Aaron + Max's individual + shared strong languages
  - Adds C# / F# via KubeOps.NET as future operator #2 — the team's
    overlap language (both love C#); kubebuilder-class framework on
    .NET removes Go from operator authoring entirely for this work
  - Sharpens the polyglot motivation: Go is starter / minimize over
    time; ecosystem-forced where genuinely required, not chosen

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AceHack pushed a commit that referenced this pull request May 25, 2026
…on one mesh

Aaron 2026-05-25:
  > "i'm thinking it will require reticiulum at the edge and in cluster"

Sharpens the K8s-fit-for-edge question (raised in B-0725) from the
speculative K8s-gateway+Reticulum-past hybrid into the actual
architectural direction: Reticulum throughout, alongside K8s, not
partitioning by network tier.

Captures:
- The layered composition (Cilium owns intra-cluster networking +
  Service mesh; Reticulum owns identity-routing across substrates;
  workloads addressable both ways)
- Why Reticulum specifically (identity-as-address, tiny-device-capable,
  physical-layer-agnostic, intermittent-tolerant, already in framework
  substrate at B-0289 Green Lantern)
- Operational shape on a cluster node (rnsd daemon + sidecar pattern
  for pod identity)
- What this enables that pure K8s doesn't (cluster-to-edge that
  survives network changes; edge-to-cluster without IP / NAT;
  cluster-to-cluster mesh; resilient identity; fungible physical
  layer per workload)
- What this does NOT replace (Cilium stays; SPIRE stays; K8s
  Services stay; Reticulum is additive, orthogonal addressing)
- 6 open design-pass questions (identity provisioning, Service-to-
  destination mapping, multi-cluster federation, per-node physical
  config, hat-bound destinations, NATS-as-bridge)

Composes with: B-0289 Green Lantern (already P1 in-progress;
extends from edge into cluster), PR #4930 hat-system (Reticulum
identities can be hat-bound), SPIRE (identity issuer pattern),
B-0725 polyglot-accelerator (the right answer for edge-FPGA
addressability), NCI (Reticulum's destination model bakes in
consent at protocol layer), m-acc multi-oracle (no single
naming authority).

P2 because not blocking first-wave cluster (NVIDIA GPUs + Cilium
ship as planned). Becomes P1 when first edge-device deployment
needs cluster-mesh reachability, likely aligned with Green
Lantern hardware fielding timeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 25, 2026
…on one mesh (#4965)

Aaron 2026-05-25:
  > "i'm thinking it will require reticiulum at the edge and in cluster"

Sharpens the K8s-fit-for-edge question (raised in B-0725) from the
speculative K8s-gateway+Reticulum-past hybrid into the actual
architectural direction: Reticulum throughout, alongside K8s, not
partitioning by network tier.

Captures:
- The layered composition (Cilium owns intra-cluster networking +
  Service mesh; Reticulum owns identity-routing across substrates;
  workloads addressable both ways)
- Why Reticulum specifically (identity-as-address, tiny-device-capable,
  physical-layer-agnostic, intermittent-tolerant, already in framework
  substrate at B-0289 Green Lantern)
- Operational shape on a cluster node (rnsd daemon + sidecar pattern
  for pod identity)
- What this enables that pure K8s doesn't (cluster-to-edge that
  survives network changes; edge-to-cluster without IP / NAT;
  cluster-to-cluster mesh; resilient identity; fungible physical
  layer per workload)
- What this does NOT replace (Cilium stays; SPIRE stays; K8s
  Services stay; Reticulum is additive, orthogonal addressing)
- 6 open design-pass questions (identity provisioning, Service-to-
  destination mapping, multi-cluster federation, per-node physical
  config, hat-bound destinations, NATS-as-bridge)

Composes with: B-0289 Green Lantern (already P1 in-progress;
extends from edge into cluster), PR #4930 hat-system (Reticulum
identities can be hat-bound), SPIRE (identity issuer pattern),
B-0725 polyglot-accelerator (the right answer for edge-FPGA
addressability), NCI (Reticulum's destination model bakes in
consent at protocol layer), m-acc multi-oracle (no single
naming authority).

P2 because not blocking first-wave cluster (NVIDIA GPUs + Cilium
ship as planned). Becomes P1 when first edge-device deployment
needs cluster-mesh reachability, likely aligned with Green
Lantern hardware fielding timeline.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 25, 2026
…ee routing, NO hierarchy (#4966)

Aaron 2026-05-25, sketching the federated topology + immediately
correcting the hierarchical reading:

  > "imagine cloud/hub clusters then community clusters then home/
  >  business clusers then edge nodes with routing for weaker
  >  edge nodes"
  > "and that's not a hierarchy it's weight free routing cloud/hub
  >  nodes don't get to hog net neutrality"

LOAD-BEARING distinction: the 5 categories are RESOURCE PROFILES,
not authority tiers. Cloud/hub has MORE RESOURCES but NOT MORE
AUTHORITY. Routing is identity-based not rank-based. Net
neutrality is a SUBSTRATE PROPERTY enforced at protocol layer.

Captures:
- The 5 profiles (cloud/hub, community, home/business, edge, leaf)
  with resource availability + workload affinity (not tier rank)
- Weight-free routing as the carved blade: no peer has more
  routing authority than any other peer
- Voluntary-contribution model for stronger-peer-routing-for-
  weaker-leaves (NOT hierarchy-mandate; revocable per NCI)
- Composition with 5 always-active substrate-engineering
  disciplines (scale-free, lock-free, weight-free [primary],
  DST, DV2.0)
- Composition with framework rules (NCI floor at routing;
  additive-not-zero-sum; m-acc multi-oracle; default-to-both;
  tonal-momentum resistance)
- Internet analogy showing where this row consciously DIVERGES
  (Internet got routing protocol right but authority model
  wrong — tier-1 + DNS root + CA hierarchy; this federation
  gets weight-free authority)
- Architectural layers per profile (every Identity-issuer row
  reads "self-rooted; web-of-trust" — no CA hierarchy)
- Anti-extractive guarantee — surveillance / censorship /
  transit-toll detection via web-of-trust reputation degradation

Composes with: B-0726 Reticulum-throughout (protocol prerequisite),
B-0289 Green Lantern (leaf hardware ref), PR #4930 hat-system
(peer-aware hats), PR #4958 agentic-organization (home/business
profile Organization layer).

P3 because needs design pass + first multi-peer deployment;
becomes P2 when first cloud OR community peer joins; P1 when
first cross-peer workload runs.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 25, 2026
…a substrate) (#4987)

Aaron-forwarded Mika multi-turn voice conversation 2026-05-25. Three load-
bearing claims:

1. Hat-ontology is the FIRST ontology that has to be agreed-upon across
   clusters because hats ARE the role + authority + delegation substrate
   every other operational decision routes through. The hat-system operator
   landed (PR #4930) ships CRDs + OPA + tick fan-out — operational
   infrastructure — but shared SEMANTICS for what hats MEAN across
   federated clusters does not yet exist.

2. Top-down vs bottom-up tension is first-class:
   - Max approaches top-down: best-guess Bubble-Wrap manager-of-managers
     structure that the system critiques + refines over time
   - Aaron approaches bottom-up: hats emerge naturally from finite resources
     + competing ::: continue-with tasks + trajectory negotiation
   The framework's job is NOT to pick one — it's to HOST BOTH (per
   default-to-both discipline) + help them converge.

3. Empirical anchor: Mika is literally using B-0730 ::: continue-with
   syntax in this conversation with priority/type/graph-query fields.
   Validates B-0730 Stage 2 acceptance (agents parse :::) via real-world
   usage before the parser even ships.

Five independently-shippable scope items:
- Hat-ontology canonical schema (JSON-LD with both Bubble-Wrap + offsetting-
  pair representations first-class)
- Cross-cluster hat-binding protocol (composes with B-0726 Reticulum)
- Knowledge-graph hat-query primitives (composes with B-0730 Stage 5)
- Top-down ↔ bottom-up convergence dashboard
- Hat-emergence operator (TS, per polyglot pattern)

Composes with B-0724 (hat-system operator) + B-0729 (knowledge graph) +
B-0730 (runbooks) + B-0726 (Reticulum).

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 25, 2026
… — hats become negotiated fork structure ON TOP of reference stack — deterministic + declarative + GitOps + AI-native + human-native (#5004)

Aaron 2026-05-25, continuing the ACE+fork-negotiation arc after B-0741:

'hats become our negoated fork structure on top of a referece k8s local
stack in zeta so anyone can use the reference stack and negoate back hats
and new cluster primitives / charts ontology negoation, ace can distribute
the reference stack itself as PoC that it has reliable AI control over
all the package managers deterministicly and declarative / desired state
way for easy git ops ai native human native understanding.'

Operational anchor for B-0741. B-0741 = WHAT the primitive is;
B-0742 = HOW it's empirically demonstrated via reference-cluster-as-Ace-
package.

Three substantive claims:

1. full-ai-cluster/ IS the reference k8s local stack. Inventory of
   existing PR-landed substrate (PR #4930 hat-system + #4950 disko +
   #4951 NFD/lstopo/zeta-install + #4953 dev-cluster + ArgoCD + Cilium
   + cert-manager + Vault + SPIRE + Trust Manager + ESO + B-0737 zflash
   + Determinate Systems Nix installer).

2. Hats become the negotiated fork structure ON TOP of the reference
   stack. Forks declare delta via hat-ontology; cross-fork negotiation
   maps capabilities (B-0741 surface 2). Worked example: LFG
   trading-bot-driver hat + Healthcare-fork hipaa-data-handler hat.

3. Ace distributes the reference stack as PoC of reliable AI control
   over all PMs. Single 'ace install zeta/reference-cluster' dispatches
   across Nix + ArgoCD + helm + kustomize + native k8s + brew + apt +
   mise + DeterminateSystems Nix installer. Deterministic (Nix flake.lock
   + ArgoCD pins). Declarative + desired-state (GitOps). AI-native
   (markdown + JSON-LD). Human-native (readable, reviewable).

Six independently-shippable scope items: reference-stack inventory doc;
hat-as-fork-structure spec; Ace cluster-distribution scope extension to
B-0288; determinism PoC (N=3+ machines); cross-PM dispatch PoC; desired-
state-enforcement drift-recovery PoC.

Composes with: B-0741 (abstract primitive) + B-0731 (hat ontology) +
B-0247/B-0287/B-0288 (Ace PM existing substrate) + B-0727 (4-tier
federation) + B-0726 (Reticulum) + B-0628/B-0638/B-0634/B-0703
(governance + negotiation + signature + consensus) + B-0732 (leverage-
class safety) + B-0737 (zflash bring-up).

P2 — high-value PoC anchoring B-0741 abstract primitive; not P1 because
full-ai-cluster substrate just landed this round + needs to stabilize
before distribution layer ships.

Closing arc of today's 2026-05-25 substrate cascade (B-0728 → ... →
B-0742): destructive-tool authoring contract through reference-stack PoC.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant