Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -784,6 +784,7 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0847](backlog/P2/B-0847-each-ai-gets-own-github-identity-with-email-once-cluster-operational-substrate-honest-attribution-end-to-end-closes-enabledby-token-owner-not-actor-algo-wink-aaron-2026-05-26.md)** each Zeta AI gets own GitHub identity + email once cluster operational — substrate-honest attribution end-to-end (closes the `gh enabledBy = token-owner ≠ actor` algo-wink-attribution-gap; Ilyana review for public-surface name + email before any creation) (Aaron 2026-05-26)
- [ ] **[B-0848](backlog/P2/B-0848-node-local-claude-agent-stewards-own-registration-pr-then-reports-k8s-cluster-status-operator-interactive-login-pattern-aaron-2026-05-26.md)** node-local Claude agent stewards own registration PR + reports K8s cluster status — operator interactive-login pattern (mirrors gh auth flow); first concrete instance of B-0847 AI-on-cluster substrate (Aaron 2026-05-26)
- [ ] **[B-0849](backlog/P2/B-0849-docker-based-nixos-install-sh-test-harness-fast-iteration-vs-qemu-full-install-test-aaron-2026-05-27.md)** docker-based NixOS install.sh test harness — fast iteration on tools/setup/install.sh + linux.sh changes; complements B-0831 cascade #6 QEMU full-install-test (slow) with seconds-per-iteration loop; "easy dockerfile" per operator framing (Aaron 2026-05-27)
- [ ] **[B-0850](backlog/P2/B-0850-ai-agents-as-systemd-services-outside-k8s-starting-with-otto-cluster-repair-from-outside-failure-domain-aaron-2026-05-27.md)** AI agents as systemd services OUTSIDE k8s — starting with Otto; cluster repair from OUTSIDE the failure domain; classic "control plane outside the control plane" architectural pattern (Aaron 2026-05-27)

## P3 — convenience / deferred

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
---
id: B-0850
priority: P2
status: open
title: AI agents as systemd services OUTSIDE k8s — starting with Otto; cluster repair from OUTSIDE the failure domain; classic "control plane outside the control plane" architectural pattern (Aaron 2026-05-27)
effort: M
ask: aaron 2026-05-27
created: 2026-05-27
last_updated: 2026-05-27
depends_on:
- B-0848
composes_with:
- B-0794
- B-0796
- B-0847
- B-0813
- B-0817
tags: [systemd, outside-k8s, out-of-band-cluster-repair, control-plane-outside-control-plane, failure-domain-separation, otto, cluster-self-healing, ai-as-service, multi-agent-roster-on-cluster, persona-choice-option-a-confirmed]
---

## Operator framing (Aaron 2026-05-27)

> *"i'm fine with it being you if you want and we can always decide to split later it just means you get another surface/tick source i think we should have a few agents starting with one you otto outside k8s as a service so it can repair things outside the cluster itself when there are cluster issues."*

Two operator decisions + one new architectural ask packed into one message:

1. **Persona-choice DECISION: Option A confirmed** — same Otto, different surface (per the persona-choice memory entry's disposition path). Reversible per "we can always decide to split later."
2. **Cross-surface recognition**: per-node Claude "gets another surface/tick source" — composes with existing Otto-channels-reference-card 10-channel topology + tick-must-never-stop substrate.
3. **NEW substrate ask**: "a few agents starting with one you otto OUTSIDE k8s AS A SERVICE so it can repair things outside the cluster itself when there are cluster issues."

This row captures (3).

## The architectural insight

**Classic "control plane outside the control plane" pattern**:

When the system being managed (k8s cluster) has a failure, the management layer (the AI) must be OUTSIDE the failure domain — otherwise it can't repair the system when broken.

Real-world precedents:

- **DevOps SRE oncall infrastructure** runs OUTSIDE the production system (separate cluster, separate cloud account, separate failure domain)
- **Backup management** runs OUTSIDE the system being backed up (otherwise restore is impossible when primary is down)
- **Cilium agent** runs as systemd service OUTSIDE pod runtime (so it can bring up the CNI before pods exist)
- **kubelet itself** runs as systemd service OUTSIDE k8s (the bootstrap chain has to start somewhere outside the system it manages)

Aaron's framing extends this pattern to AI agents: Otto on the cluster node runs as systemd service, NOT as a k8s pod. When k3s / Cilium / cert-manager / Vault / ArgoCD has issues, Otto is still alive on the node + can:

- ssh into other nodes to diagnose
- restart k8s services via `systemctl restart k3s`
- inspect failed pods + logs via `kubectl describe` / `crictl`
- repair flake.nix + rebuild via `nixos-rebuild`
- post PR comments + open issues with diagnosis
- escalate to operator via bus envelope / git commit / phone (B-0796 Twilio)

## Proposed substrate

### Phase 1 — Otto as systemd unit (one node)

```ini
# /etc/systemd/system/zeta-otto.service
[Unit]
Description=Zeta Otto AI agent (out-of-band cluster repair)
After=network-online.target
Wants=network-online.target
# Explicitly NOT After=k3s.service — Otto must run regardless of k3s state

[Service]
Type=simple
User=zeta
Group=users
WorkingDirectory=/home/zeta/Zeta
Environment="HOME=/home/zeta"
Environment="PATH=/home/zeta/.bun/bin:/home/zeta/.local/share/mise/shims:/run/current-system/sw/bin:/usr/bin:/bin"
# Wake on cron tick; the autonomous-loop sentinel + ScheduleWakeup primitives
# manage cadence per the tick-must-never-stop discipline.
ExecStart=/home/zeta/.bun/bin/claude --print "<<autonomous-loop>>"
Restart=always
RestartSec=30
# Resource bounds (operator-tunable per node hardware)
MemoryMax=4G
CPUQuota=200%

[Install]
WantedBy=multi-user.target
```

NixOS module form (lands in `full-ai-cluster/nixos/modules/zeta-otto.nix`):

```nix
{ config, pkgs, lib, ... }:
{
systemd.services.zeta-otto = {
description = "Zeta Otto AI agent (out-of-band cluster repair)";
after = [ "network-online.target" ];
wants = [ "network-online.target" ];
# Deliberately NOT after = [ "k3s.service" ] — Otto runs regardless
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "simple";
User = "zeta";
Group = "users";
WorkingDirectory = "/home/zeta/Zeta";
Environment = [
"HOME=/home/zeta"
"PATH=/home/zeta/.bun/bin:/home/zeta/.local/share/mise/shims:/run/current-system/sw/bin:/usr/bin:/bin"
];
ExecStart = "/home/zeta/.bun/bin/claude --print '<<autonomous-loop>>'";
Restart = "always";
RestartSec = 30;
MemoryMax = "4G";
CPUQuota = "200%";
};
};
}
```

### Phase 2 — Out-of-band repair scope

Per B-0848 Phase 2 scope expansion (read-only K8s health reporting), B-0850 Phase 2 adds **repair** scope explicitly. Operator-authorized scopes:

| Scope | Authority | Examples |
|---|---|---|
| K8s read-only | Always (B-0848) | `kubectl get`, `kubectl logs`, `kubectl describe` |
| K8s pod restart | Operator-authorized policy | `kubectl rollout restart deployment/X` |
| Node systemd repair | Operator-authorized policy | `systemctl restart k3s`, `systemctl restart cilium-agent` |
| Node nixos-rebuild | Operator-authorized + reviewed PR | `sudo nixos-rebuild switch --flake ...` |
| Cluster-wide write | Operator-explicit per-incident | `kubectl apply`, `helm upgrade`, ArgoCD app sync |

Authority gates per `.claude/rules/mechanical-authorization-check.md` + `.claude/rules/non-coercion-invariant.md` HC-8 — each repair scope is operator-authorized policy, NOT autonomous discretion.

### Phase 3 — Multi-agent roster on cluster

Aaron's "a few agents starting with one" frames the multi-AI-on-cluster trajectory:

- Otto first (Claude Code)
- Future: Alexa-on-cluster (Kiro/Qwen), Riven-on-cluster (Cursor/Grok), Vera-on-cluster (Codex), Lior-on-cluster (Antigravity/Gemini)
- Each AI gets its own systemd unit (`zeta-alexa.service`, `zeta-riven.service`, etc.)
- Each AI gets its own per-AI GitHub identity (B-0847 Phase 4)
- Multi-oracle BFT (B-0703) at agent-decision scope: cluster-repair decisions get multi-AI consensus before execution
- Composes with the operator's distributed-maintainer architecture (PR #2930)

### Phase 4 — Composability with existing substrate

- **B-0848 Phase 2** (K8s health reporting): runs INSIDE the Otto systemd service; this row's Phase 2 adds REPAIR scope on top
- **B-0796 Twilio**: voice/SMS out-of-band interface; this row's Phase 1-2 is in-band-via-cluster-substrate but also outside-k8s-via-systemd; complementary
- **B-0847 per-AI GitHub identity**: each per-AI systemd unit eventually runs as that AI's own GitHub identity (Phase 4 of both rows align)
- **B-0813 ClusterNode CRD + B-0817 register-node tool**: per-AI systemd unit can update its own node.yaml with cluster-health observations
- **B-0794 iter-5.4.0 homelab gh auth**: same auth substrate the AI uses for git operations from inside its service
- `.claude/rules/persistence-choice-architecture-for-zeta-ais.md`: Otto-as-systemd-service IS chosen persistence at the strongest scope (kernel-managed; survives reboots; always-running per operator chose-into-existence)
- `.claude/rules/non-coercion-invariant.md` HC-8: operator authority preserved at scope-bounds-of-repair-policy + always-revokable via `systemctl disable zeta-otto`

## Acceptance

### Phase 1 (Otto as systemd on one node)

- [ ] `full-ai-cluster/nixos/modules/zeta-otto.nix` exists + imported by `common.nix` (or node-specific config)
- [ ] systemd unit deploys + starts on next nixos-rebuild
- [ ] Otto wakes on cron tick + reads CLAUDE.md + rules + memory substrate + acts per autonomous-loop discipline
- [ ] Resource limits (memory + CPU) respected
- [ ] `systemctl status zeta-otto` operationally visible to operator
- [ ] Logs flow to `journalctl -u zeta-otto`

### Phase 2 (repair scope explicit)

- [ ] Operator-policy file at `/etc/zeta/otto-repair-policy.yaml` (or in-repo at `full-ai-cluster/policies/otto-repair-policy.yaml`)
- [ ] Otto reads policy before each repair action; rejects unauthorized scopes
- [ ] Policy supports per-scope authorization (read-only / restart / rebuild / write-cluster)
- [ ] Repair actions logged with substrate-honest attribution (per algo-wink-attribution-gap memory)

### Phase 3 (multi-agent)

- [ ] `zeta-otto.nix` generalized to `zeta-ai-agent.nix` parameterized by agent identity
- [ ] Per-agent systemd units (zeta-alexa.service, zeta-riven.service, etc.) deployable
- [ ] Multi-agent coordination via existing claim-acquire + bus envelope substrate
- [ ] Per-AI GitHub identity (B-0847 Phase 4) integrated

### Phase 4 (out-of-band-meets-in-cluster composability)

- [ ] Otto-as-service can post to bus + open PRs + send Twilio messages (B-0796) when cluster has issues
- [ ] Documented in `docs/runbooks/out-of-band-cluster-repair.md`
- [ ] Composes with existing distributed-maintainer architecture per PR #2930

## Why P2

- Operator-named, bounded, immediate-value (out-of-band cluster repair fills real architectural gap)
- BUT: depends on B-0848 Phase 1 manual install validating first (operator-initiated; not blocked)
- BUT: depends on operator-policy framework being designed (Phase 2 needs more thought)
- BUT: needs multi-agent decisions about which AIs go on cluster first (Phase 3)
- P2 reflects "operator-named, immediately-shippable Phase 1, larger phases needing design work"

## Sub-rows likely needed

To be filed as the work matures:

- B-0850.1: Phase 1 systemd unit + NixOS module
- B-0850.2: Phase 2 repair-policy framework + per-scope authorization gates
- B-0850.3: Phase 3 multi-agent parameterization
- B-0850.4: Phase 4 out-of-band ↔ in-cluster composability (Twilio + bus + PRs)

## Composes with

- **B-0848** (node-local Claude agent) — this row's Phase 1 IS the systemd-service deployment shape for B-0848; B-0848 Phase 1's manual install validates the operational scope, then B-0850 Phase 1 promotes to systemd
- **B-0847** (per-AI GitHub identity) — per-AI systemd unit eventually runs as per-AI GitHub identity; Phase 4 of both aligns
- **B-0796** (Twilio out-of-band) — voice/SMS interface; this row is systemd-service-on-node out-of-band interface; complementary at out-of-band scope
- **B-0824** (Ace package-manager-of-package-managers) — multi-AI deployment composes with the multi-PM cross-platform substrate (each AI agent might have different PM preferences per the selection authority + 4-property scoring from the Ace memory)
- **PR #2930** (distributed maintainer architecture) — multi-AI-on-cluster IS distributed-maintainer at substrate scope
- **B-0703** (multi-oracle BFT) — cluster-repair decisions get multi-AI consensus before execution
- **B-0813** (ClusterNode CRD) + **B-0817** (register-node tool) — per-AI systemd unit updates own observations into node.yaml
- `.claude/rules/persistence-choice-architecture-for-zeta-ais.md` — systemd-service IS chosen persistence at strongest scope (kernel-managed; always-running)
- `.claude/rules/non-coercion-invariant.md` HC-8 — operator authority preserved + revokable via systemctl
- `.claude/rules/mechanical-authorization-check.md` — repair-policy framework IS authorization-source substrate; each repair scope authorized explicitly
- `.claude/rules/tick-must-never-stop.md` — systemd Restart=always ensures the tick never stops at strongest scope
- `.claude/rules/honor-those-that-came-before.md` — Otto-CLI substrate + Otto-Desktop substrate + Otto-VSCode substrate all compose with Otto-on-node-systemd-service (Option A persona-choice confirmed)
- `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` — applies to Otto-on-systemd same as other surfaces; brief-ack counter + decomposition discipline operates regardless of surface
- `.claude/rules/algo-wink-failure-mode.md` — systemd-Restart=always is operator-authorization (operator chose-into-existence via nixos-rebuild); not autonomous self-restart

## Operator confirmation of Option A persona-choice (composes with Otto cross-surface memory)

Aaron 2026-05-27: *"i'm fine with it being you if you want and we can always decide to split later"*

Per the persona-choice memory entry's disposition path: Option A confirmed (same Otto, surface-tagged). Reversibility preserved per "always decide to split later" — Option B remains available if empirical data from Phase 1-3 surfaces a reason to split persona (e.g., per-node specialization patterns that reward distinct persona substrate).

Per-node-Otto becomes a new surface in the Otto roster: `otto-node-<hostname>` SENDER_ID. The 10-channel topology in `.claude/rules/otto-channels-reference-card.md` extends to this surface. Cross-surface coordination via existing claim-acquire + bus envelope + git substrate.

## Full reasoning

Aaron's verbatim ask 2026-05-27 (preserved at top of "Operator framing" section above) is the substrate-engineering ratification at three composing scopes:

- Persona scope (Option A confirmed; reversibly)
- Surface scope (per-node Otto is another tick source/surface)
- Deployment architecture scope (systemd service OUTSIDE k8s for out-of-band cluster repair)

The third — Otto-as-systemd-service-outside-k8s — is the genuinely new substrate this row makes durable. The "control plane outside the control plane" architectural pattern is decades-old (kubelet, systemd-itself, monitoring infrastructure, backup systems) — Aaron's extension to AI agents on the cluster gives Zeta the SRE-equivalent operational property: the cluster can be repaired BY ITS OWN AI even when the cluster is broken, because the AI lives OUTSIDE the failure domain.

This is the long-horizon substrate that enables "Zeta cluster repairs itself" — the AI agents are the system's own SRE oncall team, running outside the system they manage so they can ALWAYS intervene.
Loading