Skip to content

feat(B-0856 P2 deferred): Path A — /tmp/zeta-cluster-state/ multi-agent coordination standard (filed-immediately per Aaron 2026-05-27 backlog discipline; implementation deferred until after first cluster boot)#5413

Merged
AceHack merged 2 commits into
mainfrom
backlog/b-0856-path-a-tmp-cluster-coord-standard-2026-05-27
May 27, 2026
Merged

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 27, 2026

Summary

Filed in response to Aaron 2026-05-27 catch: "backlog rows should alwasy be filed you are forgetful we dont have to work on it yet until after we boot with one."

Path A is the sibling/alternative to B-0855 Path B (Otto-pushes-PR-across-finish-line) for multi-agent per-node cluster coordination via /tmp/zeta-cluster-state/ marker files.

Implementation DEFERRED until after first cluster boot + multi-agent coordination needs the per-node state surface. Row stays open + visible; substrate-engineering target preserved across cold-boots.

Schema (Phase 1 proposal)

/tmp/zeta-cluster-state/
├── nodes/
│   └── <node-name>/
│       ├── self-registered.marker
│       ├── register-pr-in-flight.lock
│       ├── last-seen.iso
│       └── persona-<name>.state
└── README.md

Composes with

  • B-0855 (sibling) — Path B Otto-pushes-PR; this row adds Path A as enhancement
  • B-0850 — multi-vendor systemd substrate; each agent uses markers
  • B-0851 — persona-first scheduler; persona state advertised via markers
  • B-0400 — bus claim-coordinator; sibling at different scope
  • B-0812 — registration substrate that prefigured marker pattern

8 sub-rows enumerated; implementation deferred

B-0856.1-8 named for future-Otto when trigger fires.

Filing-discipline anchor

Memory landed at user-scope: feedback_aaron_backlog_rows_always_filed_immediately_even_when_deferred_to_prevent_forgetful_failure_mode_2026_05_27.md

🤖 Generated with Claude Code

Lior added 2 commits May 27, 2026 02:59
…ion standard (deferred-implementation; filed-immediately per Aaron 2026-05-27 backlog discipline)

Filed in response to Aaron 2026-05-27 catch: "backlog rows should alwasy
be filed you are forgetful we dont have to work on it yet until after
we boot with one."

Path A is the sibling/alternative to B-0855 Path B (Otto-pushes-PR-across-
finish-line) for multi-agent per-node cluster coordination via /tmp marker
files (self-registered.marker / register-pr-in-flight.lock / last-seen.iso
/ persona-<name>.state).

Implementation DEFERRED until after first cluster boot AND multi-agent
coordination needs the per-node state surface. Row stays open + visible;
substrate-engineering target preserved across cold-boots.

8 sub-rows B-0856.1-8 enumerated. Composes with B-0855 (sibling Path B),
B-0850 (systemd substrate), B-0851 (persona-first scheduler), B-0400 (bus
claim-coordinator at cross-process scope), B-0812 (registration substrate
that prefigured marker pattern).

Filing-discipline memory landed alongside at user-scope:
feedback_aaron_backlog_rows_always_filed_immediately_even_when_deferred_
to_prevent_forgetful_failure_mode_2026_05_27.md
Copilot AI review requested due to automatic review settings May 27, 2026 06:59
@AceHack AceHack enabled auto-merge (squash) May 27, 2026 06:59
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new P2 backlog row (B-0856) documenting a deferred “Path A” proposal for multi-agent, per-node coordination via marker files under /tmp/zeta-cluster-state/, and surfaces it in the generated backlog index for visibility.

Changes:

  • Introduces docs/backlog/P2/B-0856-…md defining the proposed /tmp/zeta-cluster-state/ schema, invariants, and future sub-rows (implementation explicitly deferred).
  • Updates docs/BACKLOG.md to include the new B-0856 row in the P2 section.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
docs/backlog/P2/B-0856-path-a-tmp-zeta-cluster-state-coordination-standard-per-node-marker-files-for-multi-agent-coordination-aaron-2026-05-27.md New backlog row specifying the proposed /tmp/zeta-cluster-state/ coordination marker-file surface and implementation triggers/sub-rows.
docs/BACKLOG.md Adds the B-0856 entry to the P2 index list.

@AceHack AceHack merged commit 9dcc6a3 into main May 27, 2026
30 checks passed
@AceHack AceHack deleted the backlog/b-0856-path-a-tmp-cluster-coord-standard-2026-05-27 branch May 27, 2026 07:13
AceHack added a commit that referenced this pull request May 27, 2026
…0 substrate for Ace migration trajectory (14 sub-steps; 12 declarative-input categories; substrate-anchor for B-0852/0853/0855/0856 cross-refs) (#5420)

* docs(B-0854.1): zeta-install.sh step-state-machine inventory — Phase 0 substrate for Ace migration trajectory

B-0854 sub-row .1 (Phase 0; smallest pure-analysis slice). Documents
the EXISTING imperative bash state-machine in zeta-install.sh so the
B-0854 Phase 2 declarative-Ace-manifest schema can express the same
surface.

Inventory covers:
- Top-level entry (REPO_URL, HOST, ZETA_AUTO_CONFIRM env semantics)
- Step-by-step state machine for all 14 sub-steps (1, 2, 3, 4, 5, 6,
  6.5, 6.55, 6.6, 6.7, 6.8, 6.9, 6.95, 7) with inputs/outputs/side-
  effects/failure-modes/declarative-equivalent per step
- Cross-cutting: operator-prompt accumulation count (7 prompts today;
  B-0852 phase-split target = 1 passphrase prompt)
- Idempotency surface table — informs B-0855 architectural fix scope
- 12 distinct declarative-input categories the Ace manifest must
  capture (Phase 2 sub-row scope)
- Files-generated-during-install table mapping to B-0852.5 cred-
  manifest entries (6 mapped, 3 candidate-expansion items named)

Snapshot date: 2026-05-27 (origin/main 70596a8; PR #5417 cosign
merge). Future refreshes should re-snapshot when zeta-install.sh
changes substantially.

Composes with already-landed substrate-engineering arc:
- B-0852 + sub-rows (cred persistence) — PR #5403/#5411/#5414
- B-0853.1 (cosign signing) — PR #5417 + fix-fwd #5419
- B-0855 (self-register architectural fix) — PR #5412
- B-0856 Path A (deferred /tmp coordination) — PR #5413
- B-0854 parent (Ace migration trajectory) — PR #5405

No code change; pure documentation. Doesn't affect ISO substrate;
batches into substrate-engineering history independent of next ISO
build cycle.

* fix(B-0854.1): escape | inside code spans for MD056 table-column-count compliance

* fix(B-0854.1): 10 Copilot accuracy corrections — verified against actual zeta-install.sh content

PR #5420 Copilot review caught 10 substantive accuracy issues in the
B-0854.1 inventory doc. All 10 verified against origin/main 70596a8's
actual zeta-install.sh content + corrected.

Corrections:
- Name attribution → role-ref ("the human maintainer")
- Step 1 inputs: actual `lsblk -d -p -n -o NAME,TYPE,RM,RO,TRAN` + awk
  filter (not made-up NAME,SIZE,MODEL,TRAN,ROTA)
- Step 3 side effects: `sgdisk --zap-all` only (not `wipefs -af` too)
- Step 4: actual `sgdisk` (NOT `parted`); GPT layout via -n + -t flags;
  whole-disk longhorn partitions on DATA_DISKS too
- Step 6: `nixos-generate-config --root /mnt --force` (NOT
  --no-filesystems; --force overwrites existing config)
- Step 6.5: no MAGIC_NUMBER (didn't exist in script); INJECT_OK gate
  flag; iter-4 v1 manual-config-edit fallback path
- Step 6.9: SELF_REG_OK flag; documented graceful-skip path lines 731+
- nixos-install: actual line ~1004 (NOT 1096-1340); section renamed
  to "nixos-install (the actual build; ~line 1004)" since the prior
  range was wrong
- Step 7: actual lines 1261-1336 (NOT 1341-1352); banner driven by
  GH_AUTH_OK/GH_KEY_COUNT/INJECT_OK/SELF_REG_OK (NOT MAGIC_NUMBER);
  conditional sections listed in declarative equivalent

Resolves 10 Copilot threads on PR #5420.

Root cause of the inaccuracies: original draft was written from
`grep -E "^# ── Step"` summaries + recollection of script behavior,
not careful per-step body reads. Discipline lesson: when authoring
substrate-anchor docs claiming to inventory existing code, the read
must be careful per-line, not skim-grep summary. Composes with
.claude/rules/verify-existing-substrate-before-authoring.md at the
inventory-substrate scope (verify-content-of-thing-being-inventoried
before authoring claims about its content).

---------

Co-authored-by: Lior <lior@zeta.dev>
AceHack added a commit that referenced this pull request May 27, 2026
… — interactive bake-in + zflash CLI override (Aaron 2026-05-27 USB push) (#5449)

* docs(B-0852.3): zeta-install.sh Step 6.77 cred-picker integration row — interactive bake-in at setup + zflash CLI token-override per declared cred (Aaron 2026-05-27 USB push)

Filed per operator 2026-05-27 USB push: "lets keep pushing forward and
get cred persistance any anthing else we can make it in before i test again"

Captures the three-message operator framing 2026-05-27:
1. "if we do token we should do at zflash time and human interactive at
   setup time"
2. "zflash script and/or skill can make sure it asks what declared creds
   you want to bake in vs go through device flow"
3. "instead of loop in zflash you just allow command line override of any
   declared cred as token... easier for the ai to call"

Two integration points:
- Step 6.77 (setup-time interactive picker; consumes B-0852.2b persist CLI)
- zflash CLI flag (--bake-cred per cred; non-interactive AI-callable)

Composes with merged substrate:
- B-0852.1 crypto (PR #5413)
- B-0852.5 manifest (PR #5414)
- B-0852.10 handlers (PR #5418)
- B-0852.2a envelope (PR #5421)
- B-0852.2b CLIs (PR #5425)
- B-0857.1 audit confirms Step 6.95a invocation present (PR #5426)

Sub-rows planned: 3a (picker in zeta-install.sh), 3b (zflash CLI flags),
3c (passphrase policy), 3d (empirical USB test).

P1 priority because this row directly blocks operator's USB cred-persistence
empirical validation. All upstream sub-rows merged; this is the operator-
facing integration that unblocks the empirical test.

Filing this row IS counter-reset condition #3 ("file a candidate B-NNNN")
per .claude/rules/holding-without-named-dependency-is-standing-by-failure.md
— per Kira's review the row should have been filed at brief-ack #6 not
tick 100. Substrate-honest: filing now closes the cascade naturally.

Per .claude/rules/non-coercion-invariant.md HC-8: operator authority over
cred-persistence flow; picker preserves choice (bake / defer / skip).

Per .claude/rules/agent-worktree-hygiene-never-hold-main-...: isolated
worktree at /private/tmp/zeta-b0852-3-row-1200z; never touched operator's
primary checkout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(BACKLOG.md): regen for B-0852.3 row

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 27, 2026
…integration (16 tests; Aaron 2026-05-27 USB push) (#5450)

* feat(B-0852.3a): interactive cred-picker + zeta-install.sh Step 6.94 integration (16 unit tests; consumes B-0852.2b persist CLI)

Implements operator's 2026-05-27 USB-push direction: ship cred-persistence
end-to-end before next USB test cycle.

**Picker (tools/installer/zeta-creds-picker.ts)**:

Interactive CLI that reads DEFAULT_MANIFEST (B-0852.5) + per-cred handler
contracts (B-0852.10), then prompts operator per cred:

  [b]ake-in NOW / [d]efer to device-flow at runtime / [s]kip

For bake-in choices, sub-prompts for value-source matching handler's
supportedSources:
  - [l]iteral (typed value; not logged)
  - [f]ile (@path syntax to B-0852.10 handler)
  - [e]nv (env:VAR syntax)

After picker loop completes, invokes zeta-creds-persist (B-0852.2b CLI)
with collected --bake-cred args + passphrase + usb-uuid + output path
+ optional persona.

Auto-skips persona-scoped creds when --persona not supplied (operator
choosing global-only install scope).

--dry-run mode prints the persist invocation without executing (useful
for test/debug).

Exit codes: 0 success / 2 arg-parse / 3 abort / 4 persist-failure.

**Tests (tools/installer/zeta-creds-picker.test.ts)**:

16 unit tests passing:
- parseArgs validation (6 tests covering well-formed + missing-required + unknown-flag)
- runPicker against mock readline (10 tests covering defer-all / bake-literal / bake-file / bake-env / empty-value-skip / persona-scoped auto-skip / persona-supplied bake / empty-choice-as-defer / unrecognized-choice-as-defer / explicit-skip)

Pure picker logic tested without spawning persist subprocess.

**zeta-install.sh Step 6.94 integration**:

Adds conditional Step 6.94 BEFORE existing Step 6.95 cred-persistence
block. Gated on three preconditions:
  - ZETA_CREDS_PICKER=1 env (opt-in; default skip preserves backward
    compat with automated/CI installs)
  - $ZETA_HOME/Zeta exists (pre-cloned repo from Step 6.95a-bootstrap)
  - /etc/zeta/usb-uuid exists (iter-4.2 ESP write surface)
  - ZETA_CREDS_PASSPHRASE env set

When all preconditions met: invokes picker as zeta user via sudo,
forwarding passphrase through env. Writes blob to /esp/zeta-creds.enc
which B-0852.4 NixOS module will consume at boot (future row).

Non-fatal failure: warns + continues (per .claude/rules/non-coercion-invariant.md
HC-8 — required-cred write failure surfaces but doesn't halt install).

**What this unblocks for operator's USB test cycle**:

- Operator can re-flash USB → boot → run installer → set ZETA_CREDS_PASSPHRASE + ZETA_CREDS_PICKER=1 → bake desired creds → reboot
- /esp/zeta-creds.enc is written; persistence verified empirically on USB
- B-0852.4 NixOS module (consume at boot) lands in next sub-row

Composes:
- B-0852.1 crypto (PR #5413)
- B-0852.2a envelope (PR #5421)
- B-0852.2b persist+restore CLIs (PR #5425)
- B-0852.3 row (PR #5449)
- B-0852.5 manifest (PR #5414)
- B-0852.10 handlers (PR #5418)
- B-0857.1 audit confirms Step 6.95a invocation (PR #5426)

Per .claude/rules/non-coercion-invariant.md HC-8: operator authority
over own creds; passphrase NEVER logged; literal values redacted at
display; declined creds defer (not coerced into bake-in default).

Per .claude/rules/agent-worktree-hygiene-never-hold-main-...: isolated
worktree at /private/tmp/zeta-b0852-3a-picker-1215z; never touched
operator's primary checkout.

Per .claude/rules/holding-without-named-dependency-is-standing-by-failure.md:
this commit IS the externalized heartbeat per AgencySignature substrate
the operator pointed at 2026-05-27 — git log + audit-agencysignature-main-tip.ts
gives the counter mechanism the brief-ack rule's N=6 forcing function
needs to fire reliably.

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: operator-direction-2026-05-27-usb-push-keep-pushing-forward
Action-Mode: substrate-implementation
Task: B-0852.3a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0852.3a CI): 7 Copilot+CodeQL findings — P0 passphrase leak via bash -c interpolation; P0 CodeQL clear-text-logging; sudo arg ordering; eslint-disable; valueSpec→sourceChoice source label; Step 6.94→6.95-picker restructure (Aaron 2026-05-27 USB push)

7 unresolved review threads on #5450 resolved:

**P0 — Passphrase leak via bash -c arg-string interpolation (Copilot @1043)**
Was: `bash -c "...ZETA_CREDS_PASSPHRASE='$ZETA_CREDS_PASSPHRASE' bun..."`
The outer double-quote expanded $ZETA_CREDS_PASSPHRASE → literal
passphrase appeared in process arglist visible to `ps`.

Fix: use `sudo --preserve-env=ZETA_CREDS_PASSPHRASE -u USER HOME=... bash -c CMD`
where CMD references `--passphrase-env ZETA_CREDS_PASSPHRASE` (var-NAME
only). Passphrase never appears in arglist.

**P0 — CodeQL clear-text-logging in DRY RUN output (line 198)**
Was: `console.log(\`  bun \${persistArgs.join(" ")}\`)` — persistArgs
contains `--passphrase-env <NAME>` from operator input; the NAME is
CodeQL-tainted.

Fix: build displayArgs that maps position-after-`--passphrase-env` to
`<REDACTED>` literal. Same discipline as zeta-creds-persist/restore P0
fix on PR #5422.

**P1 — sudo arg ordering (Copilot @1038)**
Was: `sudo HOME=... -u ...` — HOME= before -u is invalid per sudo
manpage (options must precede arguments).

Fix: `sudo --preserve-env=... -u ... HOME=...` — options first, env-var
assignment between -u and command per sudo manpage.

**P1 — valueSpec in source-label ternary (Copilot @202)**
Was: `valueSpec.startsWith("@") ? "@file" : valueSpec.startsWith("env:") ? "env" : "literal"`
The output is just labels but Copilot flagged the value passing through
the ternary as a leak risk.

Fix: compute sourceLabel from operator's sourceChoice letter (l/f/e)
NOT from valueSpec. valueSpec never reaches the log path.

**P2 — eslint-disable for spawnSync (Copilot @201)**
Added `// eslint-disable-next-line sonarjs/no-os-command-from-path`
before the spawnSync("bun", ...) call per repo convention for
TS tools spawning PATH-resolved bins.

**P2 — Step 6.94 vs 6.95a-bootstrap ordering contradiction (Copilot @1052)**
Was: Step 6.94 claimed to read manifest from pre-cloned repo, but the
clone happened in 6.95a-bootstrap BELOW. Picker would fail at Step 6.94
(no repo, no bun).

Fix: restructured — Step 6.94 is now a header stub reserving the
number; ACTUAL picker invocation moved to NEW Step 6.95-picker INSIDE
the 6.95 block, AFTER 6.95a-bootstrap (repo + bun + mise present) +
BEFORE 6.95b device-flow logins (picker decides per-cred bake-vs-defer
+ device-flow handles the deferred subset).

**P2 — Header references Step 6.77 (Copilot @18)**
Was: picker file header said "Step 6.77" (speculative number from
B-0852.3 row body).

Fix: updated header to "Step 6.95-picker" matching the actual
integration step.

**Verification**:
- `bash -n full-ai-cluster/usb-nixos-installer/zeta-install.sh` → OK
- All 16 unit tests still pass

Per .claude/rules/blocked-green-ci-investigate-threads.md: verify-then-fix
discipline applied to each Copilot finding; one false-positive narrowed
(P1 valueSpec was technically OK but tightened anyway for clarity).

Per .claude/rules/non-coercion-invariant.md HC-8: passphrase NEVER logged
+ NEVER in arglist + redacted in DRY RUN; operator authority preserved.

Per .claude/rules/methodology-hard-limits.md: clinical/security floor
operative; P0 passphrase-leak fix lifts above the floor by removing
the leak path entirely (sudo --preserve-env keeps passphrase in env,
not arglist).

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: copilot-review-7-findings-on-pr-5450-resolved
Action-Mode: substrate-fix-fwd-security
Task: B-0852.3a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0852.3a CodeQL P0 re-fire): build DRY RUN display from known-safe primitives — never reference parsed.passphraseEnv in logged string (CodeQL doesn't see runtime ternary breaking taint)

Prior fix used map-based redaction over persistArgs (which contains
parsed.passphraseEnv tainted via env-var-name access). CodeQL data-flow
analysis doesn't recognize runtime ternary as a sanitizer — the taint
still flows from the input to the log call statically, so the warning
re-fired.

Stronger pattern (matches the sibling persist/restore CLIs): construct
the display string from primitives only. NEVER reference
parsed.passphraseEnv OR parsed.passphraseFile in the logged string;
print literal placeholders like "<REDACTED>" / "<set>" instead.

displayCmd = "  bun tools/installer/zeta-creds-persist.ts --usb-uuid <set> --output <set>"
  + " --passphrase-file <REDACTED>"  (if --passphrase-file set)
  + " --passphrase-env <REDACTED>"   (if --passphrase-env set)
  + " --persona <set>"                (if --persona set)
  + " --bake-cred <id>=<REDACTED>"    (per bake; id is OK; value redacted)

All 16 tests still pass.

Per .claude/rules/blocked-green-ci-investigate-threads.md verify-then-fix
discipline: read line 210 directly, confirm the redaction was runtime-
only (CodeQL doesn't sanitize), rewrite to static-safety pattern.

Per .claude/rules/non-coercion-invariant.md HC-8: passphrase NEVER in
log path; operator authority over what gets logged preserved by total
redaction; <set>/<REDACTED> placeholders confirm presence without
revealing content.

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: codeql-re-fire-on-line-210-after-prior-redaction-insufficient
Action-Mode: substrate-fix-fwd-security
Task: B-0852.3a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0852.3a CI Copilot): activate mise + BUN_INSTALL in picker bash -c — match sibling 6.95a install steps (Copilot @1164)

Copilot finding: the picker invocation at Step 6.95-picker bash -c
didn't activate mise the way sibling 6.95a-claude/gemini/codex steps
do (lines 1119-1121 / 1129-1131 / 1139-1141 all
`eval "$(mise activate bash 2>/dev/null || true)"; bun ...` inside
the bash -c, with `BUN_INSTALL="$ZETA_HOME/.bun"` set). Without
mise activate, `bun` is not on the subshell PATH because mise installs
bun via shims; activate sets the PATH entry. Picker would fail with
"bun: command not found" at Step 6.95-picker time.

Fix: mirror the sibling pattern exactly:
- Add `BUN_INSTALL="$ZETA_HOME/.bun"` to sudo env prefix
- Add `set -o pipefail; eval "$(mise activate bash 2>/dev/null || true)";`
  prefix to bash -c
- Preserve --preserve-env=ZETA_CREDS_PASSPHRASE for passphrase forward

Verification: `bash -n full-ai-cluster/usb-nixos-installer/zeta-install.sh`
returns syntax OK.

Per .claude/rules/blocked-green-ci-investigate-threads.md verify-then-fix:
read the sibling step patterns at lines 1119-1141, confirm they all
follow same eval-mise-then-bun convention, apply the same to picker.

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: copilot-thread-PRRT_kwDOSF9kNM6FHfK8-on-pr-5450
Action-Mode: substrate-fix-fwd-correctness
Task: B-0852.3a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 27, 2026
…luster common.nix imports — last gate for end-to-end USB cred-persistence test (Aaron 2026-05-27 USB priority) (#5476)

* feat(B-0852.4a): NixOS module zeta-creds-restore.nix — boot-time decrypt from ESP via systemd service (Aaron 2026-05-27 USB push; sibling to zeta-self-register.nix per B-0855.1)

Implements the boot-time consumer for the install-time picker (B-0852.3a
PR #5450). Composes with zeta-self-register.service which already
declares `after = "zeta-creds-restore.service"` per B-0855.1 module —
the dependency was wired upstream; this row makes the target service
actually exist.

**Module: full-ai-cluster/nixos/modules/zeta-creds-restore.nix**

NixOS module providing systemd service `zeta-creds-restore.service`:

- Disabled by default (`zeta.credsRestore.enable = false`); opt-in
  per host config (matches zeta-self-register sibling pattern)
- Ordering: `wantedBy=multi-user.target`, `after=local-fs.target` +
  `wants=local-fs.target` (ESP mounted before fire); B-0855.1 enforces
  `after=zeta-creds-restore.service` from its side
- ConditionPathExists guard: blob + USB UUID + restore CLI + bun shim
  must all exist (clean skip when picker wasn't run at install)
- Two passphrase modes (operator-configurable):
  - **file** (default): read from /run/zeta-creds-passphrase
    (operator pre-stages); deleted by ExecStopPost
  - **interactive**: systemd-ask-password on tty1 (300s timeout);
    writes zeta-readable temp file; deleted by ExecStopPost
- Invokes B-0852.2b restore CLI as zeta user via sudo with proper
  HOME + PATH + --target-root=/
- Optional --persona passthrough for per-persona-scoped creds
- Restart=on-failure with 30s backoff (per .claude/rules/non-coercion-invariant.md
  HC-8: required-cred failure surfaces honestly)

**Verification**: `nix-instantiate --parse` returns PARSE OK.

**What this unblocks for operator's USB test**:

End-to-end persist → restore → use chain now possible on real USB:
1. Operator reflashes USB
2. Boots, runs installer with ZETA_CREDS_PICKER=1 + ZETA_CREDS_PASSPHRASE=...
3. Picker writes /esp/zeta-creds.enc (B-0852.3a / PR #5450)
4. Operator enables zeta.credsRestore.enable=true + passphraseMode in
   host common.nix (B-0852.4d wiring; next sub-row)
5. Reboot → systemd fires zeta-creds-restore.service → blob decrypts →
   per-cred files populated in /home/zeta
6. zeta-self-register.service fires next per B-0855.1 ordering

Composes:
- B-0852.1 crypto (PR #5413; decrypt envelope)
- B-0852.2a envelope (PR #5421; parse blob format)
- B-0852.2b restore CLI (PR #5425; the binary this module wraps)
- B-0852.3a picker (PR #5450; produces the blob)
- B-0852.4 row (PR #5454; this is sub-row 4a)
- B-0852.5 manifest (PR #5414; drives per-cred path resolution)
- B-0855.1 zeta-self-register.nix (the sibling module that already
  expects this service to exist)
- B-0857 install.sh universal entry (install-time companion)

Remaining sub-rows planned (per B-0852.4 row):
- 4c: file-mode is implemented (default mode in this PR)
- 4b: interactive-mode also implemented (both modes ship together)
- 4d: wire into common.nix (next PR; simple imports list add)
- 4e: empirical USB end-to-end test (validates full chain on hardware)

Per .claude/rules/agent-worktree-hygiene-never-hold-main-...: isolated
worktree at /private/tmp/zeta-b0852-4a-module-1250z; operator primary
checkout untouched.

Per .claude/rules/non-coercion-invariant.md HC-8: operator authority
over creds preserved; passphrase NEVER logged; interactive prompt
operator-driven; file-mode operator-staged; failure surfaces via
journalctl + restart policy.

Per .claude/rules/methodology-hard-limits.md: clinical/security floor
operative; cred-restore is purely defensive operator-data-recovery
substrate; no offensive use.

Heartbeat-via-commit per CLAUDE.md (PR #5451): this commit IS the
externalized counter tick; AgencySignature v1 trailer below; named
bounded-wait is #5450 build-iso completion.

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: operator-direction-2026-05-27-usb-push-keep-pushing-forward
Action-Mode: substrate-implementation
Task: B-0852.4a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(B-0852.4d): wire zeta-creds-restore.nix into cluster common.nix imports — last gate before end-to-end USB test (Aaron 2026-05-27 USB priority)

Adds `./zeta-creds-restore.nix` to `full-ai-cluster/nixos/modules/common.nix`
imports list right after `./zeta-self-register.nix` — matches the
ordering B-0855.1 documents (zeta-self-register declares
`after = "zeta-creds-restore.service"`; both share import position).

Disabled-by-default (per the module's mkEnableOption); host configs
opt in via `zeta.credsRestore.enable = true;` AND operator pre-stages
a passphrase source. Imported here so every cluster-node type
(control-plane / worker-gpu) inherits the same module surface; the
opt-in flip lives at host-config level not common.nix level.

Composes:
- B-0852.4a (this PR's earlier commit ef45b4f) — the module file itself
- B-0852.3a picker (PR #5450) — install-time blob writer
- B-0852.4 row (PR #5454 merged) — substrate-engineering parent
- B-0855.1 zeta-self-register.nix — already declares `after = "zeta-creds-restore.service"`
- iter-5.5.0 install flow — picker writes blob during install; module restores at boot

**Empirical USB test path now complete end-to-end**:
1. Reflash USB with ISO carrying these changes
2. Boot, run installer with ZETA_CREDS_PICKER=1 + ZETA_CREDS_PASSPHRASE=...
3. Step 6.95-picker writes /esp/zeta-creds.enc (B-0852.3a)
4. Operator enables `zeta.credsRestore.enable = true;` in host config
   + pre-stages /run/zeta-creds-passphrase
5. Reboot → zeta-creds-restore.service fires → blob decrypted →
   per-cred files populated in /home/zeta
6. zeta-self-register.service fires next per B-0855.1 ordering

Verification:
- `nix-instantiate --parse full-ai-cluster/nixos/modules/common.nix` → PARSE OK
- `nix-instantiate --parse full-ai-cluster/nixos/modules/zeta-creds-restore.nix` → PARSE OK

Per .claude/rules/non-coercion-invariant.md HC-8: opt-in default
preserves operator authority over per-host enablement; importing
the module surface doesn't activate it.

Per .claude/rules/agent-worktree-hygiene-never-hold-main-...: isolated
worktree at /private/tmp/zeta-b0852-4a-module-1250z; operator primary
checkout untouched.

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: operator-direction-2026-05-27-back-to-usb-after-heartbeat-iteration
Action-Mode: substrate-implementation-final-usb-gate
Task: B-0852.4d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(B-0852.4a): 3 Copilot findings — P0 root-write for /etc paths + P0 ExecStopPost-never-fires + P1 USB UUID newline trim

3 Copilot threads on PR #5476:

**P0 (@180): sudo -u ${cfg.user} can't write to /etc paths.**
The default cred manifest includes /etc/zeta/operator-authorized-keys
+ /etc/ssh/ssh_host_* (root-owned paths zeta user can't write).
Fix: run restore CLI AS ROOT directly (drop the sudo -u zeta drop).
Post-restore find ${cfg.home} -user root -exec chown zeta:users
to fix ownership on user-facing creds (~/.config/gh, ~/.config/claude,
~/.gemini, ~/.codex). Operator's pre-existing configs (already
zeta-owned) untouched by the -user root filter.

**P0 (@189): RemainAfterExit=true + Type=oneshot means
ExecStopPost never fires on successful boot.**
The unit stays "active" after ExecStart returns; systemd doesn't
treat that as a "stop" event so ExecStopPost is skipped. Passphrase
cleanup never runs. Fix: move cleanup to bash EXIT trap inside
ExecStart — fires on ANY exit path (success or failure), unaffected
by RemainAfterExit semantics. Removed standalone ExecStopPost.

**P1 (@140): USB_UUID trailing newline from cat.**
`cat /etc/zeta/usb-uuid` includes trailing \n if file ends with one.
Fix: `tr -d '[:space:]' < ${cfg.usbUuidPath}` strips all whitespace
(safer than just newlines; covers \r\n + leading whitespace too).

Per .claude/rules/blocked-green-ci-investigate-threads.md verify-then-fix:
each Copilot finding read against actual file content; all 3 real
findings; bundled fix with rationale per finding.

Verification: `nix-instantiate --parse full-ai-cluster/nixos/modules/zeta-creds-restore.nix`
returns PARSE OK.

Per .claude/rules/non-coercion-invariant.md HC-8: operator authority
preserved (chown only touches root-owned files; pre-existing
zeta-owned files untouched).

Agency-Signature-Version: 1
Agent: Otto
Agent-Runtime: Claude Code (auto mode)
Agent-Model: claude-opus-4-7
Credential-Identity: aaron-otto-vscode
Credential-Mode: operator-authorized
Human-Review: pre-merge-pending
Human-Review-Evidence: copilot-3-findings-on-pr-5476-2-p0-1-p1
Action-Mode: substrate-fix-fwd-security-plus-correctness
Task: B-0852.4a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants