feat(B-0812 / B-0835 Bug 4): iter-5.4.1 self-registration commit+push (Step 6.9) — opens registration PR per node-install; fixes CORE REQUIREMENT failure#5352
Merged
AceHack merged 1 commit intoMay 26, 2026
Conversation
…ommit+push (Step 6.9) — opens registration PR per node-install; fixes CORE REQUIREMENT failure
Implements B-0812 (iter-5.4.1; B-0794 sub-target 3 full) per the
operator's CORE REQUIREMENT from 2026-05-26 physical hardware-support
test: "post-boot fully-operational chain without operator login".
Adds Step 6.9 to zeta-install.sh (after Step 6.8 iter-5.4.0 gh-auth,
before nixos-install). Conditional on GH_AUTH_OK=1 (composes additively
with iter-5.4.0; cascade-skips if gh-auth was skipped).
Step 6.9 substrate:
1. Resolve operator GH user via `gh api /user --jq .login`
2. Resolve node hostname from $HOSTNAME_DST (iter-5.2 substrate) OR
fallback to flake-host $HOST with WARN
3. Hardware probe (CPU model + memory + cores + GPU + storage devices
+ IP + MAC) — emits ClusterNode-schema-conformant YAML fragment
per B-0812 Sub-target 1
4. Compose ClusterNode resource (apiVersion/kind/metadata/spec/hardware)
matching B-0794 sub-target 2 provisional CRD schema
5. Clone Zeta repo to tempdir via `gh repo clone --depth 1`
6. Write to maintainers/<operator-gh-user>/cluster-nodes/<hostname>/node.yaml
7. Configure git user.{name,email} from gh-auth'd operator identity
(commit-author = operator; clean attribution chain; no shipped
credentials)
8. Commit + push to fresh branch `register-<hostname>-<UTC-timestamp>`
9. Open PR via `gh pr create` per B-0812 Sub-target 3 (default; safer
than direct-to-main; operator reviews node-config before ArgoCD
reconciles)
10. Surface PR URL in install-complete banner per Sub-target 4
("phone-merge OK — no laptop kubectl required")
Sub-targets satisfied:
- [x] Sub-target 1: hardware-probe shell function emits valid YAML
- [x] Sub-target 2: node.yaml conforms to provisional ClusterNode schema
- [x] Sub-target 3: commit+push opens a PR
- [x] Sub-target 4: install banner shows registration PR URL
- [ ] Sub-target 5 (empirical end-to-end): requires next physical test
with this code; deferred to operator's re-test cycle
Composes_with substrate:
- B-0794 parent (sub-target 3 minimum-viable → full implementation)
- B-0789 cluster-as-PR-author (homelab-first instance using operator's
gh-auth from iter-5.4.0; no per-node deploy key)
- B-0790 zero-dev-machines (operator can review+merge from phone)
- B-0782 cluster-IS-DIO (bridges "node booted" → "node IS git-native
cluster substrate")
- B-0813 iter-5.4.2 (ArgoCD reconciles maintainers/*/cluster-nodes/**;
separate row; this row's commit+push triggers that reconciliation)
- B-0835 Bug 4 (CRITICAL self-registration failure; this row IS the
fix for that bug)
Git-as-source-of-truth + CockroachDB-repopulates-from-git architecture
(operator 2026-05-26): this row writes the source-of-truth node.yaml
to git; CockroachDB ingests from git when operational; Addison's
hardware-inventory SQL queries run against CockroachDB which can be
rebuilt from git anytime.
No HARD LIMITS violated:
- NO credentials baked on ISO (uses operator's gh-auth from iter-5.4.0)
- NO secrets in commit (only hardware specs + operator identity)
- Commit author = operator (clean attribution)
- Branch is per-node (no main collision; mergeable independently)
- Cleanup: tempdir removed at end of Step 6.9
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
3 tasks
There was a problem hiding this comment.
Pull request overview
Adds installer-time node self-registration so a freshly installed cluster node can create a Git-backed registration PR after successful GitHub authentication.
Changes:
- Adds Step 6.9 to probe hardware, compose
node.yaml, commit/push a branch, and open a registration PR. - Adds install-complete banner output for the self-registration PR URL or fallback instructions.
- Integrates the flow with the existing Step 6.8
gh auth loginsuccess path.
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…+ buying-decisions substrate (no more buying willy nilly) (#5353) Per operator 2026-05-26 composed from two messages: 1. "git for source of truth and coackroach can be repopulated from" 2. "we will also have an inventory for every machine and know if some are missing registration when she is done with her hardware inventory work. and know what and how we need to expand so we are not buying willy nilly anymore." Files B-0836 as P1 substrate-engineering target. 4-phase decomposition: - Phase 1: Addison's CSV → DuckDB ingestion (immediate; doesn't need cluster) - Phase 2: tools/cluster/reconcile-inventory-vs-cluster.ts (this row core; surfaces 3 gap types — missing-registration / phantom-node / expansion-buying-decision) - Phase 3: CockroachDB ingestion when cluster operational (materialized view from git source of truth; repopulates anytime) - Phase 4: tools/cluster/buying-recommendations.ts (closes the loop; data-driven purchase decisions) Architecture: git (B-0812 cluster-nodes + Addison inventory) is source of truth; CockroachDB is materialized view; reconciliation diffs both sides; buying decisions informed by capacity-gap analysis. Composes_with: - B-0812 iter-5.4.1 (cluster-side data source; PR #5352 in flight) - B-0794 parent (full GitOps cluster bring-up) - B-0782 cluster-IS-DIO (git source of truth) - B-0789 cluster-as-PR-author - Addison's hardware-inventory paper-audit work (Phase 1 ingestion) - 2026-05-26 physical hardware-support test session Highest-value operator outcome: shifts hardware-purchase decisions from "guess what we need" to "data says we need N more of make/model X for workload Y." Materially affects operator cost-management. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Member
Author
|
All 5 Copilot findings addressed in follow-up PR #5355 (bundle-fix): subshell error handling (CRITICAL — would kill installer), MAC parsing, schema alignment (roles[]/registration.maintainer/hardware.storage per B-0813 + B-0817), comment-name redaction. |
AceHack
added a commit
that referenced
this pull request
May 26, 2026
…ing + subshell error handling + comment-name redaction (#5355) 5 legitimate findings on PR #5352 (iter-5.4.1 self-registration), all real bugs that would block end-to-end self-registration: 1. **CRITICAL — subshell could kill installer** (line 806 of #5352): subshell inherited `set -euo pipefail`; ANY failure inside (git push permission denied, gh pr create scope missing, network drop) would propagate out + abort the installer BEFORE nixos-install runs. Step 6.9 is documented warning-only/skippable so it MUST never abort. FIX: subshell-local `set +e` + outer `|| true` defense-in-depth + explicit success/fail handling around git push + gh pr create with WARN-to-stderr on failure. 2. **MAC parsing wrong** (line 730 of #5352): `awk ... $(NF-2)` extracted `brd` not the MAC. `ip -o link` outputs `link/ether <MAC> brd <broadcast>`. FIX: `awk '{for(i=1;i<=NF;i++) if($i=="link/ether"){print $(i+1); exit}}'` parses the field AFTER `link/ether` correctly. 3. **Schema mismatch — spec.roles array** (line 749 of #5352): had `spec.role: $HOST` (scalar) but B-0813 ClusterNode CRD defines `spec.roles` as ARRAY. FIX: `spec.roles:\n - $HOST`. 4. **Schema mismatch — spec.registration.maintainer** (line 749 of #5352): had `spec.maintainer: $MAINTAINER` (top-level) but B-0817 places maintainer under `spec.registration.maintainer` (B-0813 CRD doesn't allow arbitrary spec fields; the reconciler reads `spec.registration.*` for maintenance metadata). FIX: nested under `spec.registration:` with timestamp + flake-commit + flake-host + registered-via siblings. Also added `metadata.labels` for the standard `zeta.lucent-financial-group.com/maintainer` label to support kubectl grouping. 5. **Schema mismatch — spec.hardware.storage** (line 761 of #5352): had `storage:` as sibling of `hardware:` but B-0813 places storage UNDER hardware block. FIX: indent storage 6 spaces (under hardware:) instead of 4 (sibling). Storage lines indented to 8 spaces accordingly. Same for network block (moved under hardware). 6. **Name attribution in comment** (line 691 of #5352): comment had "maintainers/aaron/cluster-nodes/" — direct maintainer name in current-state script. Per AGENT-BEST-PRACTICES no-name-attribution in .claude/rules/** + code/docs. FIX: replaced with placeholder "maintainers/<operator>/cluster-nodes/". No new substrate; all 5 fixes preserve existing structure + intent. Bash syntax OK. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 27, 2026
AceHack
pushed a commit
that referenced
this pull request
May 27, 2026
…o parenthesis closes on first line Copilot finding on the audit-installer-substrate.ts iter-5.4 sentinel addition: the comment 'iter-5.4.1 YAML schema sentinels (catches the Copilot findings from #5352' opened a parenthesis on line 98 that didn't close until line 100 ('block)'). To a code-reader scanning line 98, the sentence reads as unfinished. Fix: restructure as 'sentinels. Each catches a specific Copilot finding on PR #5352: ...' — no multi-line parenthesis; each schema-correction is a complete clause.
AceHack
added a commit
that referenced
this pull request
May 27, 2026
…low (asserts logical relationships between Bug 2a + 2b fix elements, ClusterNode YAML schema, iter-5.4.1 cascade gating) (#5367) Layer 2a of the 4-layer CI testing approach for iter-5.4 substrate: Layer 1 (#5365) — source-level sentinel audit (substring presence) Layer 2a (THIS PR) — structural-behavioral test (logical relationships) Layer 2b (future PR) — true mock-gh shim execution (refactor iter-5.4 into sourceable bash function; test against mock gh on PATH with success/scope-error/empty modes) Layer 3 (B-0833 App A) — mock GH device-code endpoint Layer 4 (B-0831) — QEMU full-install + cluster auto-join What this layer catches that Layer 1 doesn't: 1. `gh auth setup-git` is INSIDE the SUCCESS branch of `if gh auth login; then` (not just present somewhere in the script — placement matters; if setup-git ended up outside the success branch, it'd run on auth failure too). 2. setup-git is called BEFORE the ssh-key fetch (ordering matters — the git credential helper must be wired before any git push attempt). 3. SSH_KEY_ERR_FILE is wired AS the stderr redirect to `gh ssh-key list` (Bug 2b: if the file is created but not used as stderr, scope-error discrimination silently fails). 4. 3 distinct WARN paths exist (scope-error, empty-no-keys, pipe-broke) with their substrate-honest recovery messages (recovery commands for scope-error; settings/keys URL for empty-no-keys). 5. GH_AUTH_OK=1 is set EXACTLY ONCE — in the success branch of gh auth login (not in any failure or skip path). 6. iter-5.4.1 self-reg is gated on `GH_AUTH_OK = 1` (cascade-skip discipline; runs only if iter-5.4.0 succeeded). 7. iter-5.4.1 subshell uses `set +e` + the subshell wrapper closes with `|| true` (Copilot finding on #5352 — outer set -euo pipefail would propagate subshell failure out of the install). 8. ClusterNode YAML schema sentinels (catches the 3 Copilot findings on #5352 — spec.role was scalar instead of array; spec.maintainer was at flat path instead of nested under spec.registration; spec.storage was sibling of hardware instead of nested under it). 9. MAC parsing extracts the field AFTER `link/ether` (prior bug was `$(NF-2)` extracting `brd` instead of the MAC). 10. Self-reg branch name shape matches `register-<HOSTNAME>-<UTCTS>` (catches accidental rename that would break the cluster-side ArgoCD pattern watching register-* branches). Test approach: parse zeta-install.sh as text; extract iter-5.4.0 and iter-5.4.1 blocks by step-header boundaries; assert regex relationships within each block. 23 tests, 35 expect() calls, ~150ms runtime. Layer 2b deferred: requires refactoring iter-5.4.0 + iter-5.4.1 into a sourceable bash function so we can mock `gh` on PATH and assert behavior across the 4 modes (success/scope-error/empty/pipe-broke). That's a bigger refactor — separate PR. Structural-behavioral catches the same failure modes at much lower cost as the inner-loop test. Composes with: - PR #5364 (Bug 2a + 2b fixes — this layer asserts the fixes' STRUCTURE not just their presence) - PR #5352 (Copilot YAML schema findings — this layer asserts the schema corrections held) - PR #5365 (Layer 1 sentinels — composes; same workflow runs both) - B-0831 (cascade #6 full-install QEMU — this is layer 2a) - B-0833 (interactive-login vs baked-in-keys tension — layer 3 of cascade) Wired into .github/workflows/build-ai-cluster-iso.yml as a fast preflight (runs BEFORE the ~15-min Nix build; fails fast if iter-5.4 substrate has regressed structurally). Verified locally: $ bun test tools/ci/test-iter-54-install-flow.test.ts bun test v1.3.13 23 pass 0 fail 35 expect() calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Lior <lior@zeta.dev>
AceHack
added a commit
that referenced
this pull request
May 27, 2026
…ntinels (#5365) * ci(B-0831 layer-1): extend audit-installer-substrate with iter-5.4 sentinels (gh auth setup-git, ssh-key stderr-capture, self-reg flow, ClusterNode YAML schema, MAC parsing) Layer 1 of a 4-layer CI testing approach for the iter-5.4 substrate (B-0812 self-registration + B-0813 cluster reconciliation + B-0835 bug fixes Bug 2a + 2b on #5364): Layer 1 (THIS PR) — source-level sentinel audit (cheap; catches regression) Layer 2 (next PR) — behavioral test with mock gh shim on PATH Layer 3 (B-0833 Approach A) — mock GH device-code endpoint Layer 4 (B-0831 cascade #6) — QEMU full-install + cluster auto-join This layer extends the existing REQUIRED_SENTINELS for full-ai-cluster/usb-nixos-installer/zeta-install.sh with 14 new substrings, organized into 3 groups: (a) iter-5.4 flow anchors (5 sentinels): - "Step 6.8: iter-5.4.0 homelab gh-auth + operator pubkey copy" - "Step 6.9: iter-5.4.1 self-registration commit+push" - "gh auth login" - "gh ssh-key list" - "gh repo clone Lucent-Financial-Group/Zeta" (b) Bug 2a + 2b fix-regression catches (3 sentinels): - "gh auth setup-git" — Bug 2a fix; presence catches removal - "SSH_KEY_ERR_FILE" — Bug 2b fix; presence catches stderr-capture removal - "admin:public_key" — Bug 2b fix; presence catches scope-recovery message removal (c) ClusterNode YAML schema sentinels (5 sentinels — catches the Copilot findings on #5352 where spec.role was scalar, spec.maintainer was at wrong path, spec.storage was a sibling instead of under hardware block): - "apiVersion: zeta.lucent-financial-group.com/v1" - "kind: ClusterNode" - " roles:" — spec.roles is ARRAY per B-0813 - " registration:" — spec.registration block per B-0813 - " hardware:" — spec.hardware block per B-0813 (d) Hardware-probe sentinels (catches MAC parsing regression from #5352): - "/proc/cpuinfo" — CPU_MODEL extraction - "link/ether" — MAC parses field after link/ether (not before) (e) Self-reg branch-shape sentinel: - "register-${NODE_HOSTNAME}-" — iter-5.4.1 branch name pattern Composes with: - PR #5364 (Bug 2a + 2b fixes that this audit will catch if regressed) - PR #5352 (iter-5.4.1 Copilot findings that this audit will catch) - PR #5354 (Bug 1 hostname symlink fix — already covered by existing sentinels) - B-0831 (cascade #6 full-install QEMU test; this is layer 1 of that work) - B-0833 (interactive-login vs baked-in-keys tension; layer 3 of the cascade) Why source-level + cheap-first: - Workflow build-ai-cluster-iso.yml runs `bun tools/ci/audit-installer-substrate.ts` on every PR touching the installer surface - Source-level catches substrate-regression at PR-author-time (seconds) - vs Layer 4 QEMU full-install (~minutes; expensive; flaky) - Layer 1 is the inner loop; Layers 2-4 are the outer loops Per `.claude/rules/verify-existing-substrate-before-authoring.md`: substrate-inventory pass found `tools/ci/audit-installer-substrate.ts` already has the REQUIRED_SENTINELS pattern for iter-4.2 + iter-5.1 + iter-5.2 + iter-5.2.2; this PR extends with iter-5.4 sentinels rather than minting parallel substrate. Verified: `bun tools/ci/audit-installer-substrate.ts` exits 0 ("PASS — 10 required files + 5 sentinel-file assertions OK") with the extended sentinel list against the current installer script at origin/main HEAD (commit 19d9617 from #5364). 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(#5365 Copilot): reflow iter-5.4.1 YAML schema sentinels comment so parenthesis closes on first line Copilot finding on the audit-installer-substrate.ts iter-5.4 sentinel addition: the comment 'iter-5.4.1 YAML schema sentinels (catches the Copilot findings from #5352' opened a parenthesis on line 98 that didn't close until line 100 ('block)'). To a code-reader scanning line 98, the sentence reads as unfinished. Fix: restructure as 'sentinels. Each catches a specific Copilot finding on PR #5352: ...' — no multi-line parenthesis; each schema-correction is a complete clause. --------- Co-authored-by: Lior <lior@zeta.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements B-0812 (iter-5.4.1; B-0794 sub-target 3 full) per the operator's CORE REQUIREMENT from 2026-05-26 physical hardware-support test: "post-boot fully-operational chain without operator login."
Adds Step 6.9 to zeta-install.sh — conditional on GH_AUTH_OK=1 (composes additively with Step 6.8 iter-5.4.0; cascade-skips if gh-auth was skipped).
10-step Step 6.9 substrate
All 4 B-0812 sub-targets satisfied
Git-as-source-of-truth + CockroachDB architecture
Per operator 2026-05-26: "git for source of truth and coackroach can be repopulated from". This row writes the source-of-truth node.yaml to git; CockroachDB ingests from git when operational; Addison's hardware-inventory SQL queries run against CockroachDB which can be rebuilt from git anytime.
HARD LIMITS preserved
Test plan
🤖 Generated with Claude Code